CanonicalGroundingService¶
CanonicalGroundingService bridges the gap between RAG search results (which contain short text chunks) and full ground-truth documents stored in external systems (S3, SharePoint, GCS). It resolves KnowledgeSource references from retrieval back to their canonical source, fetches the complete documents, and optionally expands markdown documents through their includes[] frontmatter via the CompositionLoader.
Import¶
Constructor¶
const grounding = new CanonicalGroundingService(
fileSearchStores: GeminiFileSearchStores,
fetcher: CanonicalDocumentFetcher,
cfg?: CanonicalGroundingConfig,
);
| Parameter | Type | Description |
|---|---|---|
fileSearchStores | GeminiFileSearchStores | File Search Stores client for resolving document metadata. |
fetcher | CanonicalDocumentFetcher | Fetches documents from canonical storage (S3, SharePoint, GCS) and prepares them for Gemini. |
cfg | CanonicalGroundingConfig | Configuration for document limits, source types, preparation, and composition. |
CanonicalGroundingConfig¶
| Property | Type | Default | Description |
|---|---|---|---|
maxDocuments | number | 3 | Maximum number of canonical documents to resolve per request. |
allowedSourceTypes | CanonicalSourceType[] | ["sharepoint", "s3"] | Only resolve canonical pointers of these source types. |
prepareDocumentOptions | PrepareDocumentOptions | { forceContentType: "attachment", useFilesApi: "auto" } | How to prepare documents for Gemini (inline vs Files API, content type forcing). |
compositionPolicy | CompositionPolicy | undefined | When set, enables markdown includes[] expansion via BFS traversal. See Document Composition. |
Creating the service
import {
GeminiClient,
CanonicalGroundingService,
CanonicalDocumentFetcher,
DocumentAttachment,
} from "@modernpath/agent-framework";
const client = new GeminiClient({ apiKey: process.env.GEMINI_API_KEY! });
const stores = client.fileSearchStores();
const attachment = new DocumentAttachment(spAuth, spGraph, parser, client);
const fetcher = new CanonicalDocumentFetcher(attachment);
const grounding = new CanonicalGroundingService(stores, fetcher, {
maxDocuments: 3,
allowedSourceTypes: ["s3", "sharepoint"],
prepareDocumentOptions: {
forceContentType: "attachment",
useFilesApi: "auto",
},
compositionPolicy: {
maxDepth: 2,
maxDocs: 8,
maxIncludeBytes: 5 * 1024 * 1024,
timeoutMs: 10_000,
},
});
Methods¶
prepareFromSources()¶
Resolve RAG sources to canonical documents, fetch them from ground-truth storage, and optionally expand includes.
async prepareFromSources(
sources: KnowledgeSource[],
ctx: { userId: number; auditingId: number },
overrides?: Partial<CanonicalGroundingConfig>,
): Promise<CanonicalGroundingResult>
| Parameter | Type | Description |
|---|---|---|
sources | KnowledgeSource[] | Sources from RetrievalService.search(). |
ctx | { userId, auditingId } | Authentication context for fetching documents from SharePoint or other authenticated sources. |
overrides | Partial<CanonicalGroundingConfig> | Per-call config overrides (merged with constructor config). |
Processing phases:
flowchart TD
A["Phase 1: Build store catalog"] --> B["Phase 2: Resolve sources to pointers"]
B --> C["Phase 3: Fetch canonical documents"]
C --> D{"Composition\npolicy set?"}
D -- Yes --> E["Phase 4: Expand includes"]
D -- No --> F["Return flat result"]
E --> G["Return composed result"] Phase 1 -- Build store catalog: Lists all documents in referenced stores (paginated, max 20 per page) and builds an in-memory catalog mapping display names to resource names and metadata. This avoids per-document API calls.
Phase 2 -- Resolve sources to canonical pointers: For each KnowledgeSource, looks up its customMetadata from the catalog (or falls back to individual getDocument() calls), parses the canonical pointer, and filters by allowedSourceTypes. Stops at maxDocuments.
Phase 3 -- Fetch canonical documents: Fetches full documents from the canonical source (S3, SharePoint, GCS) and prepares them as PreparedDocument objects ready for Gemini attachment.
Phase 4 -- Composition expansion (optional): If compositionPolicy is configured, markdown documents are expanded via the CompositionLoader. A shared visited set across all root documents prevents duplicate fetches.
Basic usage
// After retrieval:
const answer = await retrieval.search("cold complaint procedure", "hvac-kb");
// Resolve to canonical documents:
const result = await grounding.prepareFromSources(
answer.sources,
{ userId: 123, auditingId: 456 },
);
console.log(`Resolved ${result.pointers.length} canonical documents`);
console.log(`Total attachments: ${result.allDocumentsForAttachment.length}`);
// Use with Gemini:
const geminiResult = await client.generateContent(prompt, {
documents: result.allDocumentsForAttachment
.filter(d => d.geminiDocument)
.map(d => d.geminiDocument!),
fileReferences: result.allDocumentsForAttachment
.filter(d => d.geminiFileReference)
.map(d => d.geminiFileReference!),
});
With per-call overrides
Types¶
CanonicalGroundingResult¶
interface CanonicalGroundingResult {
/** Resolved canonical pointers (parallel to preparedDocuments). */
pointers: CanonicalDocumentPointer[];
/** Fetched and prepared root documents (parallel to pointers). */
preparedDocuments: PreparedDocument[];
/**
* Composition bundles for documents with expanded includes.
* Indexed parallel to pointers/preparedDocuments.
* `undefined` entries indicate no composition was performed for that document.
* Only populated when compositionPolicy is set.
*/
compositionBundles?: Array<CompositionBundle | undefined>;
/**
* All documents for attachment (root docs + included docs), deduplicated
* by canonical URI, in recommended order (roots first, then included
* documents in BFS order).
* When composition is disabled, equals preparedDocuments.
*/
allDocumentsForAttachment: PreparedDocument[];
}
CanonicalDocumentPointer¶
A pointer to a document in canonical storage. Stored in Gemini File Search Store customMetadata.
type CanonicalDocumentPointer =
| SharePointDocumentPointer
| S3DocumentPointer
| GcsDocumentPointer;
// S3 example:
interface S3DocumentPointer {
source_type: "s3";
s3_bucket: string;
s3_key: string;
s3_region?: string;
s3_endpoint?: string; // For MinIO/S3-compatible stores
s3_force_path_style?: boolean;
display_name?: string;
mime_type?: string;
}
CanonicalDocumentFetcher¶
Fetches documents from canonical storage and prepares them for Gemini.
class CanonicalDocumentFetcher {
constructor(attachment: DocumentAttachment);
async fetchPreparedDocument(
pointer: CanonicalDocumentPointer,
ctx: { userId: number; auditingId: number },
prepareOptions?: PrepareDocumentOptions,
): Promise<PreparedDocument>;
}
Routing by source_type:
| Source Type | Fetch Method |
|---|---|
sharepoint | DocumentAttachment.downloadAndPrepareDocument() via Graph API |
s3 | AWS SDK GetObjectCommand, then DocumentAttachment.prepareInlineDocument() |
gcs | @google-cloud/storage download, then DocumentAttachment.prepareInlineDocument() |
Graceful Degradation¶
The service is designed for resilient operation:
- Catalog failures are logged but do not abort the pipeline. Individual
getDocument()calls are used as fallback. - Individual document fetch failures are logged and skipped. The remaining documents are still returned.
- Composition failures for individual documents are logged and skipped. The root document is still included.
- Unresolvable sources (missing metadata, unsupported source types) are silently skipped.
Debug Logging¶
Enable with:
Logs include detailed per-phase output: catalog building, pointer resolution, metadata parsing, and composition expansion.
Related Pages¶
- Knowledge Base Overview -- full RAG pipeline
- Document Composition --
includes[]expansion - Grounding Policy -- controlling what reaches the LLM
- Ingestion -- setting up canonical metadata during ingestion
- Retrieval -- generating the
KnowledgeSource[]input