CanonicalGroundingService¶

CanonicalGroundingService bridges the gap between RAG search results (which contain short text chunks) and full ground-truth documents stored in external systems (S3, SharePoint, GCS). It resolves KnowledgeSource references from retrieval back to their canonical source, fetches the complete documents, and optionally expands markdown documents through their includes[] frontmatter via the CompositionLoader.

Import¶

import { CanonicalGroundingService } from "@modernpath/agent-framework";

Constructor¶

const grounding = new CanonicalGroundingService(
  fileSearchStores: GeminiFileSearchStores,
  fetcher: CanonicalDocumentFetcher,
  cfg?: CanonicalGroundingConfig,
);

Parameter	Type	Description
`fileSearchStores`	`GeminiFileSearchStores`	File Search Stores client for resolving document metadata.
`fetcher`	`CanonicalDocumentFetcher`	Fetches documents from canonical storage (S3, SharePoint, GCS) and prepares them for Gemini.
`cfg`	`CanonicalGroundingConfig`	Configuration for document limits, source types, preparation, and composition.

CanonicalGroundingConfig¶

Property	Type	Default	Description
`maxDocuments`	`number`	`3`	Maximum number of canonical documents to resolve per request.
`allowedSourceTypes`	`CanonicalSourceType[]`	`["sharepoint", "s3"]`	Only resolve canonical pointers of these source types.
`prepareDocumentOptions`	`PrepareDocumentOptions`	`{ forceContentType: "attachment", useFilesApi: "auto" }`	How to prepare documents for Gemini (inline vs Files API, content type forcing).
`compositionPolicy`	`CompositionPolicy`	`undefined`	When set, enables markdown `includes[]` expansion via BFS traversal. See Document Composition.

Creating the service

import {
  GeminiClient,
  CanonicalGroundingService,
  CanonicalDocumentFetcher,
  DocumentAttachment,
} from "@modernpath/agent-framework";

const client = new GeminiClient({ apiKey: process.env.GEMINI_API_KEY! });
const stores = client.fileSearchStores();

const attachment = new DocumentAttachment(spAuth, spGraph, parser, client);
const fetcher = new CanonicalDocumentFetcher(attachment);

const grounding = new CanonicalGroundingService(stores, fetcher, {
  maxDocuments: 3,
  allowedSourceTypes: ["s3", "sharepoint"],
  prepareDocumentOptions: {
    forceContentType: "attachment",
    useFilesApi: "auto",
  },
  compositionPolicy: {
    maxDepth: 2,
    maxDocs: 8,
    maxIncludeBytes: 5 * 1024 * 1024,
    timeoutMs: 10_000,
  },
});

Methods¶

`prepareFromSources()`¶

Resolve RAG sources to canonical documents, fetch them from ground-truth storage, and optionally expand includes.

async prepareFromSources(
  sources: KnowledgeSource[],
  ctx: { userId: number; auditingId: number },
  overrides?: Partial<CanonicalGroundingConfig>,
): Promise<CanonicalGroundingResult>

Parameter	Type	Description
`sources`	`KnowledgeSource[]`	Sources from `RetrievalService.search()`.
`ctx`	`{ userId, auditingId }`	Authentication context for fetching documents from SharePoint or other authenticated sources.
`overrides`	`Partial<CanonicalGroundingConfig>`	Per-call config overrides (merged with constructor config).

Processing phases:

flowchart TD
    A["Phase 1: Build store catalog"] --> B["Phase 2: Resolve sources to pointers"]
    B --> C["Phase 3: Fetch canonical documents"]
    C --> D{"Composition\npolicy set?"}
    D -- Yes --> E["Phase 4: Expand includes"]
    D -- No --> F["Return flat result"]
    E --> G["Return composed result"]

Phase 1 -- Build store catalog: Lists all documents in referenced stores (paginated, max 20 per page) and builds an in-memory catalog mapping display names to resource names and metadata. This avoids per-document API calls.

Phase 2 -- Resolve sources to canonical pointers: For each KnowledgeSource, looks up its customMetadata from the catalog (or falls back to individual getDocument() calls), parses the canonical pointer, and filters by allowedSourceTypes. Stops at maxDocuments.

Phase 3 -- Fetch canonical documents: Fetches full documents from the canonical source (S3, SharePoint, GCS) and prepares them as PreparedDocument objects ready for Gemini attachment.

Phase 4 -- Composition expansion (optional): If compositionPolicy is configured, markdown documents are expanded via the CompositionLoader. A shared visited set across all root documents prevents duplicate fetches.

Basic usage

// After retrieval:
const answer = await retrieval.search("cold complaint procedure", "hvac-kb");

// Resolve to canonical documents:
const result = await grounding.prepareFromSources(
  answer.sources,
  { userId: 123, auditingId: 456 },
);

console.log(`Resolved ${result.pointers.length} canonical documents`);
console.log(`Total attachments: ${result.allDocumentsForAttachment.length}`);

// Use with Gemini:
const geminiResult = await client.generateContent(prompt, {
  documents: result.allDocumentsForAttachment
    .filter(d => d.geminiDocument)
    .map(d => d.geminiDocument!),
  fileReferences: result.allDocumentsForAttachment
    .filter(d => d.geminiFileReference)
    .map(d => d.geminiFileReference!),
});

With per-call overrides

const result = await grounding.prepareFromSources(
  answer.sources,
  ctx,
  {
    maxDocuments: 5,
    allowedSourceTypes: ["s3"],
    compositionPolicy: { maxDepth: 1 }, // shallow expansion only
  },
);

Types¶

CanonicalGroundingResult¶

interface CanonicalGroundingResult {
  /** Resolved canonical pointers (parallel to preparedDocuments). */
  pointers: CanonicalDocumentPointer[];

  /** Fetched and prepared root documents (parallel to pointers). */
  preparedDocuments: PreparedDocument[];

  /**
   * Composition bundles for documents with expanded includes.
   * Indexed parallel to pointers/preparedDocuments.
   * `undefined` entries indicate no composition was performed for that document.
   * Only populated when compositionPolicy is set.
   */
  compositionBundles?: Array<CompositionBundle | undefined>;

  /**
   * All documents for attachment (root docs + included docs), deduplicated
   * by canonical URI, in recommended order (roots first, then included
   * documents in BFS order).
   * When composition is disabled, equals preparedDocuments.
   */
  allDocumentsForAttachment: PreparedDocument[];
}

CanonicalDocumentPointer¶

A pointer to a document in canonical storage. Stored in Gemini File Search Store customMetadata.

type CanonicalDocumentPointer =
  | SharePointDocumentPointer
  | S3DocumentPointer
  | GcsDocumentPointer;

// S3 example:
interface S3DocumentPointer {
  source_type: "s3";
  s3_bucket: string;
  s3_key: string;
  s3_region?: string;
  s3_endpoint?: string;        // For MinIO/S3-compatible stores
  s3_force_path_style?: boolean;
  display_name?: string;
  mime_type?: string;
}

CanonicalDocumentFetcher¶

Fetches documents from canonical storage and prepares them for Gemini.

class CanonicalDocumentFetcher {
  constructor(attachment: DocumentAttachment);

  async fetchPreparedDocument(
    pointer: CanonicalDocumentPointer,
    ctx: { userId: number; auditingId: number },
    prepareOptions?: PrepareDocumentOptions,
  ): Promise<PreparedDocument>;
}

Routing by source_type:

Source Type	Fetch Method
`sharepoint`	`DocumentAttachment.downloadAndPrepareDocument()` via Graph API
`s3`	AWS SDK `GetObjectCommand`, then `DocumentAttachment.prepareInlineDocument()`
`gcs`	`@google-cloud/storage` download, then `DocumentAttachment.prepareInlineDocument()`

Graceful Degradation¶

The service is designed for resilient operation:

Catalog failures are logged but do not abort the pipeline. Individual getDocument() calls are used as fallback.
Individual document fetch failures are logged and skipped. The remaining documents are still returned.
Composition failures for individual documents are logged and skipped. The root document is still included.
Unresolvable sources (missing metadata, unsupported source types) are silently skipped.

Debug Logging¶

Enable with:

DEBUG=canonical-grounding node app.js

Logs include detailed per-phase output: catalog building, pointer resolution, metadata parsing, and composition expansion.

Knowledge Base Overview -- full RAG pipeline
Document Composition -- includes[] expansion
Grounding Policy -- controlling what reaches the LLM
Ingestion -- setting up canonical metadata during ingestion
Retrieval -- generating the KnowledgeSource[] input