Skip to content

CanonicalGroundingService

CanonicalGroundingService bridges the gap between RAG search results (which contain short text chunks) and full ground-truth documents stored in external systems (S3, SharePoint, GCS). It resolves KnowledgeSource references from retrieval back to their canonical source, fetches the complete documents, and optionally expands markdown documents through their includes[] frontmatter via the CompositionLoader.

Import

import { CanonicalGroundingService } from "@modernpath/agent-framework";

Constructor

const grounding = new CanonicalGroundingService(
  fileSearchStores: GeminiFileSearchStores,
  fetcher: CanonicalDocumentFetcher,
  cfg?: CanonicalGroundingConfig,
);
Parameter Type Description
fileSearchStores GeminiFileSearchStores File Search Stores client for resolving document metadata.
fetcher CanonicalDocumentFetcher Fetches documents from canonical storage (S3, SharePoint, GCS) and prepares them for Gemini.
cfg CanonicalGroundingConfig Configuration for document limits, source types, preparation, and composition.

CanonicalGroundingConfig

Property Type Default Description
maxDocuments number 3 Maximum number of canonical documents to resolve per request.
allowedSourceTypes CanonicalSourceType[] ["sharepoint", "s3"] Only resolve canonical pointers of these source types.
prepareDocumentOptions PrepareDocumentOptions { forceContentType: "attachment", useFilesApi: "auto" } How to prepare documents for Gemini (inline vs Files API, content type forcing).
compositionPolicy CompositionPolicy undefined When set, enables markdown includes[] expansion via BFS traversal. See Document Composition.

Creating the service

import {
  GeminiClient,
  CanonicalGroundingService,
  CanonicalDocumentFetcher,
  DocumentAttachment,
} from "@modernpath/agent-framework";

const client = new GeminiClient({ apiKey: process.env.GEMINI_API_KEY! });
const stores = client.fileSearchStores();

const attachment = new DocumentAttachment(spAuth, spGraph, parser, client);
const fetcher = new CanonicalDocumentFetcher(attachment);

const grounding = new CanonicalGroundingService(stores, fetcher, {
  maxDocuments: 3,
  allowedSourceTypes: ["s3", "sharepoint"],
  prepareDocumentOptions: {
    forceContentType: "attachment",
    useFilesApi: "auto",
  },
  compositionPolicy: {
    maxDepth: 2,
    maxDocs: 8,
    maxIncludeBytes: 5 * 1024 * 1024,
    timeoutMs: 10_000,
  },
});

Methods

prepareFromSources()

Resolve RAG sources to canonical documents, fetch them from ground-truth storage, and optionally expand includes.

async prepareFromSources(
  sources: KnowledgeSource[],
  ctx: { userId: number; auditingId: number },
  overrides?: Partial<CanonicalGroundingConfig>,
): Promise<CanonicalGroundingResult>
Parameter Type Description
sources KnowledgeSource[] Sources from RetrievalService.search().
ctx { userId, auditingId } Authentication context for fetching documents from SharePoint or other authenticated sources.
overrides Partial<CanonicalGroundingConfig> Per-call config overrides (merged with constructor config).

Processing phases:

flowchart TD
    A["Phase 1: Build store catalog"] --> B["Phase 2: Resolve sources to pointers"]
    B --> C["Phase 3: Fetch canonical documents"]
    C --> D{"Composition\npolicy set?"}
    D -- Yes --> E["Phase 4: Expand includes"]
    D -- No --> F["Return flat result"]
    E --> G["Return composed result"]

Phase 1 -- Build store catalog: Lists all documents in referenced stores (paginated, max 20 per page) and builds an in-memory catalog mapping display names to resource names and metadata. This avoids per-document API calls.

Phase 2 -- Resolve sources to canonical pointers: For each KnowledgeSource, looks up its customMetadata from the catalog (or falls back to individual getDocument() calls), parses the canonical pointer, and filters by allowedSourceTypes. Stops at maxDocuments.

Phase 3 -- Fetch canonical documents: Fetches full documents from the canonical source (S3, SharePoint, GCS) and prepares them as PreparedDocument objects ready for Gemini attachment.

Phase 4 -- Composition expansion (optional): If compositionPolicy is configured, markdown documents are expanded via the CompositionLoader. A shared visited set across all root documents prevents duplicate fetches.

Basic usage

// After retrieval:
const answer = await retrieval.search("cold complaint procedure", "hvac-kb");

// Resolve to canonical documents:
const result = await grounding.prepareFromSources(
  answer.sources,
  { userId: 123, auditingId: 456 },
);

console.log(`Resolved ${result.pointers.length} canonical documents`);
console.log(`Total attachments: ${result.allDocumentsForAttachment.length}`);

// Use with Gemini:
const geminiResult = await client.generateContent(prompt, {
  documents: result.allDocumentsForAttachment
    .filter(d => d.geminiDocument)
    .map(d => d.geminiDocument!),
  fileReferences: result.allDocumentsForAttachment
    .filter(d => d.geminiFileReference)
    .map(d => d.geminiFileReference!),
});

With per-call overrides

const result = await grounding.prepareFromSources(
  answer.sources,
  ctx,
  {
    maxDocuments: 5,
    allowedSourceTypes: ["s3"],
    compositionPolicy: { maxDepth: 1 }, // shallow expansion only
  },
);

Types

CanonicalGroundingResult

interface CanonicalGroundingResult {
  /** Resolved canonical pointers (parallel to preparedDocuments). */
  pointers: CanonicalDocumentPointer[];

  /** Fetched and prepared root documents (parallel to pointers). */
  preparedDocuments: PreparedDocument[];

  /**
   * Composition bundles for documents with expanded includes.
   * Indexed parallel to pointers/preparedDocuments.
   * `undefined` entries indicate no composition was performed for that document.
   * Only populated when compositionPolicy is set.
   */
  compositionBundles?: Array<CompositionBundle | undefined>;

  /**
   * All documents for attachment (root docs + included docs), deduplicated
   * by canonical URI, in recommended order (roots first, then included
   * documents in BFS order).
   * When composition is disabled, equals preparedDocuments.
   */
  allDocumentsForAttachment: PreparedDocument[];
}

CanonicalDocumentPointer

A pointer to a document in canonical storage. Stored in Gemini File Search Store customMetadata.

type CanonicalDocumentPointer =
  | SharePointDocumentPointer
  | S3DocumentPointer
  | GcsDocumentPointer;

// S3 example:
interface S3DocumentPointer {
  source_type: "s3";
  s3_bucket: string;
  s3_key: string;
  s3_region?: string;
  s3_endpoint?: string;        // For MinIO/S3-compatible stores
  s3_force_path_style?: boolean;
  display_name?: string;
  mime_type?: string;
}

CanonicalDocumentFetcher

Fetches documents from canonical storage and prepares them for Gemini.

class CanonicalDocumentFetcher {
  constructor(attachment: DocumentAttachment);

  async fetchPreparedDocument(
    pointer: CanonicalDocumentPointer,
    ctx: { userId: number; auditingId: number },
    prepareOptions?: PrepareDocumentOptions,
  ): Promise<PreparedDocument>;
}

Routing by source_type:

Source Type Fetch Method
sharepoint DocumentAttachment.downloadAndPrepareDocument() via Graph API
s3 AWS SDK GetObjectCommand, then DocumentAttachment.prepareInlineDocument()
gcs @google-cloud/storage download, then DocumentAttachment.prepareInlineDocument()

Graceful Degradation

The service is designed for resilient operation:

  • Catalog failures are logged but do not abort the pipeline. Individual getDocument() calls are used as fallback.
  • Individual document fetch failures are logged and skipped. The remaining documents are still returned.
  • Composition failures for individual documents are logged and skipped. The root document is still included.
  • Unresolvable sources (missing metadata, unsupported source types) are silently skipped.

Debug Logging

Enable with:

DEBUG=canonical-grounding node app.js

Logs include detailed per-phase output: catalog building, pointer resolution, metadata parsing, and composition expansion.