DocumentParser¶

DocumentParser parses CSV, Excel (.xlsx, .xls), and plain text files into a structured ParsedDocumentData object. It automatically detects file types from content bytes (magic-byte detection), identifies delimiters in CSV files, and analyzes document structure to classify content as tabular data or structured/nested formats.

Import¶

import { DocumentParser, ParsedDocumentData } from "@modernpath/agent-framework";

Constructor¶

new DocumentParser()

No configuration required. The parser is stateless and can be reused across multiple documents.

Methods¶

parseDocument¶

async parseDocument(
  content: Buffer,
  mimeType: string,
  fileName: string,
): Promise<ParsedDocumentData>

Parses a document from raw bytes into structured data.

Parameter	Type	Description
`content`	`Buffer`	Raw file content as a Node.js Buffer.
`mimeType`	`string`	MIME type of the file (e.g. `"text/csv"`, `"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"`).
`fileName`	`string`	Original file name including extension (used for type detection and structure analysis).

Returns: Promise<ParsedDocumentData> -- the parsed document with headers, data rows, summary, and structure analysis.

Throws: Error if the file type is unsupported.

Supported File Types¶

File Type	MIME Types	Extensions	Notes
CSV	`text/csv`	`.csv`	Auto-detects `,` and `;` delimiters
Excel	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`, `application/vnd.ms-excel`	`.xlsx`, `.xls`	Multi-sheet support; first sheet is primary
Text	`text/plain`	`.txt`	Parsed as line-separated data

File type is determined by a combination of MIME type, file extension, and content-based magic byte detection. Excel files are detected by ZIP (PK) and OLE signatures. CSV is detected by text-character density heuristics.

convertToJSONString¶

convertToJSONString(parsedData: ParsedDocumentData): string

Converts parsed data to a JSON array of objects, using headers as keys. Handles duplicate header names by appending a suffix. Useful for passing structured data to an LLM.

Parameter	Type	Description
`parsedData`	`ParsedDocumentData`	Previously parsed document data.

Returns: string -- JSON-formatted string of row objects.

ParsedDocumentData¶

interface ParsedDocumentData {
  type: "csv" | "excel" | "text" | "unknown";
  sheets?: { name: string; data: any[][] }[];
  data?: any[][];
  headers?: string[];
  rowCount: number;
  columnCount: number;
  summary: string;
  structureType?: "data-rows" | "structured";
  structureConfidence?: number;
  contentDescription?: string;
}

Field	Type	Description
`type`	`string`	Detected document type.
`sheets`	`array`	For Excel files, array of `{ name, data }` per sheet.
`data`	`any[][]`	Two-dimensional array of cell values. First row is headers if `headers` is populated.
`headers`	`string[]`	Column header names. Empty columns are named `Column1`, `Column2`, etc.
`rowCount`	`number`	Number of data rows (excludes header row).
`columnCount`	`number`	Number of columns.
`summary`	`string`	Human-readable summary including row/column counts and sample data.
`structureType`	`string`	Heuristic classification: `"data-rows"` for tabular data suitable for SQL queries, or `"structured"` for nested/hierarchical formats (e.g. financial statements).
`structureConfidence`	`number`	Confidence score (0--1) for the structure classification.
`contentDescription`	`string`	Description of column names, row counts, and suitability for querying.

Code Example¶

import { DocumentParser } from "@modernpath/agent-framework";
import * as fs from "node:fs";

const parser = new DocumentParser();

// Parse a CSV file
const csvBuffer = fs.readFileSync("./data/employees.csv");
const csvResult = await parser.parseDocument(csvBuffer, "text/csv", "employees.csv");

console.log(csvResult.summary);
// "CSV file with 150 rows and 5 columns. Columns: Name, Department, Salary, Start Date, Role"
console.log(csvResult.structureType); // "data-rows"
console.log(csvResult.headers);       // ["Name", "Department", "Salary", "Start Date", "Role"]

// Parse an Excel file
const xlsxBuffer = fs.readFileSync("./data/financials.xlsx");
const xlsxResult = await parser.parseDocument(
  xlsxBuffer,
  "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
  "financials.xlsx",
);

console.log(xlsxResult.sheets?.length); // Number of sheets
console.log(xlsxResult.structureType);  // "structured" for financial statements

// Convert to JSON for LLM consumption
const jsonString = parser.convertToJSONString(csvResult);
console.log(jsonString);
// [{ "Name": "Alice", "Department": "Engineering", ... }, ...]

DocumentQuery -- SQL-based querying of parsed documents
DocumentAttachment -- attaching parsed documents to LLM prompts
SharePoint Integration -- downloading documents for parsing