Skip to content

DocumentParser

DocumentParser parses CSV, Excel (.xlsx, .xls), and plain text files into a structured ParsedDocumentData object. It automatically detects file types from content bytes (magic-byte detection), identifies delimiters in CSV files, and analyzes document structure to classify content as tabular data or structured/nested formats.

Import

import { DocumentParser, ParsedDocumentData } from "@modernpath/agent-framework";

Constructor

new DocumentParser()

No configuration required. The parser is stateless and can be reused across multiple documents.

Methods

parseDocument

async parseDocument(
  content: Buffer,
  mimeType: string,
  fileName: string,
): Promise<ParsedDocumentData>

Parses a document from raw bytes into structured data.

Parameter Type Description
content Buffer Raw file content as a Node.js Buffer.
mimeType string MIME type of the file (e.g. "text/csv", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet").
fileName string Original file name including extension (used for type detection and structure analysis).

Returns: Promise<ParsedDocumentData> -- the parsed document with headers, data rows, summary, and structure analysis.

Throws: Error if the file type is unsupported.

Supported File Types

File Type MIME Types Extensions Notes
CSV text/csv .csv Auto-detects , and ; delimiters
Excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel .xlsx, .xls Multi-sheet support; first sheet is primary
Text text/plain .txt Parsed as line-separated data

File type is determined by a combination of MIME type, file extension, and content-based magic byte detection. Excel files are detected by ZIP (PK) and OLE signatures. CSV is detected by text-character density heuristics.

convertToJSONString

convertToJSONString(parsedData: ParsedDocumentData): string

Converts parsed data to a JSON array of objects, using headers as keys. Handles duplicate header names by appending a suffix. Useful for passing structured data to an LLM.

Parameter Type Description
parsedData ParsedDocumentData Previously parsed document data.

Returns: string -- JSON-formatted string of row objects.

ParsedDocumentData

interface ParsedDocumentData {
  type: "csv" | "excel" | "text" | "unknown";
  sheets?: { name: string; data: any[][] }[];
  data?: any[][];
  headers?: string[];
  rowCount: number;
  columnCount: number;
  summary: string;
  structureType?: "data-rows" | "structured";
  structureConfidence?: number;
  contentDescription?: string;
}
Field Type Description
type string Detected document type.
sheets array For Excel files, array of { name, data } per sheet.
data any[][] Two-dimensional array of cell values. First row is headers if headers is populated.
headers string[] Column header names. Empty columns are named Column1, Column2, etc.
rowCount number Number of data rows (excludes header row).
columnCount number Number of columns.
summary string Human-readable summary including row/column counts and sample data.
structureType string Heuristic classification: "data-rows" for tabular data suitable for SQL queries, or "structured" for nested/hierarchical formats (e.g. financial statements).
structureConfidence number Confidence score (0--1) for the structure classification.
contentDescription string Description of column names, row counts, and suitability for querying.

Code Example

import { DocumentParser } from "@modernpath/agent-framework";
import * as fs from "node:fs";

const parser = new DocumentParser();

// Parse a CSV file
const csvBuffer = fs.readFileSync("./data/employees.csv");
const csvResult = await parser.parseDocument(csvBuffer, "text/csv", "employees.csv");

console.log(csvResult.summary);
// "CSV file with 150 rows and 5 columns. Columns: Name, Department, Salary, Start Date, Role"
console.log(csvResult.structureType); // "data-rows"
console.log(csvResult.headers);       // ["Name", "Department", "Salary", "Start Date", "Role"]

// Parse an Excel file
const xlsxBuffer = fs.readFileSync("./data/financials.xlsx");
const xlsxResult = await parser.parseDocument(
  xlsxBuffer,
  "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
  "financials.xlsx",
);

console.log(xlsxResult.sheets?.length); // Number of sheets
console.log(xlsxResult.structureType);  // "structured" for financial statements

// Convert to JSON for LLM consumption
const jsonString = parser.convertToJSONString(csvResult);
console.log(jsonString);
// [{ "Name": "Alice", "Department": "Engineering", ... }, ...]