DocumentParser¶
DocumentParser parses CSV, Excel (.xlsx, .xls), and plain text files into a structured ParsedDocumentData object. It automatically detects file types from content bytes (magic-byte detection), identifies delimiters in CSV files, and analyzes document structure to classify content as tabular data or structured/nested formats.
Import¶
Constructor¶
No configuration required. The parser is stateless and can be reused across multiple documents.
Methods¶
parseDocument¶
async parseDocument(
content: Buffer,
mimeType: string,
fileName: string,
): Promise<ParsedDocumentData>
Parses a document from raw bytes into structured data.
| Parameter | Type | Description |
|---|---|---|
content | Buffer | Raw file content as a Node.js Buffer. |
mimeType | string | MIME type of the file (e.g. "text/csv", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"). |
fileName | string | Original file name including extension (used for type detection and structure analysis). |
Returns: Promise<ParsedDocumentData> -- the parsed document with headers, data rows, summary, and structure analysis.
Throws: Error if the file type is unsupported.
Supported File Types¶
| File Type | MIME Types | Extensions | Notes |
|---|---|---|---|
| CSV | text/csv | .csv | Auto-detects , and ; delimiters |
| Excel | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel | .xlsx, .xls | Multi-sheet support; first sheet is primary |
| Text | text/plain | .txt | Parsed as line-separated data |
File type is determined by a combination of MIME type, file extension, and content-based magic byte detection. Excel files are detected by ZIP (PK) and OLE signatures. CSV is detected by text-character density heuristics.
convertToJSONString¶
Converts parsed data to a JSON array of objects, using headers as keys. Handles duplicate header names by appending a suffix. Useful for passing structured data to an LLM.
| Parameter | Type | Description |
|---|---|---|
parsedData | ParsedDocumentData | Previously parsed document data. |
Returns: string -- JSON-formatted string of row objects.
ParsedDocumentData¶
interface ParsedDocumentData {
type: "csv" | "excel" | "text" | "unknown";
sheets?: { name: string; data: any[][] }[];
data?: any[][];
headers?: string[];
rowCount: number;
columnCount: number;
summary: string;
structureType?: "data-rows" | "structured";
structureConfidence?: number;
contentDescription?: string;
}
| Field | Type | Description |
|---|---|---|
type | string | Detected document type. |
sheets | array | For Excel files, array of { name, data } per sheet. |
data | any[][] | Two-dimensional array of cell values. First row is headers if headers is populated. |
headers | string[] | Column header names. Empty columns are named Column1, Column2, etc. |
rowCount | number | Number of data rows (excludes header row). |
columnCount | number | Number of columns. |
summary | string | Human-readable summary including row/column counts and sample data. |
structureType | string | Heuristic classification: "data-rows" for tabular data suitable for SQL queries, or "structured" for nested/hierarchical formats (e.g. financial statements). |
structureConfidence | number | Confidence score (0--1) for the structure classification. |
contentDescription | string | Description of column names, row counts, and suitability for querying. |
Code Example¶
import { DocumentParser } from "@modernpath/agent-framework";
import * as fs from "node:fs";
const parser = new DocumentParser();
// Parse a CSV file
const csvBuffer = fs.readFileSync("./data/employees.csv");
const csvResult = await parser.parseDocument(csvBuffer, "text/csv", "employees.csv");
console.log(csvResult.summary);
// "CSV file with 150 rows and 5 columns. Columns: Name, Department, Salary, Start Date, Role"
console.log(csvResult.structureType); // "data-rows"
console.log(csvResult.headers); // ["Name", "Department", "Salary", "Start Date", "Role"]
// Parse an Excel file
const xlsxBuffer = fs.readFileSync("./data/financials.xlsx");
const xlsxResult = await parser.parseDocument(
xlsxBuffer,
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"financials.xlsx",
);
console.log(xlsxResult.sheets?.length); // Number of sheets
console.log(xlsxResult.structureType); // "structured" for financial statements
// Convert to JSON for LLM consumption
const jsonString = parser.convertToJSONString(csvResult);
console.log(jsonString);
// [{ "Name": "Alice", "Department": "Engineering", ... }, ...]
Related Pages¶
- DocumentQuery -- SQL-based querying of parsed documents
- DocumentAttachment -- attaching parsed documents to LLM prompts
- SharePoint Integration -- downloading documents for parsing