1.0.0 • Published 9 months ago
@racsodev/cv-pdf-to-json v1.0.0
cv-pdf-to-json
A TypeScript library for extracting and processing CV/resume data from PDF files with Claude AI integration. This package provides a complete pipeline for:
- PDF text extraction
- Text sanitization
- Structured CV data extraction
- Configurable output formats (JSON/TXT)
Installation
npm install cv-pdf-to-json
Usage
Basic Usage
import { createPdfExtractor } from "cv-pdf-to-json";
const processor = createPdfExtractor({
anthropicApiKey: "your-claude-api-key",
debug: false,
outputPaths: {
jsonDir: "./output/json",
txtDir: "./output/text",
},
saveJson: true,
saveTxt: true,
});
// Process a single PDF
const result = await processor.process("./path/to/cv.pdf");
console.log(result.data); // Extracted CV data in structured format
// Process a directory of PDFs
const results = await processor.processDirectory("./cv-directory", {
recursive: true, // Include subdirectories
});
Advanced Usage
The package exports all core components for custom implementations:
import {
PdfParseExtractor,
ClaudeProcessor,
DocumentProcessor,
sanitizePrompt,
ocrPrompt,
} from "cv-pdf-to-json";
// Create custom extractor
const extractor = new PdfParseExtractor({ debug: true });
// Create custom Claude processor
const llmProcessor = new ClaudeProcessor({
apiKey: "your-claude-api-key",
debug: true,
});
// Create custom document processor
const processor = new DocumentProcessor({
extractor,
llmProcessor,
debug: true,
saveJson: true,
saveTxt: true,
outputPaths: {
pdfParse: {
jsonDir: "./output/json",
txtDir: "./output/text",
},
},
});
Configuration
PdfExtractorConfig
interface PdfExtractorConfig {
anthropicApiKey: string; // Claude API key
debug?: boolean; // Enable debug logging
outputPaths?: {
jsonDir?: string; // Directory for JSON output
txtDir?: string; // Directory for text output
};
saveJson?: boolean; // Save JSON output
saveTxt?: boolean; // Save text output
}
Output Format
JSON Structure
The extracted CV data follows this structure:
interface ExtractionResult {
text: string; // Raw extracted text
pages: Array<{
pageNumber: number;
text: string;
}>;
}
interface ProcessorResult {
success: boolean;
data?: any; // Structured CV data
error?: string;
}
The structured CV data includes:
- Personal information (name, contact details)
- Professional experiences
- Education history
- Skills (hard and soft)
- Languages
- Publications
- Distinctions
- Hobbies
- References
Features
- PDF text extraction with page-level granularity
- Text sanitization using Claude AI
- Structured CV data extraction
- Support for single files and directories
- Configurable output formats (JSON/TXT)
- TypeScript support
- Extensible architecture
- Debug mode for troubleshooting
Requirements
- Node.js >= 18.4.2
- Claude API key
- TypeScript (for TypeScript projects)
Architecture
The package uses a pipeline architecture:
PDF Extraction (PdfParseExtractor)
- Extracts raw text from PDF files
- Preserves page structure
Text Processing (ClaudeProcessor)
- Sanitizes extracted text
- Extracts structured CV data
Document Processing (DocumentProcessor)
- Orchestrates the extraction pipeline
- Handles file I/O
- Manages output formats
License
ISC
1.0.0
9 months ago