1.0.0 • Published 9 months ago

@racsodev/cv-pdf-to-json v1.0.0

Weekly downloads
-
License
ISC
Repository
github
Last release
9 months ago

cv-pdf-to-json

A TypeScript library for extracting and processing CV/resume data from PDF files with Claude AI integration. This package provides a complete pipeline for:

  • PDF text extraction
  • Text sanitization
  • Structured CV data extraction
  • Configurable output formats (JSON/TXT)

Installation

npm install cv-pdf-to-json

Usage

Basic Usage

import { createPdfExtractor } from "cv-pdf-to-json";

const processor = createPdfExtractor({
  anthropicApiKey: "your-claude-api-key",
  debug: false,
  outputPaths: {
    jsonDir: "./output/json",
    txtDir: "./output/text",
  },
  saveJson: true,
  saveTxt: true,
});

// Process a single PDF
const result = await processor.process("./path/to/cv.pdf");
console.log(result.data); // Extracted CV data in structured format

// Process a directory of PDFs
const results = await processor.processDirectory("./cv-directory", {
  recursive: true, // Include subdirectories
});

Advanced Usage

The package exports all core components for custom implementations:

import {
  PdfParseExtractor,
  ClaudeProcessor,
  DocumentProcessor,
  sanitizePrompt,
  ocrPrompt,
} from "cv-pdf-to-json";

// Create custom extractor
const extractor = new PdfParseExtractor({ debug: true });

// Create custom Claude processor
const llmProcessor = new ClaudeProcessor({
  apiKey: "your-claude-api-key",
  debug: true,
});

// Create custom document processor
const processor = new DocumentProcessor({
  extractor,
  llmProcessor,
  debug: true,
  saveJson: true,
  saveTxt: true,
  outputPaths: {
    pdfParse: {
      jsonDir: "./output/json",
      txtDir: "./output/text",
    },
  },
});

Configuration

PdfExtractorConfig

interface PdfExtractorConfig {
  anthropicApiKey: string; // Claude API key
  debug?: boolean; // Enable debug logging
  outputPaths?: {
    jsonDir?: string; // Directory for JSON output
    txtDir?: string; // Directory for text output
  };
  saveJson?: boolean; // Save JSON output
  saveTxt?: boolean; // Save text output
}

Output Format

JSON Structure

The extracted CV data follows this structure:

interface ExtractionResult {
  text: string; // Raw extracted text
  pages: Array<{
    pageNumber: number;
    text: string;
  }>;
}

interface ProcessorResult {
  success: boolean;
  data?: any; // Structured CV data
  error?: string;
}

The structured CV data includes:

  • Personal information (name, contact details)
  • Professional experiences
  • Education history
  • Skills (hard and soft)
  • Languages
  • Publications
  • Distinctions
  • Hobbies
  • References

Features

  • PDF text extraction with page-level granularity
  • Text sanitization using Claude AI
  • Structured CV data extraction
  • Support for single files and directories
  • Configurable output formats (JSON/TXT)
  • TypeScript support
  • Extensible architecture
  • Debug mode for troubleshooting

Requirements

  • Node.js >= 18.4.2
  • Claude API key
  • TypeScript (for TypeScript projects)

Architecture

The package uses a pipeline architecture:

  1. PDF Extraction (PdfParseExtractor)

    • Extracts raw text from PDF files
    • Preserves page structure
  2. Text Processing (ClaudeProcessor)

    • Sanitizes extracted text
    • Extracts structured CV data
  3. Document Processing (DocumentProcessor)

    • Orchestrates the extraction pipeline
    • Handles file I/O
    • Manages output formats

License

ISC

1.0.0

9 months ago