@racsodev/cv-pdf-to-json NPM

cv-pdf-to-json

A TypeScript library for extracting and processing CV/resume data from PDF files with Claude AI integration. This package provides a complete pipeline for:

PDF text extraction
Text sanitization
Structured CV data extraction
Configurable output formats (JSON/TXT)

Installation

npm install cv-pdf-to-json

Usage

Basic Usage

import { createPdfExtractor } from "cv-pdf-to-json";

const processor = createPdfExtractor({
  anthropicApiKey: "your-claude-api-key",
  debug: false,
  outputPaths: {
    jsonDir: "./output/json",
    txtDir: "./output/text",
  },
  saveJson: true,
  saveTxt: true,
});

// Process a single PDF
const result = await processor.process("./path/to/cv.pdf");
console.log(result.data); // Extracted CV data in structured format

// Process a directory of PDFs
const results = await processor.processDirectory("./cv-directory", {
  recursive: true, // Include subdirectories
});

Advanced Usage

The package exports all core components for custom implementations:

import {
  PdfParseExtractor,
  ClaudeProcessor,
  DocumentProcessor,
  sanitizePrompt,
  ocrPrompt,
} from "cv-pdf-to-json";

// Create custom extractor
const extractor = new PdfParseExtractor({ debug: true });

// Create custom Claude processor
const llmProcessor = new ClaudeProcessor({
  apiKey: "your-claude-api-key",
  debug: true,
});

// Create custom document processor
const processor = new DocumentProcessor({
  extractor,
  llmProcessor,
  debug: true,
  saveJson: true,
  saveTxt: true,
  outputPaths: {
    pdfParse: {
      jsonDir: "./output/json",
      txtDir: "./output/text",
    },
  },
});

Configuration

PdfExtractorConfig

interface PdfExtractorConfig {
  anthropicApiKey: string; // Claude API key
  debug?: boolean; // Enable debug logging
  outputPaths?: {
    jsonDir?: string; // Directory for JSON output
    txtDir?: string; // Directory for text output
  };
  saveJson?: boolean; // Save JSON output
  saveTxt?: boolean; // Save text output
}

Output Format

JSON Structure

The extracted CV data follows this structure:

interface ExtractionResult {
  text: string; // Raw extracted text
  pages: Array<{
    pageNumber: number;
    text: string;
  }>;
}

interface ProcessorResult {
  success: boolean;
  data?: any; // Structured CV data
  error?: string;
}

The structured CV data includes:

Personal information (name, contact details)
Professional experiences
Education history
Skills (hard and soft)
Languages
Publications
Distinctions
Hobbies
References

Features

PDF text extraction with page-level granularity
Text sanitization using Claude AI
Structured CV data extraction
Support for single files and directories
Configurable output formats (JSON/TXT)
TypeScript support
Extensible architecture
Debug mode for troubleshooting

Requirements

Node.js >= 18.4.2
Claude API key
TypeScript (for TypeScript projects)

Architecture

The package uses a pipeline architecture:

PDF Extraction (PdfParseExtractor)
- Extracts raw text from PDF files
- Preserves page structure
Text Processing (ClaudeProcessor)
- Sanitizes extracted text
- Extracts structured CV data
Document Processing (DocumentProcessor)
- Orchestrates the extraction pipeline
- Handles file I/O
- Manages output formats

License

ISC

pdf cv resume extract claude ai typescript

@anthropic-ai/sdk pdf-parse

1.0.0

9 months ago