libpdf v0.2.1
LibPdf
LibPdf is a fast and efficient Node.js library for converting PDF files to text. This open-source project aims to simplify the process of extracting text content from PDFs, making it easier for developers to work with PDF data in their applications.
Features
- Fast PDF to text conversion
- Easy-to-use API
Installing libpdf
To install libpdf
and its dependencies, ensure you have a supported version of Node installed. You can install the
project with npm. In the project directory, run:
$ npm install libpdf --save
This command fully installs the project, including installing any dependencies.
Building libpdf
To build libpdf
, you need to have Rust installed. If you have already installed the project and only want to run the
build, use:
$ npm run build
Exploring libpdf
After building libpdf
, you can explore its exports at the Node REPL.
Using Node Shell
Install
libpdf
:$ npm install libpdf --save
Open Node REPL:
$ node
Execute the following commands:
> const pdfFile = require("fs").readFileSync("doc.pdf") > const doc = require("libpdf").document(pdfFile); > console.log(doc);
Using a JavaScript file
Create a file named
index.ts
with the following content:const pdfFile = require('fs').readFileSync("doc.pdf"); const doc = require('libpdf').document(pdfFile); console.log(doc);
Run the file with Node:
$ node index.ts
This setup ensures you can easily install, build, and explore the capabilities of libpdf
.
Benchmark Result
conclusion
Best for Small and Medium PDFs:
libPdf
consistently performs the fastest for small and medium PDF files, showing significant speed advantages overpdf-lib
andpdf-parse
.Balanced Performance:
pdf-parse
offers a middle-ground performance across all file sizes but is generally slower thanlibPdf
for smaller files andpdf-lib
for medium files.Inefficiency with Complex PDFs:
libPdf
shows a notable drop in performance with complex PDFs, taking significantly longer compared topdf-parse
andpdf-lib
.Library Efficiency:
pdf-lib
excels with small and medium PDFs but struggles significantly with large and complex documents, making it less suitable for those cases.
Future Steps (TODO)
- Run Benchmark
- Add support for extracting text from specific pages
- Improve text extraction accuracy for complex PDFs
- Implement batch processing for multiple PDFs
- Add CLI support for direct command-line usage
- Create detailed documentation and examples
Known issues
- Not supported for Identity-H encoding
Contributing
We welcome contributions to improve LibPdf! Feel free to submit issues and pull requests on our GitHub repository.
License
This project is licensed under the Apache-2.0 license.