@pdftron/data-extraction NPM

@pdftron/data-extraction

This package is meant to be used in conjunction with @pdftron/pdfnet-node to support IDP data extraction from Apryse. Follow this guide for more info on usage. https://docs.apryse.com/documentation/core/guides/intelligent-data-extraction/

For further reading checkout our blog post on the project. https://apryse.com/blog/introducing-automated-data-extraction-pdf-idp

Supported platform, Node.js, and Electron versions

This package depends on unmanaged add-on binaries, and the add-on binaries are not cross-platform. At the moment we have support for

OS: Linux (excluding Alpine), Windows(x64)
Node.js version: 8 - 22
Electron version: 6 - 30

Installation will fail if your OS, Node.js or Electron version is not supported.

Usage

Add the @pdftron/data-extraction package as a dependency in your package.json

Inside of your @pdftron/pdfnet-node code after initialization you should include the following line:

await PDFNet.addResourceSearchPath("./node_modules/@pdftron/data-extraction/lib")

Here is an example of data extraction being used with this line.

const { PDFNet } = require('@pdftron/pdfnet-node');
const licenseKey = "Insert license key here"
const inputFile = "Insert input file location here"

async function main() {
        // This is where we import data-extraction
        await PDFNet.addResourceSearchPath("./node_modules/@pdftron/data-extraction/lib")

        // Extract document structure as a JSON file
        console.log('Extract document structure as a JSON file');

        let outputFile = 'out/paragraphs_and_tables.json';
        await PDFNet.DataExtractionModule.extractData(inputFile, outputFile, PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure);

        console.log('Result saved in ' + outputFile);

        ///////////////////////////////////////////////////////
        // Extract document structure as a JSON string
        console.log('Extract document structure as a JSON string');

        outputFile = 'out/tagged.json';
        const json = await PDFNet.DataExtractionModule.extractDataAsString(inputFile, PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure);

        fs.writeFileSync(outputFile, json);
}

PDFNet.runWithCleanup(main, licenseKey).catch(function (error) {
    console.log('Error: ' + JSON.stringify(error));
}).then(function () { return PDFNet.shutdown(); });;

A larger code sample can be found here

To get started please see the documentation at https://www.pdftron.com/documentation/nodejs/get-started/integration.