fresh-tabula-js v2.0.0
fresh-tabula-js
Convert tables inside PDFs to CSV via
tabula-javausing JavaScript.
This is a maintained fork of the tabula-js package,
with changes such as:
- Non-stream asynchronous extraction (use
async/await)
Please submit any issues (or e-mail me).
Contents
Getting Started
Only Node.js environments are supported due to file-system usage requirements. The package is exported as a CommonJS module.
Requirements
- Java Development Kit (JDK) with
javaavailable via command-line - Node.js/npm
Installing
To install as a dependency via npm:
$ npm install --save fresh-tabula-jsUsage
Import the module:
// 1. Import the module
const Tabula = require('fresh-tabula-js');
const extractData = async () => {
// 2. Instantiate a table via passing a path to a PDF (this can be relative or absolute)
const table = new Tabula('data/foobar.pdf');
// 3. Call an extraction method
return await table.getData();
};
// 4. Call the method!
const data = extractData();API
First, an instance of Tabula must be instantiated via calling tabula
with a path (relative or absolute) to a valid PDF.
Example:
const Tabula = require('fresh-tabula-js');
const table = new Tabula('path/to/pdf/foobar.pdf');
// Do stuffOptions
All extraction methods support the same set of options.
Options are passed through to tabula-java with some exceptions, such as the inability to write the output to file (-o). Extracted data is available through callbacks, streams, and return values.
Options are structured as a plain object.
| Key | Type | Default | Description |
|---|---|---|---|
area | String or Array | Entire page | Co-ordinates of the portion(s) of the page to analyze, formatted in strings in the following format top,left,bottom,right. For example, 269.875,12.75,790.5,561 or ["269.875,12.75,790.5,561", "132.45,23.2,256.3,534"]. |
columns | String | none | X coordinates of column boundaries. Example "10.1,20.2,30.3" |
debug | Boolean | false | Print detected table areas instead of processing them. |
guess | Boolean | true | Guess the portion(s) of the page to analyze and process. |
silent | Boolean | false | Suppresses all stderr output from the tabula-java JAR only. JavaScript errors will still be logged. |
noSpreadsheet | Boolean | false | Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). |
pages | String | 1 | Comma separated list of ranges, or all. E.g. 1-3,5-7, 3, all. |
spreadsheet | Boolean | false | Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). |
password | String | empty | Password used to decrypt/access the document. |
useLineReturns | Boolean | false | Use embedded line returns in cells (only in spreadsheet mode). |
Methods
Tabula.getData
Use this method to process extracted data from PDF asynchronously using async/await.
It returns an object in the following format:
{
output: <String>,
error: <String>,
}Example:
const Tabula = require('fresh-tabula-js');
const data = async () => {
const table = new Tabula('dir/foobar.pdf');
return await table.getData();
};Tabula.streamSections
Use this method to process extracted data in sections (separate tables).
Callbacks will be executed for each parsed section of the PDF.
Extracted data is a string representing an array of all rows (in CSV format) found, including headers.
const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf');
table.streamSections((err, data) => console.log(data));We can use the area option to analyze specific portions of the document.
const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf', {
area: "269.875,150,690,545",
});
table.streamSections((err, data) => console.log(data));Tabula.stream
This is used to process data from PDFs via streams.
Example:
const Tabula = require('fresh-tabula-js');
new Tabula('dir/foobar.pdf')
.stream()
.pipe(process.stdout);The underlying library is built on streams using Highland.js.
This means the returned stream can perform highland-js-style transformations and operations.
Example:
const Tabula = require('fresh-tabula-js');
const stream = new Tabula('dir/foobar.pdf')
.stream();
stream.split()
.doto(console.log)
.done(() => console.log('All done!'));Developing
Introduction
Branch Usage
Development is done in the develop branch.
When master changes (e.g. via pull request), Travis CI
will build and deploy a new version of the package using semantic versioning based on commit messages
to determine the version type.
Commit Message Convention
Commit messages must be formatted according to the conventional commits Angular spec:
<type>[optional scope]: <description>
[optional body]
[optional footer]The following types are supported:
- build: Changes that affect the build system or npm dependencies
- ci: Changes to CI config (e.g. Travis CI config changes)
- docs: Documentation-only changes
- feat: New features
- fix: Bug fix
- perf: Code change related to performance
- refactor: A code change that neither fixes a bug nor adds a feature
- style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
- test: Adding missing tests or correcting existing tests
Rules configuration is found in in release.config.js.
Installing
- Clone the repository.
Switch to the
developbranch:git checkout developInstall dependencies:
$ npm install
Testing
To run tests:
$ npm run testTo run tests in watch mode:
$ npm run test:watchTo run test coverage:
$ npm run test:covBuilding
To run deployment builds:
$ npm run buildDeploying
- Push the changes to
develop. - Merge to
mastervia pull request.
Travis CI will build and deploy the new version of the package (based on semantic commits) to NPM.
Acknowledgements
- Ezo Saleh, original author of this package
- The tabula-java team
- tabula