Convert tables inside PDFs to CSV via
tabula-java
using JavaScript.
THIS IS NO LONGER MAINTAINED, BUT I WILL ACCEPT PRs.
This is a fork of the tabula-js
package,
with changes such as:
- Non-stream asynchronous extraction (use
async
/await
)
Please submit any issues (or e-mail me).
Only Node.js environments are supported due to file-system usage requirements. The package is exported as a CommonJS module.
- Java Development Kit (JDK) with
java
available via command-line - Node.js/npm
To install as a dependency via npm
:
$ npm install --save fresh-tabula-js
Import the module:
// 1. Import the module
const Tabula = require('fresh-tabula-js');
const extractData = async () => {
// 2. Instantiate a table via passing a path to a PDF (this can be relative or absolute)
const table = new Tabula('data/foobar.pdf');
// 3. Call an extraction method
return await table.getData();
};
// 4. Call the method!
const data = extractData();
First, an instance of Tabula must be instantiated via calling tabula
with a path (relative or absolute) to a valid PDF.
Example:
const Tabula = require('fresh-tabula-js');
const table = new Tabula('path/to/pdf/foobar.pdf');
// Do stuff
All extraction methods support the same set of options.
Options are passed through to tabula-java
with some exceptions, such as the inability to write the output to file (-o
). Extracted data is available through callbacks, streams, and return values.
Options are structured as a plain object.
Key | Type | Default | Description |
---|---|---|---|
area |
String or Array | Entire page | Co-ordinates of the portion(s) of the page to analyze, formatted in strings in the following format top,left,bottom,right . For example, 269.875,12.75,790.5,561 or ["269.875,12.75,790.5,561", "132.45,23.2,256.3,534"] . |
columns |
String | none | X coordinates of column boundaries. Example "10.1,20.2,30.3" |
debug |
Boolean | false |
Print detected table areas instead of processing them. |
guess |
Boolean | true |
Guess the portion(s) of the page to analyze and process. |
silent |
Boolean | false |
Suppresses all stderr output from the tabula-java JAR only. JavaScript errors will still be logged. |
noSpreadsheet |
Boolean | false |
Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). |
pages |
String | 1 |
Comma separated list of ranges, or all . E.g. 1-3,5-7 , 3 , all . |
spreadsheet |
Boolean | false |
Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet). |
password |
String | empty | Password used to decrypt/access the document. |
useLineReturns |
Boolean | false |
Use embedded line returns in cells (only in spreadsheet mode). |
Use this method to process extracted data from PDF asynchronously using async
/await
.
It returns an object in the following format:
{
output: <String>,
error: <String>,
}
Example:
const Tabula = require('fresh-tabula-js');
const data = async () => {
const table = new Tabula('dir/foobar.pdf');
return await table.getData();
};
Use this method to process extracted data in sections (separate tables).
Callbacks will be executed for each parsed section of the PDF.
Extracted data is a string representing an array of all rows (in CSV format) found, including headers.
const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf');
table.streamSections((err, data) => console.log(data));
We can use the area
option to analyze specific portions of the document.
const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf', {
area: "269.875,150,690,545",
});
table.streamSections((err, data) => console.log(data));
This is used to process data from PDFs via streams.
Example:
const Tabula = require('fresh-tabula-js');
new Tabula('dir/foobar.pdf')
.stream()
.pipe(process.stdout);
The underlying library is built on streams using Highland.js.
This means the returned stream can perform highland-js
-style transformations and operations.
Example:
const Tabula = require('fresh-tabula-js');
const stream = new Tabula('dir/foobar.pdf')
.stream();
stream.split()
.doto(console.log)
.done(() => console.log('All done!'));
Development is done in the develop
branch.
When master
changes (e.g. via pull request), Travis CI
will build and deploy a new version of the package using semantic versioning based on commit messages
to determine the version type.
Commit messages must be formatted according to the conventional commits Angular spec:
<type>[optional scope]: <description>
[optional body]
[optional footer]
The following types are supported:
- build: Changes that affect the build system or npm dependencies
- ci: Changes to CI config (e.g. Travis CI config changes)
- docs: Documentation-only changes
- feat: New features
- fix: Bug fix
- perf: Code change related to performance
- refactor: A code change that neither fixes a bug nor adds a feature
- style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
- test: Adding missing tests or correcting existing tests
Rules configuration is found in in release.config.js
.
-
Clone the repository.
-
Switch to the
develop
branch:git checkout develop
-
Install dependencies:
$ npm install
To run tests:
$ npm run test
To run tests in watch mode:
$ npm run test:watch
To run test coverage:
$ npm run test:cov
To run deployment builds:
$ npm run build
- Push the changes to
develop
. - Merge to
master
via pull request.
Travis CI will build and deploy the new version of the package (based on semantic commits) to NPM.
- Ezo Saleh, original author of this package
- The tabula-java team
- tabula