Skip to content

cdtinney/fresh-tabula-js

 
 

Repository files navigation

fresh-tabula-js

Convert tables inside PDFs to CSV via tabula-java using JavaScript.

Build Status Coverage Status semantic-release

THIS IS NO LONGER MAINTAINED, BUT I WILL ACCEPT PRs.

This is a fork of the tabula-js package, with changes such as:

  • Non-stream asynchronous extraction (use async/await)

Please submit any issues (or e-mail me).

Contents

Getting Started

Only Node.js environments are supported due to file-system usage requirements. The package is exported as a CommonJS module.

Requirements

  • Java Development Kit (JDK) with java available via command-line
  • Node.js/npm

Installing

To install as a dependency via npm:

$ npm install --save fresh-tabula-js

Usage

Import the module:

// 1. Import the module
const Tabula = require('fresh-tabula-js');
const extractData = async () => {
  // 2. Instantiate a table via passing a path to a PDF (this can be relative or absolute)
  const table = new Tabula('data/foobar.pdf');
  // 3. Call an extraction method
  return await table.getData();
};
// 4. Call the method!
const data = extractData();

API

First, an instance of Tabula must be instantiated via calling tabula with a path (relative or absolute) to a valid PDF.

Example:

const Tabula = require('fresh-tabula-js');
const table = new Tabula('path/to/pdf/foobar.pdf');
// Do stuff

Options

All extraction methods support the same set of options.

Options are passed through to tabula-java with some exceptions, such as the inability to write the output to file (-o). Extracted data is available through callbacks, streams, and return values.

Options are structured as a plain object.

Key Type Default Description
area String or Array Entire page Co-ordinates of the portion(s) of the page to analyze, formatted in strings in the following format top,left,bottom,right. For example, 269.875,12.75,790.5,561 or ["269.875,12.75,790.5,561", "132.45,23.2,256.3,534"].
columns String none X coordinates of column boundaries. Example "10.1,20.2,30.3"
debug Boolean false Print detected table areas instead of processing them.
guess Boolean true Guess the portion(s) of the page to analyze and process.
silent Boolean false Suppresses all stderr output from the tabula-java JAR only. JavaScript errors will still be logged.
noSpreadsheet Boolean false Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet).
pages String 1 Comma separated list of ranges, or all. E.g. 1-3,5-7, 3, all.
spreadsheet Boolean false Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet).
password String empty Password used to decrypt/access the document.
useLineReturns Boolean false Use embedded line returns in cells (only in spreadsheet mode).

Methods

Tabula.getData

Use this method to process extracted data from PDF asynchronously using async/await.

It returns an object in the following format:

{
  output: <String>,
  error: <String>,
}

Example:

const Tabula = require('fresh-tabula-js');
const data = async () => {
  const table = new Tabula('dir/foobar.pdf');
  return await table.getData();
};

Tabula.streamSections

Use this method to process extracted data in sections (separate tables).

Callbacks will be executed for each parsed section of the PDF.

Extracted data is a string representing an array of all rows (in CSV format) found, including headers.

const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf');
table.streamSections((err, data) => console.log(data));

We can use the area option to analyze specific portions of the document.

const Tabula = require('fresh-tabula-js');
const table = new Tabula('dir/foobar.pdf', {
  area: "269.875,150,690,545",
});
table.streamSections((err, data) => console.log(data));

Tabula.stream

This is used to process data from PDFs via streams.

Example:

const Tabula = require('fresh-tabula-js');
new Tabula('dir/foobar.pdf')
  .stream()
  .pipe(process.stdout);

The underlying library is built on streams using Highland.js.

This means the returned stream can perform highland-js-style transformations and operations.

Example:

const Tabula = require('fresh-tabula-js');
const stream = new Tabula('dir/foobar.pdf')
  .stream();
stream.split()
  .doto(console.log)
  .done(() => console.log('All done!'));

Developing

Introduction

Branch Usage

Development is done in the develop branch.

When master changes (e.g. via pull request), Travis CI will build and deploy a new version of the package using semantic versioning based on commit messages to determine the version type.

Commit Message Convention

Commit messages must be formatted according to the conventional commits Angular spec:

<type>[optional scope]: <description>

[optional body]

[optional footer]

The following types are supported:

  • build: Changes that affect the build system or npm dependencies
  • ci: Changes to CI config (e.g. Travis CI config changes)
  • docs: Documentation-only changes
  • feat: New features
  • fix: Bug fix
  • perf: Code change related to performance
  • refactor: A code change that neither fixes a bug nor adds a feature
  • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
  • test: Adding missing tests or correcting existing tests

Rules configuration is found in in release.config.js.

Installing

  1. Clone the repository.

  2. Switch to the develop branch:

    git checkout develop
    
  3. Install dependencies:

    $ npm install
    

Testing

To run tests:

$ npm run test

To run tests in watch mode:

$ npm run test:watch

To run test coverage:

$ npm run test:cov

Building

To run deployment builds:

$ npm run build

Deploying

  1. Push the changes to develop.
  2. Merge to master via pull request.

Travis CI will build and deploy the new version of the package (based on semantic commits) to NPM.

Acknowledgements

About

Extract CSV data from PDF tables using tabula-java.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 100.0%