GitHub - srinivasaraov/pdfparser: Python binding to libpoppler with focus on text extraction

pdfparser

Python binding for libpoppler - focused on text extration from PDF documents.

Intended as an easy to use replacement for pdfminer, which provides much better perfomance (see below for short comparison) and is Python3 compatible.

See this article for some comparisons with pdfminer and other approaches.

Binding is written in cython.

Requires recent libpoppler >= 0.40 - so I'd recommend to build it from source to get latest library, but it works also with recent libpoppler library present in common linux distributions (then it requires dev package to build). See below for installation instructions.

Available under GPL v3 or any later version license (libpoppler is also GPL).

How to install

Below or some instructions to install this package

git clone --depth 1 https://github.com/rossumai/pdfparser.git
cd pdfparser
sudo ./install_fonts.sh
sudo ./build_poppler.sh
sudo apt-get install -y coreutils g++ gcc git libcairo2 libcairo2-dev libfontconfig1 libopenjpeg5 libtiff5 libzip4 pkg-config python-dev
# If not in virtualenv, run install_pdfparser.sh with sudo
./install_pdfparser.sh
#test that it works
python tests/dump_file.py test_docs/test1.pdf

Speed comparisons

	pdfreader	pdfminer	speed-up factor
tiny document (half page)	0.033s	0.121s	3.6 x
small document (5 pages)	0.141s	0.810s	5.7 x
medium document (55 pages)	1.166s	10.524s	9.0 x
large document (436 pages)	10.581s	108.095s	10.2 x

pdfparser code used in test

import pdfparser.poppler as pdf
import sys

d=pdf.PopplerDocument(sys.argv[1])

print('No of pages', d.no_of_pages)
for p in d:
    print('Page', p.page_no, 'size =', p.size)
    for f in p:
        print(' '*1,'Flow')
        for b in f:
            print(' '*2,'Block', 'bbox=', b.bbox.as_tuple())
            for l in b:
                print(' '*3, l.text.encode('UTF-8'), '(%0.2f, %0.2f, %0.2f, %0.2f)'% l.bbox.as_tuple())
                #assert l.char_fonts.comp_ratio < 1.0
                for i in range(len(l.text)):
                    print(l.text[i].encode('UTF-8'), '(%0.2f, %0.2f, %0.2f, %0.2f)'% l.char_bboxes[i].as_tuple(),\
                        l.char_fonts[i].name, l.char_fonts[i].size, l.char_fonts[i].color,)
                print()

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
pdfparser		pdfparser
test_docs		test_docs
tests		tests
.gitignore		.gitignore
README.md		README.md
build_poppler.sh		build_poppler.sh
install_fonts.sh		install_fonts.sh
install_fonts_orig.sh		install_fonts_orig.sh
install_pdfparser.sh		install_pdfparser.sh
requirements.txt		requirements.txt
setup.py		setup.py
vistafonts-installer		vistafonts-installer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfparser

How to install

Speed comparisons

About

Releases

Packages

Languages

srinivasaraov/pdfparser

Folders and files

Latest commit

History

Repository files navigation

pdfparser

How to install

Speed comparisons

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages