v0.6.0
See CHANGELOG.md for a full list of additions, changes, and fixes. In some (hopefully) rare cases, this version may introduce breaking changes, which is why we're bumping to v0.6.0
. Highlights from the changelog include:
- Upgrade
pdfminer.six
from20200517
to20211012
; see that library's changelog for details, but a key difference is an improvement in how it assignsline
,rect
, andcurve
objects. (Diagonal two-point lines, for instance, are nowline
objects instead ofcurve
objects.) (#515) - Add
.extract_text(layout=True)
, an experimental feature which attempts to mimic the structural layout of the text on the page. (#10) - Remove Decimal-ization of parsed object attributes, which are now represented with as much precision as is returned by
pdfminer.six
(#346 + #520) .extract_text(...)
returns""
instead ofNone
when character list is empty. (#482 + cb9900b) [h/t @tungph]- Add
--precision
argument to CLI (#520) - Add
snap_x_tolerance
andsnap_y_tolerance
to table extraction settings. (#51 + #475) [h/t @dustindall] - Add
join_x_tolerance
andjoin_y_tolerance
to table extraction settings. (cbb34ce) .extract_words(...)
now includesdoctop
among the attributes it returns for each word. (66fef89)
And many thanks to @samkit-jain for his feedback and review of contributions to this release. 🎉