Skip to content

Releases: jsvine/pdfplumber

v0.6.0

21 Dec 14:06
Compare
Choose a tag to compare

See CHANGELOG.md for a full list of additions, changes, and fixes. In some (hopefully) rare cases, this version may introduce breaking changes, which is why we're bumping to v0.6.0. Highlights from the changelog include:

  • Upgrade pdfminer.six from 20200517 to 20211012; see that library's changelog for details, but a key difference is an improvement in how it assigns line, rect, and curve objects. (Diagonal two-point lines, for instance, are now line objects instead of curve objects.) (#515)
  • Add .extract_text(layout=True), an experimental feature which attempts to mimic the structural layout of the text on the page. (#10)
  • Remove Decimal-ization of parsed object attributes, which are now represented with as much precision as is returned by pdfminer.six (#346 + #520)
  • .extract_text(...) returns "" instead of None when character list is empty. (#482 + cb9900b) [h/t @tungph]
  • Add --precision argument to CLI (#520)
  • Add snap_x_tolerance and snap_y_tolerance to table extraction settings. (#51 + #475) [h/t @dustindall]
  • Add join_x_tolerance and join_y_tolerance to table extraction settings. (cbb34ce)
  • .extract_words(...) now includes doctop among the attributes it returns for each word. (66fef89)

And many thanks to @samkit-jain for his feedback and review of contributions to this release. 🎉

v0.5.28

08 May 21:50
Compare
Choose a tag to compare

From CHANGELOG.md:

Added

  • Add --laparams flag to CLI. (#407)

Changed

  • Change .convert_csv(...) to order objects first by page number, rather than object type. (#407)
  • Change .convert_csv(...), .convert_json(...), and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. (#407)

Fixed

  • Fix .extract_text(...) so that it can accept generator objects as its main parameter. (#385) [h/t @alexreg]
  • Fix page-parsing so that LTAnno objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when setting laparams.) (#388)
  • Fix Page.extract_table(...) so that it honors text tolerance settings (#415) [h/t @trifling]

v0.5.27

28 Feb 19:37
Compare
Choose a tag to compare

From CHANGELOG.md:

Fixed

  • Fix regression (introduced in 0.5.26/b1849f4) in closing files opened by PDF.open
  • Reinstate access to higher-level layout objects (such as textboxhorizontal) when laparams is passed to pdfplumber.open(...). Had been removed in 0.5.24 via 1f87898. (#359 + #364)

Development Changes

  • Add a python setup.py build sdist test to main GitHub action. (#365)

v0.5.26

11 Feb 02:54
Compare
Choose a tag to compare

See CHANGELOG.md for details.

v0.5.25

09 Dec 14:22
Compare
Choose a tag to compare

See CHANGELOG.md for details.

v0.5.24

20 Oct 13:50
Compare
Choose a tag to compare

See CHANGELOG.md for details.

v0.5.23

15 Aug 17:08
Compare
Choose a tag to compare

See changelog for details.

v0.5.22

25 Jul 11:59
Compare
Choose a tag to compare

[0.5.22] — 2020-07-18

Changed

Added

  • Add support for non_stroking_color attribute on char objects (0254da3) [h/t @idan-david]

v0.5.15

06 Jan 02:38
Compare
Choose a tag to compare

Primarily: Upgrades pinned requirements for pdfminer.six and pillow.

v0.6.0-alpha

06 Feb 14:39
Compare
Choose a tag to compare
v0.6.0-alpha Pre-release
Pre-release

This release is a preview/alpha for pdfplumber v0.6.0. Among the more notable changes:

  • Revamps the table-extraction methods, to simplify them and make them more flexible.
  • Adds font size and font name to results of Page/utils.extract_words(...), based on @jsfenfen's suggestions in #28. (Thanks!)

Goals before v0.6.0-beta:

  • Add Page.find_text_gutters feature, bringing back that table-finding strategy from earlier versions of pdfplumber.
  • Attempt to fix/address as many extant GitHub issues as possible.
  • Update the example notebooks, so that they work.

Goals before v0.6.0 full release:

  • Reach full test coverage.
  • Add more robust documentation.
  • Add more/better docstrings.