Skip to content

Commit

Permalink
Prevent iteration on pages with no tables (#22)
Browse files Browse the repository at this point in the history
* get fields 5 thru 14

* consider edge cases for section 14

* adjust table settings to visually join separated tables

* write csvs for all fields

* refactor writing the headers for similar dicts

* prevent iteration on pages with no tables

* use list comprehension

---------

Co-authored-by: Xavier Medrano <[email protected]>
  • Loading branch information
xmedr and Xavier Medrano authored Mar 1, 2024
1 parent 79ba9de commit b484a22
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion scrapers/financial_disclosure/parse_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,8 @@ def parse_pdf(pdf: pdfplumber.PDF) -> dict[str, dict[str, str | None]]:
table_settings = {
"intersection_tolerance": 6, # minimum allowable tolerance to grab all tables
}
rows = [tuple(row) for page in pdf.pages for row in page.extract_table(table_settings=table_settings)] # type: ignore[union-attr]

rows = [tuple(row) for page in pdf.pages if (table := page.extract_table(table_settings=table_settings)) for row in table] # type: ignore[union-attr]

grouped_rows = SubstringDict(_group_rows(rows))

Expand Down

0 comments on commit b484a22

Please sign in to comment.