Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#need to figure out a way to convert pdf to txt before grep #3

Open
vr00n opened this issue Oct 30, 2016 · 1 comment
Open

#need to figure out a way to convert pdf to txt before grep #3

vr00n opened this issue Oct 30, 2016 · 1 comment

Comments

@vr00n
Copy link

vr00n commented Oct 30, 2016

Try using PDFGREP - I was able to convert the schema PDF to a fairly structured format.

From there you can potentially use grep's contextual operators "-A, -B" to include n lines before or after a pattern match.

Here are my results on a simple pdfgrep command

pdfgrep " " schema_alphabetic.pdf | uniq | more
State of California
Civil Service Pay Scale - Alpha by Class Title
  Schem Class
          Code   Full Class Title
                           Compensation              SISA Footnotes         AR Crit  MCR Prob. Mo. WWG NT   CBID
  CU70     1733  ACCOUNT CLERK II
                      $2,471.00 - $3,097.00           SISA                             1        6   2       R 04
  ME10     4915  ACCOUNT MANAGER, CALIFORNIA EXPOSITION AND STATE FAIR
                      $5,553.00 - $6,901.00                01 43                       1       12   E       S 01
  JL32     4177  ACCOUNTANT I (SPECIALIST)
                 A    $3,000.00 - $3,757.00                                285         1        6   2       R 01
                 L    $3,000.00 - $3,757.00                                285         1        6   2       R 01

@josephlei
Copy link
Owner

Thank you for the recommendation, I will definitely check this out!

In our discussions, this will soon be provided from source system, openly, in machine readable format and possibly an API. I wasn't aware of this package and think it will be useful in other applications in the future as well, thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants