Improving text extraction of PDFs #21
Replies: 8 comments 18 replies
-
@scambier I forked the repo and tried to run it locally on my M1, but I am unable to build the lib package. |
Beta Was this translation helpful? Give feedback.
-
Thanks for contributing :) It looks like |
Beta Was this translation helpful? Give feedback.
-
I somehow managed to run the project. Now I'm stuck on converting the python code to javascript. Can you help me with that? |
Beta Was this translation helpful? Give feedback.
-
@scambier I am sharing a sample code, which is able to extract text from the pdf using the pdfminer.six in nodejs, can you review it whether it will work for our use case or not? I'm very dump to understand javascript and nodejs.
Some of the packages are using hard-coded file paths, I'm unable to import some of them as there is some resolution problem. but if the solution is usable, I will work on to fix it. |
Beta Was this translation helpful? Give feedback.
-
@scambier any comment on the above solution? |
Beta Was this translation helpful? Give feedback.
-
There's another plugin that does it with
|
Beta Was this translation helpful? Give feedback.
-
Quickly tried again today it with this Omnisearch PR, it still crashes the same way. |
Beta Was this translation helpful? Give feedback.
-
Most of the textbooks and documents I am hoping to have extracted are failing, unfortunately. I even tried using ocrmypdf, pre-ocr'd all of my PDFs. After clearing my cache hoping for better results with the newly cleaned up and pre-OCR'd files, they still weren't being extracted :( Here is one file of many that were not working: If this plugin were to work successfully on all of my PDFs, Obsidian would be close to perfect for me. |
Beta Was this translation helpful? Give feedback.
-
As of today, Text Extractor works reasonably well with images - at least with latin characters -, but PDFs leave a lot to be desired.
The library used for PDFs is https://github.com/jrmuizel/pdf-extract; it's easily compiled to wasm, and I wrote a small PR for it. Unfortunately it's quite unreliable. It fails to work with many files (#7), obviously can't read images (#20), sometimes crashes, and when it works, the extracted text often have glitchy whitespaces (ref)
So what are possible improvements?
The greatest constraint with this plugin is that I want it to be self-contained, as in the user must not install another 3rd-party on top of it. So the work must be done in JavaScript or compiled to wasm. Or we find a simple way to bundle binaries that work for all desktop OSes.
Unfortunately, I can't dedicate a lot of time on Text Extractor, but I'll gladly onboard and mentor new contributors.
Addendum: if another plugin does the job better (with or without 3rd party tools), I'm also totally open to integrate it within Omnisearch.
Beta Was this translation helpful? Give feedback.
All reactions