-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Run OCR on images in PDFs to extract text #20
Comments
Another alternative could be converting the PDF into an image and running OCR on that. |
This is the ideal solution, as it would greatly improve pdf text extraction. The problem is that it looks really hard to do in a pure js/wasm context without external dependencies. The only robust solution I've found is pdf.js, but it scales awfully and eats all ram after a few files. Its probably worth it to try again though. |
Played around with PDF.js and it works really well, in my opinion. It might not scale like a dream, but I think that it might not be feasible for operations like this to scale anyway. For sake of memory, prioritizing small files in the vault first and only doing files 1 at a time after a certain size should be satisfactory (I imagine). |
Integrating |
Is your feature request related to a problem? Please describe.
Would be nice to have the ability to extract text from images embedded in PDFs.
Describe the solution you'd like
Ability to extract text from images in PDFs, such as if the PDF is a slide deck of images. This might be something we could configure with a toggle switch or a list so that this isn't run by default, since it will likely be computationally expensive to do both text extraction as well as OCR.
Describe alternatives you've considered
https://evermap.com/Tutorial_ABM_OCR.asp describes a way to make OCR documents with Adobe Acrobat. I believe you can also do this with tools like Readiris that OCR in multiple languages.
Additional context
Some PDFs may contain diagrams or other images with text in them that can be useful to extract. We already have OCR support for images so it may be an idea to extract the images from the PDF and run OCR on them, then combine this with the existing text extraction results.
The text was updated successfully, but these errors were encountered: