[Feature request] Run OCR on images in PDFs to extract text #20

gwillcox-r7 · 2023-03-15T21:01:08Z

Is your feature request related to a problem? Please describe.
Would be nice to have the ability to extract text from images embedded in PDFs.

Describe the solution you'd like
Ability to extract text from images in PDFs, such as if the PDF is a slide deck of images. This might be something we could configure with a toggle switch or a list so that this isn't run by default, since it will likely be computationally expensive to do both text extraction as well as OCR.

Describe alternatives you've considered
https://evermap.com/Tutorial_ABM_OCR.asp describes a way to make OCR documents with Adobe Acrobat. I believe you can also do this with tools like Readiris that OCR in multiple languages.

Additional context
Some PDFs may contain diagrams or other images with text in them that can be useful to extract. We already have OCR support for images so it may be an idea to extract the images from the PDF and run OCR on them, then combine this with the existing text extraction results.

khesed · 2023-08-02T03:54:37Z

Another alternative could be converting the PDF into an image and running OCR on that.

scambier · 2023-08-02T08:59:34Z

This is the ideal solution, as it would greatly improve pdf text extraction. The problem is that it looks really hard to do in a pure js/wasm context without external dependencies. The only robust solution I've found is pdf.js, but it scales awfully and eats all ram after a few files. Its probably worth it to try again though.

khesed · 2023-08-05T02:37:53Z

This is the ideal solution, as it would greatly improve pdf text extraction. The problem is that it looks really hard to do in a pure js/wasm context without external dependencies. The only robust solution I've found is pdf.js, but it scales awfully and eats all ram after a few files. Its probably worth it to try again though.

Played around with PDF.js and it works really well, in my opinion. It might not scale like a dream, but I think that it might not be feasible for operations like this to scale anyway. For sake of memory, prioritizing small files in the vault first and only doing files 1 at a time after a certain size should be satisfactory (I imagine).

khesed · 2023-08-06T19:16:42Z

Integrating imagemagick could work solve this:
#21 (comment)

gwillcox-r7 changed the title ~~[Feature request] Extract data from Images in PDFs~~ [Feature request] Run OCR on images in PDFs to extract text Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Run OCR on images in PDFs to extract text #20

[Feature request] Run OCR on images in PDFs to extract text #20

gwillcox-r7 commented Mar 15, 2023

khesed commented Aug 2, 2023

scambier commented Aug 2, 2023

khesed commented Aug 5, 2023

khesed commented Aug 6, 2023

[Feature request] Run OCR on images in PDFs to extract text #20

[Feature request] Run OCR on images in PDFs to extract text #20

Comments

gwillcox-r7 commented Mar 15, 2023

khesed commented Aug 2, 2023

scambier commented Aug 2, 2023

khesed commented Aug 5, 2023

khesed commented Aug 6, 2023