Improving text extraction of PDFs #21

scambier · 2023-03-18T09:09:56Z

scambier
Mar 18, 2023
Maintainer

As of today, Text Extractor works reasonably well with images - at least with latin characters -, but PDFs leave a lot to be desired.

The library used for PDFs is https://github.com/jrmuizel/pdf-extract; it's easily compiled to wasm, and I wrote a small PR for it. Unfortunately it's quite unreliable. It fails to work with many files (#7), obviously can't read images (#20), sometimes crashes, and when it works, the extracted text often have glitchy whitespaces (ref)

So what are possible improvements?

Find and integrate another library to do the job
OCR the PDFs and run the result through Tesseract. Might use much more resources than it already does.
Re-evaluate PDF.js. The results are correct, but it did not scale well at all, and crashed Obsidian after working on a dozen files.
?

The greatest constraint with this plugin is that I want it to be self-contained, as in the user must not install another 3rd-party on top of it. So the work must be done in JavaScript or compiled to wasm. Or we find a simple way to bundle binaries that work for all desktop OSes.

Unfortunately, I can't dedicate a lot of time on Text Extractor, but I'll gladly onboard and mentor new contributors.

Addendum: if another plugin does the job better (with or without 3rd party tools), I'm also totally open to integrate it within Omnisearch.

ConfuzedCoder · 2023-03-18T18:46:42Z

ConfuzedCoder
Mar 18, 2023

@scambier I forked the repo and tried to run it locally on my M1, but I am unable to build the lib package.

0 replies

scambier · 2023-03-18T19:58:48Z

scambier
Mar 18, 2023
Maintainer Author

Thanks for contributing :) It looks like bad cpu type in executable (os error 86) an error that is specific to macOS. Does this help you? https://apple.stackexchange.com/questions/408375/zsh-bad-cpu-type-in-executable

0 replies

ConfuzedCoder · 2023-04-28T17:31:58Z

ConfuzedCoder
Apr 28, 2023

I somehow managed to run the project. Now I'm stuck on converting the python code to javascript. Can you help me with that?

3 replies

scambier Apr 28, 2023
Maintainer Author

I'm not familiar with python, but I can tell you that if you're using some OCR library, that won't work. Your python code depends on python libraries, and more than likely C bindings. You can't just convert the code to javascript, because you won't be able to use your libs.

ConfuzedCoder Apr 28, 2023

It's hardly 3 lines of code and as I mentioned in the forum (https://forum.obsidian.md/t/search-inside-handwritten-pdf-imported-from-goodnote/56468/7?u=dipaktandel, https://forum.obsidian.md/t/search-inside-handwritten-pdf-imported-from-goodnote/56468/4?u=dipaktandel), I am not doing any OCR, just extracting embedded text from PDF, which is not working currently.

The get_pdf_text function needs to be invoked from the javascript, but I don't know how to do it.

from pdfminer.high_level import extract_text

def get_pdf_text(file_path):
    text = extract_text(file_path)
    return text

scambier Apr 29, 2023
Maintainer Author

The get_pdf_text function needs to be invoked from the javascript, but I don't know how to do it.

Like I said, that's not how it works, you can't just call python code from javascript. Being 3 lines of code is irrelevant, there's a whole library under those 3 lines.

There are a few ways to make it work:

Rewrite the python code in javascript. Not just your code, all the code.
Compile your python program into wasm, and call it from the javascript: https://pythondev.readthedocs.io/wasm.html
Build binaries of your python code for the different platforms (win, linux, osx), bundle them with text extractor, and call those binaries from Electron.

None of these solutions are ideal nor easy though, Python doesn't work well with the web.

ConfuzedCoder · 2023-05-20T12:20:26Z

ConfuzedCoder
May 20, 2023

@scambier I am sharing a sample code, which is able to extract text from the pdf using the pdfminer.six in nodejs, can you review it whether it will work for our use case or not? I'm very dump to understand javascript and nodejs.

const { loadPyodide } = require("pyodide");

async function hello_python() {
  const pyodide = await loadPyodide();
  await pyodide.loadPackage("cryptography")
  await pyodide.loadPackage("https://files.pythonhosted.org/packages/18/36/7ae10a3dd7f9117b61180671f8d1e4802080cca88ad40aaabd3dad8bab0e/charset_normalizer-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl")
  await pyodide.loadPackage("https://files.pythonhosted.org/packages/46/68/b3fb5f073bcd3df9143a3520289c147351bfa3c1b096d44081f38fd1c247/pdfminer.six-20221105-py3-none-any.whl")
  let mountDir = "/mnt";
  pyodide.FS.mkdir(mountDir);
  pyodide.FS.mount(pyodide.FS.filesystems.NODEFS, { root: "." }, mountDir);
  return pyodide.runPythonAsync(`
    from pdfminer.high_level import extract_text
    text = extract_text("/mnt/resume.pdf")
    text
  `);
}

hello_python().then((result) => {
  console.log("Embedded text is ", result);
});

Some of the packages are using hard-coded file paths, I'm unable to import some of them as there is some resolution problem. but if the solution is usable, I will work on to fix it.

0 replies

ConfuzedCoder · 2023-05-23T17:26:27Z

ConfuzedCoder
May 23, 2023

@scambier any comment on the above solution?

4 replies

scambier May 23, 2023
Maintainer Author

Sorry I haven't had the time to look yet

ConfuzedCoder May 23, 2023

No issue, take your time

ConfuzedCoder Jun 5, 2023

@scambier any comment on this?

scambier Jun 5, 2023
Maintainer Author

Just quickly tried to run it, but Pyodide considers that Electron is a node environment. It relies on this to use the node filesystem api, which is not available on Electron.

khesed · 2023-08-05T02:41:59Z

khesed
Aug 5, 2023

There's another plugin that does it with imagemagick and I have found it to work well:
https://github.com/MohrJonas/obsidian-ocr

imagemagick is available purely in js: https://github.com/manuels/unix-toolbox.js-imagemagick

0 replies

scambier · 2023-09-16T09:12:47Z

scambier
Sep 16, 2023
Maintainer Author

Re-evaluate PDF.js. The results are correct, but it did not scale well at all, and crashed Obsidian after working on a dozen files.

Quickly tried again today it with this Omnisearch PR, it still crashes the same way.

7 replies

figadore Feb 1, 2024

Around omnisearch.ts#L104-L108 I'm trying to use https://github.com/rxaviers/async-pool (@v1.3.0) to limit the number of promises that try to resolve at once

    let documents = await asyncPool(3, paths, cacheManager.getDocument)

but I just don't know enough typescript to get this approach to compile, the syntax is harder to pick up than I anticipated coming from js.

Am I at least on the right track to test this as you described above? If so, maybe I can go learn typescript for a while and come back to this later.

scambier Feb 1, 2024
Maintainer Author

You shouldn't have to modify Omnisearch, the changes should be restricted to Text Extractor. The function you're trying to modify will affect all of Omnisearch indexing, but we only want to alter how PDFs are read, which is Text Extractor's job.

Asking for a PDF text that is not yet in cache (through Omnisearch or through the right click menu from TE) will call this function, which just adds the text extraction job to the pdfProcessQueue. This queue will process x numbers of jobs in parallel.

What I suggest is

Limit the number of concurrent PDF jobs to 1
Replace the current rust library with with a call to PDF.js (the actual extraction is done here)
Add a delay of a few seconds after a PDF is extracted to hopefully let the GC do some cleaning and avoid a crash of Obsidian. This could be done by e.g. adding a setTimeout() before resolving the promise here (edit: I think there is a convenient awaitable wait(ms: number) function instead of the setTimeout)

Text Extractor might seem more complex, but it's actually just 2 packages under the same repo. Execute pnpm run dev or run build in both projects and it should just run. If you want to build the project as-is, you'll need to install rust, though since the goal is to replace it with PDF.js you can scrap it all and remove the wasm-pack calls in package.json

gavinwright-engr Mar 15, 2024

@figadore if you are still trying to work on this, would you mind also trying to use this pdf as a testfile? I'm pretty out of my depth here as a Mechanical Engineer... lol
Supermicro X10DRH-C Manual PDFA.pdf

I used ocrmypdf to pre-OCR all of my files with these flags:
--output-type pdfa --redo-ocr --optimize 1 --rotate-pages-threshold 3 --tesseract-timeout 75000 --color-conversion-strategy RGB

figadore Mar 19, 2024

Yes, working on this is on my backlog. I hope to get to it soon

(Coincidentally, I'm also a mechanical engineer, and I'm shopping for servers, and this motherboard is of the ones on my short list. What are the odds?)

gavinwright-engr Mar 19, 2024

That's awesome!! Small world no way haha. I'm building a NAS with it but the first board I got (that one) went up in a puff of smoke... hoping to buy another soon and try to get UnRaid working. What type of ME are you? I'm a recent grad in product development/consumer tech.

Here is my configuration:

gavinwright-engr · 2024-03-14T17:29:46Z

gavinwright-engr
Mar 14, 2024

Most of the textbooks and documents I am hoping to have extracted are failing, unfortunately. I even tried using ocrmypdf, pre-ocr'd all of my PDFs. After clearing my cache hoping for better results with the newly cleaned up and pre-OCR'd files, they still weren't being extracted :(

Here is one file of many that were not working:
Supermicro X10DRH-C Manual.pdf

If this plugin were to work successfully on all of my PDFs, Obsidian would be close to perfect for me.

4 replies

gavinwright-engr Mar 14, 2024

@scambier are you still working on this problem?

The imagemagick plugin can't install for me (others in the github have the same issue), so I can't speak to its capabilities. Even if it were working, I'd prefer to be able to use the Omnisearch UI and general functionality, just with working PDF text extraction.

scambier Mar 14, 2024
Maintainer Author

@scambier are you still working on this problem?

No, sorry

Reliably extracting text from PDFs is hard, actually way harder than I anticipated, even when a PDF looks as clean as possible. And it's even harder when you're limited to JavaScript and wasm... I unfortunately don't have the time at all for this, as I personally won't benefit from it.

gavinwright-engr Mar 14, 2024

Gotcha, I understand that its essentially volunteer work.

Do you have any hunches about any of the modifiers that may be useful for me to try to change within the pre-ocr process? List of modifiers here.

The ones I use:
--redo-ocr --optimize 1 --invalidate-digital-signatures --rotate-pages-threshold 3 --tesseract-timeout 750

I'm thinking it might be which type of PDF it is? Not sure.

scambier Mar 14, 2024
Maintainer Author

I can't say, I've never used this program, but I think the issues lies more with the library I use for Text Extractor (https://github.com/jrmuizel/pdf-extract) which doesn't seem to exploit the correct fields.

Improving text extraction of PDFs #21

scambier Mar 18, 2023 Maintainer

Replies: 8 comments · 18 replies

scambier Mar 18, 2023 Maintainer Author

scambier Apr 28, 2023 Maintainer Author

scambier Apr 29, 2023 Maintainer Author

scambier May 23, 2023 Maintainer Author

scambier Jun 5, 2023 Maintainer Author

scambier Sep 16, 2023 Maintainer Author

scambier Feb 1, 2024 Maintainer Author

scambier Mar 14, 2024 Maintainer Author

scambier Mar 14, 2024 Maintainer Author

scambier
Mar 18, 2023
Maintainer

Replies: 8 comments 18 replies

scambier
Mar 18, 2023
Maintainer Author

scambier Apr 28, 2023
Maintainer Author

scambier Apr 29, 2023
Maintainer Author

scambier May 23, 2023
Maintainer Author

scambier Jun 5, 2023
Maintainer Author

scambier
Sep 16, 2023
Maintainer Author

scambier Feb 1, 2024
Maintainer Author

scambier Mar 14, 2024
Maintainer Author

scambier Mar 14, 2024
Maintainer Author