-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read pdf asynchronously #1789
Comments
can you please provide a standalone code and the error reported. |
Actually, I managed to make it work like so:
but then when I print the text from this pdf: https://arxiv.org/pdf/2304.01202v1.pdf it gives that for the beginning:
why is it so bad? |
You are using an extremely outdated version. Uninstall |
I'm closing this issue as there seems nothing to do here. |
To complete I've re-run the test without text scrambling |
Thanks. I compared to pdfminer, which I use (but doesn't work in async) and the formula given by pypdf are not as good. The text is also better from pdfminer.six. Am I doing something wrong or is it just better? |
You are doing something wrong:
|
No I meant, I have updated the code to the most recent pypdf and I find it not as good as pdfminer. Here is the new code: import aiofiles
|
@summarizepaper What you write is pretty confusing. pypdf is way better than what you have shared in the excerpt. that is what I get:
It's also highly unlikely that there are any issues with reading PDF asynchronously. Also I have no idea what error message you mean. Comparison of pypdf and pdfminerI looked at the difference between pdfminder and pypdf. Left is pypdf, right is pdfminer: What pdfminer does better:
What pypdf does better:
@pubpub-zz Do you want to investigate that further? |
I did a quick comparison with the extraction from acrobat reader.
My opinion is that pypdf is extracting the text as it is described within the pdf. |
@MartinThoma I propose to close it |
Acrobat reader is reading the ff correctly. If pdfminer is able to get ff right, there should be a way. These arxiv pdfs are compiled from latex which is creating a nice formatted ff (with correct typography) which is different from having two f next to each other. As pdf miner can do it, it would be weird not to do it in pypdf. For instance, I'd like to use pypdf but if it cannot read ff and equations correctly then it's a problem. For the equations, here is what I get from pypdf for the first two equations: and from pdfminer: 0 = gab∇a∇bφ. (1) ∇bφ = ∂bφ. (2) Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters? |
You have to compare output from copy/paste not display
again: compare output from clipboard. |
Indeed, a copy paste from Acrobat reader displays what pypdf gives but the pdf reader from Mac or from the browser display what pdfminer gives. I'm reading the pdf and showing them on my website so I really need a library that can do the job as good as pdfminer. I don't think Acrobat is a reference anymore and at least they show the right ff and equations in the reader and the pypdf "python" reader should do the same in my opinion and give a text that can be understood and read by humans. |
I'm closing this as "not planned" as we simply don't have anybody to work on this. However, I will add a new benchmark looking at math text extraction: py-pdf/benchmarks#6 |
Explanation
I tried to read a pdf file with PyPDF2 in asynchronous mode but it didn't work.
I tried using the aiofiles open-source library accessible on GitHub with async with aiofiles.open(pdf_filename, 'rb') as file:
but then the PyPDF2 functions would return an error.
Am I doing something wrong or async is not implemented?
The text was updated successfully, but these errors were encountered: