Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read pdf asynchronously #1789

Closed
summarizepaper opened this issue Apr 14, 2023 · 15 comments
Closed

Read pdf asynchronously #1789

summarizepaper opened this issue Apr 14, 2023 · 15 comments
Assignees

Comments

@summarizepaper
Copy link

Explanation

I tried to read a pdf file with PyPDF2 in asynchronous mode but it didn't work.

I tried using the aiofiles open-source library accessible on GitHub with async with aiofiles.open(pdf_filename, 'rb') as file:

but then the PyPDF2 functions would return an error.

Am I doing something wrong or async is not implemented?

@pubpub-zz
Copy link
Collaborator

can you please provide a standalone code and the error reported.

@summarizepaper
Copy link
Author

summarizepaper commented Apr 14, 2023

Actually, I managed to make it work like so:

import aiofiles
import PyPDF2
import io

async with aiofiles.open(pdf_filename, "rb") as f:
    pdf_data = await f.read()
    pdf_stream = io.BytesIO(pdf_data)
    pdf_reader = PyPDF2.PdfFileReader(pdf_stream)
    text = ""
    for page in range(pdf_reader.getNumPages()):
        text += pdf_reader.getPage(page).extractText()
print('text',text)

but then when I print the text from this pdf: https://arxiv.org/pdf/2304.01202v1.pdf

it gives that for the beginning:

text manuscriptNo.
(willbeinsertedbytheeditor)
WaveMechanics,Interference,andDecoherenceinStrongGravitational
Lensing
CalvinLeung

DylanJow

PrasenjitSaha

LiangDai

MasamuneOguri

Léon
V.E.Koopmans
Abstract
Wave-mechanicale

ectsingravitationallensing
havelongbeenpredicted,andwiththediscoveryofpopula-
tionsofcompacttransientssuchasgravitationalwaveevents
andfastradiobursts,maysoonbeobserved.Wepresentan
observer'sreviewoftherelevanttheoryunderlyingwave-
mechanicale

ectsingravitationallensing.Startingfromthe
curved-spacetimescalarwaveequation,wederivetheFresnel-
Kircho

di

ractionintegral,andanalyzeitintheeikonal
andwaveopticsregimes.Weanswerthequestionofwhat
makesinterferencee

ectsobservableinsomesystemsbut
notinothers,andhowinterferencee

ectsallowforcomple-
mentaryinformationtobeextractedfromlensingsystems
ascomparedtotraditionalmeasurements.Weendbydis-
cussinghowdi

why is it so bad?

@MartinThoma
Copy link
Member

You are using an extremely outdated version. Uninstall PyPDF2. Use pypdf.

@MartinThoma
Copy link
Member

I'm closing this issue as there seems nothing to do here.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Apr 14, 2023

To complete I've re-run the test without text scrambling

@summarizepaper
Copy link
Author

summarizepaper commented Apr 14, 2023

Thanks. I compared to pdfminer, which I use (but doesn't work in async) and the formula given by pypdf are not as good. The text is also better from pdfminer.six. Am I doing something wrong or is it just better?

@MartinThoma
Copy link
Member

You are doing something wrong:

  1. You're using PyPDF2 and not pypdf
  2. You're additionally using a super outdated version, probably PyPDF2<=1.26.0. That is more than 7 years old.

@summarizepaper
Copy link
Author

No I meant, I have updated the code to the most recent pypdf and I find it not as good as pdfminer. Here is the new code:

import aiofiles
import pypdf
import io

async with aiofiles.open(pdf_filename, "rb") as f:
    pdf_data = await f.read()
    pdf_stream = io.BytesIO(pdf_data)
    pdf_reader = pypdf.PdfReader(pdf_stream)
    text = ""
    for num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[num]
        text += page.extract_text(0)

@MartinThoma
Copy link
Member

MartinThoma commented Apr 15, 2023

@summarizepaper What you write is pretty confusing. pypdf is way better than what you have shared in the excerpt. that is what I get:

manuscript No.
(will be inserted by the editor)
Wave Mechanics, Interference, and Decoherence in Strong Gravitational
Lensing
Calvin Leung �Dylan Jow �Prasenjit Saha �Liang Dai �Masamune Oguri �Léon
V . E. Koopmans
Abstract Wave-mechanical e �ects in gravitational lensing
have long been predicted, and with the discovery of popula-
tions of compact transients such as gravitational wave events
and fast radio bursts, may soon be observed. We present an
observer’s review of the relevant theory underlying wave-
mechanical e �ects in gravitational lensing. Starting from the
curved-spacetime scalar wave equation, we derive the Fresnel-
Kircho �di�raction integral, and analyze it in the eikonal
and wave optics regimes. We answer the question of what
makes interference e �ects observable in some systems but
not in others, and how interference e �ects allow for comple-
mentary information to be extracted from lensing systems
as compared to traditional measurements. We end by dis-
cussing how di �raction e �ects a �ect optical depth forecasts
and lensing near caustics, and how compact, low-frequency
transients like gravitational waves and fast radio bursts pro-
vide promising paths to open up the frontier of interferomet-
ric gravitational lensing.
Keywords Gravitational lensing, wave optics, gravitational
waves, transients
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 The curved-spacetime scalar wave equation . . . . . . . . . 2
3 Di �erent Regimes in Wave Optical Gravitational Lensing . . 5
3.1 Beyond Scalar Wave Optics . . . . . . . . . . . . . . . 7
4 Eikonal Optics . . . . . . . . . . . . . . . . . . . . . . . . . 8
[cropped a lot more]

It's also highly unlikely that there are any issues with reading PDF asynchronously. Also I have no idea what error message you mean.

Comparison of pypdf and pdfminer

I looked at the difference between pdfminder and pypdf.

Left is pypdf, right is pdfminer:

image

What pdfminer does better:

  • Math mode ("formula") is extracted better
  • It handles ligatures better (ff)

What pypdf does better:

  • Extracting the "arxiv" text on the side
  • Extracting the "contents" section

@pubpub-zz Do you want to investigate that further?

@MartinThoma MartinThoma reopened this Apr 15, 2023
@py-pdf py-pdf deleted a comment from summarizepaper Apr 16, 2023
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Apr 16, 2023

@pubpub-zz Do you want to investigate that further?

I did a quick comparison with the extraction from acrobat reader.
image
The results are similar:

  • the ff digraph is not extracted as ff there (same as pypdf) => the pdf is not including the correct translation of the digraph character
  • the "." between is authors is extracted as a special character (same as pypdf) => the pdf is not including the correct translation of the digraph
  • the arvix text is extracted at the same position and as a single sentence.

My opinion is that pypdf is extracting the text as it is described within the pdf.

@pubpub-zz
Copy link
Collaborator

@MartinThoma I propose to close it

@summarizepaper
Copy link
Author

summarizepaper commented Apr 16, 2023

Acrobat reader is reading the ff correctly. If pdfminer is able to get ff right, there should be a way. These arxiv pdfs are compiled from latex which is creating a nice formatted ff (with correct typography) which is different from having two f next to each other. As pdf miner can do it, it would be weird not to do it in pypdf. For instance, I'd like to use pypdf but if it cannot read ff and equations correctly then it's a problem.

For the equations, here is what I get from pypdf for the first two equations:
0=gabrarb: (1)
rb=@b: (2)

and from pdfminer:

0 = gab∇a∇bφ.

(1)

∇bφ = ∂bφ.

(2)

Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters?

@pubpub-zz
Copy link
Collaborator

Acrobat reader is reading the ff correctly.

You have to compare output from copy/paste not display

Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters?

again: compare output from clipboard.

@summarizepaper
Copy link
Author

Indeed, a copy paste from Acrobat reader displays what pypdf gives but the pdf reader from Mac or from the browser display what pdfminer gives. I'm reading the pdf and showing them on my website so I really need a library that can do the job as good as pdfminer. I don't think Acrobat is a reference anymore and at least they show the right ff and equations in the reader and the pypdf "python" reader should do the same in my opinion and give a text that can be understood and read by humans.

@MartinThoma
Copy link
Member

I'm closing this as "not planned" as we simply don't have anybody to work on this.

However, I will add a new benchmark looking at math text extraction: py-pdf/benchmarks#6

@MartinThoma MartinThoma closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants