Read pdf asynchronously #1789

summarizepaper · 2023-04-14T14:13:44Z

Explanation

I tried to read a pdf file with PyPDF2 in asynchronous mode but it didn't work.

I tried using the aiofiles open-source library accessible on GitHub with async with aiofiles.open(pdf_filename, 'rb') as file:

but then the PyPDF2 functions would return an error.

Am I doing something wrong or async is not implemented?

pubpub-zz · 2023-04-14T15:03:23Z

can you please provide a standalone code and the error reported.

summarizepaper · 2023-04-14T15:41:47Z

Actually, I managed to make it work like so:

import aiofiles
import PyPDF2
import io

async with aiofiles.open(pdf_filename, "rb") as f:
    pdf_data = await f.read()
    pdf_stream = io.BytesIO(pdf_data)
    pdf_reader = PyPDF2.PdfFileReader(pdf_stream)
    text = ""
    for page in range(pdf_reader.getNumPages()):
        text += pdf_reader.getPage(page).extractText()
print('text',text)

but then when I print the text from this pdf: https://arxiv.org/pdf/2304.01202v1.pdf

it gives that for the beginning:

text manuscriptNo.
(willbeinsertedbytheeditor)
WaveMechanics,Interference,andDecoherenceinStrongGravitational
Lensing
CalvinLeung

DylanJow

PrasenjitSaha

LiangDai

MasamuneOguri

Léon
V.E.Koopmans
Abstract
Wave-mechanicale

ectsingravitationallensing
havelongbeenpredicted,andwiththediscoveryofpopula-
tionsofcompacttransientssuchasgravitationalwaveevents
andfastradiobursts,maysoonbeobserved.Wepresentan
observer'sreviewoftherelevanttheoryunderlyingwave-
mechanicale

ectsingravitationallensing.Startingfromthe
curved-spacetimescalarwaveequation,wederivetheFresnel-
Kircho

di

ractionintegral,andanalyzeitintheeikonal
andwaveopticsregimes.Weanswerthequestionofwhat
makesinterferencee

ectsobservableinsomesystemsbut
notinothers,andhowinterferencee

ectsallowforcomple-
mentaryinformationtobeextractedfromlensingsystems
ascomparedtotraditionalmeasurements.Weendbydis-
cussinghowdi

why is it so bad?

MartinThoma · 2023-04-14T17:42:06Z

You are using an extremely outdated version. Uninstall PyPDF2. Use pypdf.

MartinThoma · 2023-04-14T17:42:34Z

I'm closing this issue as there seems nothing to do here.

pubpub-zz · 2023-04-14T18:37:13Z

To complete I've re-run the test without text scrambling

summarizepaper · 2023-04-14T20:00:57Z

Thanks. I compared to pdfminer, which I use (but doesn't work in async) and the formula given by pypdf are not as good. The text is also better from pdfminer.six. Am I doing something wrong or is it just better?

MartinThoma · 2023-04-15T08:22:22Z

You are doing something wrong:

You're using PyPDF2 and not pypdf
You're additionally using a super outdated version, probably PyPDF2<=1.26.0. That is more than 7 years old.

summarizepaper · 2023-04-15T19:28:45Z

No I meant, I have updated the code to the most recent pypdf and I find it not as good as pdfminer. Here is the new code:

import aiofiles
import pypdf
import io

async with aiofiles.open(pdf_filename, "rb") as f:
    pdf_data = await f.read()
    pdf_stream = io.BytesIO(pdf_data)
    pdf_reader = pypdf.PdfReader(pdf_stream)
    text = ""
    for num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[num]
        text += page.extract_text(0)

MartinThoma · 2023-04-15T20:14:08Z

@summarizepaper What you write is pretty confusing. pypdf is way better than what you have shared in the excerpt. that is what I get:

manuscript No.
(will be inserted by the editor)
Wave Mechanics, Interference, and Decoherence in Strong Gravitational
Lensing
Calvin Leung �Dylan Jow �Prasenjit Saha �Liang Dai �Masamune Oguri �Léon
V . E. Koopmans
Abstract Wave-mechanical e �ects in gravitational lensing
have long been predicted, and with the discovery of popula-
tions of compact transients such as gravitational wave events
and fast radio bursts, may soon be observed. We present an
observer’s review of the relevant theory underlying wave-
mechanical e �ects in gravitational lensing. Starting from the
curved-spacetime scalar wave equation, we derive the Fresnel-
Kircho �di�raction integral, and analyze it in the eikonal
and wave optics regimes. We answer the question of what
makes interference e �ects observable in some systems but
not in others, and how interference e �ects allow for comple-
mentary information to be extracted from lensing systems
as compared to traditional measurements. We end by dis-
cussing how di �raction e �ects a �ect optical depth forecasts
and lensing near caustics, and how compact, low-frequency
transients like gravitational waves and fast radio bursts pro-
vide promising paths to open up the frontier of interferomet-
ric gravitational lensing.
Keywords Gravitational lensing, wave optics, gravitational
waves, transients
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 The curved-spacetime scalar wave equation . . . . . . . . . 2
3 Di �erent Regimes in Wave Optical Gravitational Lensing . . 5
3.1 Beyond Scalar Wave Optics . . . . . . . . . . . . . . . 7
4 Eikonal Optics . . . . . . . . . . . . . . . . . . . . . . . . . 8
[cropped a lot more]

It's also highly unlikely that there are any issues with reading PDF asynchronously. Also I have no idea what error message you mean.

Comparison of pypdf and pdfminer

I looked at the difference between pdfminder and pypdf.

Left is pypdf, right is pdfminer:

What pdfminer does better:

Math mode ("formula") is extracted better
It handles ligatures better (ff)

What pypdf does better:

Extracting the "arxiv" text on the side
Extracting the "contents" section

@pubpub-zz Do you want to investigate that further?

pubpub-zz · 2023-04-16T06:43:54Z

@pubpub-zz Do you want to investigate that further?

I did a quick comparison with the extraction from acrobat reader.

The results are similar:

the ff digraph is not extracted as ff there (same as pypdf) => the pdf is not including the correct translation of the digraph character
the "." between is authors is extracted as a special character (same as pypdf) => the pdf is not including the correct translation of the digraph
the arvix text is extracted at the same position and as a single sentence.

My opinion is that pypdf is extracting the text as it is described within the pdf.

pubpub-zz · 2023-04-16T08:02:28Z

@MartinThoma I propose to close it

summarizepaper · 2023-04-16T13:53:17Z

Acrobat reader is reading the ff correctly. If pdfminer is able to get ff right, there should be a way. These arxiv pdfs are compiled from latex which is creating a nice formatted ff (with correct typography) which is different from having two f next to each other. As pdf miner can do it, it would be weird not to do it in pypdf. For instance, I'd like to use pypdf but if it cannot read ff and equations correctly then it's a problem.

For the equations, here is what I get from pypdf for the first two equations:
0=gabrarb: (1)
rb=@b: (2)

and from pdfminer:

0 = gab∇a∇bφ.

(1)

∇bφ = ∂bφ.

(2)

Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters?

pubpub-zz · 2023-04-16T15:08:34Z

Acrobat reader is reading the ff correctly.

You have to compare output from copy/paste not display

Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters?

again: compare output from clipboard.

summarizepaper · 2023-04-16T15:25:30Z

Indeed, a copy paste from Acrobat reader displays what pypdf gives but the pdf reader from Mac or from the browser display what pdfminer gives. I'm reading the pdf and showing them on my website so I really need a library that can do the job as good as pdfminer. I don't think Acrobat is a reference anymore and at least they show the right ff and equations in the reader and the pypdf "python" reader should do the same in my opinion and give a text that can be understood and read by humans.

MartinThoma · 2023-04-16T18:08:48Z

I'm closing this as "not planned" as we simply don't have anybody to work on this.

However, I will add a new benchmark looking at math text extraction: py-pdf/benchmarks#6

summarizepaper assigned MartinThoma Apr 14, 2023

MartinThoma closed this as completed Apr 14, 2023

MartinThoma reopened this Apr 15, 2023

py-pdf deleted a comment from summarizepaper Apr 16, 2023

MartinThoma mentioned this issue Apr 16, 2023

Add "math extraction" benchmark py-pdf/benchmarks#6

Open

MartinThoma closed this as not planned Won't fix, can't repro, duplicate, stale Apr 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read pdf asynchronously #1789

Read pdf asynchronously #1789

summarizepaper commented Apr 14, 2023

pubpub-zz commented Apr 14, 2023

summarizepaper commented Apr 14, 2023 •

edited by MartinThoma

Loading

MartinThoma commented Apr 14, 2023

MartinThoma commented Apr 14, 2023

pubpub-zz commented Apr 14, 2023 •

edited

Loading

summarizepaper commented Apr 14, 2023 •

edited

Loading

MartinThoma commented Apr 15, 2023

summarizepaper commented Apr 15, 2023

MartinThoma commented Apr 15, 2023 •

edited

Loading

pubpub-zz commented Apr 16, 2023 •

edited

Loading

pubpub-zz commented Apr 16, 2023

summarizepaper commented Apr 16, 2023 •

edited

Loading

pubpub-zz commented Apr 16, 2023

summarizepaper commented Apr 16, 2023

MartinThoma commented Apr 16, 2023

Read pdf asynchronously #1789

Read pdf asynchronously #1789

Comments

summarizepaper commented Apr 14, 2023

Explanation

pubpub-zz commented Apr 14, 2023

summarizepaper commented Apr 14, 2023 • edited by MartinThoma Loading

MartinThoma commented Apr 14, 2023

MartinThoma commented Apr 14, 2023

pubpub-zz commented Apr 14, 2023 • edited Loading

summarizepaper commented Apr 14, 2023 • edited Loading

MartinThoma commented Apr 15, 2023

summarizepaper commented Apr 15, 2023

MartinThoma commented Apr 15, 2023 • edited Loading

Comparison of pypdf and pdfminer

pubpub-zz commented Apr 16, 2023 • edited Loading

pubpub-zz commented Apr 16, 2023

summarizepaper commented Apr 16, 2023 • edited Loading

pubpub-zz commented Apr 16, 2023

summarizepaper commented Apr 16, 2023

MartinThoma commented Apr 16, 2023

summarizepaper commented Apr 14, 2023 •

edited by MartinThoma

Loading

pubpub-zz commented Apr 14, 2023 •

edited

Loading

summarizepaper commented Apr 14, 2023 •

edited

Loading

MartinThoma commented Apr 15, 2023 •

edited

Loading

pubpub-zz commented Apr 16, 2023 •

edited

Loading

summarizepaper commented Apr 16, 2023 •

edited

Loading