Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs #774

Open
igrmk opened this issue Dec 1, 2024 · 14 comments
Open

Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs #774

igrmk opened this issue Dec 1, 2024 · 14 comments

Comments

@igrmk
Copy link

igrmk commented Dec 1, 2024

Text copied from a XeLaTeX-produced PDF using Inter v4.1 contains unexpected characters, while version 3.19 works flawlessly.

To Reproduce

  1. Install XeTeX, Poppler, curl, unzip

  2. Run the script below in a dedicated directory. Both produced PDFs are attached for reference:

    mkdir -p fonts
    curl -s -L -O --output-dir fonts "https://github.com/rsms/inter/releases/download/v3.19/Inter-3.19.zip"
    curl -s -L -O --output-dir fonts "https://github.com/rsms/inter/releases/download/v4.1/Inter-4.1.zip"
    unzip -q -o -d fonts/Inter-3.19 fonts/Inter-3.19.zip
    unzip -q -o -d fonts/Inter-4.1 fonts/Inter-4.1.zip
    
    cat <<EOF > inter-3.19.tex
    \documentclass{article}
    \pagestyle{empty}
    \usepackage{fontspec}
    
    \setmainfont{Inter}[
        Path           = ./fonts/Inter-3.19/Inter Desktop/,
        Extension      = .otf,
        UprightFont    = *-Regular,
        BoldFont       = *-Bold,
        ItalicFont     = *-Italic,
        BoldItalicFont = *-BoldItalic
    ]
    
    \begin{document}
    (C++) (100\%)
    \end{document}
    EOF
    
    cat <<EOF > inter-4.1.tex
    \documentclass{article}
    \pagestyle{empty}
    \usepackage{fontspec}
    
    \setmainfont{Inter}[
        Path           = ./fonts/Inter-4.1/extras/otf/,
        Extension      = .otf,
        UprightFont    = *-Regular,
        BoldFont       = *-Bold,
        ItalicFont     = *-Italic,
        BoldItalicFont = *-BoldItalic
    ]
    
    \begin{document}
    (C++) (100\%)
    \end{document}
    EOF
    
    mkdir -p pdfs
    xelatex -interaction=batchmode -output-directory pdfs inter-3.19.tex > /dev/null
    xelatex -interaction=batchmode -output-directory pdfs inter-4.1.tex > /dev/null
    
    pdftotext pdfs/Inter-3.19.pdf - | grep -v $'\f' | grep -v '^$'
    pdftotext pdfs/Inter-4.1.pdf - | grep -v $'\f' | grep -v '^$'
    
  3. It outputs the following, even though both PDFs appear fine visually:

    (C++) (100%)
    ?C?????100%? <redacted due to smileys that cannot be pasted>
    

Expected behavior
I expect it to output:

(C++) (100%)
(C++) (100%)

Environment

  • OS: macOS 15.1.1, M2
  • XeTeX 3.141592653-2.6-0.999996 (TeX Live 2024)
  • Inter Regular 4.1

Additional notes
You can reproduce the issue by copying text from the provided PDFs. The problem is evident at least in macOS Preview.

inter-3.19.pdf
inter-4.1.pdf

@kenmcd
Copy link

kenmcd commented Dec 1, 2024

Your PDFs appear to be corrupt, or infected, or both.
Regardless, I cannot download and open them.
Please put the PDFs inside a ZIP, and attach the ZIP file here.

@igrmk
Copy link
Author

igrmk commented Dec 1, 2024

@kenmcd I highly doubt they are either corrupt or infected. It's more likely that some of your protection tools are giving false positives. Anyway, a zip archive is attached.

inter-pdfs.zip

@kenmcd
Copy link

kenmcd commented Dec 2, 2024

Appears the encoding is wrong in the v4.1 PDF.
For some reason the (, ), and + are being substituted with the tabular figures alternate glyphs - which have code-points assigned up in the Unicode PUA (Private Use Area).
( EE4E parenleft.case.tf
) EE4F parenright.case.tf
+ EE6A plus.case.tf
If the display font does not have those code-points (like here) - then the .notdef glyph appears.
So the problem is in how the PDF is being created.

@igrmk
Copy link
Author

igrmk commented Dec 2, 2024

@kenmcd I discovered that, for example, c++ produces a readable mapping, while C++ shifts the + higher and uses an alternate glyph. This is great, as it looks much better. If Inter v3.19 doesn't have these alternate glyphs, it would explain why it works without issues (UPD: It turns out Inter v3.19 does have these alternate glyphs after all).

I can confirm that the same input produces correct mappings in a LibreOffice-generated PDF while using both glyphs. Unfortunately, I lack detailed knowledge about how mappings in PDFs work. However, it's clear that all the necessary information to map alternate glyphs correctly exists in the font, as LibreOffice handles it successfully.

This is likely an issue with XeTeX itself or the LaTeX packages it relies on. I will report this to their team and am closing this issue, as it is no longer relevant.

Thank you for such a great font!

@igrmk igrmk closed this as completed Dec 2, 2024
@kenmcd
Copy link

kenmcd commented Dec 2, 2024

I just realized what is going on.
The Contextual Alternates (calt) feature is substituting the case alternate glyphs (which are a little higher).
So parenleft becomes parenleft.case, etc.
But I do not know why Tabular Figures (tnum) is then also applied - so parenleft.case becomes parenleft.case.tf.
tnum is not On by default - so something is enabling it.
calt is On by default, so you would need to disable it if desired.

@igrmk
Copy link
Author

igrmk commented Dec 2, 2024

@kenmcd Sorry, did I rush to close the issue? Feel free to reopen it if you believe it is related to the font.

@igrmk
Copy link
Author

igrmk commented Dec 2, 2024

@kenmcd This completely blew my mind, as the following produces correct mappings by enabling the tnum feature, not disabling it:

\documentclass{article}
\pagestyle{empty}
\usepackage{fontspec}

\setmainfont{Inter}[
    Path           = ./fonts/Inter-4.1/extras/otf/,
    Extension      = .otf,
    UprightFont    = *-Regular,
    BoldFont       = *-Bold,
    ItalicFont     = *-Italic,
    BoldItalicFont = *-BoldItalic,
    RawFeature     = +tnum
]

\begin{document}
(C++) (c++) (100\%)
\end{document}

@igrmk igrmk reopened this Dec 2, 2024
@kenmcd
Copy link

kenmcd commented Dec 2, 2024

No, I do not think this is an issue (error) with the font.
The automatic calt replacements often confuse users.

According to OpenType specs...
calt default should be On
tnum default should be Off

Just to be sure, I checked Inter v4.1 Regular OTF - and calt and tnum appear to be working as expected.
And as you mention, it works correctly in LibreOffice.

So the tnum being On by default appears to be a problem with XeLaTeX.
Something appears to be broken there.
You should probably file a bug with XeLaTeX about the odd tnum behavior.

@khaledhosny
Copy link

The font’s cmap table maps PUA code points to alternate glyphs. This is an outdated, and IMO wrong, practice.

Some PDF producers like XeTeX here will use the cmap mappings for PDF text extraction, others like LibreOffice here, will use the respective code point(s) from input text regardless of the cmap mapping.

Try using \XeTeXgenerateactualtext=1, it might fix the text extraction issue with XeTeX.

@khaledhosny
Copy link

Seeing #541, it is seems unlikely that PUA mappings are going away.

@igrmk
Copy link
Author

igrmk commented Dec 3, 2024

@khaledhosny I can confirm that \XeTeXgenerateactualtext=1 works. I will write a PR for README.md, as this is an important peculiarity of the font that can take hours of debugging for those using XeTeX.

@igrmk
Copy link
Author

igrmk commented Dec 3, 2024

The solution with \XeTeXgenerateactualtext=1 fixes only part of the problem. When set to 1, the /ActualText entry is added to the output PDF, improving copy/paste and search functionality in PDF viewers. However, some tools like pdftotext respect this entry, while others, like macOS Preview, do not. Since my goal is to maximise accessibility for my document, my current solution is to fall back to Inter v3.19 for now.

@igrmk
Copy link
Author

igrmk commented Dec 3, 2024

Just a wild guess: could it be worth mapping these glyphs to both the Private Use Area (PUA) and the actual text? I mean having the calt feature produce actual text mappings while also making these glyphs accessible in the PUA through Unicode codes for those who need them there. I'm not a font expert and have no idea if this is feasible. However, if it is, XeTeX is a significant tool, and there are only 32 glyphs involved, as noted in #541.

@khaledhosny
Copy link

The proper code points are already mapped to the default glyphs, and it is not possible to map the same code point to different glyphs in cmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants