Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs #774

igrmk · 2024-12-01T17:27:10Z

Text copied from a XeLaTeX-produced PDF using Inter v4.1 contains unexpected characters, while version 3.19 works flawlessly.

To Reproduce

Install XeTeX, Poppler, curl, unzip

Run the script below in a dedicated directory. Both produced PDFs are attached for reference:

mkdir -p fonts
curl -s -L -O --output-dir fonts "https://github.com/rsms/inter/releases/download/v3.19/Inter-3.19.zip"
curl -s -L -O --output-dir fonts "https://github.com/rsms/inter/releases/download/v4.1/Inter-4.1.zip"
unzip -q -o -d fonts/Inter-3.19 fonts/Inter-3.19.zip
unzip -q -o -d fonts/Inter-4.1 fonts/Inter-4.1.zip

cat <<EOF > inter-3.19.tex
\documentclass{article}
\pagestyle{empty}
\usepackage{fontspec}

\setmainfont{Inter}[
    Path           = ./fonts/Inter-3.19/Inter Desktop/,
    Extension      = .otf,
    UprightFont    = *-Regular,
    BoldFont       = *-Bold,
    ItalicFont     = *-Italic,
    BoldItalicFont = *-BoldItalic
]

\begin{document}
(C++) (100\%)
\end{document}
EOF

cat <<EOF > inter-4.1.tex
\documentclass{article}
\pagestyle{empty}
\usepackage{fontspec}

\setmainfont{Inter}[
    Path           = ./fonts/Inter-4.1/extras/otf/,
    Extension      = .otf,
    UprightFont    = *-Regular,
    BoldFont       = *-Bold,
    ItalicFont     = *-Italic,
    BoldItalicFont = *-BoldItalic
]

\begin{document}
(C++) (100\%)
\end{document}
EOF

mkdir -p pdfs
xelatex -interaction=batchmode -output-directory pdfs inter-3.19.tex > /dev/null
xelatex -interaction=batchmode -output-directory pdfs inter-4.1.tex > /dev/null

pdftotext pdfs/Inter-3.19.pdf - | grep -v $'\f' | grep -v '^$'
pdftotext pdfs/Inter-4.1.pdf - | grep -v $'\f' | grep -v '^$'

It outputs the following, even though both PDFs appear fine visually:

(C++) (100%)
?C?????100%? <redacted due to smileys that cannot be pasted>

Expected behavior
I expect it to output:

(C++) (100%)
(C++) (100%)

Environment

OS: macOS 15.1.1, M2
XeTeX 3.141592653-2.6-0.999996 (TeX Live 2024)
Inter Regular 4.1

Additional notes
You can reproduce the issue by copying text from the provided PDFs. The problem is evident at least in macOS Preview.

inter-3.19.pdf
inter-4.1.pdf

The text was updated successfully, but these errors were encountered:

kenmcd · 2024-12-01T20:34:34Z

Your PDFs appear to be corrupt, or infected, or both.
Regardless, I cannot download and open them.
Please put the PDFs inside a ZIP, and attach the ZIP file here.

igrmk · 2024-12-01T20:49:40Z

@kenmcd I highly doubt they are either corrupt or infected. It's more likely that some of your protection tools are giving false positives. Anyway, a zip archive is attached.

inter-pdfs.zip

kenmcd · 2024-12-02T19:39:16Z

Appears the encoding is wrong in the v4.1 PDF.
For some reason the (, ), and + are being substituted with the tabular figures alternate glyphs - which have code-points assigned up in the Unicode PUA (Private Use Area).
( EE4E parenleft.case.tf
) EE4F parenright.case.tf
+ EE6A plus.case.tf
If the display font does not have those code-points (like here) - then the .notdef glyph appears.
So the problem is in how the PDF is being created.

igrmk · 2024-12-02T20:18:05Z

@kenmcd I discovered that, for example, c++ produces a readable mapping, while C++ shifts the + higher and uses an alternate glyph. This is great, as it looks much better. If Inter v3.19 doesn't have these alternate glyphs, it would explain why it works without issues (UPD: It turns out Inter v3.19 does have these alternate glyphs after all).

I can confirm that the same input produces correct mappings in a LibreOffice-generated PDF while using both glyphs. Unfortunately, I lack detailed knowledge about how mappings in PDFs work. However, it's clear that all the necessary information to map alternate glyphs correctly exists in the font, as LibreOffice handles it successfully.

This is likely an issue with XeTeX itself or the LaTeX packages it relies on. I will report this to their team and am closing this issue, as it is no longer relevant.

Thank you for such a great font!

kenmcd · 2024-12-02T20:31:26Z

I just realized what is going on.
The Contextual Alternates (calt) feature is substituting the case alternate glyphs (which are a little higher).
So parenleft becomes parenleft.case, etc.
But I do not know why Tabular Figures (tnum) is then also applied - so parenleft.case becomes parenleft.case.tf.
tnum is not On by default - so something is enabling it.
calt is On by default, so you would need to disable it if desired.

igrmk · 2024-12-02T20:36:57Z

@kenmcd Sorry, did I rush to close the issue? Feel free to reopen it if you believe it is related to the font.

igrmk · 2024-12-02T21:01:54Z

@kenmcd This completely blew my mind, as the following produces correct mappings by enabling the tnum feature, not disabling it:

\documentclass{article}
\pagestyle{empty}
\usepackage{fontspec}

\setmainfont{Inter}[
    Path           = ./fonts/Inter-4.1/extras/otf/,
    Extension      = .otf,
    UprightFont    = *-Regular,
    BoldFont       = *-Bold,
    ItalicFont     = *-Italic,
    BoldItalicFont = *-BoldItalic,
    RawFeature     = +tnum
]

\begin{document}
(C++) (c++) (100\%)
\end{document}

kenmcd · 2024-12-02T21:15:56Z

No, I do not think this is an issue (error) with the font.
The automatic calt replacements often confuse users.

According to OpenType specs...
calt default should be On
tnum default should be Off

Just to be sure, I checked Inter v4.1 Regular OTF - and calt and tnum appear to be working as expected.
And as you mention, it works correctly in LibreOffice.

So the tnum being On by default appears to be a problem with XeLaTeX.
Something appears to be broken there.
You should probably file a bug with XeLaTeX about the odd tnum behavior.

khaledhosny · 2024-12-03T08:03:54Z

The font’s cmap table maps PUA code points to alternate glyphs. This is an outdated, and IMO wrong, practice.

Some PDF producers like XeTeX here will use the cmap mappings for PDF text extraction, others like LibreOffice here, will use the respective code point(s) from input text regardless of the cmap mapping.

Try using \XeTeXgenerateactualtext=1, it might fix the text extraction issue with XeTeX.

khaledhosny · 2024-12-03T08:07:07Z

Seeing #541, it is seems unlikely that PUA mappings are going away.

igrmk · 2024-12-03T13:25:45Z

@khaledhosny I can confirm that \XeTeXgenerateactualtext=1 works. I will write a PR for README.md, as this is an important peculiarity of the font that can take hours of debugging for those using XeTeX.

igrmk · 2024-12-03T13:45:35Z

The solution with \XeTeXgenerateactualtext=1 fixes only part of the problem. When set to 1, the /ActualText entry is added to the output PDF, improving copy/paste and search functionality in PDF viewers. However, some tools like pdftotext respect this entry, while others, like macOS Preview, do not. Since my goal is to maximise accessibility for my document, my current solution is to fall back to Inter v3.19 for now.

igrmk · 2024-12-03T14:13:57Z

Just a wild guess: could it be worth mapping these glyphs to both the Private Use Area (PUA) and the actual text? I mean having the calt feature produce actual text mappings while also making these glyphs accessible in the PUA through Unicode codes for those who need them there. I'm not a font expert and have no idea if this is feasible. However, if it is, XeTeX is a significant tool, and there are only 32 glyphs involved, as noted in #541.

khaledhosny · 2024-12-03T14:20:56Z

The proper code points are already mapped to the default glyphs, and it is not possible to map the same code point to different glyphs in cmap.

igrmk closed this as completed Dec 2, 2024

igrmk reopened this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs #774

Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs #774

igrmk commented Dec 1, 2024 •

edited

Loading

kenmcd commented Dec 1, 2024 •

edited

Loading

igrmk commented Dec 1, 2024

kenmcd commented Dec 2, 2024 •

edited

Loading

igrmk commented Dec 2, 2024 •

edited

Loading

kenmcd commented Dec 2, 2024

igrmk commented Dec 2, 2024

igrmk commented Dec 2, 2024

kenmcd commented Dec 2, 2024

khaledhosny commented Dec 3, 2024

khaledhosny commented Dec 3, 2024

igrmk commented Dec 3, 2024

igrmk commented Dec 3, 2024

igrmk commented Dec 3, 2024 •

edited

Loading

khaledhosny commented Dec 3, 2024

Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs #774

Text extraction issue with Inter v4.1 in XeLaTeX-generated PDFs #774

Comments

igrmk commented Dec 1, 2024 • edited Loading

kenmcd commented Dec 1, 2024 • edited Loading

igrmk commented Dec 1, 2024

kenmcd commented Dec 2, 2024 • edited Loading

igrmk commented Dec 2, 2024 • edited Loading

kenmcd commented Dec 2, 2024

igrmk commented Dec 2, 2024

igrmk commented Dec 2, 2024

kenmcd commented Dec 2, 2024

khaledhosny commented Dec 3, 2024

khaledhosny commented Dec 3, 2024

igrmk commented Dec 3, 2024

igrmk commented Dec 3, 2024

igrmk commented Dec 3, 2024 • edited Loading

khaledhosny commented Dec 3, 2024

igrmk commented Dec 1, 2024 •

edited

Loading

kenmcd commented Dec 1, 2024 •

edited

Loading

kenmcd commented Dec 2, 2024 •

edited

Loading

igrmk commented Dec 2, 2024 •

edited

Loading

igrmk commented Dec 3, 2024 •

edited

Loading