Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integrity check for LaTeX special characters #8712

Open
3 tasks
ThiloteE opened this issue Apr 22, 2022 · 27 comments
Open
3 tasks

Add integrity check for LaTeX special characters #8712

ThiloteE opened this issue Apr 22, 2022 · 27 comments

Comments

@ThiloteE
Copy link
Member

ThiloteE commented Apr 22, 2022

This issue was inspired by inspecting and working on issues #8673, plurimath/unicode2latex#19, #8650, #8682 and #3644

Problem:

  1. Special characters do special things in LaTeX (or when being compiled by Biber).
  2. Special characters are hard to convert from LaTeX to Unicode and vice versa, because context is relevant! (e.g. being in Text Mode or Math Mode or another special LaTeX environment). Using special characters may necessitate user action.

Desired solution:

  1. Users should be able to easily find existing "special characters" in their database.
  2. Create integrity check(s) for having the following symbols in your entries:
  • From Table 1: LaTeX escapable "special" characters:

    • %, $, _, # and &
    • Not necessary: Detect \%, \$, \_, \# and \&, because this should be handled at text paste.
  • From Table 2: Predefined LATEX 2𝜀 Text-mode Commands:

    • See Table 2. I am too lazy to copy / paste them all here.
    • Comment: I tried pasting a few of these symbols into JabRef. NOT ALL could be entered into the JabRef title field. Feel free to experiment and implement an integrity check for characters that can easily be used with JabRef. Would be nice to create a list of Symbols from Table 2 that JabRef can't easily deal with.
  • Table 3: Commands Defined to Work in Both Math and Text Mode:

    • ... - PS. not yet sure about this one. Maybe the ... should be added to the Unicode2LaTeX converter instead. Please Exclude this character for now. Work on the others first!
    • All the other characters in table 3 can be converted from Unicode to LaTeX and from LaTeX to Unicode (except _ and \_), so no need to implement an integrity check for them! :-)

Additional info:

  • I believe an integrity check for # is already implemented. Maybe this integrity check can be used as a template.
@ThiloteE
Copy link
Member Author

ThiloteE commented Apr 22, 2022

The workflow for users is as follows:

  1. Decide if Database should be exported to LaTeX/OpenOffice/LibreOffice/MicrosoftOffice (Each of them may require entries to be formatted in a specific way (e.g. LaTeX, Unicode, HTML etc.) to be rendered correctly).
  2. Use integrity check
  3. Decide if characters found by integrity check are correct or not.
  4. If not, change characters
    • Either manually
    • Or via "cleanup actions"

@github-project-automation github-project-automation bot moved this to Normal priority in Features & Enhancements Apr 16, 2023
@github-project-automation github-project-automation bot moved this to Free to take in Good First Issues Apr 16, 2023
@Siedlerchr Siedlerchr added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Apr 16, 2023
@koppor koppor moved this from Free to take to Under Review in Candidates for University Projects Apr 24, 2023
@koppor koppor removed the good first issue An issue intended for project-newcomers. Varies in difficulty. label Apr 24, 2023
@systemoperator
Copy link
Contributor

@Siedlerchr @koppor With this new feature, I receive now tons of LaTeX Parsing Errors, when using "Check integrity", which irritated me for a while and made me think:

Field: File
Message: LaTeX Parsing Error: Command \: cannot be used in PARAG...

This results because of using the JabRef Browser Extension, which adds linked files as follows:

file = {Full Text PDF:https\://www.jair.org/index.php/jair/article/download/11675/26513:application/pdf;Snapshot:https\://www.jair.org/index.php/jair/article/view/11675:text/html;:Jauhiainen2019 - Automatic Language Identification in Texts_ a Survey.pdf:PDF},

@koppor
Copy link
Member

koppor commented Sep 19, 2023

@systemoperator Thank you for letting us know! CC @Zylence

@koppor
Copy link
Member

koppor commented Sep 19, 2023

@systemoperator A new version with a fix be available in 30min at #10401. There should then be a comment appearing linking to the binaries.

@systemoperator
Copy link
Contributor

@systemoperator A new version with a fix be available in 30min at #10401. There should then be a comment appearing linking to the binaries.

Awesome! I'm gonna check this out.

@Zylence
Copy link
Contributor

Zylence commented Sep 19, 2023

This results because of using the JabRef Browser Extension

I am terribly sorry, I did not check for this usecase. ( Honestly I only tested manual input. 😐)

@koppor
Copy link
Member

koppor commented Sep 21, 2023

No worries at all - and thank @systemoperator for the quick report!

Background: This is one consequence of our "merge early" development strategy. Instead of asking users to try out PRs, we "just" merge and fix ASAP if something goes wrong. We get more velocity into JabRef - and I think, this good for the tool as whole. -- Otherwise, we had many PRs piling up, because some details were missing. Then, they got abandoned and never got revived. OK, most of the non-merged PRs were 20% solutions. These currently pile up at https://github.com/koppor/jabref/pulls

@systemoperator
Copy link
Contributor

systemoperator commented Sep 21, 2023

@koppor The mentioned error has gone now. :)

I've noticed other things for imported entries, using the JabRef Browser Extension:

For DOI fields:
LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode
Examples: 10.1007/0-387-34805-0_39, 10.1007/3-540-49264-X_24

For Comment, Comment-username, Abstract, Title and Journaltitle fields:
LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode
LaTeX Parsing Error: Undefined command \textgreater
LaTeX Parsing Error: Undefined command \textbackslash
LaTeX Parsing Error: Undefined command \textbar
LaTeX Parsing Error: Undefined command \textless

@koppor
Copy link
Member

koppor commented Sep 21, 2023

TBH, I don't use the integrity check, too much to fix 🙈. Thus, I am relying on feedback here.

Currently, we just exclude "verbatim" fields. I think, all identifier fields should also be classified as verbatim. On the other hand, DOI etc should be checked for HTML characters (shouldn't they)?

Regarding comment and abstract. The should have proper LaTeX. Maybe, we need to allow the commands you mentioned?

@koppor
Copy link
Member

koppor commented Sep 21, 2023

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

@koppor
Copy link
Member

koppor commented Sep 21, 2023

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

@systemoperator
Copy link
Contributor

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

In my case, I pasted some URLs into the Comments or Comments-username field, which contained the underscores.
e.g. https://www.bhu.ac.in/research_pub/jsr/Volumes/JSR_65_01_2021

@Zylence
Copy link
Contributor

Zylence commented Sep 21, 2023

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

In my case, I pasted some URLs into the Comments or Comments-username field, which contained the underscores. e.g. https://www.bhu.ac.in/research_pub/jsr/Volumes/JSR_65_01_2021

In that case it would be expected to fail, as _ and ^ are only allowed in latex math environments.

@systemoperator
Copy link
Contributor

TBH, I don't use the integrity check, too much to fix 🙈. Thus, I am relying on feedback here.

Sure. It's nothing, which breaks something, it's just informative.

Currently, we just exclude "verbatim" fields. I think, all identifier fields should also be classified as verbatim. On the other hand, DOI etc should be checked for HTML characters (shouldn't they)?

Or only allowing URL characters. (maybe excluding &, ? and #?)

Regarding comment and abstract. The should have proper LaTeX. Maybe, we need to allow the commands you mentioned?

I guess, there would be more commands to consider, which can show up.

@systemoperator
Copy link
Contributor

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

Concerning the title, some authors use extraordinary characters there. 🤔

@Zylence
Copy link
Contributor

Zylence commented Sep 21, 2023

LaTeX Parsing Error: Undefined command \textgreater
LaTeX Parsing Error: Undefined command \textbackslash
LaTeX Parsing Error: Undefined command \textbar
LaTeX Parsing Error: Undefined command \textless

I guess, there would be more commands to consider, which can show up.

The implementation of the underlying snuggletex is not complete. To add new commands, we'd either need to make PRs to snuggletex or to inject new commands as shown here: JabRef#646 (comment) (takeaways section)

The latter is probably faster.

@systemoperator
Copy link
Contributor

I have pasted URLs into the URL field, where also some parameters exist, which are concatenated with the &.
Resulting in: Found 2 unescaped '&'.

The same applies to the File field, where the JabRef Browser Extension sends the Links, which can also contain "&" characters.
Resulting in: Found 2 unescaped '&'.

Examples:

@Zylence
Copy link
Contributor

Zylence commented Sep 21, 2023

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

Ideas ranked from most to least favorite:

  1. As you said, exclude fields (at least those that are filled in automatically by for example the browser extension)
  2. Separate the latex integrity check (but that would be bad style)
  3. Enforce url encoding as in anything other than a url you should simply be able to escape the problematic characters. (but that's not userfriendly and may cause other similar errors)

Can not really come up with a better idea then the one proposed first. May need a little more time to think.

@Zylence
Copy link
Contributor

Zylence commented Sep 21, 2023

Found 2 unescaped '&'

Okay, but that's the "fault" of the AmpersandChecker which was introduced earlier.

As the problem seems to be predominantly induced by urls, url-encoding as solution now ranks higher in my opinion. We could intercept paste and additionally warn the user.

So for example:
> https://en.wikipedia.org/w/index.php?title=Information_extraction&oldid=948091115

becomes:
https%3A%2F%2Fen.wikipedia.org%2Fw%2Findex.php%3Ftitle%3DInformation%5fextract%0Aion%26oldid%3D948091115

which is manageable.

[Edit]: Just realized that won't make much sense for that case, since the ampersands are not part of the queryparameter.

@Siedlerchr
Copy link
Member

Siedlerchr commented Sep 22, 2023 via email

@koppor
Copy link
Member

koppor commented Sep 22, 2023

Please don't mess with url fields. Rather exclude them. As far as I know
url is also a verbatim field

Then the ampersand checker needs to be updated to exclude verbatim fields. #quickwin

@koppor
Copy link
Member

koppor commented Sep 27, 2023

The implementation of the underlying snuggletex is not complete. To add new commands, we'd either need to make PRs to snuggletex or to inject new commands as shown here: koppor#646 (comment) (takeaways section)

As usual, I would do both ^^ -- PR to https://github.com/davemckain/snuggletex (reference is more for me)

However, I am not 100% sure about the error "Undefined command". We can never be package-complete. On the one hand, I assume the sources come from publishers, which "should" use correct commands. On the other hand, we should add the most common commands to avoid "typos" (\textsupsrcipt vs. \textsuperscript (correct)).

@systemoperator
Copy link
Contributor

@koppor I've checked the latest build.

I would recommend, allowing underscores in DOI fields, since they are not infrequent:

image

In the comment-{username} field, I would recommend allowing any character.

Furthermore, in my opinion, it would make sense to adapt the messages in the integrity check so that "LaTeX Parsing Error" is only printed, if there is REALLY some LaTeX parsing error. (Initially, I thought, something is broken in my file, but actually, everything was fine.) I think, a message like e.g. "LaTeX Parsing Warning" or something different (like "LaTeX Inconsistency", or even "LaTeX Improvement") would be more appropriate, so that a user does not panic, when reading it. Basically any message in the integrity check suggests improvements in terms of inconsistencies, without anything being broken. (Probably, if something is really broken, like a wrongly inserted "}" or "{", in the bib-file, I guess, even Jabref cannot properly parse it anymore.

@koppor
Copy link
Member

koppor commented Oct 9, 2023

@systemoperator I fully agree. Then, my PR #10436 does not work. I need to re-check. Sorry for that many iterations on this. I will also work on the text. I think, LaTeX Warning could be good.

On the one hand, we could classify the errors at https://github.com/davemckain/snuggletex/blob/development_1_2_x/snuggletex-core/src/main/resources/uk/ac/ed/ph/snuggletex/core-error-messages.properties between warning and error.

On the other hand, always "Warning" feels currently right.

The issue with the missing } needs to be checked. JabRef currently drops too much entries if too much } are contained. Our parser needs to improved with that regard.

@Siedlerchr
Copy link
Member

@systemoperator I was testing this again today and cannot reproduce your issue. For me, no error related to DOI is raised
e..g DOI: 10.1007/978-3-642-01190-0_17

@koppor
Copy link
Member

koppor commented Oct 20, 2023

@Siedlerchr After three iterations, we managed. 😅🙈🎉🎉

@systemoperator
Copy link
Contributor

I've checked it with my reference library, with more than 500 entries. Everything is printed properly now. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Normal priority
Development

No branches or pull requests

5 participants