-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
charset of trained models #4
Comments
Thank you for your questions:
I hope this answers your questions. |
Hi, |
@jannevanderloop thanks for the explanation!
That's because Unicode cannot represent all diacritic glyphs as a precombined codepoint, some need to be represented as two codepoints (base character + combining 'character'). But IIUC @seuretm answered this one, thx. (BTW, if you choose to represent combining as independent output channel, your decode should try to enforce Unicode rules.)
Fantastic! So you even trained with reject class. IMO we should make use of this in the OCR-D wrapper: although PAGE-XML does not directly allow representing gap, we can by convention use empty TextEquiv (on the Glyph level) here. Or some non-printable character like (Discerning this in the output representation is especially useful for post-correction and indexing BTW.)
ok!
Good to know – always astonishing to learn about these bygone traditions.
Again, fascinating! |
A new character set is now handled, you can take a look at it using the charmap.json file |
Thanks! Looks good. I can still see U+0303 (combining superscript e) combined to U+0020 (space), but I gather this is intentional? (I would not expect it to be correct...) |
If I load the provided model and dump its character set, I can see a number of combining codepoints which were assigned a code by themselves:
Thus, with
́ ̃ ̍ ͤ
contains 4 combining codepoints. What's the rationale for this? (Why do you represent them independent of their base codepoints in the model, e.g.aͤ oͤ uͤ
?)Also, I can see
€
in here, which I find odd for historical texts.Next,
ű
looks like a mistake (double acute instead of diaeresis/umlaut), but I could be wrong.Moreover, the single Greek
ζ
is strange, too.Finally, is
ʒ
used for Frakturz
here by any chance? (Is that correct in GT transcription level 3?)The text was updated successfully, but these errors were encountered: