Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charset of trained models #4

Open
bertsky opened this issue Mar 4, 2023 · 5 comments
Open

charset of trained models #4

bertsky opened this issue Mar 4, 2023 · 5 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented Mar 4, 2023

If I load the provided model and dump its character set, I can see a number of combining codepoints which were assigned a code by themselves:

' '.join(f.converter.letters.keys())
'  ! " & \' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \\ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | § ° ² ´ ¶ · ½ Æ È É Ó Ü ß à á â ä æ ç è é ê ë ì í î ï ò ó ô ö ù ú û ü ÿ đ ę ł Œ œ ů ű ſ ʒ ́ ̃ ̍ ͤ ; ζ ᵱ “ ” „ † ‡ ⁊ € ↄ ꝓ ꝗ ꝛ Ꝝ ꝟ ꝰ /PAD/'

Thus, with ́ ̃ ̍ ͤ contains 4 combining codepoints. What's the rationale for this? (Why do you represent them independent of their base codepoints in the model, e.g. aͤ oͤ uͤ?)

Also, I can see in here, which I find odd for historical texts.

Next, ű looks like a mistake (double acute instead of diaeresis/umlaut), but I could be wrong.

Moreover, the single Greek ζ is strange, too.

Finally, is ʒ used for Fraktur z here by any chance? (Is that correct in GT transcription level 3?)

@jannevanderloop
Copy link

Thank you for your questions:

  • Since I am a book historian, I am not sure about this question. I can only tell you, that the characters are made of a diacritic + letter as suggested by the transcription rules. I don't know why some characters act as two different ones like ũ and some act as just one like è.

  • In our transcriptions we use the € sign to 'transcribe' characters we cannot transcribe or would turn out like [] on most people's screens - the users of the transcriptions will book historians and not computer scientists.

  • ű is not a mistake, these refer to a different character than ü, uͤ or ũ.

  • The Greek ζ is weird if you look at it as a greek ζ. From a different point of view it is an abbreviation used for -is.

  • no, ʒ is not used for a Fraktur z. A z would be transcribed as such, ʒ is used for words that have this character in the printings themselves. It is another way to write an m at the end of a word. We transcribe to GT transcription level 2, not level 3.

I hope this answers your questions.

@seuretm
Copy link

seuretm commented Mar 13, 2023

Hi,
At first, we used string iterators to split our data. Beside making the code simpler, it has the advantage of decreasing the number of network outputs. However, it might cause some issues, such as adding diacritic symbols to glyphs that cannot be combined with them, or lead to a wrong CER. So, we have recently started migrating to mapping one glyph to one output, and might have still some inconsistencies in the json file.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 13, 2023

@jannevanderloop thanks for the explanation!

  • I don't know why some characters act as two different ones like ũ and some act as just one like è.

That's because Unicode cannot represent all diacritic glyphs as a precombined codepoint, some need to be represented as two codepoints (base character + combining 'character').

But IIUC @seuretm answered this one, thx.

(BTW, if you choose to represent combining as independent output channel, your decode should try to enforce Unicode rules.)

  • In our transcriptions we use the € sign to 'transcribe' characters we cannot transcribe or would turn out like [] on most people's screens - the users of the transcriptions will book historians and not computer scientists.

Fantastic! So you even trained with reject class.

IMO we should make use of this in the OCR-D wrapper: although PAGE-XML does not directly allow representing gap, we can by convention use empty TextEquiv (on the Glyph level) here. Or some non-printable character like \a (ASCII bell) or ASCII SUB or unit separator?

(Discerning this in the output representation is especially useful for post-correction and indexing BTW.)

  • ű is not a mistake, these refer to a different character than ü, uͤ or ũ.

ok!

  • The Greek ζ is weird if you look at it as a greek ζ. From a different point of view it is an abbreviation used for -is.

Good to know – always astonishing to learn about these bygone traditions.

  • no, ʒ is not used for a Fraktur z. A z would be transcribed as such, ʒ is used for words that have this character in the printings themselves. It is another way to write an m at the end of a word.

Again, fascinating!

@GemCarr
Copy link
Collaborator

GemCarr commented Mar 14, 2023

A new character set is now handled, you can take a look at it using the charmap.json file

@bertsky
Copy link
Contributor Author

bertsky commented Mar 14, 2023

A new character set is now handled, you can take a look at it using the charmap.json file

Thanks! Looks good.

I can still see U+0303 (combining superscript e) combined to U+0020 (space), but I gather this is intentional? (I would not expect it to be correct...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants