charset of trained models #4

bertsky · 2023-03-04T17:10:50Z

If I load the provided model and dump its character set, I can see a number of combining codepoints which were assigned a code by themselves:

' '.join(f.converter.letters.keys())
'  ! " & \' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \\ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z | § ° ² ´ ¶ · ½ Æ È É Ó Ü ß à á â ä æ ç è é ê ë ì í î ï ò ó ô ö ù ú û ü ÿ đ ę ł Œ œ ů ű ſ ʒ ́ ̃ ̍ ͤ ; ζ ᵱ “ ” „ † ‡ ⁊ € ↄ ꝓ ꝗ ꝛ Ꝝ ꝟ ꝰ /PAD/'

Thus, with ́ ̃ ̍ ͤ contains 4 combining codepoints. What's the rationale for this? (Why do you represent them independent of their base codepoints in the model, e.g. aͤ oͤ uͤ?)

Also, I can see € in here, which I find odd for historical texts.

Next, ű looks like a mistake (double acute instead of diaeresis/umlaut), but I could be wrong.

Moreover, the single Greek ζ is strange, too.

Finally, is ʒ used for Fraktur z here by any chance? (Is that correct in GT transcription level 3?)

The text was updated successfully, but these errors were encountered:

jannevanderloop · 2023-03-13T13:10:52Z

Thank you for your questions:

Since I am a book historian, I am not sure about this question. I can only tell you, that the characters are made of a diacritic + letter as suggested by the transcription rules. I don't know why some characters act as two different ones like ũ and some act as just one like è.
In our transcriptions we use the € sign to 'transcribe' characters we cannot transcribe or would turn out like [] on most people's screens - the users of the transcriptions will book historians and not computer scientists.
ű is not a mistake, these refer to a different character than ü, uͤ or ũ.
The Greek ζ is weird if you look at it as a greek ζ. From a different point of view it is an abbreviation used for -is.
no, ʒ is not used for a Fraktur z. A z would be transcribed as such, ʒ is used for words that have this character in the printings themselves. It is another way to write an m at the end of a word. We transcribe to GT transcription level 2, not level 3.

I hope this answers your questions.

seuretm · 2023-03-13T14:05:43Z

Hi,
At first, we used string iterators to split our data. Beside making the code simpler, it has the advantage of decreasing the number of network outputs. However, it might cause some issues, such as adding diacritic symbols to glyphs that cannot be combined with them, or lead to a wrong CER. So, we have recently started migrating to mapping one glyph to one output, and might have still some inconsistencies in the json file.

bertsky · 2023-03-13T17:15:23Z

@jannevanderloop thanks for the explanation!

I don't know why some characters act as two different ones like ũ and some act as just one like è.

That's because Unicode cannot represent all diacritic glyphs as a precombined codepoint, some need to be represented as two codepoints (base character + combining 'character').

But IIUC @seuretm answered this one, thx.

(BTW, if you choose to represent combining as independent output channel, your decode should try to enforce Unicode rules.)

In our transcriptions we use the € sign to 'transcribe' characters we cannot transcribe or would turn out like [] on most people's screens - the users of the transcriptions will book historians and not computer scientists.

Fantastic! So you even trained with reject class.

IMO we should make use of this in the OCR-D wrapper: although PAGE-XML does not directly allow representing gap, we can by convention use empty TextEquiv (on the Glyph level) here. Or some non-printable character like \a (ASCII bell) or ASCII SUB or unit separator?

(Discerning this in the output representation is especially useful for post-correction and indexing BTW.)

ű is not a mistake, these refer to a different character than ü, uͤ or ũ.

ok!

The Greek ζ is weird if you look at it as a greek ζ. From a different point of view it is an abbreviation used for -is.

Good to know – always astonishing to learn about these bygone traditions.

no, ʒ is not used for a Fraktur z. A z would be transcribed as such, ʒ is used for words that have this character in the printings themselves. It is another way to write an m at the end of a word.

Again, fascinating!

GemCarr · 2023-03-14T13:55:18Z

A new character set is now handled, you can take a look at it using the charmap.json file

bertsky · 2023-03-14T14:37:20Z

A new character set is now handled, you can take a look at it using the charmap.json file

Thanks! Looks good.

I can still see U+0303 (combining superscript e) combined to U+0020 (space), but I gather this is intentional? (I would not expect it to be correct...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

charset of trained models #4

charset of trained models #4

bertsky commented Mar 4, 2023

jannevanderloop commented Mar 13, 2023

seuretm commented Mar 13, 2023

bertsky commented Mar 13, 2023

GemCarr commented Mar 14, 2023

bertsky commented Mar 14, 2023

charset of trained models #4

charset of trained models #4

Comments

bertsky commented Mar 4, 2023

jannevanderloop commented Mar 13, 2023

seuretm commented Mar 13, 2023

bertsky commented Mar 13, 2023

GemCarr commented Mar 14, 2023

bertsky commented Mar 14, 2023