You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
IMO we are still lacking a convention to represent illegible substrings. DTABf (TEI) uses gap for this.
Since there is a dependency from GT to OCR training to OCR inference to OCR postcorrection, we should make this as concrete as possible without breaking existing habits. For example, in public GT datasets you often see € or £ to represent this directly in the string. The downside is obviously that you might somehow end up confusing these substitutes with their actual meaning after all.
If possible we could also try to enforce a non-printable character like ASCII bell, substitute or unit separator. In the simplest form, we just use the empty string – but that only works when transcribing on character level, and OCR is trained on line level.
The text was updated successfully, but these errors were encountered:
IMO we are still lacking a convention to represent illegible substrings. DTABf (TEI) uses gap for this.
Since there is a dependency from GT to OCR training to OCR inference to OCR postcorrection, we should make this as concrete as possible without breaking existing habits. For example, in public GT datasets you often see
€
or£
to represent this directly in the string. The downside is obviously that you might somehow end up confusing these substitutes with their actual meaning after all.If possible we could also try to enforce a non-printable character like ASCII bell, substitute or unit separator. In the simplest form, we just use the empty string – but that only works when transcribing on character level, and OCR is trained on line level.
The text was updated successfully, but these errors were encountered: