-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation of ucto output fails due to space character in FoLiA output from Piereling #83
Comments
hmm, this is "interesting". We should think about the best strategy here for ucto. @pirolen I don't see a Linebreak problem in these files...? |
I think ucto generated a declaration like this:
about which I got an error message (either by foliavalidator, or when using foliapy -- cannot reproduce right now). |
Btw, can one simply call python-ucto on a folia.Paragraph and access sentences too, next to tokens, with foliapy? |
Right, Piereling first invokes pandoc to convert docx to rst and then it uses rst2folia to convert the rst to FoLiA, so it'd be a two-step process on the command line.
That's not what I reproduced here:
@kosloot So it rejects the ucto output but the input is valid. So that would make it an ucto issue. I see you already identified the problem even.
Yes, the wrapper should support folia input and output. (though the input should be a full document I think) |
Indeed, I stand corrected. So it boils down to determine what to do with the Soft Hyphen. ucto discards those. Which always seemed a good plan. But do we want to remove them from the original Alternatively we could choose to
NOTE: libfolia maps a lot 'space-like' characters to a normal space, so in general Ucto will NOT receive the Soft Hyphen at all. But I have to check this. Still the best solution might be to remove them in an earlier stage, as the are a big pita. |
Addition: Opening this file in Emacs shows a hyphen (-) symbol. more and vi will display a space. less will show @proycon in libfolia you added a parameter to the normalize_spaces() function to replace all Control Characters by a single space. The Soft Hyphen IS a Control Character. Hence the normalization to a space. So one conclusion is already that the |
So I propose the following solution:
This means that as far as I can see, libfolia needs a small change to exempt Soft-Hyphen from normalize_spaces(). As a consequence of this libfolia adaptation, we will see that tools like ucto and frog will create output which seemingly contains spaces. (as some tools will show a space for a Soft-Hyphen). Example, for: <p xml:id="MWG.p.2">
<t>des Hammer klaviers</t>
</p> With a Soft-Hyphen between 'Hammer' and 'klavier' ucto will create: <p xml:id="MWG.p.2">
<t>des Hammerklaviers</t>
<s xml:id="MWG.p.2.s.1">
<w xml:id="MWG.p.2.s.1.w.1" class="WORD">
<t>des</t>
</w>
<w xml:id="MWG.p.2.s.1.w.2" class="WORD">
<t>Hammer klaviers</t>
</w>
</s>
</p> and Frog will produce:
So 'Hammer klavier' is just ONE word. (which seems right, in fact) As tools like Mbt, MBMA MBLEM and such are NOT trained with data containing Soft-Hypens, it is very well possible that processing of those words is not optimal. Therefor avoiding them is still the best way. If this gets a really big issue, we could still decide to adapt Frog to remove them. |
After converting a document from docx to FoLiA using Piereling (@proycon: I did not find a command line option for such a conversion), the FoLiA document contains (hidden/small) space characters, these in turn cause a validation error in the document that ucto produces.
How could one get rid of these spaces, resp. make ucto ignore them?
I attach both docs, there are two words with such space characters ('Bologna' and 'Hammerklaviers'), i.e., two validation errors.
The original document is much longer with more of these spaces.
I suspect the spaces might be the result of converting a conditional linebreak or pagebreak character in MS Word. (I cannot open the docx file right now in MS Word, unfortunately.)
Foliavalidator also complained about the Linebreak not getting declared.
VALIDATION ERROR on full parse by library (stage 2/3), in mwg-digital-doku/dataextraction-infrastructure/processes-lamachine/seminterpret_docstructure/registers/bla_ucto.folia.xml ParseError: FoLiA exception in handling of <div> @ line 48 (in parent <text> @ parent line 47) : [InconsistentText] Text for <Paragraph at 140596983260272 id=MWG-Gesamtpersonenverzeichnis_2019-09-25.text.div.1.div.2.p.75 set=None class=None>, is inconsistent: EXPECTED (deep text after normalization) *****> Corelli , Arcangelo (I/14) (17.2.1653–8.1.1713). Komponist, Violinvirtuose. Wurde 17jährig in die Academia filarmonica in Bo logna aufgenommen, 1687 „Maestro di Musica“ des Kardinals Benedetto Panfili in Rom. 1700 von Kardinal Pietro Ottoboni, dem Neffen des Papstes Alexander VIII., zum Haupt der Instrumentisten der „Academia di Santa Cecilia“ (somit zum ersten Instrumental-Komponisten Roms) ernannt. Er liegt im Pantheon links neben Raffael begraben. Die Zeitgenossen verehrten ihn als „Princeps musicorum“, „Maestro dei Maestri“ und „Virtuosissimo di Violino e vero Orfeo di nostri tempi“. Kompositionsgeschichtlich bedeutend sind seine Concerti grossi, Trio- und Violinsonaten. ****> BUT FOUND (strict text after normalization) ****> Corelli , Arcangelo (I/14) (17.2.1653–8.1.1713). Komponist, Violinvirtuose. Wurde 17jährig in die Academia filarmonica in Bologna aufgenommen, 1687 „Maestro di Musica“ des Kardinals Benedetto Panfili in Rom. 1700 von Kardinal Pietro Ottoboni, dem Neffen des Papstes Alexander VIII., zum Haupt der Instrumentisten der „Academia di Santa Cecilia“ (somit zum ersten Instrumental-Komponisten Roms) ernannt. Er liegt im Pantheon links neben Raffael begraben. Die Zeitgenossen verehrten ihn als „Princeps musicorum“, „Maestro dei Maestri“ und „Virtuosissimo di Violino e vero Orfeo di nostri tempi“. Kompositionsgeschichtlich bedeutend sind seine Concerti grossi, Trio- und Violinsonaten. ******* DEVIATION POINT: nica in Bo<*HERE*>logna auf (also checked against older rules prior to FoLiA v2.4.1)
bla.folia.xml.txt
bla_ucto.folia.xml.txt
The text was updated successfully, but these errors were encountered: