Valency: saving sentence text, token spans #960
Labels
backend
bug is related to backend
enhancement
this label means that resolving the issue would improve some part of the system
Currently we only store token text data of sentences of valency data, and we have no option but to use only tokens themselves when we need to reconstruct sentence text, resulting in imperfect reconstruction.
E.g. in verb valency instance approval at /valency, we just reconstruct sentences through joining tokens by spaces, which is not completely right e.g. around dots and commas, with e.g. "кутске ай ." instead of "кутске ай." and "пуке , гуртэз" instead of "пуке, гуртэз":
Or in verb valency case analysis, at perspective view -> Tool -> Verb valency cases, sentences are reconstructed by ad-hoc algorithm by Mikhail, see https://github.com/ispras/lingvodoc/blob/b644a0a4256af4fd613b3c2fbf72203e0bed8eb6/lingvodoc/scripts/valency_verb_cases.py#L41.
We should save full sentence texts, making such imperfect reconstructions unnecessary; to do that, we would probably need to modify valency data extraction at https://github.com/ispras/lingvodoc/blob/heavy_refactor/lingvodoc/scripts/export_parser_result.py and at process_eaf() https://github.com/ispras/lingvodoc/blob/heavy_refactor/lingvodoc/schema/query.py#L16591. Perhaps by adding follow-up text to tokens?
If we would store full sentence texts, we should also probably store token spans to indicate where a particular token is located in the sentence.
The text was updated successfully, but these errors were encountered: