More amy/anysplit modifications #1337

ampli · 2022-08-12T16:01:57Z

This patch makes anysplit.c potentially grapheme-aware. It works only with the PCRE2 library.
I added a new definition to the affix file: atomic-unit that defines a sequence that should not get split. Initially, I defined it as \X. But noted that for some reasons that I don't know (I even have no idea if this is due to a bug or feature in PCRE2 or Unicode) sometimes a word starts with mark characters (that may be rendered badly as they don't have a base character to modify). So
I changed it to \X\pM*. (A better name than atomic-unit may be needed to prevent confusion with the atomase...).

I also simplified the regexes in amy/4.0.regex. I have left there the non-PCRE2 regexes, commented out.
In the affix file (through any/affix-punc) I added [[:punct:]] regexes for RPUNC/LPUNC/MPUNC, to strip off all types of punctuations (of course this splits numbers and times too). I don't know how much this is a good idea, but of course, this is optional and can be modified. I have left the multi-character punctuations (e.g. ...).

Due to the multi-character punctuation, I added the following, which is actually a bug fix to my previous affix-related modifications:

afdict_init(): Validate affixes w/dictionary_word_is_known().

Note that my new ANY_PUNCT accepts subscripted punctuation, and if they are used, this allows to know from which
side they got strip. However, I now see that due to my syntax decision for affix regexes, this way cannot be used with them.
So maybe it will be an improvement to change their syntax to /regex/\1/ after all (and then /regex/\1.y/ could be used to add a subscript, and, for example /regex/\1/a can specify to split as an alternative (instead of a replacement like now).

Refs: Issues #1334, #1333, #1315; PRs #1334, #1329, #1321.

This doesn't work yet for splitting on grapheme boundaries, because ^X matches at leas one codepoint so it matches a split initial morpheme in a part. This change is needed for the upcoming new code to split at grapheme boundaries.

No need for them after the grapheme-aware separation modification.

This way morpheme candidates (split parts) are not starting with marks. This looks nicer and gives less splits. I don't know it is more useful.

...instead of dict_has_word(), to allow punctuation that match a regex.

linas · 2022-08-13T09:15:25Z

Thank you!

ampli added 23 commits August 12, 2022 17:20

amy/4.0.affix,amy/4.0.regex: Simplify the regexes

5825dd2

This doesn't work yet for splitting on grapheme boundaries, because ^X matches at leas one codepoint so it matches a split initial morpheme in a part. This change is needed for the upcoming new code to split at grapheme boundaries.

anysplit(): Move the sanity checks to the start

36901d0

free_anysplit(): Move it to be near its usage

f3c7d8a

morpheme_match(): Rename prefix_string to word_part

8df81c5

anysplit(): Remove commented-out code line

b22706c

anysplit,c: Fix a comment rot

97fd0c7

anysplit.c: Rename p_start to p_end

73eaff6

morpheme_match(): Update description

af8e481

anysplit.c: Define D_ANYS as the verbosity level for this file

1576152

anysplit(): Move 0 length check to the start

128fccd

anysplit.c: Include pcre2.h

53dd29f

anysplit.c: Add data structure for grapheme separation

aa3e7f1

anysplit.c: Add functions for grapheme separation

3193e89

anysplit.c: Add ability to split on grapheme boundaries

cf6a498

anysplit.c: Add #define atomic-unit instead of a hardcoded value

c295754

amy/4.0.affix: Remove regexes for REG*

74a1468

No need for them after the grapheme-aware separation modification.

amy/4.0.regex: Include trailing mark codepoints in atomic-unit

54423ab

This way morpheme candidates (split parts) are not starting with marks. This looks nicer and gives less splits. I don't know it is more useful.

anysplit.c: Change p_end to unsigned int

e236030

any/affix_punc: Replace one-char affixes by [[:punct:]]

82a97b0

any/affix_punc: Add comments

ff92f6a

amy/4.0.regex: Accept subscripted punctuation

8b9f32e

amy/4.0.regex: Update the comments

9fb80f0

afdict_init(): Validate affixes w/dictionary_word_is_known()

56247cb

...instead of dict_has_word(), to allow punctuation that match a regex.

linas merged commit 1744562 into opencog:master Aug 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More amy/anysplit modifications #1337

More amy/anysplit modifications #1337

ampli commented Aug 12, 2022 •

edited

Loading

linas commented Aug 13, 2022

More amy/anysplit modifications #1337

More amy/anysplit modifications #1337

Conversation

ampli commented Aug 12, 2022 • edited Loading

linas commented Aug 13, 2022

ampli commented Aug 12, 2022 •

edited

Loading