Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More amy/anysplit modifications #1337

Merged
merged 23 commits into from
Aug 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5825dd2
amy/4.0.affix,amy/4.0.regex: Simplify the regexes
ampli Jul 8, 2022
36901d0
anysplit(): Move the sanity checks to the start
ampli Jul 8, 2022
f3c7d8a
free_anysplit(): Move it to be near its usage
ampli Jul 8, 2022
8df81c5
morpheme_match(): Rename prefix_string to word_part
ampli Jul 8, 2022
b22706c
anysplit(): Remove commented-out code line
ampli Jul 8, 2022
97fd0c7
anysplit,c: Fix a comment rot
ampli Jul 9, 2022
73eaff6
anysplit.c: Rename p_start to p_end
ampli Jul 11, 2022
af8e481
morpheme_match(): Update description
ampli Jul 9, 2022
1576152
anysplit.c: Define D_ANYS as the verbosity level for this file
ampli Jul 10, 2022
128fccd
anysplit(): Move 0 length check to the start
ampli Jul 11, 2022
53dd29f
anysplit.c: Include pcre2.h
ampli Jul 11, 2022
aa3e7f1
anysplit.c: Add data structure for grapheme separation
ampli Jul 11, 2022
3193e89
anysplit.c: Add functions for grapheme separation
ampli Jul 11, 2022
cf6a498
anysplit.c: Add ability to split on grapheme boundaries
ampli Jul 11, 2022
c295754
anysplit.c: Add #define atomic-unit instead of a hardcoded value
ampli Jul 11, 2022
74a1468
amy/4.0.affix: Remove regexes for REG*
ampli Jul 11, 2022
54423ab
amy/4.0.regex: Include trailing mark codepoints in atomic-unit
ampli Jul 11, 2022
e236030
anysplit.c: Change p_end to unsigned int
ampli Jul 11, 2022
82a97b0
any/affix_punc: Replace one-char affixes by [[:punct:]]
ampli Aug 10, 2022
ff92f6a
any/affix_punc: Add comments
ampli Aug 10, 2022
8b9f32e
amy/4.0.regex: Accept subscripted punctuation
ampli Aug 10, 2022
9fb80f0
amy/4.0.regex: Update the comments
ampli Aug 10, 2022
56247cb
afdict_init(): Validate affixes w/dictionary_word_is_known()
ampli Aug 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions data/amy/4.0.affix
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,13 @@

% Anysplit parameters

% A PCRE2 regex defining a character sequence that shouldn't get split.
% The LG library must be configured with PCRE2 in order to use it. If not,
% or if this definition is missing, a single utf8 codpoint is used as a
% byte sequence that should not get split.
%#define atomic-unit "\X"; % split at grapheme boundaries.
#define atomic-unit "\X\pM*"; % ... but include trailing mark codepoints.

% Maximum number of word partitions
% FYI: 3 barely works, 4 and higher mostly do not work.
% 6: REGPARTS+;
Expand Down Expand Up @@ -47,16 +54,13 @@
% For ASCII input, the empty regexes can be used.
% See the comments in 4.0.affix.

%"" : REGPRE+;
"^(?=\p{Xan})\X(?:(?=\p{Xan}|\pM|\p{Po})\X)*$" : REGPRE+;
"" : REGPRE+;

% Regex to match the middle parts.
%"" : REGMID+;
"^(?=\p{Xan})\X(?:(?=\p{Xan}|\pM|\p{Po})\X)*$" : REGMID+;
"" : REGMID+;
%".{2,}": REGMID+;

% Regex to match the suffix.
%"" : REGSUF+;
"^(?=\p{Xan})\X(?:(?=\p{Xan}|\pM|\p{Po})\X)*$" : REGSUF+;
"" : REGSUF+;

% End of Anysplit parameters.
56 changes: 27 additions & 29 deletions data/amy/4.0.regex
Original file line number Diff line number Diff line change
Expand Up @@ -8,38 +8,36 @@
% The regexes here use the PCRE2 pattern syntax.
% The LG library must be configured with PCRE2 in order to use them.

% \X matches any Unicode grapheme.
% (?:(?=\p{Xan}) specifies that it should start with a letter or number.
% Similarly, \pM allows it to start with a mark character.
% Since most of the script-specific punctuation characters are not in
% the affix-punc file, they are allowed here to join to the end word/parts
% Most probably these regexes still reject valid word graphemes in some languages.
% \X matches any Unicode grapheme. \x03 matches the internal representation
% of the dot in STEMSUBSCR (See 4.0.affix).
%
% For information on graphemes see: http://www.unicode.org/reports/tr29/

% Want to match apostrophes, for abbreviations (I'm I've, etc.) since
% these cannot be auto-split with the current splitter.
% Hyphenated words, and words with underbars in them, get split.
ANY-WORD: /^(?=\p{Xan})\X(?:(?=\p{Xan}|\pM|\p{Po})\X)*$/
ANY-PUNCT: /^[[:punct:]]+$/

% Multi-part random morphology: match any string as prefix, stem, or
% suffix.
% \x03 matches the internal representation of the dot in STEMSUBSCR
% (See 4.0.affix).

MOR-STEM: /^(?=\p{Xan})\X(?:(?=\p{Xan}|\pM|\p{Po})\X)*\x03=$/
MOR-PREF: /^(?=\p{Xan})\X(?:(?=\p{Xan}|\pM|\p{Po})\X)*=$/
MOR-SUFF: /^=(?=\p{Xan})\X(?:(?=\p{Xan}|\pM|\p{Po})\X)*/

% For ASCII input, the following is enough (and it works even if the
% LG library is configured with a regex library other then PCRE2).
% To use it, uncomment it out and comment out the previous definitions.
% ANY-WORD: /^[[:alnum:]']+$/
% ANY-PUNCT: /^[[:punct:]]+$/
% MOR-PREF: /^[[:alnum:]']+=$/
% MOR-STEM: /^[[:alnum:]']+.=$/
% MOR-SUFF: /^=[[:alnum:]']+$/
% Punctuation characters are getting strip from start and end of words,
% and words that contain punctuation are getting split at them. See the
% "any/affix-punc" file.
% These punctuation characters will match here. The \x03 is to match
% subscripted punctuation that may be specified in this file.
ANY-PUNCT: /^[[:punct:]]+(:?\x03|$)/

% Multi-part random morphology: match any string as prefix, stem, or suffix.

MOR-STEM: /^\X+\x03=$/
MOR-PREF: /^\X+=$/
MOR-SUFF: /^=\X+/

% Reject anything that contains punctuation, so that the tokenizer will
% have a chance to split them off as affixes.
% Most of the script-dependent punctuation characters are not mentioned in
% the "any/affix-punc" file and thus may be included in words.
ANY-WORD: /^[^[:punct:]]+$/

% For ASCII input and non-PCRE2 regex libraries you can use these instead:
% ANY-WORD: /^[[:alnum:]]+$/
% ANY-PUNCT: /^[[:punct:]].*$/ % The .* is to match an optional subscript.
% MOR-PREF: /^[[:alnum:]]+=$/
% MOR-STEM: /^[[:alnum:]]+.=$/
% MOR-SUFF: /^=[[:alnum:]]+$/

% Match anything that doesn't match the above.
% Match anything that isn't white-space.
Expand Down
26 changes: 13 additions & 13 deletions data/any/affix-punc
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
")" "}" "]" ">" » 〉 ) 〕 》 】 ] 』」 """ "’’" "’" ''.y '.y
"%" "," "." 。 ":" ";" "?" "!" ‽ ؟ ?! ….y ....y "”"
_ - ‐ ‑ ‒ – — ― ~ ━ ー 、
¢ ₵ ™ ℠ : RPUNC+;

"(" "{" "[" "<" « 〈 ( 〔 《 【 [ 『 「 """ `` „ “ ‘ ''.x '.x ….x ....x
¿ ¡ "$"
_ - ‐ ‑ ‒ – — ― ━ ー ~
£ ₤ € ¤ ₳ ฿ ₡ ₢ ₠ ₫ ৳ ƒ ₣ ₲ ₴ ₭ ₺ ℳ ₥ ₦ ₧ ₱ ₰ ₹ ₨ ₪ ﷼ ₸ ₮ ₩ ¥ ៛ 호점
† †† ‡ § ¶ © ® ℗ № "#": LPUNC+;

-- ‒ – — ― - _ "(" ")" "[" "]" ... … "," ";" ":"
': MPUNC+;
% Affixes get stripped off the left and right side of words
% i.e. spaces are inserted between the affix and the word itself.

% An LPUNC/RPUNC/MPUNC token can be specified as "/regex/.\N", when \N is
% the capture group that should match the affix (the whole pattern is
% capture group 0). Disregarding the position in which they appear, they
% are checked last - but in the same order. (Experimental.)

"’’" ''.y ….y ....y "/[[:punct:]]$/.\0": RPUNC+;

`` ''.x ….x ....x †† "/^[[:punct:]]/.\0": LPUNC+;

-- ... … "/[[:punct:]]/.\0" ': MPUNC+;
2 changes: 1 addition & 1 deletion link-grammar/dict-common/dict-impl.c
Original file line number Diff line number Diff line change
Expand Up @@ -805,7 +805,7 @@ bool afdict_init(Dictionary dict)

for (int n = 0; n < ac->length - ac->Nregexes; n++)
{
if (!dict_has_word(dict, ac->string[n]))
if (!dictionary_word_is_known(dict, ac->string[n]))
{
if (!not_in_dict)
{
Expand Down
Loading