Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strippable affix class regexes #1333

Open
ampli opened this issue Jul 31, 2022 · 3 comments
Open

Strippable affix class regexes #1333

ampli opened this issue Jul 31, 2022 · 3 comments

Comments

@ampli
Copy link
Member

ampli commented Jul 31, 2022

I finished implementing and testinmg it, and here are the examples I used:

% TODO: this list should be expanded with other "typical"(?) junk
% that is commonly (?) in broken texts.
-- ‒ – — ― "(" ")" "[" "]" ... ";" ±: MPUNC+;
% Split on comma's, but be careful with numbers:
% "The enzyme has a weight of 125,000 to 130,000"
% Also split on colons, but be careful not to mess up time
% expressions: "The train arrives at 13:42"
"/(?<!\d)[,:]|[,:](?!\d)/": MPUNC+;

In corpus-fixes.batch:

% Test tokenization by affix regexes.
% Sentence that should not be affected.
The enzyme has a weight of 125,000 to 130,000
The train arrives at 13:42
% Sentences that use punctuation without a trailing whitespace.
We used the same colors (red,blue,yellow).
The price of this item:$100

LPUNC and RPUNC also support regexes, and I tested with them /^[[:punct:]]/ and /[[:punct:]]$/ (respectively) in amy.

However, there is a problem: It is supported only when configured with PCRE2, and when configured with C++ the lookbehind regex compilation fails (not supported by C++). POSIX regexes (C library and TRE) also fail. (This is not really a problem for amy etc. since we don't need to support other regex libraries there.)

Possible solutions:

  1. Distribute it with commented-out affix regexes and that's all.
  2. Use autoconf to enable PCRE2 regexes if configure with PCRE2.
  3. Add configuration file support for '#if SOMETHING' when SOMETHING is HAVE_POCRE2_H.
    4.Only support PCRE2 on POSIX systems. (BTW, it is now easy for me to add PCRE2 support on MS-Windows too.)
  4. Add support for regex library specification (easy to implement):
    "/(?<!\d)[,:]|[,:](?!\d)/PCRE2" (or even flag "e" for "extended").

I am for (5) and otherwise for (2) or (1).

@ampli
Copy link
Member Author

ampli commented Jul 31, 2022

I am for (5) and otherwise for (2) or (1).

EDIT: Fix the POSIX regex.

I found a better solution, that all the regex libraries support:
Instead of lookahead/lookbehind, use a capture group for the matching part.
e.g, instead of:
"/(?<!\d)[,:]|[,:](?!\d)/"
use a POSIX regex:
"/\d([,:]|[,:])\d/"

I will change the code to support this too.

EDIT yet again:
"/\D([,:]|[,:])\D/"

EDIT:
\D didn't work for me, but [^^d] did.

@ampli
Copy link
Member Author

ampli commented Jul 31, 2022

@linas,
To solve the split problem you pointed out in your comment on MPUNC, I implemented an MPUNC regex mechanism that uses lookahead/lookbehind (directly or indirectly) in a try not to split numbers with commas or times with colons. It works.

However, it seems there is a simpler solution that doesn't use a regex affix: Use : and , in MPUNC, and just don't MPUNC-split words that match a regex (in contrast to morpheme-split, that is done before trying a regex).
I will try to implement that, and for now, leave the use of MPUNC-regex for the sake of any/ady/amy (as a simple split on [[:punct:]]).

Another thing:
The corpus test sentences I used are not good enough: If they are not getting split as intended, they still parse fine, because the word with an internal colon or comma is looked up as UNKNOWN-WORD. But it is hard to find sentences that don't parse then. This is a general problem, that causes sentences with junk to get parsed just fine.
Does this need a solution?
If so, should we just have a regex category JUNK with no possible linkage, for words with junk in them?

@ampli
Copy link
Member Author

ampli commented Aug 2, 2022

I said above:

just don't MPUNC-split words that match a regex [...]
I will try to implement that, [...]

If a word contains 2 kinds of punctuations, one that has to be separated and one that should not, this way wouldn't work since the word could either match or not match a regex. So I will send the PR that splits by affix regexes.

EDIT:

just don't MPUNC-split words that match a regex [...]

This was a bad idea since in general such matches have nothing to do with word splits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant