-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stripping affix-class tokens #1330
Comments
I implemented an affix-class tokens dict check, and I get the following:
These tokens will be classified as UNKNOWN-WORD and appear with subscripts EDIT: I will of course fix the buggy argument order in this error message. |
Question:
See also: link-grammar/link-grammar/dict-common/regex-morph.c Lines 239 to 244 in 46a2d31
Here it currently issues an error, doesn't remove it from the list, but ignores it if it matches. I'm for (2) or (3). |
Yes.
OK
No opinion.
Option 3 is OK. I can fix the existing dictionaries. I'm surprised by these errors ... I'm looking, now. |
Since I added regex support to affix stripping (an upcoming PR), no much need to check the any/amy affixes. |
What happens if an affix appears twice in the dict: e.g. |
It seems QUOTES and BALLETS are only used in
|
You already answered it above, sorry... |
I just patched English in #1331 -- its a minimalist fix, I didn't get fancy. |
I don't understand this remark. BTW -- they're "bullets" like "bullet points" I'm not sure, but I think that gun bullets are named after the typographical mark .... (??) from french/latin "bull" or stamp. ("papal bull") |
Consider this (see QUOTES and BULLETS at the end of link-grammar/link-grammar/tokenize/tokenize.c Lines 1550 to 1583 in ffc6529
Similarly, ---> Proposal 1: Convert QUOTES/BULLETS to a list of tokens. |
Or we can ditch QUOTES and BULLETS altogether and use a single name, say CAPSTART:
|
Ah OK. Yes to all of the above. From what I can tell, there is no difference between There is an interesting theoretical problem behind the idea of capitalization. I will think about it some more... |
It seems to me it would be a major achievement if an unsupervised algo could find the equivalence of words that start with capital letters to those that are not, without being specifically programmed for that. I mean that it would just be designed to find patterns in the input, and "automagically" would indicate this equivalence. The current code contains my initial implementation of capitalization parsing using dict definitions, but I stopped developing it when it seemed to me it would need disjunct manipulation because I understood - maybe wrongly, that you dislike this idea (so I just continued with my very long LG todo list). |
Is it reasonable to check the existence of the affix tokens in case of a DB/Atomese dict? |
There is a 4.0.affix used for splitting. I'll remove the subscripts in there.
I think that it sometimes does this, but not consistently. It can't deduce this as a general rule, right now. To find the general rule, there would need to be work on general tokenization. I'm thinking about this. Deducing the simple regexes is also desirable. I think it's doable in principle; setting up the machinery in practice is ... a lot of work. |
I still get this with the
I guess they should be added to the dict. With most of the rest of dicti,onaries there is of course a long list of such errors, as most of the tokens are not found in the dict.
---> Which one is desired? Comments:
My proposals:
Please tell me what is desired and I will send the PR. |
I'll do 1 or 2 on a case-by-case basis. I don't like 3 because it just adds complexity and hides a real problem. |
In the PR I'm finishing now, I made the message on non-existent strippable affixes to be only a warning. I noted that you made some minor changes to most/all 4.0.affix files (e20556a).
'th': In the dict, they define several Thai punctuations. But they added only 3 of them to 4.0.affix (in RPUNC)!
Since it is an implementation, that is more than just a pure demo, I think we have to preserve the list of punctuation and just define the non-existent ones with a null expression. However, the dict file is a generated one and I guess that such a definition should be added somewhere else (maybe as a new words file). Maybe a comment should be added to all these modified affix files that most of the punctuations are not handled by their respective dict, and "see en/4.0.affix for a more a complete list of strippable affixes". Besides needing your input on this post (on what I have changed and also on things that still needed to be fixed), this PR is ready. It also has code commits. Alternatively, I can just submit it, and send fixes according to your comments. It may be convenient to apply it first because it implements the affix existence check. |
Go ahead and make the dict changes as you propose. I'll fix the russian dictionary. For the Thai dictionary, let it emit warnings for now; maybe by tagging @kaamanita we'll get his attention and a pull req containing an appropriate fix. |
Hmm. Russian dict does not even offer any handling at all for quotation marks ... I guess they'll need to be unknown-word ... you cn add that, if desired. The current punctuation is at line 104, the unknown-word is at the very bottom. |
I looked at it and it seems I can add them in 4.0.dict since only the files it includes are generated.
I will try to suppress them in I forgot to include the warnings from
|
While working on the affix stripping code I noted that subscripted
LPUNC
tokens are mishandled.For example, using "any":
Note that in
any/affix-punc
,...
appears as....x
(at the end of line 6):link-grammar/data/any/affix-punc
Lines 6 to 10 in 46a2d31
Several years ago, when I modified the affix stripping code, I also removed the subscripts from the affixes in
en/4.0.affix
, as the original code didn't use them for dict lookup.At some point (I didn't check when) the
LPUNC
code (instrip_left()
) didn't check for subscripts anymore, but this went unnoticed becauseen/4.0.affix
didn't includeLPUNC
subscripts. Similarly, the later-written MPUNC code also doesn't handle subscripts.The
RPUNC
code (instrip_right()
) still handles subscripts because it shares a common code withUNITS
, and units may be subscripted.As a result, to following languages have some mishandled LPUNC tokens:
id
,th
,demo-sql
,he
,vn
,demo-atomese
, and the aforementionedany
.When I fixed
LPUNC
to re-consider subscripts, another problem happened in the case of''.y
and''.x
(twice single quote) : mangled results. The problem is that''
is not in the dict (especially not''.y
- as lookups are now done wWasith the subscript), and is thus resolved as UNKNOWN-WORD. It then gets subscripted by a subscript ofUNKNOWN-WORD
and the code doesn't expect this double-subscript. Was it an unfinished trying to add''
as a synonym to double quotes? (For that, a change is needed in theQUOTES
handling, see (1) below).---> My proposed fix is to disallow subscripted affixes which are not in the dict.
Regarding unsubscripted affixes that are not in the dict, maybe they should be allowed if they match a regex (in that case the code should be modified to add a regex check). For example,
''
currently matchEMOTICON
(but there is no point to strip it as EMOTICON unless we strip everything that matchesEMOTICON
, a thing that can be done with affix regexes, see (2) below).---> I will fix these bugs, add subscripts and send a PR.
While looking in the affix list, I got these questions/ideas:
BULLETS
include--
. However, the relevant code may handle single characters only. If desired, the BALLETS list can be changed to a list of tokens likeRPUNC
etc. (code modification is needed for that).amy
), I added the ability to specifyLPUNC
andRPUNC
regexes (as/regex/
). I used this feature only foramy
.However, I think it may be a good idea to add that also to
MPUNC
(mostly copy/paste) so it will be able to split on commas/colons using regexes with lookahead/lookbehind to prevent the mentioned pitfalls.link-grammar/data/en/4.0.affix
Lines 30 to 39 in 46a2d31
It needs PCRE2/C++ regexes, but with a POSIX regex library, the lookahead/lookbehind expressions would be character sequences that just "never" match.
The text was updated successfully, but these errors were encountered: