-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiple deduplication bugs #1396
Conversation
According to comment opencog#1378 (comment) the num_linkages_alloced might have changed, and thus cannot be used for compare.
There is still a similar silence problem: In addition to fixing the problem, I propose to add an assert() that at least one linkage remain. Questions: link-grammar/link-grammar/parse/parse.c Lines 406 to 411 in 21df959
link-grammar/link-grammar/dict-file/dictionary.c Lines 268 to 271 in 21df959
For any there is a special check for not sorting. What is the reason?Also, in that case there is no deduplication. Is it intended? Now that we have dict #define , maybe it's better to define shuffle_linkages in the any dict?
|
Yeah ... There's still something weird going on. I reordered the code to mark during sorting, and am now finding many many more duplicates. And when this happens, somehow something doesn't work. Yes, the |
Ahhh there are more duplicates, than valid linkages. Then it fails... because I dedupe the bad linkages too. Perhaps the bad linkages shouldn't be sorted or deduplicated, anyway!? |
The The Certainly, if num linkages < limit, then all planar linkages will be enumerated, one each, there won't be duplicates, and so these are by definition "uniformly distributed". If num linkages > limit, then ... we sample ... does that sampling result in a uniform distribution? I don't really know. I assumed it did ... Would removing duplicates make it "more uniform" if it is not? How many duplicates might there be, anyway? The reason for the shuffle is that when I sample, I set limit=1000 but then take only the first 24 linkages, ever. I'd tripped over a bug, where, no matter what the sentence, I always got back exactly the same 24 parses. So that wasn't uniform at all, over thousands of sentences. The shuffle guarantees that a different set of 24 appears each time. |
I won't be fixing the |
#1397 fixes the too-many-duplicates bug mentioned above. |
(EDITED to fix linkage number typos.) My test sentence is: In the current release (5.12.0) I get: In the repository release (9c071f7, just before your recent dict changes): I still don't understand why So something in the deduplication algo apparently doesn't detect all the duplicates. Some other issues:
(I cannot find There are some problems in using desc:string:
In addition, connector address comparisons are still done, as in: Last thing here: |
This idea is not as simple as I first thought, and anyway I recalled it had already been used in the old code: In case there is no linkage overflow, say we found N linkages when L=linkage_limit. |
In #1378 you said
There is still linkage differences in some sentences with Since the linkages without deduplication are the same, I think that it is a bug. |
Note that it may not be easy to check that
But Another observation is that with |
I found the reason for the linkage order difference if tracon compression (connector address sharing) is used (the default). link-grammar/link-grammar/parse/parse.c Lines 291 to 303 in 62711eb
This is due to the connector address checks and the following However, I still don't understand why some duplicates are not detected. |
You are confusing me.
Anyway, for my statistical sampling, I'm starting to think that duplicates are not such a bad thing, after all, so I'm open to disabling this code, by default. |
I found the bug about failing to detect differences. It's obvious and a result of late-night cloudy thinking. Fix coming shortly. |
I apologize for changing the dictionary. I should not have done that. |
Let's start with one point I partially agree with you about:
Such a check is needed too. However, sentences with the same disjuncts may have different linkages.
|
I fixed the bad linnkage compare in e35ea9f and pushed directly upstream. |
Hmm. I won't be doing that check. If two different disjuncts give the same links and use the same dict words, then, from the point of view of the linkage, they are the same, even if the dict managed to include two different disjuncts, for the same word (i.e. one with and one without a multi-connector, but otherwise being the same). Hmmm. At least, I think I want to ignore such differences. But for dict debugging, they do make a difference... From what I can tell, it is enough to compare pointers for the |
How can they possibly be different? We've just finished loops that checked everything there is to check, and found that everything is the same. |
What is supposed to ensure that they are the same? Let's see. Loop 1: // Compare link endpoints Loop 2: Compare link names. Loop 3. // Compare words. Loop 4. // Compare connector And indeed, when I added for (uint32_t wi=0; wi<lpv->num_words; wi++)
assert(lpv->chosen_disjuncts[wi] == lnx->chosen_disjuncts[wi], "XXX"); just above This proves that the disjuncts can be different before loop 4 when loop 4 normally finishes with no |
right before the
prints
Casual examination shows that neither of these disjunct strings is correct. WTF
|
Oh never mind. The printed parse is not the parse being shown |
It should have been this one:
So the disjuncts For linkage de-deuplication, it is the defacto links that I'm interested in, and not the dictionary entries that lead to them. The reason for this is because ... (next post) |
The reason for this ... just evaporated. I was going to write something, and I see it won't support my argument. One of my dicts has expressions entirely of the form So when I get a linkage, I am only interested in the actual connectors used, and not the hypothetical multi's that were never used. |
FYI, if I turn on deduplication, I find that linkages 9, 12 35, 39, 47, 50, 63 and 67 are duplicates. Presumably, one linkage corresponds to the This raises the question: should deduplication be turned on in general? Should it be a run-time flag? |
Previously I said (multiple times) that the default is However, I think it will be better to just collapse "@x @x" sequences to a single |
At least, there should be a |
I just found an old code in which I added duplicate |
Proceed as you wish; I'm done, here. I'm now off to fry my explosively large dictionaries... FYI, I'd like to publish version 5.12.1 in a few days or weeks or whenever its ready, before too long. |
The existence of I said:
I wrote the code to do that, and debug printing also found things like So I propose another solution to my problem of automatically comparing linkage results after making a change:
Providing such a flag (e.g. -test=no-linkage-dedup`) will solve the above problem. This can also serve us if we suspect that a linkage is wrongly suppressed, e.g. after changes in the sorting function. So I will implement such a flag. |
Some confusion above...
This just says "connect to zero or more
This means "connect to one or more FWIW, + and - directions commute, so if there are expressions like
I don;'t see a question. |
Does
It is an implied one, whether the "it will be wrong to collapse ..." sentence is correct. |
Heh.
Yes it does. I was wrong; I misremembered. I was so used to seeing The following rewrites are wrong:
Thus, it would appear that my earlier insistence that "they are the same" (in #1396 (comment) and other comments) is wrong. And now that I see the error of my ways ... what to do? The disjuncts differ, but they give the same parse ... Hmmm .... perhaps you are correct, these should be considered to be different. .. Should I patch this? Do you want to patch this? |
I opened #1417 because otherwise, this conversation will get lost in history. |
I suppose that the last two loops (including the DOUBLE_CHECK loop) are to be removed. The question is how to sort the linkages (that happen to look the same) in that case. |
??? I must be getting tired. The second-from-last loop lines 352 to 372, bails only if there are differences, so we want to have it. link-grammar/link-grammar/parse/parse.c Lines 345 to 371 in e746230
We want to keep the above. The last loop, commented out with |
I now think that you are right.
Yes. My proposed connector comparison (string + multi) seems more direct. |
I said:
|
If the descs are the same, then we know the strings cannot differ. Ths resolves opencog#1396 (comment)
Read the code more carefully. It uses |
I added a comment on that in your last PR. |
Click on "finish review" else I don't see comments. |
I added it directly on the commit (in the repository commit list). I think such comments are immediate., |
I again forgot to do that... |
This fixes multiple issues reported for #1378, including: