Replies: 4 comments 1 reply
-
it is should be exactly equivalent to mcallum’s canopy algorithm. how do you believe it differs in output? |
Beta Was this translation helpful? Give feedback.
-
For the difference between the Canopy and Search predicates, imagine that we have have a corpus of five documents with the following distances
Assume that the radius of the canopy is 0.3, then the following set canopies are possible Partition 1
Partition 2
Notice that for each canopy partition, a document can be in one and only one canopy. Now for the search predicates, assume that we have in our indexed set, the same documents, and then we have a search set of documents A', B', C', D', and E' that are identical to there counterparts in the indexed set. Then the search predicates will return the following records to compare
See how the documents in the search index can appear more than once. |
Beta Was this translation helpful? Give feedback.
-
got it. in my implementation r1=r2 |
Beta Was this translation helpful? Give feedback.
-
the non-partitioning version increases complexity. let r1=0, then the number of searches will be N. for r1=r2, the number of searches is about N/(average number of elements returned in a r2 search) you can think of r1 as a parameter to control that trade off. would welcome making the change if there was evidence that it was worthwhile. in my experience r2 was much more consequential than r1 |
Beta Was this translation helpful? Give feedback.
-
I am curious about the current implementation of the
CanopyIndex
class inpredicates.py
.Given a
doc
(a string) and thedoc_id
used to identify the current doc inside the inverted index, the main steps inside__call__(self, record, **kwargs)
(line 214 and subsequent ones) seem to be:doc_id
is already insideself.canopy
, setblock_key
to its associated value (even ifself.canopy[doc_id] is None
),member_ids
) that are close enough todoc
and for each of them setself.canopy[member_id] = doc_id
ifself.canopy[member_id]
is not already set (or leave it unchanged otherwise).len(member_ids) > 0
), point the current doc to itself inside the canopy (self.canopy[doc_id] = doc_id
)self.canopy[doc_id] = None
(meaning the the currentdoc
is "isolated" with respect to all the other docs inside the index, I suppose)block_key is None
there is nothing close to the currentdoc
(except the doc itself), otherwise the id of the "local representative" of the region wheredoc_id
lies is returned.I have two questions/doubts:
CanopyIndex
significantly differ from the behaviour ofSearchIndex
? A SearchIndex returns a list of all the doc_id near the currentdoc
, whereas the current CanopyIndex just returns the id of the "local representative" of the same set of docs, which carries (more or less) the same amount of information. Am I wrong?Thank you!
Beta Was this translation helpful? Give feedback.
All reactions