About CanopyIndex implementation #1139

lmores · 2023-01-13T16:52:12Z

lmores
Jan 13, 2023

I am curious about the current implementation of the CanopyIndex class in predicates.py.
Given a doc (a string) and the doc_id used to identify the current doc inside the inverted index, the main steps inside __call__(self, record, **kwargs) (line 214 and subsequent ones) seem to be:

if the key doc_id is already inside self.canopy, set block_key to its associated value (even if self.canopy[doc_id] is None),
otherwise retrieve from the index the list of the other docs (member_ids) that are close enough to doc and for each of them set self.canopy[member_id] = doc_id if self.canopy[member_id] is not already set (or leave it unchanged otherwise).
- if we found at least another doc (i.e., len(member_ids) > 0), point the current doc to itself inside the canopy (self.canopy[doc_id] = doc_id)
- otherwise set self.canopy[doc_id] = None (meaning the the current doc is "isolated" with respect to all the other docs inside the index, I suppose)
Finally, if block_key is None there is nothing close to the current doc (except the doc itself), otherwise the id of the "local representative" of the region where doc_id lies is returned.

I have two questions/doubts:

If my understanding of the current implementation is correct, I do not see the relation to the "traditional" concept of "canopy" as it appears in the literature, see wikipedia.
Does the current implementation of the CanopyIndex significantly differ from the behaviour of SearchIndex? A SearchIndex returns a list of all the doc_id near the current doc, whereas the current CanopyIndex just returns the id of the "local representative" of the same set of docs, which carries (more or less) the same amount of information. Am I wrong?

Thank you!

fgregg · 2023-01-17T00:22:02Z

fgregg
Jan 17, 2023
Maintainer

it is should be exactly equivalent to mcallum’s canopy algorithm. how do you believe it differs in output?

0 replies

fgregg · 2023-01-20T20:50:59Z

fgregg
Jan 20, 2023
Maintainer

For the difference between the Canopy and Search predicates, imagine that we have have a corpus of five documents with the following distances

	A	B	C	D	E
A	0	0.3	0.7	0.7	0. 7
B	0.3	0	0.3	0.7	0. 7
C	0.7	0.3	0	0.3	0. 7
D	0.7	0.7	0.3	0	0.3
E	0. 7	0.7	0.7	0.3	0

Assume that the radius of the canopy is 0.3, then the following set canopies are possible

Partition 1

(A,B)
(C, D)
(E)

Partition 2

(A,B, C)
(D, E)

Notice that for each canopy partition, a document can be in one and only one canopy.

Now for the search predicates, assume that we have in our indexed set, the same documents, and then we have a search set of documents A', B', C', D', and E' that are identical to there counterparts in the indexed set.

Then the search predicates will return the following records to compare

A': (A, B)
B': (A, B, C)
C': (B, C, D)
D': (C, D, E)
E': (D, E)

See how the documents in the search index can appear more than once.

1 reply

lmores Jan 21, 2023
Author

Seems like we are talking about two different types of canopies!

The construction of my type of canopies requires the specification of two different radius r1 < r2. When building a canopy centered on a given item you add to that canopy all items that are at most r2 afar, but you remove from the pool of available items only those that are r1 or less afar (hence, in this version canopies overlap and do not make up a partition).

You can find out more in this article by McCallum and others or on wikipedia.

The above definition of canopies lies somehow "inbetween" your notion of canopy and search index: items close to the center of my canopies only belong to that canopy, but items at distance > r1 and < r2 may be inserted in more than one canopy.

Not sure if this implementation of canopies could replace/improve the current pair of CanopyIndex + SearchIndex, though.

fgregg · 2023-01-21T19:50:46Z

fgregg
Jan 21, 2023
Maintainer

got it. in my implementation r1=r2

0 replies

fgregg · 2023-01-21T19:58:04Z

fgregg
Jan 21, 2023
Maintainer

the non-partitioning version increases complexity.

let r1=0, then the number of searches will be N.

for r1=r2, the number of searches is about N/(average number of elements returned in a r2 search)

you can think of r1 as a parameter to control that trade off.

would welcome making the change if there was evidence that it was worthwhile. in my experience r2 was much more consequential than r1

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About CanopyIndex implementation #1139

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

About CanopyIndex implementation #1139

lmores Jan 13, 2023

Replies: 4 comments · 1 reply

fgregg Jan 17, 2023 Maintainer

fgregg Jan 20, 2023 Maintainer

Partition 1

Partition 2

lmores Jan 21, 2023 Author

fgregg Jan 21, 2023 Maintainer

fgregg Jan 21, 2023 Maintainer

lmores
Jan 13, 2023

Replies: 4 comments 1 reply

fgregg
Jan 17, 2023
Maintainer

fgregg
Jan 20, 2023
Maintainer

lmores Jan 21, 2023
Author

fgregg
Jan 21, 2023
Maintainer

fgregg
Jan 21, 2023
Maintainer