-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Extend KNN neighbor search beyond coincident sites #287
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #287 +/- ##
==========================================
+ Coverage 80.99% 81.19% +0.19%
==========================================
Files 115 115
Lines 11580 11713 +133
==========================================
+ Hits 9379 9510 +131
- Misses 2201 2203 +2
Continue to review full report at Codecov.
|
OK, to be clear, this does more than simply ensure that co-incident points are assigned valid (co-incident) neighbors, it forces the neighbors of co-incident points to be outside of the coincident set, correct? |
duplicates = duplicated(self.data) | ||
coincident = duplicates[:,1].any() | ||
|
||
self.duplicates = duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we'd want all distance-style weights objects gain a duplicates attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking was to let the user know, somehow, that they have coincident points - duplicates is a bad choice in this regard as that might be taken to mean the records are identical in all attributes, whereas I think by coincident implies spatial duplicates.
This method either ignores or fails on ids. import libpysal, numpy, uuid
coordinates = numpy.random.random(size=(5,2))
coordinates = numpy.row_stack([coordinates]*5)
ids = [uuid.uuid4().hex for _ in range(100)]
#fails, since we pass ids as id_order to resolve sorting issues from the dataframe
w = libpysal.weights.KNN(coordinates, id_order=ids)
# ignores the ids because of libpysal#284
w = libpysal.weights.KNN(coordinates, ids=ids) tbf I think that's not really this issue's fault, but we'd need to address it. |
for row in duplicate_ids[0]: | ||
neighbors[row] = neighbors[duplicates[row, 2]] | ||
n = self.data.shape[0] | ||
ids = list(range(n)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces ids to be 0,n-1
. The other part of this if
statement, where we keep coincident points, does not do this.... regardless, it means that this fails when we use KNN.from_dataframe
on something with an index...
I wanted to add a |
Just so I can understand, what happens now if you have different apartment units in a building (same coordinates)? Would they be considered neighbors of each other or will they be assigned neighbors in other buildings? |
My bad was I didn't see #285 until after I did this. |
Yes, that was the intention. |
This would have them assigned neighbors with non-zero distances. There are two (at least) cases where this happens - the one you point out where the issue is the data lacks sufficient spatial information to spatially disambiguate units in the same apartment building, and the other is with repeat sales data. In the latter, the data may have a temporal attribute that can be used to differentiate the records. |
Ah, brilliant! never even thought that you could just send That'd be very strongly sensitive to scaling, right? For data measured in UTM at a monthly timeframe, your inter-temporal neighbors are much "closer" than your spatial neighbors, right? UTM coordinates are typically in thousands of meters, whereas months will be 0...11. |
I think I'd want to see
I also don't see how this needs to be ported to kernel/distance band weights, save for estimating the bandwidth of a kernel? |
Just to be clear, I wasn't thinking that they would be passing in (n,3) as the data for the knn search, with one of the three being the temporal coordinate. That would raise all kinds of scaling issues that you point out. The motivation for the approach in the PR was that coincident spatial points violate the law (I think it was Keith Clarke's) that no two events can occupy the same point at the same time. In data sets where at temporal attribute is missing, the coincident points imply this happens. Maybe the suggested approach should be an option, rather than the default. Another option would be to jigger the coordinates of the coincident points prior to the search - but that also comes with some possible side-effects. |
Yes, I think that's appropriate. I was trying to add a
Sure, that'd be reasonable. I think the solution implemented in #285 though is just fine, and better than reducing the precision of the data through jittering. |
I'd like to merge #285 to fix the problem, and then return to this as a "ignore_coincident" option for the constructor... |
i think this is resovled by graph and can be closed |
This is aimed at pysal/pysal#1060.
If the approach makes sense, it can be extended to Kernel weights and DistanceBand weights. The latter will need slightly different handling for symmetry.