Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties running dedupe_geojson out of the box and a possibly significant typo? #25

Open
theloglizard opened this issue Dec 8, 2020 · 0 comments

Comments

@theloglizard
Copy link

theloglizard commented Dec 8, 2020

Hi . Interesting package which does almost exactly what I need, but I've had some difficulty getting it to run in python 3
(3.8, specifically). I spent a day hacking various bits and pieces and seem to have certain slices running, and I'm happily running the libpostal/pypostal in other contexts (great stuff, thanks!) . I imagine I have some sort of installation/package dependency issue, but I also wonder of some sort of commit/update may have failed, somewhere. For example:

class GeoJSONLineParser(GeoJSONParser):
    def __init__(self, filename):
        if filename.endswith(".bz2"):
            self.f = bz2.BZ2File(filename)
        else:
            self.f = open(filename)

    def next_feature(self):
        return json.loads(self.f.next().rstrip())

seems to be bombing with an error report:

dedupe_geojson --use-postal-code --use-zip5 --no-phone-numbers
-o foo --output-filename z1 --name-dupe-threshold 0.0 name.json
Word index file: foo/info_gain.index
Near-dupe tempfile: foo/near_dupes
Features DB: foo/features_db
Output filename: foo/z1
-----------------------------
* Assigning IDs, creating near-dupe hashes + word index (using info_gain)
Traceback (most recent call last):
File "/.local/bin/dedupe_geojson", line 299, in
for feature_id, feature in id_features(args.files):
File "/.local/bin/dedupe_geojson", line 52, in id_features
for feature in f:
TypeError: iter() returned non-iterator of type 'GeoJSONLineParser'_


which is easily enough patched/remedied with:

return json.loads(next(self.f).rstrip())

seems some sort of python2/python3 thing?!


Also, I think there may be a typo ("canoncal" instead of "canonical" ) at line 99 in https://github.com/openvenues/lieu/blob/master/scripts/dedupe_geojson

def is_name_address_dupe(canoncal, other, dupe_pairs, dupes, word_index=None,
                         name_dupe_threshold=DedupeResponse.default_name_dupe_threshold,
                         needs_review_threshold=DedupeResponse.default_name_review_threshold,
                         with_address=True,
                         with_unit=False,
                         use_phone_number=False,
                         fuzzy_street_names=False):

Before I commit to further hacking to get other slices running (haven't done anything with the geo features yet, for example), I thought I'd check to see about some combination:

  1. dedupe_geojson should be up and running with python 3.(?)
  2. If maybe some commit or installation feature had somehow failed or strayed
  3. Make sure lieu is still something I might expect to work.

Also, I looked around in the installation and didn't see a simple, sample input file, which would have saved me a certain amount of effort as well. As noted, I haven't sorted out all the formats and features, but in the spirit of sharing back, I attach the following json as something that seems to sort of work for me in the above call to dedupe_geojson.
name.json.gz

Thanks for your attention. Good stuff, both this and libpostal. I appreciate your sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant