Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__cmp__ errors triggered in address_dupe_pairs (dedupe.py) #15

Open
thisisaaronland opened this issue Jul 25, 2018 · 5 comments
Open

__cmp__ errors triggered in address_dupe_pairs (dedupe.py) #15

thisisaaronland opened this issue Jul 25, 2018 · 5 comments

Comments

@thisisaaronland
Copy link

This appears to be a variation on issue #9 meaning it's a type mismatch being triggered in postal/utils/enum.py but I am hoping you can provide some input on how to track down the root cause in order to remedy things.

That or some suggestions for how to trap this sort of thing and drop the non-result, because in the example below the EMR process ran for ~2 hours before finally failing.

The input data should be clean so I am not sure how to debug these errors...

        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
    .filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_sub_building_dup\
e) \
  File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
    return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'tuple'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)
@thisisaaronland
Copy link
Author

After adding more verbose logging to utils/enum.py this is what I see:

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
    .filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_s\
ub_building_dupe) \
  File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 20, in __cmp__
    raise Exception, err
Exception: failed to __cmp__ 'EXACT_DUPLICATE' this: '9' that: '(EXACT_DUPLICATE, 1.0)' error: 'long.__cmp__(x,y) requires y to be a 'long', not a 'tuple''

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

@thisisaaronland
Copy link
Author

More example errors and details:

----
dupe class None == NEEDS_REVIEW
a1 '{'house_number': u'400', 'house': u'Daubendiek Karen', 'lon': -121.502926, 'phone': u'+1 916 321 4500', 'postcode': u'95814', 'country': u'US', 'lat': 38.579005, 'road': u'Capitol Mall'}' a2 '{'house_number': u'400', 'house': u'(Simpson Timber Company)', 'lon': -121.502922, 'phone': u'+1 916 492 9616', 'postcode': u'95814', 'country': u'US', 'lat': 38.579029, 'road': u'Capitol Mall'}'
p1 'Country Code: 1 National Number: 9163214500' p2 'Country Code: 1 National Number: 9164929616'
have True
same False
different True
-----
Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 420, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 352, in revised_dupe_class
    raise Exception, e
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'NoneType'

@thisisaaronland
Copy link
Author

More:

Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 424, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 328, in revised_dupe_class
    if dupe_class == None:
  File "/usr/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
    return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'NoneType'

This was triggering by the following:

	if dupe_class == None:
            print "DUPE CLASS IS NONE"
            return duplicate_status.NEEDS_REVIEW

Now trying if dupe_class.value == None but it does suggest that something, somewhere is creating a postal.EnumValue(foo) instance where foo is None...

@thisisaaronland
Copy link
Author

Once, more with type(dupe_class) == types.NoneType because...

Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 424, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 328, in revised_dupe_class
    if dupe_class.value == None:
AttributeError: 'NoneType' object has no attribute 'value'

@thisisaaronland
Copy link
Author

type(dupe_class) == types.NoneType appears to have fixed (or at least) trapped the problem.

thisisaaronland added a commit to sfomuseum/lieu that referenced this issue Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant