cmp errors triggered in address_dupe_pairs (dedupe.py) #15

thisisaaronland · 2018-07-25T01:21:48Z

This appears to be a variation on issue #9 meaning it's a type mismatch being triggered in postal/utils/enum.py but I am hoping you can provide some input on how to track down the root cause in order to remedy things.

That or some suggestions for how to trap this sort of thing and drop the non-result, because in the example below the EMR process ran for ~2 hours before finally failing.

The input data should be clean so I am not sure how to debug these errors...

        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532454729366_0001/container_1532454729366_0001_02_000003/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
    .filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_sub_building_dup\
e) \
  File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
    return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'tuple'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:100)
        at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:99)

The text was updated successfully, but these errors were encountered:

thisisaaronland · 2018-07-25T22:19:31Z

After adding more verbose logging to utils/enum.py this is what I see:

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1532544793717_0001/container_1532544793717_0001_02_000004/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/lib/python2.7/site-packages/lieu/spark/dedupe.py", line 28, in <lambda>
    .filter(lambda ((uid1, uid2), (address_dupe_status, is_sub_building_dupe)): address_dupe_status in (duplicate_status.EXACT_DUPLICATE, duplicate_status.LIKELY_DUPLICATE) and is_s\
ub_building_dupe) \
  File "/usr/local/lib64/python2.7/site-packages/postal/utils/enum.py", line 20, in __cmp__
    raise Exception, err
Exception: failed to __cmp__ 'EXACT_DUPLICATE' this: '9' that: '(EXACT_DUPLICATE, 1.0)' error: 'long.__cmp__(x,y) requires y to be a 'long', not a 'tuple''

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

thisisaaronland · 2018-07-28T02:03:04Z

More example errors and details:

----
dupe class None == NEEDS_REVIEW
a1 '{'house_number': u'400', 'house': u'Daubendiek Karen', 'lon': -121.502926, 'phone': u'+1 916 321 4500', 'postcode': u'95814', 'country': u'US', 'lat': 38.579005, 'road': u'Capitol Mall'}' a2 '{'house_number': u'400', 'house': u'(Simpson Timber Company)', 'lon': -121.502922, 'phone': u'+1 916 492 9616', 'postcode': u'95814', 'country': u'US', 'lat': 38.579029, 'road': u'Capitol Mall'}'
p1 'Country Code: 1 National Number: 9163214500' p2 'Country Code: 1 National Number: 9164929616'
have True
same False
different True
-----
Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 420, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 352, in revised_dupe_class
    raise Exception, e
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'NoneType'

thisisaaronland · 2018-07-31T16:28:35Z

More:

Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 424, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 328, in revised_dupe_class
    if dupe_class == None:
  File "/usr/lib64/python2.7/site-packages/postal/utils/enum.py", line 16, in __cmp__
    return self.value.__cmp__(other)
TypeError: long.__cmp__(x,y) requires y to be a 'long', not a 'NoneType'

This was triggering by the following:

	if dupe_class == None:
            print "DUPE CLASS IS NONE"
            return duplicate_status.NEEDS_REVIEW

Now trying if dupe_class.value == None but it does suggest that something, somewhere is creating a postal.EnumValue(foo) instance where foo is None...

thisisaaronland · 2018-07-31T18:59:17Z

Once, more with type(dupe_class) == types.NoneType because...

Traceback (most recent call last):
  File "/usr/bin/dedupe_geojson", line 420, in <module>
    is_dupe = dupe_func(canonical, other, dupe_pairs, dupes, **dupe_func_kw)
  File "/usr/bin/dedupe_geojson", line 113, in is_name_address_dupe
    fuzzy_street_name=fuzzy_street_names)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 424, in dupe_class_and_sim
    name_fuzzy_dupe_class = PhoneNumberDeduper.revised_dupe_class(name_fuzzy_dupe_class, a1, a2)
  File "/usr/lib/python2.7/site-packages/lieu/dedupe.py", line 328, in revised_dupe_class
    if dupe_class.value == None:
AttributeError: 'NoneType' object has no attribute 'value'

thisisaaronland · 2018-08-01T16:37:27Z

type(dupe_class) == types.NoneType appears to have fixed (or at least) trapped the problem.

…1.0.1

thisisaaronland added a commit to sfomuseum/lieu that referenced this issue Aug 1, 2018

trap NoneType in revised_dupe_class per issue openvenues#15; bump to …

9bc1dd7

…1.0.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmp errors triggered in address_dupe_pairs (dedupe.py) #15

cmp errors triggered in address_dupe_pairs (dedupe.py) #15

thisisaaronland commented Jul 25, 2018

thisisaaronland commented Jul 25, 2018

thisisaaronland commented Jul 28, 2018

thisisaaronland commented Jul 31, 2018

thisisaaronland commented Jul 31, 2018

thisisaaronland commented Aug 1, 2018

__cmp__ errors triggered in address_dupe_pairs (dedupe.py) #15

__cmp__ errors triggered in address_dupe_pairs (dedupe.py) #15

Comments

thisisaaronland commented Jul 25, 2018

thisisaaronland commented Jul 25, 2018

thisisaaronland commented Jul 28, 2018

thisisaaronland commented Jul 31, 2018

thisisaaronland commented Jul 31, 2018

thisisaaronland commented Aug 1, 2018

cmp errors triggered in address_dupe_pairs (dedupe.py) #15

cmp errors triggered in address_dupe_pairs (dedupe.py) #15