disk has reached capacity issue with moderate record size with >500 gb of free disk space #965

zwarshavsky · 2022-02-18T16:42:33Z

Related to this issue which I am unsure why it was closed as a resolution does not seem available. #581

I have run into the same disk and memory issue as reported previously. On a large sagemaker instance with 1 tb of disk space on 1m records and locally on 10m records with 500gb of disk space.

Is there a way to set the limit of this tempfile and where is this tempfile located?

fgregg · 2022-02-18T18:15:48Z

can you post a traceback, and your datamodel?

zwarshavsky · 2022-02-18T18:58:16Z

Data models see next comment.

"Step 2" model caused issues on Sagemaker with 1m row input file.

Both times the issue encountered was on the OS vs kernel level. The last local run with 10m rows forced a restart of my machine when I reached dangerously low remaining disk space (starting with 330gb free).

On Sagemaker was something like "disk or database has reached capacity" error which killed the ipython kernel.

ED: removed attached settings files

fgregg · 2022-02-18T19:05:03Z

do you have any sense of where in the program you were?

fgregg · 2022-02-18T19:05:35Z

could you just post the dictionary definition of the data model

zwarshavsky · 2022-02-18T19:41:43Z

model 1:

fields = [
    {'field': 'email', 'type': 'String'},
    {'field': 'phone_number', 'type': 'String'}
    ]

model 2:

fields = [
    {'field': 'full_name', 'type': 'Name'},
    {'field': 'full_address', 'type': 'Address' },
    {'field': 'phone_number', 'type': 'String'}
    ]

zwarshavsky · 2022-02-18T19:44:24Z

On deduper.partition()

fgregg · 2022-02-19T23:39:57Z

from the information we have here, i think this is probably not a bug.

within the partition method, there's a lot of places where potentially very large objects will be written to disk. historically, we have not really done anything to reduce disk usage.

here are places where we write a lot to disk, and some possible mitigations

the blocking map. this is written to a sqlite database. virtual compound predicates might help a little bit, but beyond that, not a lot we can do.
the join that produces the record pairs. if this query leads sqlite to produce a temporary materialization, this could be very big. There's potentially a lot that could be done here.
the scored pairs are written to a memmaped numpy array. if we did some pre-filtering of the scores as we have previous discussed, that will likely significantly help.

I'm open to all these types of changes, but I would want to start with actually knowing where the bottleneck is. @zwarshavsky could you put in some monitoring to see where in the pairs method you run out of disk space?

NickCrews · 2022-02-20T20:59:26Z

@fgregg Think it would be useful if dedupe actually had some profiling code built in? Seems like this sort of debugging/guesswork is sorta common. I'm no expert in this, but perhaps following this example it would just require adding a @profile to all the functions we care about. Then for debugging you just ask people to run mprof run myscript.py and post the output of mprof plot. Could add profile as an extra so that it wasn't a required dependency of dedupe. Something similar could be done for disk space usage, though it doesn't look like there's quite as turnkey of a solution.

fgregg · 2022-02-20T22:18:10Z

interesting, is '@Profile' really a no-op?

NickCrews · 2022-02-21T02:29:16Z

Good thought, don't know for sure, but looking at the source code my impression is that it will always have overhead. To get around this we could write our own decorator like

def dd_profile(func, *args, **kwargs):
    # Maybe a better way to configure this? Would have to be at import time
    if os.environ["DEDUPE_PROFILE"]:
        # Actually add the profiler wrapper
        return profile(func, *args, **kwargs)
    else:
        # noop
        return func

fgregg · 2022-02-21T14:13:22Z

interesting idea, can you open another issue for that, @NickCrews ?

hlra · 2022-09-20T10:29:09Z

I am having the same described by @zwarshavsky above. This happened when I tried to run the code of the small dataset dedupe example code on large data with 1.7m rows without SQL implementation. A large temp file with was written during

clustered_dupes = deduper.partition(data_d, 0.5)

The error was thrown when the temp file was at 180GB in the Windows Appdata folder, although there was about 120GB of free disk space left. I was running the code on a Windows Server machine. Let me know if and how I can be of any help to track this down further.

fgregg · 2022-09-20T16:29:19Z

what version of dedupe are your running?

hlra · 2022-09-21T08:11:37Z

I am mostly using 2.0.13 currently due to this issue #1077. But I ran it again with 2.0.18 now and this is the error message that I get:

Traceback (most recent call last):
File "E:.conda\envs\Dissertation\lib\code.py", line 90, in runcode
exec(code, self.locals)
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm 2022.2.2\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2022.2.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:\Users...\2nd Step - Identify duplicate shareholders.py", line 256, in
clustered_dupes = deduper.partition(data_1, 0.5)
File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 177, in partition
clusters = list(clusters)
File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 185, in _add_singletons
for record_ids, score in clusters:
File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 334, in cluster
yield from clustering.cluster(scores, threshold)
File "E:.conda\envs..\lib\site-packages\dedupe\clustering.py", line 238, in cluster
for sub_graph in dupe_sub_graphs:
File "E:.conda\envs..\lib\site-packages\dedupe\clustering.py", line 38, in connected_components
edgelist = numpy.memmap(
File "E:.conda\envs..\lib\site-packages\numpy\core\memmap.py", line 284, in new
self.filename = None
OSError: [Errno 28] No space left on device

There is another 105GB of free space on the device though.

hlra · 2022-09-21T08:15:33Z

In both runs the error seems to have been thrown when the temp file was at about 173GB.

NickCrews mentioned this issue Feb 21, 2022

Add benchmarks #967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disk has reached capacity issue with moderate record size with >500 gb of free disk space #965

disk has reached capacity issue with moderate record size with >500 gb of free disk space #965

zwarshavsky commented Feb 18, 2022 •

edited

Loading

fgregg commented Feb 18, 2022 •

edited

Loading

zwarshavsky commented Feb 18, 2022 •

edited

Loading

fgregg commented Feb 18, 2022

fgregg commented Feb 18, 2022

zwarshavsky commented Feb 18, 2022 •

edited

Loading

zwarshavsky commented Feb 18, 2022

fgregg commented Feb 19, 2022

NickCrews commented Feb 20, 2022 •

edited

Loading

fgregg commented Feb 20, 2022

NickCrews commented Feb 21, 2022

fgregg commented Feb 21, 2022

hlra commented Sep 20, 2022

fgregg commented Sep 20, 2022

hlra commented Sep 21, 2022

hlra commented Sep 21, 2022

disk has reached capacity issue with moderate record size with >500 gb of free disk space #965

disk has reached capacity issue with moderate record size with >500 gb of free disk space #965

Comments

zwarshavsky commented Feb 18, 2022 • edited Loading

fgregg commented Feb 18, 2022 • edited Loading

zwarshavsky commented Feb 18, 2022 • edited Loading

fgregg commented Feb 18, 2022

fgregg commented Feb 18, 2022

zwarshavsky commented Feb 18, 2022 • edited Loading

zwarshavsky commented Feb 18, 2022

fgregg commented Feb 19, 2022

NickCrews commented Feb 20, 2022 • edited Loading

fgregg commented Feb 20, 2022

NickCrews commented Feb 21, 2022

fgregg commented Feb 21, 2022

hlra commented Sep 20, 2022

fgregg commented Sep 20, 2022

hlra commented Sep 21, 2022

hlra commented Sep 21, 2022

zwarshavsky commented Feb 18, 2022 •

edited

Loading

fgregg commented Feb 18, 2022 •

edited

Loading

zwarshavsky commented Feb 18, 2022 •

edited

Loading

zwarshavsky commented Feb 18, 2022 •

edited

Loading

NickCrews commented Feb 20, 2022 •

edited

Loading