Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk has reached capacity issue with moderate record size with >500 gb of free disk space #965

Open
zwarshavsky opened this issue Feb 18, 2022 · 15 comments

Comments

@zwarshavsky
Copy link

zwarshavsky commented Feb 18, 2022

Related to this issue which I am unsure why it was closed as a resolution does not seem available. #581

I have run into the same disk and memory issue as reported previously. On a large sagemaker instance with 1 tb of disk space on 1m records and locally on 10m records with 500gb of disk space.

Is there a way to set the limit of this tempfile and where is this tempfile located?

@fgregg
Copy link
Contributor

fgregg commented Feb 18, 2022

can you post a traceback, and your datamodel?

@zwarshavsky
Copy link
Author

zwarshavsky commented Feb 18, 2022

Data models see next comment.

"Step 2" model caused issues on Sagemaker with 1m row input file.

Both times the issue encountered was on the OS vs kernel level. The last local run with 10m rows forced a restart of my machine when I reached dangerously low remaining disk space (starting with 330gb free).

On Sagemaker was something like "disk or database has reached capacity" error which killed the ipython kernel.

ED: removed attached settings files

@fgregg
Copy link
Contributor

fgregg commented Feb 18, 2022

do you have any sense of where in the program you were?

@fgregg
Copy link
Contributor

fgregg commented Feb 18, 2022

could you just post the dictionary definition of the data model

@zwarshavsky
Copy link
Author

zwarshavsky commented Feb 18, 2022

model 1:

fields = [
    {'field': 'email', 'type': 'String'},
    {'field': 'phone_number', 'type': 'String'}
    ]

model 2:

fields = [
    {'field': 'full_name', 'type': 'Name'},
    {'field': 'full_address', 'type': 'Address' },
    {'field': 'phone_number', 'type': 'String'}
    ]

@zwarshavsky
Copy link
Author

On deduper.partition()

@fgregg
Copy link
Contributor

fgregg commented Feb 19, 2022

from the information we have here, i think this is probably not a bug.

within the partition method, there's a lot of places where potentially very large objects will be written to disk. historically, we have not really done anything to reduce disk usage.

here are places where we write a lot to disk, and some possible mitigations

  1. the blocking map. this is written to a sqlite database. virtual compound predicates might help a little bit, but beyond that, not a lot we can do.
  2. the join that produces the record pairs. if this query leads sqlite to produce a temporary materialization, this could be very big. There's potentially a lot that could be done here.
  3. the scored pairs are written to a memmaped numpy array. if we did some pre-filtering of the scores as we have previous discussed, that will likely significantly help.

I'm open to all these types of changes, but I would want to start with actually knowing where the bottleneck is. @zwarshavsky could you put in some monitoring to see where in the pairs method you run out of disk space?

@NickCrews
Copy link
Contributor

NickCrews commented Feb 20, 2022

@fgregg Think it would be useful if dedupe actually had some profiling code built in? Seems like this sort of debugging/guesswork is sorta common. I'm no expert in this, but perhaps following this example it would just require adding a @profile to all the functions we care about. Then for debugging you just ask people to run mprof run myscript.py and post the output of mprof plot. Could add profile as an extra so that it wasn't a required dependency of dedupe. Something similar could be done for disk space usage, though it doesn't look like there's quite as turnkey of a solution.

@fgregg
Copy link
Contributor

fgregg commented Feb 20, 2022

interesting, is '@Profile' really a no-op?

@NickCrews
Copy link
Contributor

Good thought, don't know for sure, but looking at the source code my impression is that it will always have overhead. To get around this we could write our own decorator like

def dd_profile(func, *args, **kwargs):
    # Maybe a better way to configure this? Would have to be at import time
    if os.environ["DEDUPE_PROFILE"]:
        # Actually add the profiler wrapper
        return profile(func, *args, **kwargs)
    else:
        # noop
        return func

@fgregg
Copy link
Contributor

fgregg commented Feb 21, 2022

interesting idea, can you open another issue for that, @NickCrews ?

@hlra
Copy link

hlra commented Sep 20, 2022

I am having the same described by @zwarshavsky above. This happened when I tried to run the code of the small dataset dedupe example code on large data with 1.7m rows without SQL implementation. A large temp file with was written during

clustered_dupes = deduper.partition(data_d, 0.5)

The error was thrown when the temp file was at 180GB in the Windows Appdata folder, although there was about 120GB of free disk space left. I was running the code on a Windows Server machine. Let me know if and how I can be of any help to track this down further.

@fgregg
Copy link
Contributor

fgregg commented Sep 20, 2022

what version of dedupe are your running?

@hlra
Copy link

hlra commented Sep 21, 2022

I am mostly using 2.0.13 currently due to this issue #1077. But I ran it again with 2.0.18 now and this is the error message that I get:

Traceback (most recent call last):
File "E:.conda\envs\Dissertation\lib\code.py", line 90, in runcode
exec(code, self.locals)
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm 2022.2.2\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2022.2.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:\Users...\2nd Step - Identify duplicate shareholders.py", line 256, in
clustered_dupes = deduper.partition(data_1, 0.5)
File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 177, in partition
clusters = list(clusters)
File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 185, in _add_singletons
for record_ids, score in clusters:
File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 334, in cluster
yield from clustering.cluster(scores, threshold)
File "E:.conda\envs..\lib\site-packages\dedupe\clustering.py", line 238, in cluster
for sub_graph in dupe_sub_graphs:
File "E:.conda\envs..\lib\site-packages\dedupe\clustering.py", line 38, in connected_components
edgelist = numpy.memmap(
File "E:.conda\envs..\lib\site-packages\numpy\core\memmap.py", line 284, in new
self.filename = None
OSError: [Errno 28] No space left on device

There is another 105GB of free space on the device though.

@hlra
Copy link

hlra commented Sep 21, 2022

In both runs the error seems to have been thrown when the temp file was at about 173GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants