-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disk has reached capacity issue with moderate record size with >500 gb of free disk space #965
Comments
can you post a traceback, and your datamodel? |
Data models see next comment. "Step 2" model caused issues on Sagemaker with 1m row input file. Both times the issue encountered was on the OS vs kernel level. The last local run with 10m rows forced a restart of my machine when I reached dangerously low remaining disk space (starting with 330gb free). On Sagemaker was something like "disk or database has reached capacity" error which killed the ipython kernel. ED: removed attached settings files |
do you have any sense of where in the program you were? |
could you just post the dictionary definition of the data model |
model 1:
model 2:
|
On deduper.partition() |
from the information we have here, i think this is probably not a bug. within the partition method, there's a lot of places where potentially very large objects will be written to disk. historically, we have not really done anything to reduce disk usage. here are places where we write a lot to disk, and some possible mitigations
I'm open to all these types of changes, but I would want to start with actually knowing where the bottleneck is. @zwarshavsky could you put in some monitoring to see where in the pairs method you run out of disk space? |
@fgregg Think it would be useful if dedupe actually had some profiling code built in? Seems like this sort of debugging/guesswork is sorta common. I'm no expert in this, but perhaps following this example it would just require adding a |
interesting, is '@Profile' really a no-op? |
Good thought, don't know for sure, but looking at the source code my impression is that it will always have overhead. To get around this we could write our own decorator like def dd_profile(func, *args, **kwargs):
# Maybe a better way to configure this? Would have to be at import time
if os.environ["DEDUPE_PROFILE"]:
# Actually add the profiler wrapper
return profile(func, *args, **kwargs)
else:
# noop
return func |
interesting idea, can you open another issue for that, @NickCrews ? |
I am having the same described by @zwarshavsky above. This happened when I tried to run the code of the small dataset dedupe example code on large data with 1.7m rows without SQL implementation. A large temp file with was written during
The error was thrown when the temp file was at 180GB in the Windows Appdata folder, although there was about 120GB of free disk space left. I was running the code on a Windows Server machine. Let me know if and how I can be of any help to track this down further. |
what version of dedupe are your running? |
I am mostly using 2.0.13 currently due to this issue #1077. But I ran it again with 2.0.18 now and this is the error message that I get: Traceback (most recent call last): There is another 105GB of free space on the device though. |
In both runs the error seems to have been thrown when the temp file was at about 173GB. |
Related to this issue which I am unsure why it was closed as a resolution does not seem available. #581
I have run into the same disk and memory issue as reported previously. On a large sagemaker instance with 1 tb of disk space on 1m records and locally on 10m records with 500gb of disk space.
Is there a way to set the limit of this tempfile and where is this tempfile located?
The text was updated successfully, but these errors were encountered: