Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update custom function for downloading files from s3 #458

Merged
merged 7 commits into from
Aug 16, 2022

Conversation

DyfanJones
Copy link
Member

This method by-passes R when collecting data from s3. Instead it calls httr:write_disk from paws.common:::issue and relies on libcurl to do the heavy lifting through curl::curl_fetch_disk.

Addresses #401

@codecov
Copy link

codecov bot commented Sep 10, 2021

Codecov Report

Merging #458 (761529a) into main (c0b0866) will increase coverage by 0.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #458      +/-   ##
==========================================
+ Coverage   82.68%   82.71%   +0.03%     
==========================================
  Files         112      112              
  Lines        6400     6412      +12     
==========================================
+ Hits         5292     5304      +12     
  Misses       1108     1108              
Impacted Files Coverage Δ
paws/paws.common/R/request.R 79.16% <0.00%> (+0.29%) ⬆️
...sers/runner/work/paws/paws/paws.common/R/request.R 79.16% <0.00%> (+0.29%) ⬆️
paws/paws.common/R/net.R 86.66% <0.00%> (+0.95%) ⬆️
Users/runner/work/paws/paws/paws.common/R/net.R 86.66% <0.00%> (+0.95%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@philiporlando
Copy link

philiporlando commented Oct 14, 2024

Hi, thank you for this great update! I’m trying to confirm my understanding of how the s3$download_file() function behaves in terms of memory management in the newer version.

I’m using s3$download_file() within a {targets} pipeline with large-scale branching targets. This is essentially making thousands of individual calls to s3$download_file() across multiple CPU threads. I’ve noticed that my system's memory consumption rises significantly during the download process, eventually leading to memory limits being hit and the system crashing. My understanding is that, with this merged PR, s3$download_file() should not accumulate memory since files are now written directly to disk instead of being held in memory first. I'm trying to pinpoint where the memory accumulation is happening.

Can you confirm if there are any known memory issues in the current implementation, or if there might still be room for memory to accumulate under certain circumstances? I am troubleshooting my pipeline and would appreciate any insights on whether explicit gc() calls might help, or if there’s something I’m missing in how the function behaves.

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants