-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory improvements for PUT operations #538
Conversation
In addition to avoiding a re-allocation in the loop, we also factor both stream and file functions for getting digest and file size in terms of io.Readers. Allocations and memory usage are now no longer a function of input size.
Note: I'm having trouble running the unit tests in this project (some of the shell scripts seem to not be properly escaping special characters in my snowflake password). Would appreciate someone on the snowflake team giving this PR the green light to run in CI once they've had a chance to review that the code is safe. |
@@ -83,7 +83,6 @@ func encryptStream( | |||
} | |||
mode.CryptBlocks(cipherText, chunk) | |||
out.Write(cipherText[:len(chunk)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed: this should probably be out.Write(cipherText[:n])
to more proactively prevent writing buffered data from previous loop cycles.
Thank you for submitting this PR. Unfortunately, this fails our bulk array binding tests where the meta.srcStream passed into getDigestAndSizeForStream does not read its data into the bytes properly (once it's read, the read/seeker is not reset, thereby emptying the buffer). If you find a remedy that addresses this, I will be happy to review and merge this. |
Thanks @sfc-gh-jbahk. If I understand what you're saying correctly, there's a test somewhere that calls I'm probably missing something obvious, but I don't see such a test. Do you mind pointing me to the failing test? |
It's the BulkArrayBinding tests. |
Oh I see - the bind uploader calls the "put" command; I was looking for direct usages of the method. I'm still having trouble getting tests running on my local machine, let me try again tomorrow. In the mean time, I'm pushing what I think should be the fix. |
The callers here assume they can re-use the buffer, so we should reset it after invoking getDigestAndSizeForStream.
Whoops, |
No worries; I appreciate you putting in the time for this. As for the tests, it might not be possible unless you have internal credentials that are able to run the whole suite. |
@asonawalla do you have any updates on this? |
@sfc-gh-jbahk unfortunately I had to shift attention to some other work this week, but I do still hope to wrap this up soon. My plan is to get the tests working before I make more changes, then likely revert the reader changes to the stream API. That way this PR will be in good shape so that at least the file-based API keeps memory usage under control, and we can tackle the stream's memory usage some time in the future. |
FYI, I think I have most of the tests running successfully on the master branch, but still seeing this failure (which looks related, so I'm trying to avoid skipping it):
|
@asonawalla thank you. Where did you get that excerpt from? That test works locally for myself. |
It's on running |
I actually just noticed that there's another failure in there, but just based on the names of these tests, that one seems less important to get right for this change. |
Ah, I see. Thanks. I might try and merge some of these changes faster to help some perf issues on our end actually. |
#527 I updated this to incorporate some of your changes (opening vs reading). |
Hi @asonawalla ! Do you still want to merge this PR? Can you rebase and solve conflicts? I'd like to merge it when it's ready. |
Hey @sfc-gh-pfus, the most important part here (the buffer re-allocation) was captured in #527, so I'm going to close this out. The problems with the streaming PUT described in this issue still appear to be real, but this code doesn't address that. |
Thank you @asonawalla for your input anyway! |
Description
See Issue #536 (SNOW-535791) for more background.
I've addressed some of the low-hanging fruit memory improvements in this PR. Specifically, using the non-streaming API (and having the caller be responsible for staging the full data on disk instead) now follows a code path that doesn't read the whole file into memory. The streaming API still reads the whole file into memory and will need more effort to fix beyond the changes suggested here.
In addition to the changes, I've also added a few benchmarks on the functions I've been working with, which can be run with a command similar to
SKIP_SETUP=1 go test -bench '^Benchmark.*$' -run '^$'
. Running these tests in the baseline benchmarks commit and again at HEAD show that (at least for file-based functions) the number of allocations and bytes allocated don't scale with input size after the changes are applied. Note that these benchmarks don't need to connect to a test snowflake instance (henceSKIP_SETUP
), but they do take a bit of time to run since they generate a lot of fake data.Checklist
make fmt
to fix inconsistent formatsmake lint
to get lint errors and fix all of them