Memory improvements for PUT operations #538

asonawalla · 2022-01-27T21:21:36Z

Description

See Issue #536 (SNOW-535791) for more background.

I've addressed some of the low-hanging fruit memory improvements in this PR. Specifically, using the non-streaming API (and having the caller be responsible for staging the full data on disk instead) now follows a code path that doesn't read the whole file into memory. The streaming API still reads the whole file into memory and will need more effort to fix beyond the changes suggested here.

In addition to the changes, I've also added a few benchmarks on the functions I've been working with, which can be run with a command similar to SKIP_SETUP=1 go test -bench '^Benchmark.*$' -run '^$'. Running these tests in the baseline benchmarks commit and again at HEAD show that (at least for file-based functions) the number of allocations and bytes allocated don't scale with input size after the changes are applied. Note that these benchmarks don't need to connect to a test snowflake instance (hence SKIP_SETUP), but they do take a bit of time to run since they generate a lot of fake data.

Checklist

Code compiles correctly
Run make fmt to fix inconsistent formats
Run make lint to get lint errors and fix all of them
Created tests which fail without the change (if possible)
All tests passing
Extended the README / documentation, if necessary

In addition to avoiding a re-allocation in the loop, we also factor both stream and file functions for getting digest and file size in terms of io.Readers. Allocations and memory usage are now no longer a function of input size.

asonawalla · 2022-01-27T21:25:04Z

Note: I'm having trouble running the unit tests in this project (some of the shell scripts seem to not be properly escaping special characters in my snowflake password). Would appreciate someone on the snowflake team giving this PR the green light to run in CI once they've had a chance to review that the code is safe.

asonawalla · 2022-01-27T21:34:38Z

encrypt_util.go

@@ -83,7 +83,6 @@ func encryptStream(
 		}
 		mode.CryptBlocks(cipherText, chunk)
 		out.Write(cipherText[:len(chunk)])


Just noticed: this should probably be out.Write(cipherText[:n]) to more proactively prevent writing buffered data from previous loop cycles.

sfc-gh-jbahk · 2022-01-30T04:05:49Z

Thank you for submitting this PR. Unfortunately, this fails our bulk array binding tests where the meta.srcStream passed into getDigestAndSizeForStream does not read its data into the bytes properly (once it's read, the read/seeker is not reset, thereby emptying the buffer). If you find a remedy that addresses this, I will be happy to review and merge this.

asonawalla · 2022-01-30T04:54:08Z

Thanks @sfc-gh-jbahk. If I understand what you're saying correctly, there's a test somewhere that calls getDigestAndSizeForStream with a buffer that relies on that buffer being reset after the call. If so, the easiest fix might be to call Reset() on the buffer at the level of the caller so we can re-use the stream version of the method for files as well.

I'm probably missing something obvious, but I don't see such a test. Do you mind pointing me to the failing test?

sfc-gh-jbahk · 2022-01-30T04:55:32Z

It's the BulkArrayBinding tests.

asonawalla · 2022-01-30T05:18:15Z

Oh I see - the bind uploader calls the "put" command; I was looking for direct usages of the method.

I'm still having trouble getting tests running on my local machine, let me try again tomorrow. In the mean time, I'm pushing what I think should be the fix.

The callers here assume they can re-use the buffer, so we should reset it after invoking getDigestAndSizeForStream.

asonawalla · 2022-01-30T05:44:03Z

Whoops, Reset() on a buffer clears it, not seeks to 0, so the patch is wrong. I'll make some time to try again in the next day or two.

sfc-gh-jbahk · 2022-01-30T06:11:39Z

No worries; I appreciate you putting in the time for this. As for the tests, it might not be possible unless you have internal credentials that are able to run the whole suite.

sfc-gh-jbahk · 2022-02-03T20:22:55Z

@asonawalla do you have any updates on this?

asonawalla · 2022-02-03T20:37:36Z

@sfc-gh-jbahk unfortunately I had to shift attention to some other work this week, but I do still hope to wrap this up soon. My plan is to get the tests working before I make more changes, then likely revert the reader changes to the stream API. That way this PR will be in good shape so that at least the file-based API keeps memory usage under control, and we can tackle the stream's memory usage some time in the future.

asonawalla · 2022-02-03T20:49:56Z

FYI, I think I have most of the tests running successfully on the master branch, but still seeing this failure (which looks related, so I'm trying to avoid skipping it):

=== RUN   TestPutOverwrite
    put_get_test.go:326: expected SKIPPED, got UPLOADED
--- FAIL: TestPutOverwrite (2.61s)

sfc-gh-jbahk · 2022-02-03T21:01:16Z

@asonawalla thank you. Where did you get that excerpt from? That test works locally for myself.

asonawalla · 2022-02-03T21:07:49Z

It's on running make test on the repo root - here it is in context of the rest of the output.

asonawalla · 2022-02-03T21:09:19Z

I actually just noticed that there's another failure in there, but just based on the names of these tests, that one seems less important to get right for this change.

sfc-gh-jbahk · 2022-02-03T21:09:26Z

Ah, I see. Thanks. I might try and merge some of these changes faster to help some perf issues on our end actually.

sfc-gh-jbahk · 2022-02-03T21:58:40Z

#527 I updated this to incorporate some of your changes (opening vs reading).

sfc-gh-pfus · 2023-10-30T07:05:47Z

Hi @asonawalla ! Do you still want to merge this PR? Can you rebase and solve conflicts? I'd like to merge it when it's ready.

asonawalla · 2023-10-30T15:50:27Z

Hey @sfc-gh-pfus, the most important part here (the buffer re-allocation) was captured in #527, so I'm going to close this out.

The problems with the streaming PUT described in this issue still appear to be real, but this code doesn't address that.

sfc-gh-pfus · 2023-10-31T06:23:50Z

Thank you @asonawalla for your input anyway!

Azim Sonawalla added 3 commits January 25, 2022 14:58

Add baseline benchmarks

65fdf4b

digest and size: avoid unnecessary buffers

d717489

In addition to avoiding a re-allocation in the loop, we also factor both stream and file functions for getting digest and file size in terms of io.Readers. Allocations and memory usage are now no longer a function of input size.

reuse chunk buffer in encryption

a265b34

asonawalla commented Jan 27, 2022

View reviewed changes

reset buffers after getDigestAndSizeForStream

7aceec3

The callers here assume they can re-use the buffer, so we should reset it after invoking getDigestAndSizeForStream.

asonawalla closed this Oct 30, 2023

github-actions bot locked and limited conversation to collaborators Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory improvements for PUT operations #538

Memory improvements for PUT operations #538

asonawalla commented Jan 27, 2022 •

edited

Loading

asonawalla commented Jan 27, 2022

asonawalla Jan 27, 2022

sfc-gh-jbahk commented Jan 30, 2022

asonawalla commented Jan 30, 2022

sfc-gh-jbahk commented Jan 30, 2022

asonawalla commented Jan 30, 2022

asonawalla commented Jan 30, 2022

sfc-gh-jbahk commented Jan 30, 2022

sfc-gh-jbahk commented Feb 3, 2022

asonawalla commented Feb 3, 2022

asonawalla commented Feb 3, 2022

sfc-gh-jbahk commented Feb 3, 2022

asonawalla commented Feb 3, 2022

asonawalla commented Feb 3, 2022

sfc-gh-jbahk commented Feb 3, 2022

sfc-gh-jbahk commented Feb 3, 2022

sfc-gh-pfus commented Oct 30, 2023

asonawalla commented Oct 30, 2023

sfc-gh-pfus commented Oct 31, 2023

Memory improvements for PUT operations #538

Memory improvements for PUT operations #538

Conversation

asonawalla commented Jan 27, 2022 • edited Loading

Description

Checklist

asonawalla commented Jan 27, 2022

asonawalla Jan 27, 2022

Choose a reason for hiding this comment

sfc-gh-jbahk commented Jan 30, 2022

asonawalla commented Jan 30, 2022

sfc-gh-jbahk commented Jan 30, 2022

asonawalla commented Jan 30, 2022

asonawalla commented Jan 30, 2022

sfc-gh-jbahk commented Jan 30, 2022

sfc-gh-jbahk commented Feb 3, 2022

asonawalla commented Feb 3, 2022

asonawalla commented Feb 3, 2022

sfc-gh-jbahk commented Feb 3, 2022

asonawalla commented Feb 3, 2022

asonawalla commented Feb 3, 2022

sfc-gh-jbahk commented Feb 3, 2022

sfc-gh-jbahk commented Feb 3, 2022

sfc-gh-pfus commented Oct 30, 2023

asonawalla commented Oct 30, 2023

sfc-gh-pfus commented Oct 31, 2023

asonawalla commented Jan 27, 2022 •

edited

Loading