SNOW-535791: PUT operations read full input data into memory #536
Labels
enhancement
The issue is a request for improvement or a new feature
status-triage_done
Initial triage done, will be further handled by the driver team
We use the streaming PUT feature of gosnowflake to upload data to snowflake internal stages. Recently, we started throwing GB sized files at it and saw our memory usage explode. Through some experimentation with an isolated production workload, we saw that for a 1GB input file, the driver was using approximately 8GB of memory when the PUT operation was configured as a stream, and 2-3GB when it was configured as a file on disk.
Looking at the code, the problem seems to be that alleged "streams" are often passed around as
bytes.Buffer
s that read the entire input data into memory multiple times over, the exact amount depending on the command and options (e.g. here, here, here, everywhere this is invoked, etc). Some crude experimentation with a fork I've created suggest there are likely more places.Beyond the obvious issue that the documentation on streaming puts is misleading, I would go so far as to say the driver should never have to read the entire input contents into memory. The major operations it's responsible for (compression, encryption, calculating digests, etc) are all possible with modest buffers that don't need to scale with the input size.
I'd be happy to contribute here, but wanted to start a discussion since the required changes seem to be nontrivial.
The text was updated successfully, but these errors were encountered: