Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM on Write_Partitioned for S3 #6380

Open
stanbrub opened this issue Nov 15, 2024 · 0 comments
Open

OOM on Write_Partitioned for S3 #6380

stanbrub opened this issue Nov 15, 2024 · 0 comments
Assignees
Labels
bug Something isn't working core Core development tasks parquet Related to the Parquet integration s3
Milestone

Comments

@stanbrub
Copy link
Contributor

Using a very small amount of data, s3.write_partitioned crashes DHC on OOM. The below script should work in 24G Heap. However, in the DHC Code Studio, you can hover over the heap status and see memory usage increase rapidly. Running with a row_count of 2000 will crash DHC with an OOM.

Decreasing the number of unique values for the partitioned key mitigates the problem. So the issue appears to be more about, unique partition key values, or combinations of multiple partition keys, than number of rows.

import jpy
from deephaven import empty_table, garbage_collect
from deephaven.parquet import write_partitioned
from deephaven.experimental import s3

def print_heap():
    # garbage_collect()
    runtime = jpy.get_type('java.lang.Runtime').getRuntime()
    print('Heap Used MB:', (runtime.totalMemory() - runtime.freeMemory()) / 1024 / 1024)

row_count = 1_000

print_heap()
print('Generate Table')
source = empty_table(row_count).update([
    'int10K=(ii % 10 == 0) ? null : ((int)(ii % 10000))',
    'short10K=(ii % 10 == 0) ? null : ((short)(ii % 10000))'
])
print_heap()
print('Partition By 1 Int Column')
source = source.partition_by(['int10K'])
print_heap()
print('S3 Write Partitioned')
write_partitioned(
    source, 's3://data/source.ptr.parquet', special_instructions=s3.S3Instructions(
        region_name='aws-global', endpoint_override='http://minio:9000',
        credentials=s3.Credentials.basic('minioadmin', 'minioadmin'),
        connection_timeout='PT20S'
    )
)
print_heap()
@stanbrub stanbrub added bug Something isn't working triage labels Nov 15, 2024
@rcaudy rcaudy added core Core development tasks parquet Related to the Parquet integration s3 and removed triage labels Nov 15, 2024
@rcaudy rcaudy added this to the 0.38.0 milestone Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core development tasks parquet Related to the Parquet integration s3
Projects
None yet
Development

No branches or pull requests

3 participants