Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize on delta table not encoding correct URL for Dayhour partitions in delta table #3892

Open
1 of 5 tasks
gprashmi opened this issue Nov 19, 2024 · 2 comments
Open
1 of 5 tasks
Labels
bug Something isn't working

Comments

@gprashmi
Copy link

gprashmi commented Nov 19, 2024

Bug

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Describe the problem

We write data to delta table using delta-rs with PyArrow engine with DayHour as partition column.

  deltalake.write_deltalake(
            table_or_uri=delta_table_path,
            data=df,
            partition_by=[dayhour_partition_column],
            schema_mode='overwrite',
            mode="append",
            storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": "true"},
        )

I ran the optimize command using the spark sql query below on the delta table

optimize_query = f"""
OPTIMIZE delta.`s3_table_path`
ZORDER BY (col1, col2)
"""
spark.sql(optimize_query)

After optimize, it creates partitions with spaces and does not properly encode the partition urls as shown in the below image i.e; it creates new partitions url with spaces (.zstd.parquet).

image

Steps to reproduce

Sample code to reproduce

import deltalake

# Dummy data
initial_data = {
    'dayhour': ['2024-10-09 19:00:00', '2024-10-10 20:00:00'],
    'value1': [10, 20],
    'value2': [1.5, 2.5]
}
initial_df = pd.DataFrame(initial_data)

initial_df['dayhour'] = pd.to_datetime(initial_df['dayhour'])

# Define the schema for the Delta Lake table
schema = pa.schema([
    pa.field('dayhour', pa.timestamp('us')),  
    pa.field('value1', pa.int32()),          
    pa.field('value2', pa.float32())         
])

# Initialize the Delta table with the schema
deltalake.write_deltalake(
    table_or_uri=delta_table_path,
    data=initial_df,
    schema=schema,
    partition_by=['dayhour'],
    schema_mode='overwrite',
    mode="overwrite",
    storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": "true"},
)

optimize_query = f"""
OPTIMIZE delta_table_path
"""
spark.sql(optimize_query)

vacuum_query = f"""
VACUUM delta_table_path
RETAIN 168 HOURS
"""
spark.sql(vacuum_query)

This resulted in spaces in the URL encoding after optimize as below:
before optimize:
image

after optimize:
image

Environment information

Delta-rs version: 0.21.0

@gprashmi gprashmi added the bug Something isn't working label Nov 19, 2024
@gprashmi
Copy link
Author

Similar issue was raised on delta-rs (delta-io/delta-rs#2978) github, however all solutions provided there did not resolve this issue. Hence opening it here.

@gprashmi
Copy link
Author

@thomasfrederikhoeck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant