Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Spark] delta-spark allows reading column mapping when missing from table features #3890

Open
2 of 8 tasks
zachschuermann opened this issue Nov 19, 2024 · 0 comments
Open
2 of 8 tasks
Labels
bug Something isn't working

Comments

@zachschuermann
Copy link
Collaborator

Bug

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Describe the problem

TLDR you can relatively easily create a table which (according to the protocol) shouldn't allow column mapping, but is read with column mapping in delta-spark.

I think there are two pieces to this issue:

  1. [bug] delta-spark uses column mapping to read a table without column mapping in table reader features
  2. [api sharp edge?] delta's upgradeTableProtocol will upgrade from reader version 2 to reader version 3 without adding any table features. This is a problem since it effectively silently turns of column mapping. (since it is enabled/supported in reader version 2 but requires that the table feature be present when reader version is 3)

Steps to reproduce

See example below for code implementing these steps:

  1. the table is created with reader version 2 and writer version 7 with "writerFeatures":["columnMapping","icebergCompatV1"] and delta.columnMapping.mode = name
  2. then upgradeTableProtocol(3, 7) gives reader version 3 with no reader features. this effectively turns off column mapping.
  3. when reading the table it looks like it is read with columnMapping = name
# using pyspark
df = get_sample_data(spark)
delta_path = str(Path(case.delta_root).absolute())
# table at version 0
delta_table: DeltaTable = (
    DeltaTable.create(spark)
    .location(delta_path)
    .addColumns(df.schema)
    .property("delta.enableIcebergCompatV1", "true")
    .execute()
)
delta_table.upgradeTableProtocol(3, 7)
df.repartition(1).write.format("delta").mode("append").save(case.delta_root)

Observed results

Read with column mapping

Expected results

Should not be read with column mapping

Further details

Environment information

  • Delta Lake version: 3.2.1
  • Spark version: 3.5?
  • Scala version:

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.
@zachschuermann zachschuermann added the bug Something isn't working label Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant