Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Spark] INSERT INTO struct evolution in map/arrays breaks when a column is renamed #3227

Open
3 tasks
johanl-db opened this issue Jun 6, 2024 · 12 comments · May be fixed by #3886
Open
3 tasks

[BUG][Spark] INSERT INTO struct evolution in map/arrays breaks when a column is renamed #3227

johanl-db opened this issue Jun 6, 2024 · 12 comments · May be fixed by #3886
Assignees
Labels
bug Something isn't working good medium issue Good for those with Delta Lake experience

Comments

@johanl-db
Copy link
Collaborator

Bug

Describe the problem

Schema evolution in INSERT doesn't always work properly when the new column is added to a struct nested within an array or map. If another column is renamed, the operation fails when it should succeed.

Steps to reproduce

For example with a map, in python, renaming column key to renamed_key and added a field comment in the a struct inside the map:

(
  spark.createDataFrame([[1, {"event": (1, 1)}]], "key int, metrics map<string, struct<id: int, value: int>>")
    .write
    .option("overwriteSchema", "true")
    .format("delta")
    .saveAsTable("insert_map_schema_evolution")
)
(
  spark.createDataFrame([[1, {"event": (1, 1, "deprecated")}]], "renamed_key int, metrics map<string, struct<id: int, value: int, comment: string>>")
    .write
    .mode("append")
    .option("mergeSchema", "true")
    .format("delta")
    .insertInto("insert_map_schema_evolution")
)
[[DATATYPE_MISMATCH.CAST_WITHOUT_SUGGESTION](https://docs.databricks.com/error-messages/error-classes.html#datatype_mismatch.cast_without_suggestion)] Cannot resolve "metrics" due to data type mismatch: cannot cast "MAP<STRING, STRUCT<id: INT, value: INT, comment: STRING>>" to "MAP<STRING, STRUCT<id: INT, value: INT>>". 

Note that the struct inside the map isn't evolved to add the new field. Without the unrelated column being renamed, this works well:

(
  spark.createDataFrame([[1, {"event": (1, 1)}]], "key int, metrics map<string, struct<id: int, value: int>>")
    .write
    .option("overwriteSchema", "true")
    .format("delta")
    .saveAsTable("insert_map_schema_evolution")
)
(
  spark.createDataFrame([[1, {"event": (1, 1, "deprecated")}]], "key int, metrics map<string, struct<id: int, value: int, comment: string>>")
    .write
    .mode("append")
    .option("mergeSchema", "true")
    .format("delta")
    .insertInto("johan_lasperas.playground.insert_renamed_array_map")
)

Observed results

The operation fails.

Expected results

The operation succeeds, the table schema is changed to key int, metrics map<string, struct<id: int, value: int, comment: string>> and the data is inserted.

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.
@johanl-db johanl-db added bug Something isn't working good first issue Good for newcomers good medium issue Good for those with Delta Lake experience and removed good first issue Good for newcomers labels Jun 6, 2024
@Richard-code-gig
Copy link

Hello @johanl-db,
Could you please assign this issue to me? I am confident I can help resolve it and would appreciate any guidance as needed.

@johanl-db
Copy link
Collaborator Author

@Richard-code-gig thanks for volunteering for this.
I would start by adding a test in e.g. deltaInsertIntoSuite to cover the problematic case and work from there.

I believe this only impacts INSERT by position:

  • SQL: INSERT INTO <table> VALUES (...) or `INSERT INTO SELECT ...
  • Scala/Python: df.write.insertInto()
  • INSERT by name (INSERT INTO <table> (columns) VALUES (...) / df.write.save() / df.write.saveAsTable()) is probably not affected - or better said, it doesn't really properly support schema enforcement / evolution already so supporting this specific case is irrelevant for now.

    Once you have a test, head over to DeltaAnalysis.resolveQueryColumnsByOrdinal where the logic aligning schema for INSERT by position is. I believe the issue is that addCastToColumn doesn't properly support schema evolution in maps one way or another

@Richard-code-gig
Copy link

Okay thanks. I am picking it up now.

@Pshak-20000
Copy link

I hope this message finds you well. I have reviewed the issue regarding schema evolution in Delta Lake, particularly the problem that arises when renaming a column while trying to insert new data into a table with nested structures.

I believe I can contribute a fix for this bug independently. Before I get started, I wanted to check if there are any specific guidelines or preferred practices for contributing to the Delta Lake codebase that I should be aware of.

I am excited about the opportunity to contribute to the project and help improve the functionality.

@johanl-db
Copy link
Collaborator Author

@Pshak-20000 You can take a look at https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md for general information about contributions to this project.

You may want to start by checking with @Richard-code-gig to see if he's still working on this issue or not.
The typical process is then to open a pull request against this repository with a proposed fix + tests, I'm happy to provide some guidance and support with code reviews

@Richard-code-gig
Copy link

Hello @johanl-db, @Pshak-20000 I am working on it and almost done. Johanl I need your help on something minor to complete the assignment, please let me know how I can reach you. Thanks

@Pshak-20000
Copy link

You can reach me here or through my email ([email protected]) if that’s more convenient. Let me know what you need help with!!!. Looking forward to hearing from you.

@johanl-db
Copy link
Collaborator Author

Hello @johanl-db, @Pshak-20000 I am working on it and almost done. Johanl I need your help on something minor to complete the assignment, please let me know how I can reach you. Thanks

You can join the Delta slack workspace and reach me there: https://go.delta.io/slack
Or you can open a draft PR and we can have a discussion on the PR directly.

@Richard-code-gig
Copy link

Hi,
Just wanted to give an update!
I’ve successfully managed to get it working and am now in the process of writing the test cases and ensure backward compatibilities in line with the project guidelines. I expect to submit a PR sometime this week.
I initially thought it would be a simpler task, but it's taken a bit more effort than I anticipated.

@johanl-db
Copy link
Collaborator Author

@Richard-code-gig If you want some support/feedback from my side, you can open a PR even if it's not entirely finalized yet and we can discuss there

Richard-code-gig pushed a commit to Richard-code-gig/delta that referenced this issue Nov 16, 2024
…mn renaming

Resolved the issue described in [Bug delta-io#3227](delta-io#3227) where adding a field inside a struct (nested within a map) while renaming a top column caused the operation to fail.

The fix focuses on handling schema changes without affecting the integrity of existing data structures, specifically avoiding issues with nested fields within a map and renamed columns.

Signed-off-by: Sola Richard Olorunfemi <[email protected]>
Richard-code-gig pushed a commit to Richard-code-gig/delta that referenced this issue Nov 17, 2024
…mn renaming

Resolved the issue described in [Bug delta-io#3227](delta-io#3227) where adding a field inside a struct (nested within a map) while renaming a top column caused the operation to fail.

The fix focuses on handling schema changes without affecting the integrity of existing data structures, specifically avoiding issues with nested fields within a map and renamed columns.

Signed-off-by: Sola Richard Olorunfemi <[email protected]>
Richard-code-gig pushed a commit to Richard-code-gig/delta that referenced this issue Nov 17, 2024
…mn renaming

Resolved the issue described in [Bug delta-io#3227](delta-io#3227) where adding a field inside a struct (nested within a map) while renaming a top column caused the operation to fail.

The fix focuses on handling schema changes without affecting the integrity of existing data structures, specifically avoiding issues with nested fields within a map and renamed columns.

Signed-off-by: Sola Richard Olorunfemi <[email protected]>

fix!:renamed the added DeltaWriteExample to EvolutionWithMap
@Richard-code-gig
Copy link

@Richard-code-gig If you want some support/feedback from my side, you can open a PR even if it's not entirely finalized yet and we can discuss there

Hi Johanl-db,
I have submitted a PR

@Richard-code-gig
Copy link

Richard-code-gig commented Nov 18, 2024

Hi @johanl-db,
The PR is now successful. The 2 failing checks and the 1 dependent canceled check are caused by an unrelated issue, as observed in this commit and waiting to be addressed in this PR.

Richard-code-gig pushed a commit to Richard-code-gig/delta that referenced this issue Nov 18, 2024
…mn renaming

Resolved the issue described in [Bug delta-io#3227](delta-io#3227) where adding a field inside a struct (nested within a map) while renaming a top column caused the operation to fail.

The fix focuses on handling schema changes without affecting the integrity of existing data structures, specifically avoiding issues with nested fields within a map and renamed columns.

fix!:renamed the added DeltaWriteExample to EvolutionW
ithMap

fix!: Modified TypeWideningInsertSchemaEvolutionSuite to accommodate that schema evolution is now allowed for maps

Signed-off-by: Sola Richard Olorunfemi <[email protected]>
Richard-code-gig pushed a commit to Richard-code-gig/delta that referenced this issue Nov 22, 2024
…mn renaming

Resolved the issue described in [Bug delta-io#3227](delta-io#3227) where adding a field inside a struct (nested within a map) while renaming a top column caused the operation to fail.

The fix focuses on handling schema changes without affecting the integrity of existing data structures, specifically avoiding issues with nested fields within a map and renamed columns.

fix!:renamed the added DeltaWriteExample to EvolutionW
ithMap

fix!: Modified TypeWideningInsertSchemaEvolutionSuite to accommodate that schema evolution is now allowed for maps

Signed-off-by: Sola Richard Olorunfemi <[email protected]>

fix!: addCastToMap to handle complex types. Added tests to cover new abilities
Richard-code-gig pushed a commit to Richard-code-gig/delta that referenced this issue Nov 22, 2024
…mn renaming

Resolved the issue described in [Bug delta-io#3227](delta-io#3227) where adding a field inside a struct (nested within a map) while renaming a top column caused the operation to fail.

The fix focuses on handling schema changes without affecting the integrity of existing data structures, specifically avoiding issues with nested fields within a map and renamed columns.

fix!:renamed the added DeltaWriteExample to EvolutionW
ithMap

fix!: Modified TypeWideningInsertSchemaEvolutionSuite to accommodate that schema evolution is now allowed for maps

Signed-off-by: Sola Richard Olorunfemi <[email protected]>

fix!: addCastToMap to handle complex types. Added tests to cover new abilities

fix: resolved scalaStyle error
Richard-code-gig pushed a commit to Richard-code-gig/delta that referenced this issue Nov 23, 2024
…mn renaming

Resolved the issue described in [Bug delta-io#3227](delta-io#3227) where adding a field inside a struct (nested within a map) while renaming a top column caused the operation to fail.

The fix focuses on handling schema changes without affecting the integrity of existing data structures, specifically avoiding issues with nested fields within a map and renamed columns.

fix!:renamed the added DeltaWriteExample to EvolutionW
ithMap

fix!: Modified TypeWideningInsertSchemaEvolutionSuite to accommodate that schema evolution is now allowed for maps

Signed-off-by: Sola Richard Olorunfemi <[email protected]>

fix!: addCastToMap to handle complex types. Added tests to cover new abilities

fix: resolved scalaStyle error

fix: yet another scalaStyle issue

fix!:made some schema evolution for maps tests in DeltaInsertIntoTableSuite more flexible

fix: DeltaAnalysis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good medium issue Good for those with Delta Lake experience
Projects
None yet
3 participants