-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demonstrate a bug for schema "validation" when writing JSON to a table #131
base: main
Are you sure you want to change the base?
Conversation
I stumbled into this while pilfering code from kafka-delta-ingest for another project and discovered that the code in `write_values` which does `record_batch.schema() != arrow_schema` doesn't do what we think it does. Basically if `Decoder` "works" the schema it's going to return is just the schema passed into it. It has no bearing on whether the JSON has the same schema. Don't ask me why. Using the reader's `infer_json_schema_*` functions can provide a Schema that is useful for comparison: let mut value_iter = json_buffer.iter().map(|j| Ok(j.to_owned())); let json_schema = infer_json_schema_from_iterator(value_iter.clone()).expect("Failed to infer!"); let decoder = Decoder::new(Arc::new(json_schema), options); if let Some(batch) = decoder.next_batch(&mut value_iter).expect("Failed to create RecordBatch") { assert_eq!(batch.schema(), arrow_schema_ref, "Schemas don't match!"); } What's even more interesting, is that after a certain number of fields are removed, the Decoder no longer pretends it can Decode the JSON. I am baffled as to why.
I discussed this at length today with @xianwill privately. I will attempt to summarize our verbose conversation 😄 Basically this does look like a bug and a half, and maybe some unclear behavior. The schemas themselves being used have nullable fields for everything. What is interesting is that a nullable "primitive type" can be omitted and the To that end, there is in fact no real check occurring that the schema of the JSON lines up with the schema of the Delta table at the moment 🐛 Because of nullable fields, the primitive From @xianwill :
The disconnect between schema in the parquet files and the schema of the Delta table doesn't make deciding what the appropriate behavior for kafka-delta-ingest should be here. 🙀 |
This commit introduces some interplay between the IngestProcessor and DataWriter, the latter of which needs to keep track of whether or not it has a changed schema. What should be done with that changed schema must necessarily live in IngestProcessor since that will perform the Delta transaction commits at the tail end of batch processing. There is some potential mismatches between the schema in storage and what the DataWriter has, so this change tries to run the runloop again if the current schema and the evolved schema are incompatible Closes #131 Sponsored-by: Raft LLC
This commit introduces some interplay between the IngestProcessor and DataWriter, the latter of which needs to keep track of whether or not it has a changed schema. What should be done with that changed schema must necessarily live in IngestProcessor since that will perform the Delta transaction commits at the tail end of batch processing. There is some potential mismatches between the schema in storage and what the DataWriter has, so this change tries to run the runloop again if the current schema and the evolved schema are incompatible Closes #131 Sponsored-by: Raft LLC
I stumbled into this while pilfering code from kafka-delta-ingest for another project and discovered that the code in
write_values
which doesrecord_batch.schema() != arrow_schema
doesn't do what we think it does.Basically if
Decoder
"works" the schema it's going to return is just the schema passed into it. It has no bearing on whether the JSON has the same schema. Don't ask me why.Using the reader's
infer_json_schema_*
functions can provide a Schema that is useful for comparison:What's even more interesting, is that after a certain number of fields are removed, the Decoder no longer pretends it can Decode the JSON. I am baffled as to why.
The current failure from this test is: