-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IpcSchemaEncoder
, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID
#6444
Conversation
416f753
to
6cff848
Compare
6cff848
to
9adaa86
Compare
9adaa86
to
09bfe61
Compare
Sorry for all the force pushing, all linters and formatting should be happy now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @brancz and @thinkharderdev
I think as written this PR can't be merged until we have the next major release (e.g. in December 2024): https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes
However, I have a proposal for getting it in sooner
Specifically, I suggest we
- Create a new struct for serializing schemas (perhaps
IpcSchemaConverter
?) that does the right thing with respect to schema dictionary ids - deprecate (but don't change the signatures of
schema_to_fb
andschema_to_fb_offset
) - Update code in this repo to use the new struct
This would have the following benefits:
- It would avoid a breaking API change and get the fix in sooner
- It would be a nicer API in general than having to call several independent functions
- It woudl be easier to change / extend in the future
@@ -62,12 +66,13 @@ pub fn metadata_to_fb<'a>( | |||
|
|||
pub fn schema_to_fb_offset<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also a a public API https://docs.rs/arrow-ipc/latest/arrow_ipc/convert/fn.schema_to_fb_offset.html
arrow-ipc/src/convert.rs
Outdated
use crate::{size_prefixed_root_as_message, KeyValue, Message, CONTINUATION_MARKER}; | ||
use DataType::*; | ||
|
||
/// Serialize a schema in IPC format | ||
pub fn schema_to_fb(schema: &Schema) -> FlatBufferBuilder { | ||
pub fn schema_to_fb<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this appears to be a public API https://docs.rs/arrow-ipc/latest/arrow_ipc/convert/fn.schema_to_fb.html
And thus if we make this change it is a breaking API change
Makes sense! I wasn't paying attention to that and just blindly assumed that because it was such low level functionality that it was not a public API. I'll rework it the way you suggested! |
9fe8971
to
bd52cd6
Compare
Done! I don't love the |
This decouples dictionary IDs that end up in IPC from the schema further because the dictionary tracker always first gathers the dict ID for each field whether it is pre-defined and preserved or not. Then when actually writing the IPC bytes the dictionary ID is always taken from the dictionary tracker as opposed to falling back to the `Field` of the `Schema`.
When dictionary IDs are not preserved, then they are assigned depth first, however, when reading them from the dictionary tracker to write the IPC bytes, they were previously read from the dictionary tracker in the order that the schema is traversed (first come first serve), which caused an incorrect order of dictionaries serialized in IPC.
bd52cd6
to
200d1f5
Compare
How can I run the integration tests locally? I'm a bit confused why they are now failing when they were previously passing. |
CI failure is unrelated: #6448 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you (again) @brancz -- this looks really nice
I spent some time playing with the docs and API and have a suggestion for refinement: polarsignals#1
Let me know what you think
Refine IpcSchemaEncoder API and docs
Those changes are perfect, thank you, merged them! |
There were a few more lints to fix. I think this is now ready! |
IpcSchemaEncoder
, deprecate ipc schema functions, Fix IPC not respecting not preserving dict ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @brancz and @thinkharderdev
Thanks so much for all the help and reviews @alamb! |
We updated this PR so it was not an API change so removing the label |
Which issue does this PR close?
Closes #6443
Rationale for this change
It fixes a bug.
What changes are included in this PR?
This decouples dictionary IDs that end up in IPC from the schema further
because the dictionary tracker always first gathers the dict ID for each
field whether it is pre-defined and preserved or not.
Then when actually writing the IPC bytes the dictionary ID is always
taken from the dictionary tracker as opposed to falling back to the
Field
of theSchema
.On top of that, when dictionary IDs are not preserved, then they are assigned depth
first, however, when reading them from the dictionary tracker to write
the IPC bytes, they were previously read from the dictionary tracker in
the order that the schema is traversed, which
caused an incorrect order of dictionaries serialized in IPC.
Are there any user-facing changes?
No API changes, just bug fixes.
@alamb @tustvold