-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RecordBatch
normalization (flattening)
#6758
base: main
Are you sure you want to change the base?
Conversation
RecordBatch
normalization (flattening)
… iterative function for `RecordBatch`. Not sure which one is better currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some questions regarding the implementation of this, since the one example from PyArrow doesn't seem to clarify on the edge cases here. Normalizing the Schema seems fairly straight forward to me, I'm just not sure on
- Whether the iterative or recursive approach is better (or something I missed)
- If
DataType::Struct
is the onlyDataType
that requires flattening. To me, it looks like that's the only one that can contained nestedField
s.
(I'm also not sure if I'm missing something with unwrapping like a List<Struct>
)
Any feedback/help would be appreciated!
@kszlim can you please help review this PR ? You requested the feature and we are currently quite short on review capacity in arrow-rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow-array/src/record_batch.rs
Outdated
@@ -394,6 +396,56 @@ impl RecordBatch { | |||
) | |||
} | |||
|
|||
/// Normalize a semi-structured [`RecordBatch`] into a flat table. | |||
/// | |||
/// If max_level is 0, normalizes all levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please improve this documentation (maybe copy from the pyarrow version)?
- Doucment what
max_level
means (in addition to that 0) - Document what
separator
does - provide an example of flatteing a record batch as a doc example?
For example like https://docs.rs/arrow/latest/arrow/index.html#columnar-format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, missed doing this, will do!
arrow-schema/src/schema.rs
Outdated
@@ -413,6 +413,81 @@ impl Schema { | |||
&self.metadata | |||
} | |||
|
|||
/// Returns a new schema, normalized based on the max_level | |||
/// This carries metadata from the parent schema over as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, please document the parametrs to this function and add a documentation example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
I'll take a look, though please feel free to disregard anything I say and especially defer to the maintainers. |
arrow-array/src/record_batch.rs
Outdated
DataType::Struct(ff) => { | ||
// Need to zip these in reverse to maintain original order | ||
for (cff, fff) in c.as_struct().columns().iter().zip(ff.into_iter()).rev() { | ||
let new_key = format!("{}{}{}", f.name(), separator, fff.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there's a better way to structure it, but is there a way to keep the field name parts in a Vec
and create the flattened fields at the end? That allows you to avoid the repeated format!
in a deeply nested schema.
Might not be worth the trouble though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good point, this is definitely not my favorite way to do this. I'll have to do some testing and think about it some more, but it may be better to construct the queue with the components of the Field
, then go through and construct all of the Field
s at the very end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a (hopefully) better approach for this that concats the Vec<&str>
when the field is done being processed.
No problem at all, it's the holiday season! Hope everyone's taking a good break. Appreciate the feedback though! I'll get to work on it :) |
Sorry for the delays on this one, made changes based on the feedback, would appreciate another look! Hopefully the new documentation is more clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some potential simplifications
pub fn normalize(&self, separator: &str, mut max_level: usize) -> Result<Self, ArrowError> { | ||
if max_level == 0 { | ||
max_level = usize::MAX; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn normalize(&self, separator: &str, mut max_level: usize) -> Result<Self, ArrowError> { | |
if max_level == 0 { | |
max_level = usize::MAX; | |
} | |
pub fn normalize(&self, separator: &str, max_level: Option<usize>) -> Result<Self, ArrowError> { | |
let max_level = max_level.unwrap_or(usize::MAX); |
imo this seems the more Rusty way, making use of Option instead of a sentinel value (though I'm not sure if Some(0)
is a valid input?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I've been working on this a bit, I found a few possible solutions that might fit. I think Option might not be the best choice, since personally, the case of Some(0) feels weird to me, and would mean you're doing an annoying copy for no reason (because of that, I would want to add in an if statement to catch it, but then we end up in the same place).
For RecordBatch
, this seems to fit the Rusty syntax better, but unfortunately the same solution can't be echoed over to Schema
without an additional dependency, not sure how I feel about that.
max_level.is_zero().then(|| max_level = usize::MAX);
Another option is to use something like NonZeroUsize
, which I just learned about. My issue with this one is that we'd then be making the normalize call more annoying, since you have to instantiate it with something like
NonZeroUsize::new(1)
This makes the normalize call potentially longer and more annoying, but it means there wouldn't be another import.
Any thoughts on these/if you disagree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the case of Some(0) feels weird to me, and would mean you're doing an annoying copy for no reason (because of that, I would want to add in an if statement to catch it, but then we end up in the same place).
Personally I find this okay. I'm less concerned with requiring an if check inside the code (its pretty simple anyway) compared to presenting a more Rust-like interface to users.
but unfortunately the same solution can't be echoed over to
Schema
without an additional dependency, not sure how I feel about that.
I don't follow this, the Schema
code looks almost identical to the RecordBatch
version.
I agree with NonZeroUsize
potentially being a bit clunky for users to use (personally wasn't aware this was part of the stdlib either).
But yeah I'm curious to see what others might think for this too.
arrow-array/src/record_batch.rs
Outdated
if max_level == 0 { | ||
max_level = usize::MAX; | ||
} | ||
let mut queue: VecDeque<(usize, &ArrayRef, Vec<&str>, &DataType, bool)> = VecDeque::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This queue seems to instead behave like a stack; it does push_back only when initializing the queue, but otherwise does pop_front/push_front; would it be more intuitive to just use a Vec to more accurately indicate this is a stack?
Also another note is you could remove need for storing DataType and nullability as separate tuple fields by just storing the original &FieldRef
and retrieving DataType and nullability from it on demand; reduces the number of tuple fields by one which might be worth considering there's quite a few fields already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes some sense to me, I think with that approach we would need to reverse the Vec after the initial instantiation, that way we can just use pop. I'll do a bit of testing with this one.
Ah, that's a good point on the &FieldRef
, changed it over, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to rev()
the iter self.columns.iter().zip(self.schema.fields())
and then collect straight into a Vec I believe
} | ||
|
||
#[test] | ||
fn normalize_nested() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps have some test cases with some more complex types thrown in as well? e.g. have a ListArray with a StructArray within
(Even if to prove that the Struct within the List shouldn't be affected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll work on these. I was a little hesitant since I'm not sure how many cases I need to cover (also since these tests are really annoying to instantiate), but it is a current blind spot.
…d if statements, simplified the VecDeque fields.
Appreciate the feedback, as always. Changed some bits of the code, added some responses (and some stuff to work on). |
Which issue does this PR close?
Closes #6369.
Rationale for this change
Adds normalization (flattening) for
RecordBatch
, with normalization viaSchema
. Based on pandas/pola-rs.What changes are included in this PR?
Are there any user-facing changes?