`RecordBatch` normalization (flattening) #6758

ngli-me · 2024-11-20T03:10:29Z

Which issue does this PR close?

Closes #6369.

Rationale for this change

Adds normalization (flattening) for RecordBatch, with normalization via Schema. Based on pandas/pola-rs.

What changes are included in this PR?

Are there any user-facing changes?

…on pola-rs.

… iterative function for `RecordBatch`. Not sure which one is better currently.

ngli-me

I had some questions regarding the implementation of this, since the one example from PyArrow doesn't seem to clarify on the edge cases here. Normalizing the Schema seems fairly straight forward to me, I'm just not sure on

Whether the iterative or recursive approach is better (or something I missed)
If DataType::Struct is the only DataType that requires flattening. To me, it looks like that's the only one that can contained nested Fields.

(I'm also not sure if I'm missing something with unwrapping like a List<Struct>)

Any feedback/help would be appreciated!

arrow-array/src/record_batch.rs

arrow-schema/src/schema.rs

…ch the example from PyArrow.

arrow-array/src/record_batch.rs

…h-flatten

alamb · 2024-12-18T13:04:18Z

@kszlim can you please help review this PR ? You requested the feature and we are currently quite short on review capacity in arrow-rs

alamb

Thank you for this contribution @ngli-me and I apologize for the delay in reviewing.

Hopefully @kszlim can give this a look and help us review / get it moving too.

alamb · 2024-12-18T13:05:38Z

arrow-array/src/record_batch.rs

@@ -394,6 +396,56 @@ impl RecordBatch {
        )
    }

+    /// Normalize a semi-structured [`RecordBatch`] into a flat table.
+    ///
+    /// If max_level is 0, normalizes all levels.


Can you please improve this documentation (maybe copy from the pyarrow version)?

Doucment what max_level means (in addition to that 0)

Document what separator does

provide an example of flatteing a record batch as a doc example?

For example like https://docs.rs/arrow/latest/arrow/index.html#columnar-format

Ah, missed doing this, will do!

alamb · 2024-12-18T13:09:16Z

arrow-schema/src/schema.rs

@@ -413,6 +413,81 @@ impl Schema {
        &self.metadata
    }

+    /// Returns a new schema, normalized based on the max_level
+    /// This carries metadata from the parent schema over as well


Likewise, please document the parametrs to this function and add a documentation example

Sounds good, thanks!

kszlim · 2024-12-19T06:10:58Z

I'll take a look, though please feel free to disregard anything I say and especially defer to the maintainers.

arrow-array/src/record_batch.rs

arrow-schema/src/schema.rs

kszlim · 2024-12-19T06:17:31Z

arrow-array/src/record_batch.rs

+                    DataType::Struct(ff) => {
+                        // Need to zip these in reverse to maintain original order
+                        for (cff, fff) in c.as_struct().columns().iter().zip(ff.into_iter()).rev() {
+                            let new_key = format!("{}{}{}", f.name(), separator, fff.name());


Not sure if there's a better way to structure it, but is there a way to keep the field name parts in a Vec and create the flattened fields at the end? That allows you to avoid the repeated format! in a deeply nested schema.

Might not be worth the trouble though.

I think this is a good point, this is definitely not my favorite way to do this. I'll have to do some testing and think about it some more, but it may be better to construct the queue with the components of the Field, then go through and construct all of the Fields at the very end.

Added a (hopefully) better approach for this that concats the Vec<&str> when the field is done being processed.

ngli-me · 2024-12-19T13:33:59Z

Thank you for this contribution @ngli-me and I apologize for the delay in reviewing.

Hopefully @kszlim can give this a look and help us review / get it moving too.

No problem at all, it's the holiday season! Hope everyone's taking a good break.

Appreciate the feedback though! I'll get to work on it :)

… normalization to iterative approach.

ngli-me · 2024-12-31T06:41:17Z

Sorry for the delays on this one, made changes based on the feedback, would appreciate another look! Hopefully the new documentation is more clear.

Jefffrey

Some potential simplifications

Jefffrey · 2025-01-05T00:14:13Z

arrow-array/src/record_batch.rs

+    pub fn normalize(&self, separator: &str, mut max_level: usize) -> Result<Self, ArrowError> {
+        if max_level == 0 {
+            max_level = usize::MAX;
+        }


Suggested change

pub fn normalize(&self, separator: &str, mut max_level: usize) -> Result<Self, ArrowError> {

if max_level == 0 {

max_level = usize::MAX;

}

pub fn normalize(&self, separator: &str, max_level: Option<usize>) -> Result<Self, ArrowError> {

let max_level = max_level.unwrap_or(usize::MAX);

imo this seems the more Rusty way, making use of Option instead of a sentinel value (though I'm not sure if Some(0) is a valid input?)

Okay I've been working on this a bit, I found a few possible solutions that might fit. I think Option might not be the best choice, since personally, the case of Some(0) feels weird to me, and would mean you're doing an annoying copy for no reason (because of that, I would want to add in an if statement to catch it, but then we end up in the same place).

For RecordBatch, this seems to fit the Rusty syntax better, but unfortunately the same solution can't be echoed over to Schema without an additional dependency, not sure how I feel about that.

max_level.is_zero().then(|| max_level = usize::MAX);

Another option is to use something like NonZeroUsize, which I just learned about. My issue with this one is that we'd then be making the normalize call more annoying, since you have to instantiate it with something like

NonZeroUsize::new(1)

This makes the normalize call potentially longer and more annoying, but it means there wouldn't be another import.

Any thoughts on these/if you disagree?

the case of Some(0) feels weird to me, and would mean you're doing an annoying copy for no reason (because of that, I would want to add in an if statement to catch it, but then we end up in the same place).

Personally I find this okay. I'm less concerned with requiring an if check inside the code (its pretty simple anyway) compared to presenting a more Rust-like interface to users.

but unfortunately the same solution can't be echoed over to Schema without an additional dependency, not sure how I feel about that.

I don't follow this, the Schema code looks almost identical to the RecordBatch version.

I agree with NonZeroUsize potentially being a bit clunky for users to use (personally wasn't aware this was part of the stdlib either).

But yeah I'm curious to see what others might think for this too.

arrow-array/src/record_batch.rs

Jefffrey · 2025-01-05T00:26:41Z

arrow-array/src/record_batch.rs

+        if max_level == 0 {
+            max_level = usize::MAX;
+        }
+        let mut queue: VecDeque<(usize, &ArrayRef, Vec<&str>, &DataType, bool)> = VecDeque::new();


This queue seems to instead behave like a stack; it does push_back only when initializing the queue, but otherwise does pop_front/push_front; would it be more intuitive to just use a Vec to more accurately indicate this is a stack?

Also another note is you could remove need for storing DataType and nullability as separate tuple fields by just storing the original &FieldRef and retrieving DataType and nullability from it on demand; reduces the number of tuple fields by one which might be worth considering there's quite a few fields already.

This makes some sense to me, I think with that approach we would need to reverse the Vec after the initial instantiation, that way we can just use pop. I'll do a bit of testing with this one.

Ah, that's a good point on the &FieldRef, changed it over, thanks!

You should be able to rev() the iter self.columns.iter().zip(self.schema.fields()) and then collect straight into a Vec I believe

arrow-array/src/record_batch.rs

Jefffrey · 2025-01-05T00:51:29Z

arrow-array/src/record_batch.rs

+    }
+
+    #[test]
+    fn normalize_nested() {


Perhaps have some test cases with some more complex types thrown in as well? e.g. have a ListArray with a StructArray within

(Even if to prove that the Struct within the List shouldn't be affected)

Good point, I'll work on these. I was a little hesitant since I'm not sure how many cases I need to cover (also since these tests are really annoying to instantiate), but it is a current blind spot.

…d if statements, simplified the VecDeque fields.

ngli-me · 2025-01-05T06:13:02Z

Appreciate the feedback, as always. Changed some bits of the code, added some responses (and some stuff to work on).

nglime added 2 commits November 18, 2024 14:11

Added set up for the example of flattening from pyarrow.

bbd7c8b

Logic for recursive normalizer with a base normalize function, based …

8abcd25

…on pola-rs.

ngli-me changed the title ~~Feature/record batch flatten~~ RecordBatch normalization (flattening) Nov 20, 2024

ngli-me changed the title ~~RecordBatch normalization (flattening)~~ RecordBatch normalization (flattening) Nov 20, 2024

Added recursive normalize function for Schema, and started building…

6bba7d3

… iterative function for `RecordBatch`. Not sure which one is better currently.

github-actions bot added the arrow Changes to the arrow crate label Nov 23, 2024

Built out a bit more of the iterative normalize.

55eb953

ngli-me commented Nov 23, 2024

View reviewed changes

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

arrow-schema/src/schema.rs Outdated Show resolved Hide resolved

ngli-me marked this pull request as ready for review November 23, 2024 19:03

ngli-me marked this pull request as draft November 23, 2024 23:30

nglime added 2 commits November 23, 2024 21:03

Fixed normalize function for RecordBatch. Adjusted test case to mat…

30d6294

…ch the example from PyArrow.

Added tests for Schema normalization. Partial tests for RecordBatch.

0ed979d

ngli-me commented Nov 25, 2024

View reviewed changes

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

nglime added 2 commits November 24, 2024 21:54

Removed stray comments.

d9d08cd

Commenting out exclamation field.

d1b3260

ngli-me marked this pull request as ready for review November 25, 2024 04:02

nglime added 3 commits December 4, 2024 22:04

Merge remote-tracking branch 'upstream/main' into feature/record-batc…

a12082c

…h-flatten

Fixed test for RecordBatch.

7adda58

Formatting.

9c9c699

alamb reviewed Dec 18, 2024

View reviewed changes

kszlim reviewed Dec 19, 2024

View reviewed changes

ngli-me marked this pull request as draft December 20, 2024 12:27

nglime added 5 commits December 30, 2024 22:38

Additional documentation for normalize functions. Switched Schema…

4422add

… normalization to iterative approach.

Forgot to push to the columns in the else case.

d0dc5a7

Adjusted the documentation to include the parameters.

1e40c98

Formatting.

3c424d1

Edited examples to not be ran as tests.

6d6b026

ngli-me marked this pull request as ready for review December 31, 2024 06:41

Jefffrey reviewed Jan 5, 2025

View reviewed changes

Adjusted based on some of the suggestions. Simplified the matching an…

71380b6

…d if statements, simplified the VecDeque fields.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`RecordBatch` normalization (flattening) #6758

`RecordBatch` normalization (flattening) #6758

ngli-me commented Nov 20, 2024 •

edited

Loading

ngli-me left a comment •

edited

Loading

alamb commented Dec 18, 2024

alamb left a comment

alamb Dec 18, 2024

ngli-me Dec 19, 2024

alamb Dec 18, 2024

ngli-me Dec 19, 2024

kszlim commented Dec 19, 2024

kszlim Dec 19, 2024

ngli-me Dec 19, 2024

ngli-me Dec 31, 2024

ngli-me commented Dec 19, 2024 •

edited

Loading

ngli-me commented Dec 31, 2024

Jefffrey left a comment

Jefffrey Jan 5, 2025

ngli-me Jan 5, 2025

Jefffrey Jan 5, 2025

Jefffrey Jan 5, 2025

ngli-me Jan 5, 2025

Jefffrey Jan 5, 2025

Jefffrey Jan 5, 2025

ngli-me Jan 5, 2025

ngli-me commented Jan 5, 2025

RecordBatch normalization (flattening) #6758

Are you sure you want to change the base?

RecordBatch normalization (flattening) #6758

Conversation

ngli-me commented Nov 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

ngli-me left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Dec 18, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kszlim commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngli-me commented Dec 19, 2024 • edited Loading

ngli-me commented Dec 31, 2024

Jefffrey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngli-me commented Jan 5, 2025

`RecordBatch` normalization (flattening) #6758

`RecordBatch` normalization (flattening) #6758

ngli-me commented Nov 20, 2024 •

edited

Loading

ngli-me left a comment •

edited

Loading

ngli-me commented Dec 19, 2024 •

edited

Loading