Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(arrow-array): add BinaryArrayType similar to StringArrayType #6924

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rluvaton
Copy link
Contributor

@rluvaton rluvaton commented Dec 31, 2024

Rationale for this change

This trait helps to abstract over the different types of binary arrays
so that we don't need to duplicate the implementation for each type.

Same as StringArrayType

What changes are included in this PR?

Added BinaryArrayType and implemented for BinaryArray, LargeBinaryArray, BinaryViewArray and FixedSizeBinaryArray

Are there any user-facing changes?

Yes, there is a new public trait

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 31, 2024
Copy link
Contributor

@wiedld wiedld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. What does this provide which is missing with the current IntoIterator implementations for &'a GenericByteArray, &'a FixedSizeBinaryArray, and &'a BinaryViewArray?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rluvaton and @wiedld -- this looks like a good idea to me.

My concern is that this code is untested.

Would it be possible to use of this trait somewhere(e.g rewrite one of the existing kernels using it)? That would be the best kind of test in my opinion

@wiedld
Copy link
Contributor

wiedld commented Jan 4, 2025

Question. What does this provide which is missing with the current IntoIterator implementations for &'a GenericByteArray, &'a FixedSizeBinaryArray, and &'a BinaryViewArray?

My question above is because I think this may not be needed? But I wasn't sure if I was missing something.

This introduced trait converts a referenced array to an ArrayIter, and it does so by abstracting over these three generic types (GenericBinaryArray, BinaryViewArray, FixedSizeBinaryArray) calling the iter().

When I looked at the existing code, we have a &'a GenericByteArray which also takes a reference and returns ArrayIter. So I wasn't sure why this was needed -- and what it added by making the iter() trait which does the same thing as the existing into_iter() trait. Am I missing something?

@alamb
Copy link
Contributor

alamb commented Jan 4, 2025

When I looked at the existing code, we have a &'a GenericByteArray which also takes a reference and returns ArrayIter. So I wasn't sure why this was needed -- and what it added by making the iter() trait which does the same thing as the existing into_iter() trait. Am I missing something?

I am not sure -- that is why I was asking for an example of this trait being used -- then we can have a concrete discussion if this API is duplicative or not

@alamb
Copy link
Contributor

alamb commented Jan 4, 2025

When I looked at the existing code, we have a &'a GenericByteArray which also takes a reference and returns ArrayIter. So I wasn't sure why this was needed

I think the idea is that BinaryViewArray is not a GenericBinaryArray (it is a similarly named GenericBinaryViewArray, right?)

@kylebarron
Copy link
Contributor

kylebarron commented Jan 4, 2025

IMO this looks great.

I was just trying to figure out what this provided that ArrayAccessor didn't provide directly, and I didn't realize that ArrayAccessor only provided value and value_unchecked. So having this BinaryArrayType would be nicer to be able to use iterator methods over any generic binary array input.

    fn use_binary_array<'a>(array: impl arrow_array::ArrayAccessor<Item = &'a [u8]>) {
        // Have to manually iterate over the array
        for i in 0..array.len() {
            dbg!(String::from_utf8(array.value(i).to_vec()).unwrap());
        }
    }

    #[test]
    fn binary_array() {
        let mut builder = BinaryViewBuilder::new();
        builder.append_value(b"foo");
        builder.append_value(b"bar");
        builder.append_value(b"baz");
        use_binary_array(&builder.finish());

        let mut builder = FixedSizeBinaryBuilder::new(3);
        builder.append_value(b"foo").unwrap();
        builder.append_value(b"bar").unwrap();
        builder.append_value(b"baz").unwrap();
        use_binary_array(&builder.finish());
    }

@wiedld
Copy link
Contributor

wiedld commented Jan 5, 2025

The above example use case (kindly provided by @kylebarron) -- is already passing on main. 🤔
See demonstration: influxdata#6.
I added comments for where the generic iterators are already implemented.

@alamb
Copy link
Contributor

alamb commented Jan 5, 2025

The above example use case (kindly provided by @kylebarron) -- is already passing on main. 🤔 See demonstration: influxdata#6. I added comments for where the generic iterators are already implemented.

So perhaps the conclusion is that ArrayAccessor is already sufficient, but it was not obvious / clear how to use it / find it.

What can we do to make it clearer to find? Maybe we can add this example on ArrayAccessor and then add a link to ArrayAccessor from the documentation for BinaryViewArray and BinaryArray and FixedSizdBinaryArray 🤔

@kylebarron
Copy link
Contributor

Well, on the contrary I was saying that BinaryArrayType could be nice so that consumers could access iterator methods, instead of needing to iterate by hand. I'm not sure if that warrants a new public API, but given that StringArrayType already exists, I think having a binary corollary would make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants