Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arrow-string] Implement string view support for regexp_match #6849

Merged
merged 4 commits into from
Dec 30, 2024

Conversation

tlm365
Copy link
Contributor

@tlm365 tlm365 commented Dec 6, 2024

Which issue does this PR close?

Closes #6717.

Rationale for this change

What changes are included in this PR?

Add support string view (Utf8View) for regexp_match, unit tests.

Are there any user-facing changes?

More type support for regexp_match.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 6, 2024
@tlm365 tlm365 marked this pull request as ready for review December 6, 2024 18:17
@alamb
Copy link
Contributor

alamb commented Dec 18, 2024

Thank you for this PR @tlm365 -- I apologize for the delay in reviewing. I believe @wiedld plans to help review this PR later today

Copy link
Contributor

@wiedld wiedld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked the use of macros to make the test cases very clean. Although I'm unclear of the reason to use macros elsewhere.

let builder = StringViewBuilder::with_capacity(0);
let mut list_builder = ListBuilder::new(builder);

process_regexp_array_match!(array, regex_array, flags_array, list_builder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this expands to the original logic, for both StringView (here) and StringArrays (above), to match on an array of regex. Why the use of macros?

Comment on lines +416 to +423
fn regexp_scalar_match_utf8view(
array: &StringViewArray,
regex: &Regex,
) -> Result<ArrayRef, ArrowError> {
let builder = StringViewBuilder::with_capacity(0);
let mut list_builder = ListBuilder::new(builder);

process_regexp_match!(array, regex, list_builder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pattern, and same question as before.
It's the same logic (for a single regex) into a macro to use for both StringView and StringArray. Why the macro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wiedld Thanks so much for reviewing ❤️

Confirmed this expands to the original logic, for both StringView (here) and StringArrays (above), to match on an array of regex. Why the use of macros?

It's the same logic (for a single regex) into a macro to use for both StringView and StringArray. Why the macro?

I just want to reduce duplicate code here. I remember that I have struggled when I implement generic functions instead of macro. But if it makes code hard to read/debug/maintain I can try to remove the macro and using another solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around a bit, and It looks like some of the common array builder methods are not traits on ArrayBuilder. Specifically, I got to here and was blocked by the builder.append_value():

fn process_regexp_match<'a, T: ?Sized, A: ArrayAccessor<Item = &'a str>, O: OffsetSizeTrait, B: ArrayBuilder>(
    array: ArrayIter<A>,
    regex: &Regex,
    list_builder: &mut GenericListBuilder<O, B>,
) -> Result<(), ArrowError> {
    ...
}

It looks like this method is only implemented on 9 of the 15 current ArrayBuilder implementations. Not sure on whether the method should be pulled into the interface or whether we want additional builder traits/interface definitions (maybe in a separate PR). @alamb ?

Otherwise, yes the macro is the alternative for DRY. But it feels like we want abstraction (a.k.a. a trait).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A trait sounds like a good idea to me, but unless we have a specific idea/proposal of one, I think macros are the best way to proceed (we can always clean the code up later)

Or perhaps @wiedld if you have time you can try and work out what a trait based solution would look like so we can compare

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that in a follow up PR. Thank you for the input @alamb .

@tlm365 -- if the additional test gets added, then I can approve.

Copy link
Contributor

@wiedld wiedld Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @tlm365 -- Here is a proposed trait ValuesBuilder: ArrayBuilder for common interfaces in building a list of values.

Note that there were several other common sets of builder functionality (such as building lists of lists), but I didn't address that due to (a) keep the PR small, and (b) I knew only of this immediate use case (altho there are likely others).

Comment on lines -518 to -526
expected_builder.values().append_value("005");
expected_builder.append(true);
expected_builder.values().append_value("7");
expected_builder.append(true);
expected_builder.append(false);
expected_builder.append(false);
expected_builder.append(false);
expected_builder.values().append_value("");
expected_builder.append(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer not having this anymore.

Comment on lines 737 to 745
test_match_scalar_pattern!(
match_scalar_pattern_string_no_flags,
vec![Some("abc-005-def"), Some("x-7-5"), Some("X545"), None],
r"x.*-(\d*)-.*",
None::<&str>,
StringArray,
GenericStringBuilder<i32>,
[None, Some("7"), None, None]
);
Copy link
Contributor

@wiedld wiedld Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no difference in the match results with, or without, flags. Might be nice to have test cases demonstrating this difference. 🙏🏼

@tlm365 tlm365 marked this pull request as draft December 28, 2024 02:18
Signed-off-by: Tai Le Manh <[email protected]>
@tlm365 tlm365 marked this pull request as ready for review December 28, 2024 02:40
Copy link
Contributor

@wiedld wiedld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added test cases.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tlm365 for this PR and @wiedld for the review

I am running benchmarks on this PR to double check that there are no performance implications

cargo bench --bench regexp_kernels

Will report back when complete

@alamb
Copy link
Contributor

alamb commented Dec 30, 2024

++ critcmp master regex-match-string-view
group            master                                 regex-match-string-view
-----            ------                                 -----------------------
regexp           1.00     11.9±0.14ms        ? ?/sec    1.00     11.9±0.10ms        ? ?/sec
regexp scalar    1.00      9.0±0.15ms        ? ?/sec    1.00      9.0±0.19ms        ? ?/sec

👍 🚀

@alamb alamb merged commit 5249f99 into apache:main Dec 30, 2024
18 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 30, 2024

Thanks again for the patience @tlm365 and @wiedld for the revie

@tlm365
Copy link
Contributor Author

tlm365 commented Dec 30, 2024

Thanks for the review @wiedld @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement regexp_match, regexp_scalar_match and regexp_array_match for StringViewArray
3 participants