Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(arrow-ipc) Add: Support FileReader and StreamReader skip array data validation #6938

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

totoroyyb
Copy link

@totoroyyb totoroyyb commented Jan 4, 2025

Which issue does this PR close?

Rationale for this change

Beforehand, array data validation is performed for each array creation when reading an IPC file (or stream), which comes with a significant overhead. In some cases, this overhead is unwanted or the file content is trusted.

There are existing functions defined in the codebase to avoid data validation but this is not exposed to the upper level APIs. This PR brings options for both FileReader and StreamReader to disable it.

What changes are included in this PR?

  • Add skip_validations, as an argument, to multiple functions signatures.
  • Provide a new constructor try_new_unvalidated to FileReader
  • Provide a new function with_skip_validations to StreamReader
  • Various related minor changes.

Are there any user-facing changes?

No, I don't think so. There are no API-breaking changes, and essentially, two new APIs are introduced. Other changes are on the internal codes I believe.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 4, 2025
@totoroyyb
Copy link
Author

totoroyyb commented Jan 4, 2025

FYI, clippy does complain about too many arguments on one or two function signatures atm. I just want to give this PR a go for now.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately as implemented this is unsound, in that it permits UB via safe functions, this probably needs a bit more thought

@@ -79,6 +79,7 @@ fn create_array(
field: &Field,
variadic_counts: &mut VecDeque<i64>,
require_alignment: bool,
skip_validations: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically all of these functions are now unsound and must be marked unsafe. This is not ideal, and probably needs a bit more thought into handling this better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility might be to create a new function like:

fn unsafe create_array_unchecked(
    reader: &mut ArrayReader,
    field: &Field,
    variadic_counts: &mut VecDeque<i64>,
    require_alignment: bool,
    skip_validations: bool,
) -> Result<ArrayRef, ArrowError> {

And change the existing function to call it:

fn create_array(
    reader: &mut ArrayReader,
    field: &Field,
    variadic_counts: &mut VecDeque<i64>,
    require_alignment: bool,
) -> Result<ArrayRef, ArrowError> {
  
    let skip_validations = false;
    // safety: enable validatiions when checking
    unsafe { create_array_unchecked(reader, field, variadic_counts, require_aligment, skip_validations) };
}

This is kind of messy though -- finding some way to encapsulate the settings into a struct would be better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls check the latest commit, which separating unsafe codes.

Each function like create_array, now has a unsafe version that skips validation. Codebase is not idea, but currently I don't have much better idea to encapusulate it.

arrow-ipc/src/reader.rs Show resolved Hide resolved
@totoroyyb
Copy link
Author

totoroyyb commented Jan 5, 2025

Refactor all unsafe codes into separate functions. It is a bit messy as mentioned @alamb mentioned. From bottom-to-up, there are many code movings and renames.

Now, FileReader has a new wrapper struct:

/// An iterator over the record batches (without validation) in an Arrow file
pub struct UnvalidatedFileReader<R: Read + Seek> {
    reader: FileReader<R>,
}

impl<R: Read + Seek> Iterator for UnvalidatedFileReader<R> {
    type Item = Result<RecordBatch, ArrowError>;

    fn next(&mut self) -> Option<Self::Item> {
        if self.reader.current_block < self.reader.total_blocks {
            // Use the unsafe `maybe_next_unvalidated` function
            unsafe {
                match self.reader.maybe_next_unvalidated() {
                    Ok(Some(batch)) => Some(Ok(batch)),
                    Ok(None) => None, // End of the file
                    Err(e) => Some(Err(e)),
                }
            }
        } else {
            None
        }
    }
}

This still unsound iteractor implementation, but I am not sure how to express it so that we can use iterator with validation disabled.

Any thoughts or suggestions?

@totoroyyb totoroyyb requested review from tustvold and alamb January 5, 2025 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally disable data validation Unsafe Unchecked IPC Read RecordBatch
3 participants