-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(arrow-ipc) Add: Support FileReader and StreamReader skip array data validation #6938
base: main
Are you sure you want to change the base?
Conversation
FYI, clippy does complain about too many arguments on one or two function signatures atm. I just want to give this PR a go for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately as implemented this is unsound, in that it permits UB via safe functions, this probably needs a bit more thought
@@ -79,6 +79,7 @@ fn create_array( | |||
field: &Field, | |||
variadic_counts: &mut VecDeque<i64>, | |||
require_alignment: bool, | |||
skip_validations: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically all of these functions are now unsound and must be marked unsafe
. This is not ideal, and probably needs a bit more thought into handling this better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility might be to create a new function like:
fn unsafe create_array_unchecked(
reader: &mut ArrayReader,
field: &Field,
variadic_counts: &mut VecDeque<i64>,
require_alignment: bool,
skip_validations: bool,
) -> Result<ArrayRef, ArrowError> {
And change the existing function to call it:
fn create_array(
reader: &mut ArrayReader,
field: &Field,
variadic_counts: &mut VecDeque<i64>,
require_alignment: bool,
) -> Result<ArrayRef, ArrowError> {
let skip_validations = false;
// safety: enable validatiions when checking
unsafe { create_array_unchecked(reader, field, variadic_counts, require_aligment, skip_validations) };
}
This is kind of messy though -- finding some way to encapsulate the settings into a struct would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls check the latest commit, which separating unsafe codes.
Each function like create_array
, now has a unsafe version that skips validation. Codebase is not idea, but currently I don't have much better idea to encapusulate it.
Refactor all unsafe codes into separate functions. It is a bit messy as mentioned @alamb mentioned. From bottom-to-up, there are many code movings and renames. Now, /// An iterator over the record batches (without validation) in an Arrow file
pub struct UnvalidatedFileReader<R: Read + Seek> {
reader: FileReader<R>,
}
impl<R: Read + Seek> Iterator for UnvalidatedFileReader<R> {
type Item = Result<RecordBatch, ArrowError>;
fn next(&mut self) -> Option<Self::Item> {
if self.reader.current_block < self.reader.total_blocks {
// Use the unsafe `maybe_next_unvalidated` function
unsafe {
match self.reader.maybe_next_unvalidated() {
Ok(Some(batch)) => Some(Ok(batch)),
Ok(None) => None, // End of the file
Err(e) => Some(Err(e)),
}
}
} else {
None
}
}
} This still unsound iteractor implementation, but I am not sure how to express it so that we can use iterator with validation disabled. Any thoughts or suggestions? |
Which issue does this PR close?
Rationale for this change
Beforehand, array data validation is performed for each array creation when reading an IPC file (or stream), which comes with a significant overhead. In some cases, this overhead is unwanted or the file content is trusted.
There are existing functions defined in the codebase to avoid data validation but this is not exposed to the upper level APIs. This PR brings options for both
FileReader
andStreamReader
to disable it.What changes are included in this PR?
skip_validations
, as an argument, to multiple functions signatures.try_new_unvalidated
toFileReader
with_skip_validations
toStreamReader
Are there any user-facing changes?
No, I don't think so. There are no API-breaking changes, and essentially, two new APIs are introduced. Other changes are on the internal codes I believe.