Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up searches by removing repeated memsets coming from vec.resize() #2327

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RaphaelMarinier
Copy link

@RaphaelMarinier RaphaelMarinier commented Mar 12, 2024

Also, reserve exactly the size needed, which is surprisingly needed to get the full speedup of ~5% on a good fraction of the queries.

The memset (accounting for 4% of the CPU cycles) in the range_numeric query is indeed gone with this PR, see flame graphs:
baseline range_numeric
reserve_exact and set_len range_numeric

Overall, it makes things a few percent faster on many queries:
tantivy reserve_exact and set_len vs baseline

Note: I made sure to use the same version of tantivy for the baseline and for this proposed change. The only difference between the two columns is this PR.

It would be possible to have this speedup without using unsafe by never truncating it (as is done in filter_vec_in_place()) and instead passing the number of final elements in the vector back to VecCursor.

Also, reserve exactly the size needed, which is surprisingly needed to
get the full speedup of ~5% on a good fraction of the queries.
@RaphaelMarinier RaphaelMarinier marked this pull request as ready for review March 12, 2024 17:04
positions.reserve_exact(id_range.len());
#[allow(clippy::uninit_vec)]
unsafe {
positions.set_len(id_range.len());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this is undefined behaviour due to the validity requirements on integers which newly allocated memory does not fulfil.

If you want to apply this optimization, get_batch_u32s needs to work with the return value of Vec::spare_capacity_mut to write into the MaybeUninit<u32> values and only call set_len after all of these values have been initialized.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks for pointing this out, I see that in rust-lang/unsafe-code-guidelines#439 it was decided that the mere existence of an uninitialized integer is undefined behavior. There is also an interesting discussion on that here where it was controversially concluded that it was UB.

The trick is the same as in quickwit-oss/quickwit#4712, so we should probably decide whether we do neither or both. @fulmicoton

For that quickwit PR, I see that read_buf (see rfc, function doc) could be used, but it's not stable unfortunately.

For this PR, note that this vector ends up being filled by AVX2 instructions, so what you propose does not seem possible: we would either to take a performance hit by storing the output of the AVX2 computation to a temporary variable and write() it into the uninitialized vector, or use MaybeUninit.as_mut_ptr which would lead us back to UB. Or do you have an efficient suggestion to make it work with the AVX instructions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we should probably decide whether we do neither or both. @fulmicoton

I'd say we should do it for one and not for the other.
For the Read side, the one here is simply not needed. We also own the bitpacking crate. We can just add a method that takes a MaybeUninit.

For the Read trait, this is well known problem that has a workaround but it is not stabilized yet.
The problem is that if a Read implementation were to read the buffer we would be in big troubles.

We put that in a place where the Read impl is well defined so I think we are ok.

@fulmicoton
Copy link
Collaborator

I think there is a problem in the benchmark. That code is not used in most of the queries showing an improvements.

@RaphaelMarinier RaphaelMarinier marked this pull request as draft April 22, 2024 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants