Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between vectorized processing boundary calculation in benchmark and readBatchUsing512Vector calculation method in ParquetReadRouter class? #3073

Open
1111nit opened this issue Nov 22, 2024 · 0 comments

Comments

@1111nit
Copy link

1111nit commented Nov 22, 2024

Hello, I have recently become interested in using vector API Parquet Bit-Packing decode. In the course of researching the code, I found that in the ByteBitPackingVectorBenchmarks.java in the official test benchmarks in the parquet-plugins-benchmarks folder totalByteCountVector = totalBytesCount - inputByteCountPerVector; When this range is exceeded, unpack8Values() is used to decode the data, which ensures that there is enough room for a full vector operation at the end. But in readBatchUsing512Vector() totalByteCountVector = totalBytesCount - BYTES_PER_VECTOR_512;
I am wondering if this affects the throughput performance and logic implementation choices for different bit width decoding methods.
My queries are as follows:

  1. In this case, if it is the same amount of data, the number of vectorized decoding will be reduced, won't this affect the optimization effect after combining with Spark?
    For example, bitwidth = 3, outputValues = 2048; when ByteBitPackingVectorBenchmarks.java totalByteCountVector = totalBytesCount - inputByteCountPerVector. Here the decoding is done by 63 times unpackValuesUsingVector and 4 times unpack8value. When totalByteCountVector = totalBytesCount - BYTES_PER_VECTOR_512; here the decoding is done by 59 times unpackValuesUsingVector, 20 times unpack8value.
    totalVectorCount

  2. I would like to ask here to reserve 64 bytes is to take into account the data out of bounds and other data security considerations?
    Is it possible to use totalByteCountVector = totalBytesCount - inputByteCountPerVector; in the readBatchUsing512Vector boundary calculation code?

  3. Also if readBatchUsing512Vector keeps totalByteCountVector = totalBytesCount - BYTES_PER_VECTOR_512; the vectorization bounds here are enough to cover off the bounds-safe case, Loads a vector from an array of type byte[] starting at an offset and using a mask is necessary?
    load Vector Using Mask or not
    As I understand it, boundary crossing for vectorized data processing would be mitigated if BYTES_PER_VECTOR_512 is used. The performance loss caused by using masks for boundary data reading safety is greater than static ByteVector fromArray(VectorSpecies species, byte[] a, int offset) which doesn't use masks but directly loads the excess part of the array unused. Would it be possible to consider loading a vector only from an array of type byte[] starting at offset, without using a mask? Although this would add some single decode extra data to the vector loading process? Or is there some other reason why I can't eliminate the mask, I'm just eagerly waiting for an answer.Sorry for my poor English.

Component(s)

Core, Benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant