[data] optimize parquet datasource split tasks algorithm #47954

Jay-ju · 2024-10-09T01:38:56Z

Why are these changes needed?

Using np.split to split a list in NumPy is not as performant as not using split_list, and the performance difference can be significant.

int_array = range(1000003)
start = time.perf_counter()

for split in np.array_split(int_array, 100):
    len(split)

end1 = time.perf_counter()
print(f"np array_split 100w elements to 100 , cost {(end1 - start)*1000} ms")

for split in _split_list(int_array, 100):
    len(split)
end2 = time.perf_counter()
print(f"_split_list 100w elements to 100, {(end2 - end1)*1000} ms")

result is：

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

raulchen · 2024-10-10T23:33:52Z

thanks for your contribution.
Could you elaborate on why np.array_split would produce uneven and non-deterministic results? I tried the following script, the splits are even.

import numpy as np

arr = list(range(10000))
for split in np.array_split(arr, 100):
    print(len(split))

Jay-ju · 2024-10-11T02:29:48Z

thanks for your contribution. Could you elaborate on why np.array_split would produce uneven and non-deterministic results? I tried the following script, the splits are even.
import numpy as np

arr = list(range(10000))
for split in np.array_split(arr, 100):
    print(len(split))

sorry, description is confused, I have provided a benchmark here.

Signed-off-by: jukejian <[email protected]>

raulchen · 2024-10-16T00:00:55Z

do you know why np.array_split is slow?
Also, I'm curious about your use cases. This operation is done only once upon job start-up. 10s of ms latency increase doesn't sound like the big deal. Why would it matter in your case?
Asking because _split_list is no longer used else where. We should consider removing it to reduce maintenance overhead, unless there is a strong reason.

Jay-ju · 2024-10-16T05:07:09Z

do you know why np.array_split is slow? Also, I'm curious about your use cases. This operation is done only once upon job start-up. 10s of ms latency increase doesn't sound like the big deal. Why would it matter in your case? Asking because _split_list is no longer used else where. We should consider removing it to reduce maintenance overhead, unless there is a strong reason.

It was only discovered during single-machine testing.

raulchen · 2024-10-18T01:30:46Z

then maybe let's not introduce this change and just remove _split_list

Jay-ju requested review from scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners October 9, 2024 01:38

Jay-ju force-pushed the split_optimizer branch 2 times, most recently from 9db3600 to 2d6d32e Compare October 9, 2024 01:43

Jay-ju changed the title ~~[Enhancement] modify parquet datasource split tasks algorithm~~ [data] modify parquet datasource split tasks algorithm Oct 9, 2024

Jay-ju changed the title ~~[data] modify parquet datasource split tasks algorithm~~ [data] optimize parquet datasource split tasks algorithm Oct 9, 2024

Jay-ju force-pushed the split_optimizer branch 2 times, most recently from 013acee to a855450 Compare October 11, 2024 02:14

Jay-ju requested review from sven1977, maxpumperla, simonsays1980 and a team as code owners October 11, 2024 02:14

Jay-ju force-pushed the split_optimizer branch from a855450 to aff3f71 Compare October 11, 2024 02:19

[data] modify parquet datasource split tasks algorithm

52b43ec

Signed-off-by: jukejian <[email protected]>

Jay-ju force-pushed the split_optimizer branch from aff3f71 to 52b43ec Compare October 12, 2024 02:57

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] optimize parquet datasource split tasks algorithm #47954

[data] optimize parquet datasource split tasks algorithm #47954

Jay-ju commented Oct 9, 2024 •

edited

Loading

raulchen commented Oct 10, 2024

Jay-ju commented Oct 11, 2024 •

edited

Loading

raulchen commented Oct 16, 2024

Jay-ju commented Oct 16, 2024

raulchen commented Oct 18, 2024

[data] optimize parquet datasource split tasks algorithm #47954

Are you sure you want to change the base?

[data] optimize parquet datasource split tasks algorithm #47954

Conversation

Jay-ju commented Oct 9, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

raulchen commented Oct 10, 2024

Jay-ju commented Oct 11, 2024 • edited Loading

raulchen commented Oct 16, 2024

Jay-ju commented Oct 16, 2024

raulchen commented Oct 18, 2024

Jay-ju commented Oct 9, 2024 •

edited

Loading

Jay-ju commented Oct 11, 2024 •

edited

Loading