-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] optimize parquet datasource split tasks algorithm #47954
base: master
Are you sure you want to change the base?
Conversation
9db3600
to
2d6d32e
Compare
thanks for your contribution.
|
013acee
to
a855450
Compare
a855450
to
aff3f71
Compare
sorry, description is confused, I have provided a benchmark here. |
Signed-off-by: jukejian <[email protected]>
aff3f71
to
52b43ec
Compare
do you know why np.array_split is slow? |
It was only discovered during single-machine testing. |
then maybe let's not introduce this change and just remove _split_list |
Why are these changes needed?
Using np.split to split a list in NumPy is not as performant as not using split_list, and the performance difference can be significant.
result is:
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.