-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading binary data type from lancedb using ray read_lance
#3317
Comments
offset overflow generally means the batch size is too large for the data. A single arrow binary/string array can only hold 2GB of data. So if you have large values in your binary array you either need to switch the data type of the column to large_binary / large_string (which has its own disadvantages) or use a smaller batch size. I see you are limiting the read to 5 rows but I don't think Ray is pushing that limit down into lance. Ray is probably doing the "open a read stream and abort the read once we have 5+ rows" approach. You should be able to pass |
Thanks! Yup. Adding batch_size solves the problem. Let me know if you want to close the issue or you think we can do smth like passing batch size by default |
Let's probably close it. I don't think there is any low hanging fruit here and someone can open an issue with a specific improvement in mind. I don't know if we'll ever be able to set the batch size automatically because the ideal batch size depends on the data and the number of rows will change from batch to batch. I hope someday we will be able to set the batch size based on # bytes (e.g. 32KiB batches or 1MiB batches). However, I'm not sure how much Ray will like it if some batches have fewer rows than others (some tools are ok with this and some are not). Another thing we will need to do is prematurely end batches, regardless of how the batch size is set, when the next row would lead to an overflow. A final thing we could do is to push the limit down into the read but I believe that would be a Ray change. |
One more piece of low hanging fruit here (but we already have an issue for it): #2775 |
Thanks for the help. Btw I think the same thing is happening when I do |
I have lancedb with ~10k rows and following schema:
My script looks like this:
Here is the stack trace:
I can provide db if needed
The text was updated successfully, but these errors were encountered: