Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading binary data type from lancedb using ray read_lance #3317

Closed
andrijazz opened this issue Dec 30, 2024 · 5 comments
Closed

Reading binary data type from lancedb using ray read_lance #3317

andrijazz opened this issue Dec 30, 2024 · 5 comments

Comments

@andrijazz
Copy link

andrijazz commented Dec 30, 2024

I have lancedb with ~10k rows and following schema:

schema = pa.schema([
      pa.field("id", pa.string()),
      pa.field("image", pa.binary()),
      pa.field("text", pa.string())

My script looks like this:

ds = ray.data.read_lance(uri=f"{uri}/{table}.lance", columns=["id", "image"])
print(ds.take(5))

Here is the stack trace:

ray.exceptions.RayTaskError(ArrowInvalid): ray::ReadLance->SplitBlocks(200)() (pid=1474611, ip=192.168.100.30)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 451, in __call__
    for block in blocks:
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 95, in do_read
    yield from read_task()
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 168, in __call__
    yield from result
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/datasource/lance_datasource.py", line 128, in _read_fragments
    for batch in scanner.to_reader():
  File "pyarrow/ipc.pxi", line 671, in pyarrow.lib.RecordBatchReader.__next__
  File "pyarrow/ipc.pxi", line 705, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: External error: RuntimeError: Task was aborted
(ReadLance->SplitBlocks(200) pid=1474611) thread 'lance_background_thread' panicked at /home/runner/work/lance/lance/rust/lance-encoding/src/decoder.rs:1452:65:
(ReadLance->SplitBlocks(200) pid=1474611) called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(888), "offset overflow", ...)

I can provide db if needed

@westonpace
Copy link
Contributor

offset overflow generally means the batch size is too large for the data. A single arrow binary/string array can only hold 2GB of data. So if you have large values in your binary array you either need to switch the data type of the column to large_binary / large_string (which has its own disadvantages) or use a smaller batch size.

I see you are limiting the read to 5 rows but I don't think Ray is pushing that limit down into lance. Ray is probably doing the "open a read stream and abort the read once we have 5+ rows" approach.

You should be able to pass batch_size using scanner_options in read_lance. Do you still get this error if you change the batch size to 5?

@andrijazz
Copy link
Author

andrijazz commented Dec 31, 2024

Thanks! Yup. Adding batch_size solves the problem. Let me know if you want to close the issue or you think we can do smth like passing batch size by default

@westonpace
Copy link
Contributor

westonpace commented Jan 2, 2025

Let's probably close it. I don't think there is any low hanging fruit here and someone can open an issue with a specific improvement in mind.

I don't know if we'll ever be able to set the batch size automatically because the ideal batch size depends on the data and the number of rows will change from batch to batch.

I hope someday we will be able to set the batch size based on # bytes (e.g. 32KiB batches or 1MiB batches). However, I'm not sure how much Ray will like it if some batches have fewer rows than others (some tools are ok with this and some are not).

Another thing we will need to do is prematurely end batches, regardless of how the batch size is set, when the next row would lead to an overflow.

A final thing we could do is to push the limit down into the read but I believe that would be a Ray change.

@westonpace
Copy link
Contributor

One more piece of low hanging fruit here (but we already have an issue for it): #2775

@andrijazz
Copy link
Author

andrijazz commented Jan 2, 2025

Thanks for the help. Btw I think the same thing is happening when I do compact_files, ref #3330. I just found that we can pass batch_size to compact_files so this unblocks me for that part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants