Reading binary data type from lancedb using ray `read_lance` #3317

andrijazz · 2024-12-30T18:34:25Z

I have lancedb with ~10k rows and following schema:

schema = pa.schema([
      pa.field("id", pa.string()),
      pa.field("image", pa.binary()),
      pa.field("text", pa.string())

My script looks like this:

ds = ray.data.read_lance(uri=f"{uri}/{table}.lance", columns=["id", "image"])
print(ds.take(5))

Here is the stack trace:

ray.exceptions.RayTaskError(ArrowInvalid): ray::ReadLance->SplitBlocks(200)() (pid=1474611, ip=192.168.100.30)
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 451, in __call__
    for block in blocks:
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 95, in do_read
    yield from read_task()
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 168, in __call__
    yield from result
  File "/home/andrijazz/tools/miniconda3/envs/pixels-datapipelines/lib/python3.10/site-packages/ray/data/_internal/datasource/lance_datasource.py", line 128, in _read_fragments
    for batch in scanner.to_reader():
  File "pyarrow/ipc.pxi", line 671, in pyarrow.lib.RecordBatchReader.__next__
  File "pyarrow/ipc.pxi", line 705, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: External error: RuntimeError: Task was aborted
(ReadLance->SplitBlocks(200) pid=1474611) thread 'lance_background_thread' panicked at /home/runner/work/lance/lance/rust/lance-encoding/src/decoder.rs:1452:65:
(ReadLance->SplitBlocks(200) pid=1474611) called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(888), "offset overflow", ...)

I can provide db if needed

The text was updated successfully, but these errors were encountered:

westonpace · 2024-12-31T13:56:21Z

offset overflow generally means the batch size is too large for the data. A single arrow binary/string array can only hold 2GB of data. So if you have large values in your binary array you either need to switch the data type of the column to large_binary / large_string (which has its own disadvantages) or use a smaller batch size.

I see you are limiting the read to 5 rows but I don't think Ray is pushing that limit down into lance. Ray is probably doing the "open a read stream and abort the read once we have 5+ rows" approach.

You should be able to pass batch_size using scanner_options in read_lance. Do you still get this error if you change the batch size to 5?

andrijazz · 2024-12-31T15:57:46Z

Thanks! Yup. Adding batch_size solves the problem. Let me know if you want to close the issue or you think we can do smth like passing batch size by default

westonpace · 2025-01-02T20:58:35Z

Let's probably close it. I don't think there is any low hanging fruit here and someone can open an issue with a specific improvement in mind.

I don't know if we'll ever be able to set the batch size automatically because the ideal batch size depends on the data and the number of rows will change from batch to batch.

I hope someday we will be able to set the batch size based on # bytes (e.g. 32KiB batches or 1MiB batches). However, I'm not sure how much Ray will like it if some batches have fewer rows than others (some tools are ok with this and some are not).

Another thing we will need to do is prematurely end batches, regardless of how the batch size is set, when the next row would lead to an overflow.

A final thing we could do is to push the limit down into the read but I believe that would be a Ray change.

westonpace · 2025-01-02T20:59:40Z

One more piece of low hanging fruit here (but we already have an issue for it): #2775

andrijazz · 2025-01-02T21:04:17Z

Thanks for the help. Btw I think the same thing is happening when I do compact_files, ref #3330. I just found that we can pass batch_size to compact_files so this unblocks me for that part.

westonpace closed this as completed Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading binary data type from lancedb using ray `read_lance` #3317

Reading binary data type from lancedb using ray `read_lance` #3317

andrijazz commented Dec 30, 2024 •

edited

Loading

westonpace commented Dec 31, 2024

andrijazz commented Dec 31, 2024 •

edited

Loading

westonpace commented Jan 2, 2025 •

edited

Loading

westonpace commented Jan 2, 2025

andrijazz commented Jan 2, 2025 •

edited

Loading

Reading binary data type from lancedb using ray read_lance #3317

Reading binary data type from lancedb using ray read_lance #3317

Comments

andrijazz commented Dec 30, 2024 • edited Loading

westonpace commented Dec 31, 2024

andrijazz commented Dec 31, 2024 • edited Loading

westonpace commented Jan 2, 2025 • edited Loading

westonpace commented Jan 2, 2025

andrijazz commented Jan 2, 2025 • edited Loading

Reading binary data type from lancedb using ray `read_lance` #3317

Reading binary data type from lancedb using ray `read_lance` #3317

andrijazz commented Dec 30, 2024 •

edited

Loading

andrijazz commented Dec 31, 2024 •

edited

Loading

westonpace commented Jan 2, 2025 •

edited

Loading

andrijazz commented Jan 2, 2025 •

edited

Loading