Arrow schema and metadata tags are lost too easily #8547

jleibs · 2024-12-19T15:05:54Z

Context

Our plan for Sorbet (and tagged components) hinges on being able to reliably "tag" data with semantic information.

The working theory is that every column receives a set of tags:

path
archetype
field
component

We store these tags in the arrow metadata of the RecordBatch that we send to Rerun, and Rerun uses these tags for everything from driving batching logic to how we visualize the data in the viewer.

Problem

The problem is it is way way too easy to accidentally lose these tags in most of the arrow libraries.

The reason is fundamental to arrow:

Arrow Arrays have Datatypes
Arrow Tables have Fields
A field is a datatype + a name + a set of metadata (our tags)

As soon as you access a table column:

timestamps = dataset.column("log_time")
positions = dataset.column("/points:Position3D")

the metadata is gone.

You now have two bare arrays:

One of type List<TimestampNs>
One of type List<List<F32,3>>

Neither column has tags. No knowledge of the entity path. No knowledge of the components. Not only are they lost, but there's not even a way to store the data on the arrow array.

If you thought you could send this back to arrow

rr.send_dataframe(pa.Table.from_arrays([timestamps, positions])

will not do what you want.

You need to manually re-apply the tags at the time that you create the table or else nothing works.

The text was updated successfully, but these errors were encountered:

jleibs · 2024-12-20T13:37:50Z

One thought is proposal to expose archetypes maybe helps a little: #7436

Struct datatypes contain internal Field specifications, and it appears this can propagate data all the way into datafusion and out again:

import pyarrow as pa
from datafusion import SessionContext
ctx = SessionContext()

struct_array = pa.StructArray.from_arrays([pa.array([1,2,3])], fields=[pa.field('key', pa.int64(), metadata={'rerun': 'data'})])
table = pa.Table.from_arrays([struct_array], names=["my_col"])

df = ctx.from_arrow(table)
arrow = df.collect()

print(arrow[0].columns[0].type.field(0).metadata)

# {b'rerun': b'data'}

However, still suffers from the same fundamental problem in that if you ever pull out the child-array from the struct, the array will lose its metadata in the same way.

jleibs added 🏹 arrow concerning arrow 💬 discussion 🔩 data model labels Dec 19, 2024

jleibs changed the title ~~Arrow schema is lost too easily~~ Arrow schema and metadata tags are lost too easily Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow schema and metadata tags are lost too easily #8547

Arrow schema and metadata tags are lost too easily #8547

jleibs commented Dec 19, 2024 •

edited

Loading

jleibs commented Dec 20, 2024

Arrow schema and metadata tags are lost too easily #8547

Arrow schema and metadata tags are lost too easily #8547

Comments

jleibs commented Dec 19, 2024 • edited Loading

Context

Problem

jleibs commented Dec 20, 2024

jleibs commented Dec 19, 2024 •

edited

Loading