Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow schema and metadata tags are lost too easily #8547

Open
jleibs opened this issue Dec 19, 2024 · 1 comment
Open

Arrow schema and metadata tags are lost too easily #8547

jleibs opened this issue Dec 19, 2024 · 1 comment

Comments

@jleibs
Copy link
Member

jleibs commented Dec 19, 2024

Context

Our plan for Sorbet (and tagged components) hinges on being able to reliably "tag" data with semantic information.

The working theory is that every column receives a set of tags:

  • path
  • archetype
  • field
  • component

We store these tags in the arrow metadata of the RecordBatch that we send to Rerun, and Rerun uses these tags for everything from driving batching logic to how we visualize the data in the viewer.

Problem

The problem is it is way way too easy to accidentally lose these tags in most of the arrow libraries.

The reason is fundamental to arrow:

  • Arrow Arrays have Datatypes
  • Arrow Tables have Fields
  • A field is a datatype + a name + a set of metadata (our tags)

As soon as you access a table column:

timestamps = dataset.column("log_time")
positions = dataset.column("/points:Position3D")

the metadata is gone.

You now have two bare arrays:

  • One of type List<TimestampNs>
  • One of type List<List<F32,3>>

Neither column has tags. No knowledge of the entity path. No knowledge of the components. Not only are they lost, but there's not even a way to store the data on the arrow array.

If you thought you could send this back to arrow

rr.send_dataframe(pa.Table.from_arrays([timestamps, positions])

will not do what you want.

You need to manually re-apply the tags at the time that you create the table or else nothing works.

@jleibs jleibs changed the title Arrow schema is lost too easily Arrow schema and metadata tags are lost too easily Dec 19, 2024
@jleibs
Copy link
Member Author

jleibs commented Dec 20, 2024

One thought is proposal to expose archetypes maybe helps a little: #7436

Struct datatypes contain internal Field specifications, and it appears this can propagate data all the way into datafusion and out again:

import pyarrow as pa
from datafusion import SessionContext
ctx = SessionContext()

struct_array = pa.StructArray.from_arrays([pa.array([1,2,3])], fields=[pa.field('key', pa.int64(), metadata={'rerun': 'data'})])
table = pa.Table.from_arrays([struct_array], names=["my_col"])

df = ctx.from_arrow(table)
arrow = df.collect()

print(arrow[0].columns[0].type.field(0).metadata)

# {b'rerun': b'data'}

However, still suffers from the same fundamental problem in that if you ever pull out the child-array from the struct, the array will lose its metadata in the same way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant