Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to duckdb 1.1 & remove required pyarrow dependency? #54

Open
kylebarron opened this issue Sep 15, 2024 · 7 comments
Open

Upgrade to duckdb 1.1 & remove required pyarrow dependency? #54

kylebarron opened this issue Sep 15, 2024 · 7 comments

Comments

@kylebarron
Copy link
Collaborator

Now that DuckDB supports the PyCapsule Interface, we have the ability to remove pyarrow as a required dependency and pass data directly to upstream duckdb. But we still run into the common issue that we don't know if the stream input is a view or a stream, so we don't know whether to make a copy of the input as a duckdb table

@manzt
Copy link
Owner

manzt commented Sep 16, 2024

Let's do it! We can make a v0.2 release after.

we don't know whether to make a copy of the input as a duckdb table

I'm thinking for now we make a copy like before if you give us a source. Then, we can think about how we can expose an API for users to explicitly opt-out of coping. What do you think?

@domoritz
Copy link
Collaborator

Is this something we want to do in the Jupyter widget in mosaic as well?

@manzt
Copy link
Owner

manzt commented Sep 16, 2024

Is this something we want to do in the Jupyter widget in mosaic as well?

I think that would make sense, I wonder if register uses arrow pycapsule now in v1.1. Need to look at the implementation.

Either way, it might be worth updating mosaic's docstrings to note that pandas and Arrow sources work (just need to look into pycapsule).

https://github.com/uwdata/mosaic/blob/main/packages/widget/mosaic_widget/__init__.py#L31-L46

@kylebarron
Copy link
Collaborator Author

I think that would make sense, I wonder if register uses arrow pycapsule now in v1.1. Need to look at the implementation.

Let me know if you find out anything on this

@manzt
Copy link
Owner

manzt commented Sep 16, 2024

It seems like register does indeed use the pycapsule interface (awesome!), but it does not copy the source (so there is a runtime error if you query the same stream twice)

uvx --with duckdb==1.1.0 --with polars ipython
import duckdb
import polars as pl

df = pl.read_csv("https://raw.githubusercontent.com/uwdata/mosaic/main/data/penguins.csv")

class Foo:
    def __arrow_c_stream__(self):
        return df.__arrow_c_stream__() # just to make sure no special handling of polars
        
duckdb.register("penguins", Foo())
In [2]: duckdb.sql("select * from penguins")
Out[2]:
┌─────────┬───────────┬─────────────┬────────────┬────────────────┬───────────┬─────────┐
│ speciesislandbill_lengthbill_depthflipper_lengthbody_masssex   │
│ varcharvarchardoubledoubleint64int64varchar │
├─────────┼───────────┼─────────────┼────────────┼────────────────┼───────────┼─────────┤
│ AdelieTorgersen39.118.71813750MALE    │
│ AdelieTorgersen39.517.41863800FEMALE  │
│ AdelieTorgersen40.318.01953250FEMALE  │
│ AdelieTorgersen36.719.31933450FEMALE  │
│ AdelieTorgersen39.320.61903650MALE    │
│ AdelieTorgersen38.917.81813625FEMALE  │
│ AdelieTorgersen39.219.61954675MALE    │
│ AdelieTorgersen34.118.11933475NULL    │
│ AdelieTorgersen42.020.21904250NULL    │
│ AdelieTorgersen37.817.11863300NULL    │
│   ·     │   ·       │          ·  │         ·  │             ·  │        ·  │  ·      │
│   ·     │   ·       │          ·  │         ·  │             ·  │        ·  │  ·      │
│   ·     │   ·       │          ·  │         ·  │             ·  │        ·  │  ·      │
│ GentooBiscoe51.516.32305500MALE    │
│ GentooBiscoe46.214.12174375FEMALE  │
│ GentooBiscoe55.116.02305850MALE    │
│ GentooBiscoe44.515.72174875NULL    │
│ GentooBiscoe48.816.22226000MALE    │
│ GentooBiscoe47.213.72144925FEMALE  │
│ GentooBiscoe46.814.32154850FEMALE  │
│ GentooBiscoe50.415.72225750MALE    │
│ GentooBiscoe45.214.82125200FEMALE  │
│ GentooBiscoe49.916.12135400MALE    │
├─────────┴───────────┴─────────────┴────────────┴────────────────┴───────────┴─────────┤
│ 342 rows (20 shown)                                                         7 columns │
└───────────────────────────────────────────────────────────────────────────────────────┘
[ins] In [3]: duckdb.sql("select * from penguins")
---------------------------------------------------------------------------
InternalException                         Traceback (most recent call last)
Cell In[3], line 1
----> 1 duckdb.sql("select * from penguins")

InternalException: INTERNAL Error: ArrowArrayStream was released by another thread/library
This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
For more information, see https://duckdb.org/docs/dev/internal_errors

@kylebarron
Copy link
Collaborator Author

but it does not copy the source (so there is a runtime error if you query the same stream twice)

Yeah; it's a hard question when to copy vs view the source. duckdb/duckdb#13827

@domoritz
Copy link
Collaborator

Either way, it might be worth updating mosaic's docstrings to note that pandas and Arrow sources work (just need to look into pycapsule).

Sweet. Could you send a pull request since you know the right wording?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants