-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to duckdb 1.1 & remove required pyarrow dependency? #54
Comments
Let's do it! We can make a v0.2 release after.
I'm thinking for now we make a copy like before if you give us a source. Then, we can think about how we can expose an API for users to explicitly opt-out of coping. What do you think? |
Is this something we want to do in the Jupyter widget in mosaic as well? |
I think that would make sense, I wonder if Either way, it might be worth updating mosaic's docstrings to note that pandas and Arrow sources work (just need to look into pycapsule). https://github.com/uwdata/mosaic/blob/main/packages/widget/mosaic_widget/__init__.py#L31-L46 |
Let me know if you find out anything on this |
It seems like
import duckdb
import polars as pl
df = pl.read_csv("https://raw.githubusercontent.com/uwdata/mosaic/main/data/penguins.csv")
class Foo:
def __arrow_c_stream__(self):
return df.__arrow_c_stream__() # just to make sure no special handling of polars
duckdb.register("penguins", Foo()) In [2]: duckdb.sql("select * from penguins")
Out[2]:
┌─────────┬───────────┬─────────────┬────────────┬────────────────┬───────────┬─────────┐
│ species │ island │ bill_length │ bill_depth │ flipper_length │ body_mass │ sex │
│ varchar │ varchar │ double │ double │ int64 │ int64 │ varchar │
├─────────┼───────────┼─────────────┼────────────┼────────────────┼───────────┼─────────┤
│ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ MALE │
│ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ FEMALE │
│ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ FEMALE │
│ Adelie │ Torgersen │ 36.7 │ 19.3 │ 193 │ 3450 │ FEMALE │
│ Adelie │ Torgersen │ 39.3 │ 20.6 │ 190 │ 3650 │ MALE │
│ Adelie │ Torgersen │ 38.9 │ 17.8 │ 181 │ 3625 │ FEMALE │
│ Adelie │ Torgersen │ 39.2 │ 19.6 │ 195 │ 4675 │ MALE │
│ Adelie │ Torgersen │ 34.1 │ 18.1 │ 193 │ 3475 │ NULL │
│ Adelie │ Torgersen │ 42.0 │ 20.2 │ 190 │ 4250 │ NULL │
│ Adelie │ Torgersen │ 37.8 │ 17.1 │ 186 │ 3300 │ NULL │
│ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │
│ Gentoo │ Biscoe │ 51.5 │ 16.3 │ 230 │ 5500 │ MALE │
│ Gentoo │ Biscoe │ 46.2 │ 14.1 │ 217 │ 4375 │ FEMALE │
│ Gentoo │ Biscoe │ 55.1 │ 16.0 │ 230 │ 5850 │ MALE │
│ Gentoo │ Biscoe │ 44.5 │ 15.7 │ 217 │ 4875 │ NULL │
│ Gentoo │ Biscoe │ 48.8 │ 16.2 │ 222 │ 6000 │ MALE │
│ Gentoo │ Biscoe │ 47.2 │ 13.7 │ 214 │ 4925 │ FEMALE │
│ Gentoo │ Biscoe │ 46.8 │ 14.3 │ 215 │ 4850 │ FEMALE │
│ Gentoo │ Biscoe │ 50.4 │ 15.7 │ 222 │ 5750 │ MALE │
│ Gentoo │ Biscoe │ 45.2 │ 14.8 │ 212 │ 5200 │ FEMALE │
│ Gentoo │ Biscoe │ 49.9 │ 16.1 │ 213 │ 5400 │ MALE │
├─────────┴───────────┴─────────────┴────────────┴────────────────┴───────────┴─────────┤
│ 342 rows (20 shown) 7 columns │
└───────────────────────────────────────────────────────────────────────────────────────┘ [ins] In [3]: duckdb.sql("select * from penguins")
---------------------------------------------------------------------------
InternalException Traceback (most recent call last)
Cell In[3], line 1
----> 1 duckdb.sql("select * from penguins")
InternalException: INTERNAL Error: ArrowArrayStream was released by another thread/library
This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
For more information, see https://duckdb.org/docs/dev/internal_errors |
Yeah; it's a hard question when to copy vs view the source. duckdb/duckdb#13827 |
Sweet. Could you send a pull request since you know the right wording? |
Now that DuckDB supports the PyCapsule Interface, we have the ability to remove pyarrow as a required dependency and pass data directly to upstream duckdb. But we still run into the common issue that we don't know if the stream input is a view or a stream, so we don't know whether to make a copy of the input as a duckdb table
The text was updated successfully, but these errors were encountered: