-
-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added an example notebook on dask-sql. #171
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
I am a bit confused. The installation works with |
Woo, thanks for adding this @nils-braun! Re: installing |
I'm glad to see this. Some comments/questions:
|
Very reasonable comments and questions @mrocklin and @jrbourbeau - thanks for that.
|
So according to the tests, the package solving was successful. |
Right, I agree that those errors are probably unrelated. Hrm, 400MB for this seems large. If possible I'd prefer to keep the installation in the notebook itself and outside of environment.yml (large images slow down container start times). |
Unfortunately, I am currently bugged by #173 - but apart from that, I think it should work now. |
I have updated the PR for the newest dask-sql version and also moved again to mamba for installation . |
Sorry this has sat for so long. Given the age of the PR could I ask you (or perhaps @charlesbluca, @galipremsagar or @rajagurunath) to give it a quick review to ensure things are still current? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty good - not too many changes to core dask-sql stuff other than behavior surrounding persisting to memory.
We might also want to mention the active development to get dask-sql running on GPU somewhere near the end, though I imagine this binder doesn't have GPU support to showcase that, right?
EDIT:
Should also note that the change in dask-sql's persist behavior is pending the next release (dask-contrib/dask-sql#257). Given the amount of time that has passed since the last release, I think we should probably hold off on merging this notebook in until after dask-sql's next release.
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"c.create_table(\"timeseries\", df.persist())" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"c.create_table(\"timeseries\", df.persist())" | |
"c.create_table(\"timeseries\", persist=True)" |
This can be done using a kwarg of create_table
.
"\n", | ||
"Please note that we have persisted the data before passing it to dask-sql.\n", | ||
"This will tell dask that we want to prefetch the data into memory.\n", | ||
"Doing so will speed up the queries a lot, so you probably always want to do this.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given some of the discussion on dask-contrib/dask-sql#218 and the fact that persisting is no longer dask-sql's default behavior (dask-contrib/dask-sql#245), it might be worth discussing here the trade-offs of persisting to memory before (speed up vs. potential OOM errors). cc @VibhuJawa in case you have any thoughts here
"source": [ | ||
"import pandas as pd\n", | ||
"df = pd.DataFrame({\"column\": [1, 2, 3]})\n", | ||
"c.create_table(\"pandas\", df)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"c.create_table(\"pandas\", df)" | |
"c.create_table(\"pandas\", df, persist=True)" |
By the next release of dask-sql, persisting will no longer be the default behavior, and we would need to include this to make sure that happens. Given the discussion above, we may opt to not persist here.
"metadata": {}, | ||
"source": [ | ||
"In most of the cases however, your data will live on some external storage device, such as a local disk, S3 or hdfs.\n", | ||
"You can leverage dask's large set of understood input formats and sources to load the data.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any publicly available database we could register to illustrate this concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are looking for s3 datasets, Can we make use of the new york dataset used in coiled docs
df = dd.read_csv(
"s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv",
parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
dtype={
"payment_type": "UInt8",
"VendorID": "UInt8",
"passenger_count": "UInt8",
"RatecodeID": "UInt8",
"store_and_fwd_flag": "string",
"PULocationID": "UInt16",
"DOLocationID": "UInt16",
},
storage_options={"anon": True},
blocksize="16 MiB",
).persist()
c.create_table("trip_data", persist=True)
And try to add some groupby/ Aggregation examples? what do you think?
#something like
c.sql("select passenger_count,mean(tip_amount) from trip_data group by passenger_count")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems like a good idea! Do you know if it's possible to read this in directly with a query, with something like:
c.sql(
f"""
CREATE TABLE
trip_data
WITH (
location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
format = 'csv',
...
)
"""
)
Passing the kwargs into the WITH (...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tried this yet, but Can we try something like this? expecting it should parse and pass Kwargs argument to input Plugins (inspired from this example here)
What do you think?
CREATE TABLE trip_data
WITH (
location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
format = 'csv',
parse_dates = ARRAY ['pep_pickup_datetime', 'tpep_dropoff_datetime'],
type = MAP ['payment_type', 'UInt8',
'VendorID', 'UInt8',
'passenger_count', 'UInt8',
'RatecodeID', 'UInt8',
'store_and_fwd_flag', 'string',
'PULocationID', 'UInt16',
'DOLocationID', 'UInt16']
storage_options= MAP ['anon','true'],
blocksize='16 MiB',
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried this SQL after fixing bugs it seems working for me, let me know if this query works for you as well .
refactored query :
CREATE TABLE trip_data
WITH (
location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
format = 'csv',
parse_dates = ARRAY ['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
dtype = MAP ['payment_type', 'UInt8',
'VendorID', 'UInt8',
'passenger_count', 'UInt8',
'RatecodeID', 'UInt8',
'store_and_fwd_flag', 'string',
'PULocationID', 'UInt16',
'DOLocationID', 'UInt16'],
storage_options= MAP ['anon','true'],
blocksize='16 MiB'
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that works! The table is loaded in, though it looks like groupby operations may be a little complex for a simple demo:
In [7]: c.sql("select passenger_count, sum(tip_amount) from trip_data group by passenger_count")
Out[7]:
Dask DataFrame Structure:
passenger_count SUM("trip_data"."tip_amount")
npartitions=1
Int8 float64
... ...
Dask Name: getitem, 11857 tasks
Maybe just showing the futures is sufficient to show that it works. Thanks for the help @rajagurunath 😄
Closing in favour of #209 |
Fixes #170
This is a first draft for an example notebook showcasing dask-sql.
It is not very long and only shows the main features (e.g. no custom functions, no reusing results as new tables etc) - but for a short overview I think it is fine.
If you would love to see more content or anything different, I am happy to improve!