Added an example notebook on dask-sql. #171

nils-braun · 2020-10-29T21:38:05Z

Fixes #170
This is a first draft for an example notebook showcasing dask-sql.
It is not very long and only shows the main features (e.g. no custom functions, no reusing results as new tables etc) - but for a short overview I think it is fine.
If you would love to see more content or anything different, I am happy to improve!

review-notebook-app · 2020-10-29T21:38:09Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

nils-braun · 2020-10-29T21:54:12Z

I am a bit confused. The installation works with mamba, but now with conda - which makes no sense... With conda it says that "nodejs" and "numpy" are missing. Do you think it is fine to use mamba here (also, it is much faster) - even though it might be less known?

jrbourbeau · 2020-10-29T22:49:17Z

Woo, thanks for adding this @nils-braun!

Re: installing dask-sql, could you try adding it to dask-examples/binder/environment.yml to see if we can successfully solve with that added dependency? That might be the path of least resistance

mrocklin · 2020-10-29T22:59:15Z

I'm glad to see this. Some comments/questions:

Should we rename this to sql.ipynb ?
Should we put it in the root folder rather than applications?
We should add it to the index.rst file
How much does conda installing this increase the size of the docker image? How large are the dependencies (I imagine calcicte might be large)

nils-braun · 2020-10-31T10:38:46Z

Very reasonable comments and questions @mrocklin and @jrbourbeau - thanks for that.

I have done so.
I am undecided. dask-sql is not part of the main dask repo, but it could also be a main entry point for many users. I have done so in the newest changes.
Good point, have done so
Before installing dask-sqlm the conda env (/srv/conda/envs/notebook) has a size of 2.8G. After that, it is 3.2 GB. So around 400 MB. I guess the largest part of that is the JVM.
I have tested to add it to the environment file, lets see what the test says.

nils-braun · 2020-10-31T12:09:56Z

So according to the tests, the package solving was successful.
However, now there is an error popping up in the "bag.ipynb" notebook - which (at least to me) seems to be unrelated and more a problem with pandas (although I could not reproduce the same error on my laptop).

mrocklin · 2020-10-31T15:36:42Z

Right, I agree that those errors are probably unrelated.

Hrm, 400MB for this seems large. If possible I'd prefer to keep the installation in the notebook itself and outside of environment.yml (large images slow down container start times).

This reverts commit 3c0b2f2.

nils-braun · 2020-11-03T14:03:52Z

Unfortunately, I am currently bugged by #173 - but apart from that, I think it should work now.

nils-braun · 2020-11-30T20:36:09Z

I have updated the PR for the newest dask-sql version and also moved again to mamba for installation .
I tested the notebook on binder, lets see if the unittests work this time.

jacobtomlinson · 2021-10-14T14:35:48Z

Sorry this has sat for so long. Given the age of the PR could I ask you (or perhaps @charlesbluca, @galipremsagar or @rajagurunath) to give it a quick review to ensure things are still current?

charlesbluca

Overall looks pretty good - not too many changes to core dask-sql stuff other than behavior surrounding persisting to memory.

We might also want to mention the active development to get dask-sql running on GPU somewhere near the end, though I imagine this binder doesn't have GPU support to showcase that, right?

EDIT:

Should also note that the change in dask-sql's persist behavior is pending the next release (dask-contrib/dask-sql#257). Given the amount of time that has passed since the last release, I think we should probably hold off on merging this notebook in until after dask-sql's next release.

charlesbluca · 2021-10-14T14:44:28Z

sql.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c.create_table(\"timeseries\", df.persist())"


Suggested change

"c.create_table(\"timeseries\", df.persist())"

"c.create_table(\"timeseries\", persist=True)"

This can be done using a kwarg of create_table.

charlesbluca · 2021-10-14T14:47:09Z

sql.ipynb

+    "\n",
+    "Please note that we have persisted the data before passing it to dask-sql.\n",
+    "This will tell dask that we want to prefetch the data into memory.\n",
+    "Doing so will speed up the queries a lot, so you probably always want to do this.\n",


Given some of the discussion on dask-contrib/dask-sql#218 and the fact that persisting is no longer dask-sql's default behavior (dask-contrib/dask-sql#245), it might be worth discussing here the trade-offs of persisting to memory before (speed up vs. potential OOM errors). cc @VibhuJawa in case you have any thoughts here

charlesbluca · 2021-10-14T14:50:19Z

sql.ipynb

+   "source": [
+    "import pandas as pd\n",
+    "df = pd.DataFrame({\"column\": [1, 2, 3]})\n",
+    "c.create_table(\"pandas\", df)"


Suggested change

"c.create_table(\"pandas\", df)"

"c.create_table(\"pandas\", df, persist=True)"

By the next release of dask-sql, persisting will no longer be the default behavior, and we would need to include this to make sure that happens. Given the discussion above, we may opt to not persist here.

charlesbluca · 2021-10-14T14:52:14Z

sql.ipynb

+   "metadata": {},
+   "source": [
+    "In most of the cases however, your data will live on some external storage device, such as a local disk, S3 or hdfs.\n",
+    "You can leverage dask's large set of understood input formats and sources to load the data.\n",


Is there any publicly available database we could register to illustrate this concept?

Hi @charlesbluca

If you are looking for s3 datasets, Can we make use of the new york dataset used in coiled docs

df = dd.read_csv( "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv", parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"], dtype={ "payment_type": "UInt8", "VendorID": "UInt8", "passenger_count": "UInt8", "RatecodeID": "UInt8", "store_and_fwd_flag": "string", "PULocationID": "UInt16", "DOLocationID": "UInt16", }, storage_options={"anon": True}, blocksize="16 MiB", ).persist() c.create_table("trip_data", persist=True)

And try to add some groupby/ Aggregation examples? what do you think?

#something like c.sql("select passenger_count,mean(tip_amount) from trip_data group by passenger_count")

Yeah that seems like a good idea! Do you know if it's possible to read this in directly with a query, with something like:

c.sql( f""" CREATE TABLE trip_data WITH ( location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv', format = 'csv', ... ) """ )

Passing the kwargs into the WITH (...)?

Haven't tried this yet, but Can we try something like this? expecting it should parse and pass Kwargs argument to input Plugins (inspired from this example here)

What do you think?

CREATE TABLE trip_data WITH ( location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv', format = 'csv', parse_dates = ARRAY ['pep_pickup_datetime', 'tpep_dropoff_datetime'], type = MAP ['payment_type', 'UInt8', 'VendorID', 'UInt8', 'passenger_count', 'UInt8', 'RatecodeID', 'UInt8', 'store_and_fwd_flag', 'string', 'PULocationID', 'UInt16', 'DOLocationID', 'UInt16'] storage_options= MAP ['anon','true'], blocksize='16 MiB', )

I have tried this SQL after fixing bugs it seems working for me, let me know if this query works for you as well .

refactored query :

CREATE TABLE trip_data WITH ( location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv', format = 'csv', parse_dates = ARRAY ['tpep_pickup_datetime', 'tpep_dropoff_datetime'], dtype = MAP ['payment_type', 'UInt8', 'VendorID', 'UInt8', 'passenger_count', 'UInt8', 'RatecodeID', 'UInt8', 'store_and_fwd_flag', 'string', 'PULocationID', 'UInt16', 'DOLocationID', 'UInt16'], storage_options= MAP ['anon','true'], blocksize='16 MiB' )

Yeah that works! The table is loaded in, though it looks like groupby operations may be a little complex for a simple demo:

In [7]: c.sql("select passenger_count, sum(tip_amount) from trip_data group by passenger_count") Out[7]: Dask DataFrame Structure: passenger_count SUM("trip_data"."tip_amount") npartitions=1 Int8 float64 ... ... Dask Name: getitem, 11857 tasks

Maybe just showing the futures is sufficient to show that it works. Thanks for the help @rajagurunath 😄

jacobtomlinson · 2022-05-16T13:42:51Z

Closing in favour of #209

Added an example notebook on dask-sql. Fixes dask#170

cd8ac19

Fixes needed to make it run on binder

e637822

nils-braun added 2 commits October 30, 2020 16:24

Rename the notebook and add it to the index

12840cd

Install dask-sql via conda before launching the notebook

3c0b2f2

nils-braun added 2 commits November 3, 2020 12:31

Revert "Install dask-sql via conda before launching the notebook"

fa29c86

This reverts commit 3c0b2f2.

Try installation with conda

77f59f1

Fix for the newest dask-sql version and use mamba

a4eed0c

Base automatically changed from master to main January 27, 2021 16:07

charlesbluca reviewed Oct 14, 2021

View reviewed changes

This was referenced Mar 8, 2022

[DOC] Error with Binder link dask-contrib/dask-sql#416

Open

Add Dask-SQL example notebook #209

Merged

jacobtomlinson closed this May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added an example notebook on dask-sql. #171

Added an example notebook on dask-sql. #171

nils-braun commented Oct 29, 2020

review-notebook-app bot commented Oct 29, 2020

nils-braun commented Oct 29, 2020

jrbourbeau commented Oct 29, 2020

mrocklin commented Oct 29, 2020

nils-braun commented Oct 31, 2020

nils-braun commented Oct 31, 2020

mrocklin commented Oct 31, 2020

nils-braun commented Nov 3, 2020

nils-braun commented Nov 30, 2020

jacobtomlinson commented Oct 14, 2021

charlesbluca left a comment •

edited

Loading

charlesbluca Oct 14, 2021

charlesbluca Oct 14, 2021

charlesbluca Oct 14, 2021

charlesbluca Oct 14, 2021

rajagurunath Oct 14, 2021

charlesbluca Oct 14, 2021

rajagurunath Oct 14, 2021

rajagurunath Oct 14, 2021

charlesbluca Oct 14, 2021 •

edited

Loading

jacobtomlinson commented May 16, 2022

	"c.create_table(\"timeseries\", df.persist())"
	"c.create_table(\"timeseries\", persist=True)"

	"c.create_table(\"pandas\", df)"
	"c.create_table(\"pandas\", df, persist=True)"

Added an example notebook on dask-sql. #171

Added an example notebook on dask-sql. #171

Conversation

nils-braun commented Oct 29, 2020

review-notebook-app bot commented Oct 29, 2020

nils-braun commented Oct 29, 2020

jrbourbeau commented Oct 29, 2020

mrocklin commented Oct 29, 2020

nils-braun commented Oct 31, 2020

nils-braun commented Oct 31, 2020

mrocklin commented Oct 31, 2020

nils-braun commented Nov 3, 2020

nils-braun commented Nov 30, 2020

jacobtomlinson commented Oct 14, 2021

charlesbluca left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlesbluca Oct 14, 2021 • edited Loading

Choose a reason for hiding this comment

jacobtomlinson commented May 16, 2022

charlesbluca left a comment •

edited

Loading

charlesbluca Oct 14, 2021 •

edited

Loading