Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an example notebook on dask-sql. #171

Closed
wants to merge 7 commits into from

Conversation

nils-braun
Copy link
Contributor

Fixes #170
This is a first draft for an example notebook showcasing dask-sql.
It is not very long and only shows the main features (e.g. no custom functions, no reusing results as new tables etc) - but for a short overview I think it is fine.
If you would love to see more content or anything different, I am happy to improve!

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@nils-braun
Copy link
Contributor Author

I am a bit confused. The installation works with mamba, but now with conda - which makes no sense... With conda it says that "nodejs" and "numpy" are missing. Do you think it is fine to use mamba here (also, it is much faster) - even though it might be less known?

@jrbourbeau
Copy link
Member

Woo, thanks for adding this @nils-braun!

Re: installing dask-sql, could you try adding it to dask-examples/binder/environment.yml to see if we can successfully solve with that added dependency? That might be the path of least resistance

@mrocklin
Copy link
Member

I'm glad to see this. Some comments/questions:

  1. Should we rename this to sql.ipynb ?
  2. Should we put it in the root folder rather than applications?
  3. We should add it to the index.rst file
  4. How much does conda installing this increase the size of the docker image? How large are the dependencies (I imagine calcicte might be large)

@nils-braun
Copy link
Contributor Author

Very reasonable comments and questions @mrocklin and @jrbourbeau - thanks for that.

  1. I have done so.
  2. I am undecided. dask-sql is not part of the main dask repo, but it could also be a main entry point for many users. I have done so in the newest changes.
  3. Good point, have done so
  4. Before installing dask-sqlm the conda env (/srv/conda/envs/notebook) has a size of 2.8G. After that, it is 3.2 GB. So around 400 MB. I guess the largest part of that is the JVM.
    I have tested to add it to the environment file, lets see what the test says.

@nils-braun
Copy link
Contributor Author

So according to the tests, the package solving was successful.
However, now there is an error popping up in the "bag.ipynb" notebook - which (at least to me) seems to be unrelated and more a problem with pandas (although I could not reproduce the same error on my laptop).

@mrocklin
Copy link
Member

Right, I agree that those errors are probably unrelated.

Hrm, 400MB for this seems large. If possible I'd prefer to keep the installation in the notebook itself and outside of environment.yml (large images slow down container start times).

@nils-braun
Copy link
Contributor Author

Unfortunately, I am currently bugged by #173 - but apart from that, I think it should work now.

@nils-braun
Copy link
Contributor Author

I have updated the PR for the newest dask-sql version and also moved again to mamba for installation .
I tested the notebook on binder, lets see if the unittests work this time.

Base automatically changed from master to main January 27, 2021 16:07
@jacobtomlinson
Copy link
Member

Sorry this has sat for so long. Given the age of the PR could I ask you (or perhaps @charlesbluca, @galipremsagar or @rajagurunath) to give it a quick review to ensure things are still current?

Copy link
Member

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good - not too many changes to core dask-sql stuff other than behavior surrounding persisting to memory.

We might also want to mention the active development to get dask-sql running on GPU somewhere near the end, though I imagine this binder doesn't have GPU support to showcase that, right?

EDIT:

Should also note that the change in dask-sql's persist behavior is pending the next release (dask-contrib/dask-sql#257). Given the amount of time that has passed since the last release, I think we should probably hold off on merging this notebook in until after dask-sql's next release.

"metadata": {},
"outputs": [],
"source": [
"c.create_table(\"timeseries\", df.persist())"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"c.create_table(\"timeseries\", df.persist())"
"c.create_table(\"timeseries\", persist=True)"

This can be done using a kwarg of create_table.

"\n",
"Please note that we have persisted the data before passing it to dask-sql.\n",
"This will tell dask that we want to prefetch the data into memory.\n",
"Doing so will speed up the queries a lot, so you probably always want to do this.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given some of the discussion on dask-contrib/dask-sql#218 and the fact that persisting is no longer dask-sql's default behavior (dask-contrib/dask-sql#245), it might be worth discussing here the trade-offs of persisting to memory before (speed up vs. potential OOM errors). cc @VibhuJawa in case you have any thoughts here

"source": [
"import pandas as pd\n",
"df = pd.DataFrame({\"column\": [1, 2, 3]})\n",
"c.create_table(\"pandas\", df)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"c.create_table(\"pandas\", df)"
"c.create_table(\"pandas\", df, persist=True)"

By the next release of dask-sql, persisting will no longer be the default behavior, and we would need to include this to make sure that happens. Given the discussion above, we may opt to not persist here.

"metadata": {},
"source": [
"In most of the cases however, your data will live on some external storage device, such as a local disk, S3 or hdfs.\n",
"You can leverage dask's large set of understood input formats and sources to load the data.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any publicly available database we could register to illustrate this concept?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @charlesbluca

If you are looking for s3 datasets, Can we make use of the new york dataset used in coiled docs

df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        "payment_type": "UInt8",
        "VendorID": "UInt8",
        "passenger_count": "UInt8",
        "RatecodeID": "UInt8",
        "store_and_fwd_flag": "string",
        "PULocationID": "UInt16",
        "DOLocationID": "UInt16",
    },
    storage_options={"anon": True},
    blocksize="16 MiB",
).persist()

c.create_table("trip_data", persist=True)
 

And try to add some groupby/ Aggregation examples? what do you think?

#something like
c.sql("select passenger_count,mean(tip_amount) from trip_data group by passenger_count")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems like a good idea! Do you know if it's possible to read this in directly with a query, with something like:

    c.sql(
        f"""
        CREATE TABLE
            trip_data
        WITH (
            location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
            format = 'csv',
            ...
        )
    """
    )

Passing the kwargs into the WITH (...)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tried this yet, but Can we try something like this? expecting it should parse and pass Kwargs argument to input Plugins (inspired from this example here)

What do you think?

CREATE TABLE trip_data
 WITH (
    location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
    format = 'csv',
    parse_dates = ARRAY ['pep_pickup_datetime', 'tpep_dropoff_datetime'],
    type = MAP ['payment_type', 'UInt8',
               'VendorID', 'UInt8',
               'passenger_count', 'UInt8',
               'RatecodeID', 'UInt8',
               'store_and_fwd_flag', 'string',
               'PULocationID', 'UInt16',
                'DOLocationID', 'UInt16']
	storage_options= MAP ['anon','true'],
	blocksize='16 MiB',
)


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried this SQL after fixing bugs it seems working for me, let me know if this query works for you as well .

refactored query :

CREATE TABLE trip_data
 WITH (
    location = 's3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv',
    format = 'csv',
    parse_dates = ARRAY ['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    dtype = MAP ['payment_type', 'UInt8',
               'VendorID', 'UInt8',
               'passenger_count', 'UInt8',
               'RatecodeID', 'UInt8',
               'store_and_fwd_flag', 'string',
               'PULocationID', 'UInt16',
                'DOLocationID', 'UInt16'],
	storage_options= MAP ['anon','true'],
	blocksize='16 MiB'
)



Copy link
Member

@charlesbluca charlesbluca Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that works! The table is loaded in, though it looks like groupby operations may be a little complex for a simple demo:

In [7]: c.sql("select passenger_count, sum(tip_amount) from trip_data group by passenger_count")
Out[7]: 
Dask DataFrame Structure:
              passenger_count SUM("trip_data"."tip_amount")
npartitions=1                                              
                         Int8                       float64
                          ...                           ...
Dask Name: getitem, 11857 tasks

Maybe just showing the futures is sufficient to show that it works. Thanks for the help @rajagurunath 😄

@jacobtomlinson
Copy link
Member

Closing in favour of #209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add dask-sql example(s)
6 participants