Dask provides multi-core execution on larger-than-memory datasets.
We can think of dask at a high and a low level
- High level collections: Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don't fit into main memory. Dask's high-level collections are alternatives to NumPy and Pandas for large datasets.
- Low Level schedulers: Dask provides dynamic task schedulers that
execute task graphs in parallel. These execution engines power the
high-level collections mentioned above but can also power custom,
user-defined workloads. These schedulers are low-latency (around 1ms) and
work hard to run computations in a small memory footprint. Dask's
schedulers are an alternative to direct use of
threading
ormultiprocessing
libraries in complex cases or other task scheduling systems likeLuigi
orIPython parallel
.
Different users operate at different levels but it is useful to understand
both. This tutorial will interleave between high-level use of dask.array
and
dask.dataframe
(even sections) and low-level use of dask graphs and
schedulers (odd sections.)
You will need the following core libraries
conda install numpy pandas h5py Pillow matplotlib scipy toolz pytables snakeviz
And a recently updated version of dask
conda/pip install dask
conda/pip install distributed
You may find the following libraries helpful for some exercises
pip install graphviz cachey
Windows users can install graphviz as follows
- Install Graphviz from http://www.graphviz.org/Download_windows.php
- Add C:\Program Files (x86)\Graphviz2.38\bin to the PATH
- Run "pip install graphviz" on the command line
You should clone this repository
git clone http://github.com/dask/dask-tutorial
and then run this script to prepare artificial data.
cd dask-tutorial
python prep.py
- Reference
- Ask for help
dask
tag on Stack Overflow- github issues for bug reports and feature requests
- blaze-dev mailing list for community discussion
- Please ask questions during a live tutorial
-
Overview - dask's place in the universe
-
Foundations - low-level Dask and how it does what it does
-
Bag - the first high-level collection: a generalized iterator for use with a functional programming style and o clean messy data.
-
Distributed - Dask's scheduler for clusters, with details of how to view the UI.
-
Array - blocked numpy-like functionality with a collection of numpy arrays spread across your cluster.
-
Advanced Distributed - further details on distributed computing, including how to bebug.
-
Dataframe - parallelized operations on many pandas dataframes spread across your cluster.
-
Dataframe Storage - efficient ways to read and write dataframes to disc.