From de86c8e809644192fc180e8c99b1334ece3482fe Mon Sep 17 00:00:00 2001
From: Richard Pelgrim <68642378+rrpelgrim@users.noreply.github.com>
Date: Thu, 14 Dec 2023 14:22:06 +0000
Subject: [PATCH 1/3] Create dask-deltalake.ipynb
---
notebooks/delta-rs/dask-deltalake.ipynb | 800 ++++++++++++++++++++++++
1 file changed, 800 insertions(+)
create mode 100644 notebooks/delta-rs/dask-deltalake.ipynb
diff --git a/notebooks/delta-rs/dask-deltalake.ipynb b/notebooks/delta-rs/dask-deltalake.ipynb
new file mode 100644
index 0000000..cbd14e4
--- /dev/null
+++ b/notebooks/delta-rs/dask-deltalake.ipynb
@@ -0,0 +1,800 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "2324f66b-f367-4b07-9892-0a3d8b9153d2",
+ "metadata": {},
+ "source": [
+ "# Dask-Deltalake Integration\n",
+ "\n",
+ "Delta Lake is a great storage format for Dask analyses. This page will explain why and how to use Delta Lake with Dask.\n",
+ "\n",
+ "You will learn how to read Delta Lakes into Dask DataFrames, how to query Delta tables with Dask, and the unique advantages Delta Lake offers the Dask community.\n",
+ "\n",
+ "Here are some of the benefits that Delta Lake provides Dask users:\n",
+ "- better performance with file skipping\n",
+ "- enhanced file skipping via Z Ordering\n",
+ "- ACID transactions for reliable writes\n",
+ "- easy time-travel functionality"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "546a47b5-5614-4078-80be-61e2b365cedc",
+ "metadata": {},
+ "source": [
+ "> ❗️ `dask-deltatable` doesn't currently work with deltalake=0.14, use deltalake=13.0 or lower. See https://github.com/dask-contrib/dask-deltatable/issues/65"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "001e4111-23d7-4db2-9eda-68cb57ba46d2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import dask_deltatable as ddt\n",
+ "import dask.dataframe as dd\n",
+ "import pandas as pd\n",
+ "import numpy as np"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c61cc62d-ed7f-4d1c-b613-e7555708c0ac",
+ "metadata": {},
+ "source": [
+ "## Read Delta Lake into a Dask DataFrame"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "64b6710b-aa00-4f97-a2d5-b6dc9810a9b8",
+ "metadata": {},
+ "source": [
+ "Let's start with some data stored in a Delta Lake on disk. Read it into a Dask DataFrame using `dask-deltatable.read_deltalake`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "6b4a1fca-0b21-4a34-998c-7fd07d86f1ff",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# read delta table into Dask DataFrame\n",
+ "delta_path = \"../../data/people_countries_delta_dask/\"\n",
+ "ddf = ddt.read_deltalake(delta_path)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f4b07844-cba9-4095-8429-38b62f69c9e6",
+ "metadata": {},
+ "source": [
+ "Dask is a library for efficient distributed computing and works with [lazy evaluation](https://docs.dask.org/en/stable/user-interfaces.html#laziness-and-computing). Function calls to `dask.dataframe` build a task graph in the background. To trigger computation, call `.compute()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "928a93d0-0472-4d2b-9582-282f6c227618",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " first_name | \n",
+ " last_name | \n",
+ " country | \n",
+ " continent | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Ernesto | \n",
+ " Guevara | \n",
+ " Argentina | \n",
+ " <NA> | \n",
+ "
\n",
+ " \n",
+ " 0 | \n",
+ " Wolfgang | \n",
+ " Manche | \n",
+ " Germany | \n",
+ " <NA> | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Soraya | \n",
+ " Jala | \n",
+ " Germany | \n",
+ " <NA> | \n",
+ "
\n",
+ " \n",
+ " 0 | \n",
+ " Bruce | \n",
+ " Lee | \n",
+ " China | \n",
+ " Asia | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Jack | \n",
+ " Ma | \n",
+ " China | \n",
+ " Asia | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " first_name last_name country continent\n",
+ "0 Ernesto Guevara Argentina \n",
+ "0 Wolfgang Manche Germany \n",
+ "1 Soraya Jala Germany \n",
+ "0 Bruce Lee China Asia\n",
+ "1 Jack Ma China Asia"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ddf.compute()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "868bc406-14d1-4905-8dfc-1af2279c82ed",
+ "metadata": {},
+ "source": [
+ "You can read in specific versions of Delta tables by specifying a `version` number or a timestamp:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "44b79297-0d22-4411-a84c-9b385c204624",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# # with specific version\n",
+ "# ddf = ddt.read_deltalake(delta_path, version=3)\n",
+ "\n",
+ "# # with specific datetime\n",
+ "# ddt.read_deltalake(delta_path, datetime=\"2018-12-19T16:39:57-08:00\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b8651b12-aeaf-4664-b26f-a3daaabff0b0",
+ "metadata": {},
+ "source": [
+ "`dask-deltatable` also supports reading from remote sources like S3 with:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "cbf00cc0-298c-4655-ae9a-b29d12e83d2a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# ddt.read_deltalake(\"s3://bucket_name/delta_path\", version=3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c2d50dc6-0d2f-4fe6-9a8a-2cd8e3afb5eb",
+ "metadata": {},
+ "source": [
+ "> To read data from remote sources you'll need to make sure the credentials are properly configured in environment variables or config files. Refer to your cloud provider documentation to configure these."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc085856-4a88-4087-a1da-bff626094236",
+ "metadata": {},
+ "source": [
+ "## What can I do with a Dask Deltatable?"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d550eec3-7d19-4591-b752-74d98eccb997",
+ "metadata": {},
+ "source": [
+ "Reading a Delta Lake in with `dask-deltatable` returns a regular Dask DataFrame. You can perform [all the regular Dask operations](https://docs.dask.org/en/stable/dataframe.html) on this DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "b3b749a1-6e5e-41e9-b362-9faafcf9d616",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "dask.dataframe.core.DataFrame"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "type(ddf)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d8bb7107-fdce-4d22-a759-f5802155a46d",
+ "metadata": {},
+ "source": [
+ "Let's take a look at the first few rows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "eed2de99-738c-4e9b-b2e5-fcd5ab3b89f7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/Users/rpelgrim/miniforge3/envs/dask-delta-0140/lib/python3.11/site-packages/dask/dataframe/core.py:8272: UserWarning: Insufficient elements for `head`. 3 elements requested, only 1 elements available. Try passing larger `npartitions` to `head`.\n",
+ " warnings.warn(\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " first_name | \n",
+ " last_name | \n",
+ " country | \n",
+ " continent | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Ernesto | \n",
+ " Guevara | \n",
+ " Argentina | \n",
+ " <NA> | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " first_name last_name country continent\n",
+ "0 Ernesto Guevara Argentina "
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ddf.head(n=3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "717d4956-e095-465b-b506-ba4e376a7503",
+ "metadata": {},
+ "source": [
+ "`dask.dataframe.head()` shows you the first rows of the first partition in the dataframe. In this case, the first partition only has 1 row.\n",
+ "\n",
+ "This is because the Delta Lake has been partitioned by country:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "7e09a4a2-b2e0-44ea-b0c3-58c7fbbadeec",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\u001b[34m_delta_log\u001b[m\u001b[m \u001b[34mcountry=Argentina\u001b[m\u001b[m \u001b[34mcountry=China\u001b[m\u001b[m \u001b[34mcountry=Germany\u001b[m\u001b[m\n"
+ ]
+ }
+ ],
+ "source": [
+ "!ls ../../data/people_countries_delta"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bdec4f14-921f-420f-bc19-671c4f210f6b",
+ "metadata": {},
+ "source": [
+ "`dask-deltatable` neatly reads in the partitioned Delta Lake into corresponding Dask DataFrame partitions:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "f1ba689d-e198-4710-9152-0aafe761880e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# see number of partitions\n",
+ "ddf.npartitions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f1ffc4f2-45ea-4884-b861-d8293110e97c",
+ "metadata": {},
+ "source": [
+ "You can inspect a single partition using `dask.dataframe.get_partition()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "cf5724fa-2203-4917-965d-e566cd82b16e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " first_name | \n",
+ " last_name | \n",
+ " country | \n",
+ " continent | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Wolfgang | \n",
+ " Manche | \n",
+ " Germany | \n",
+ " <NA> | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Soraya | \n",
+ " Jala | \n",
+ " Germany | \n",
+ " <NA> | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " first_name last_name country continent\n",
+ "0 Wolfgang Manche Germany \n",
+ "1 Soraya Jala Germany "
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ddf.get_partition(n=1).compute()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "081a17cb-5ff9-4797-9329-38178f3342f9",
+ "metadata": {},
+ "source": [
+ "## Perform Dask Operations"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38d76a77-e09b-41c9-b516-4ecda564738a",
+ "metadata": {},
+ "source": [
+ "Let's perform some basic computations over the Delta Lake data that's now stored in our Dask DataFrame. \n",
+ "\n",
+ "Suppose you want to group the dataset by the `country` column:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "298a51f4-06b9-46f0-a654-6adbabf7eee8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " first_name | \n",
+ " last_name | \n",
+ " continent | \n",
+ "
\n",
+ " \n",
+ " country | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Argentina | \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " Germany | \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " China | \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " first_name last_name continent\n",
+ "country \n",
+ "Argentina 1 1 0\n",
+ "Germany 2 2 0\n",
+ "China 2 2 2"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "ddf.groupby(['country']).count().compute()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "11ed91c2-0fcf-4114-b057-902bc546c0d1",
+ "metadata": {},
+ "source": [
+ "Dask executes this `groupby` operation in parallel across all available cores. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b40fc539-c9d2-4050-b699-2ed00b76f4d0",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "id": "95219b2f-434b-491c-a02d-230f711ffcfe",
+ "metadata": {},
+ "source": [
+ "## Map Functions over Partitions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "230855dd-e3c2-461b-a969-7a0e8abed69d",
+ "metadata": {},
+ "source": [
+ "You can also use Dask's `map_partitions` method to map a custom Python function over all the partitions. \n",
+ "\n",
+ "Let's write a function that will replace the missing `continent` values with the right continent names."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "8e3fdaf8-d969-448a-af30-91886b166bac",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# define custom python function\n",
+ "\n",
+ "# get na_string\n",
+ "df = ddf.get_partition(0).compute()\n",
+ "na_string = df.iloc[0].continent\n",
+ "na_string\n",
+ "\n",
+ "# define function\n",
+ "def replace_proper(partition, na_string):\n",
+ " if [partition.country == \"Argentina\"]:\n",
+ " partition.loc[partition.country==\"Argentina\"] = partition.loc[partition.country==\"Argentina\"].replace(na_string, \"South America\")\n",
+ " if [partition.country == \"Germany\"]:\n",
+ " partition.loc[partition.country==\"Germany\"] = partition.loc[partition.country==\"Germany\"].replace(na_string, \"Europe\")\n",
+ " else:\n",
+ " pass\n",
+ " return partition "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "a1f7d880-0152-47bb-a67c-3d880c1d3e8b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " first_name | \n",
+ " last_name | \n",
+ " country | \n",
+ " continent | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Ernesto | \n",
+ " Guevara | \n",
+ " Argentina | \n",
+ " South America | \n",
+ "
\n",
+ " \n",
+ " 0 | \n",
+ " Wolfgang | \n",
+ " Manche | \n",
+ " Germany | \n",
+ " Europe | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Soraya | \n",
+ " Jala | \n",
+ " Germany | \n",
+ " Europe | \n",
+ "
\n",
+ " \n",
+ " 0 | \n",
+ " Bruce | \n",
+ " Lee | \n",
+ " China | \n",
+ " Asia | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Jack | \n",
+ " Ma | \n",
+ " China | \n",
+ " Asia | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " first_name last_name country continent\n",
+ "0 Ernesto Guevara Argentina South America\n",
+ "0 Wolfgang Manche Germany Europe\n",
+ "1 Soraya Jala Germany Europe\n",
+ "0 Bruce Lee China Asia\n",
+ "1 Jack Ma China Asia"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# define metadata and map function over partitions\n",
+ "meta = dict(ddf.dtypes)\n",
+ "ddf3 = ddf.map_partitions(replace_proper, na_string, meta=meta)\n",
+ "ddf3.compute()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4a639399-56d5-40cf-9e14-762f52fd78c8",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3a475f84-cb5f-44ba-bd42-511ce879318c",
+ "metadata": {},
+ "source": [
+ "## Write to Delta Lake\n",
+ "After doing your data processing in Dask, you can write the data back out to Delta Lake using `to_deltalake`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "084ba149-a179-4945-8c37-68732cf2c137",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# ddt.to_deltalake(ddf, \"tmp/test_write\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "81d675e5-3363-45d5-ab86-4cd831bdcada",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "id": "aec71505-2de9-408d-8a2f-ce48989eacd2",
+ "metadata": {},
+ "source": [
+ "## Contribute to `dask-deltalake`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b0370ae5-08a9-4873-9479-e6cfc169995a",
+ "metadata": {},
+ "source": [
+ "To contribute, go to the [`dask-deltalake` Github repository](https://github.com/rrpelgrim/dask-deltatable)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a3fb5cbc-b5ab-42a5-bf6c-937e194568e7",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ad4eb23b-3dfb-469a-a7e8-6c23aa8bfb90",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python [conda env:dask-delta-0140] *",
+ "language": "python",
+ "name": "conda-env-dask-delta-0140-py"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.0"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
From e00f672648471176d9c12c132fb2eae015c307a3 Mon Sep 17 00:00:00 2001
From: Richard Pelgrim <68642378+rrpelgrim@users.noreply.github.com>
Date: Fri, 26 Jan 2024 11:37:59 +0000
Subject: [PATCH 2/3] move dask notebook to python-deltalake
---
notebooks/{delta-rs => python-deltalake}/dask-deltalake.ipynb | 0
1 file changed, 0 insertions(+), 0 deletions(-)
rename notebooks/{delta-rs => python-deltalake}/dask-deltalake.ipynb (100%)
diff --git a/notebooks/delta-rs/dask-deltalake.ipynb b/notebooks/python-deltalake/dask-deltalake.ipynb
similarity index 100%
rename from notebooks/delta-rs/dask-deltalake.ipynb
rename to notebooks/python-deltalake/dask-deltalake.ipynb
From d101433c2869995f7ed2d90e801f07e0358e83d2 Mon Sep 17 00:00:00 2001
From: Richard Pelgrim <68642378+rrpelgrim@users.noreply.github.com>
Date: Fri, 26 Jan 2024 14:11:33 +0000
Subject: [PATCH 3/3] add delta tables for dask notebook
---
.../_delta_log/.00000000000000000000.json.crc | Bin 0 -> 28 bytes
.../_delta_log/00000000000000000000.json | 6 ++
...-b9c2-da1c941680a3.c000.snappy.parquet.crc | Bin 0 -> 16 bytes
...4265-b9c2-da1c941680a3.c000.snappy.parquet | Bin 0 -> 1018 bytes
...-9c85-9a97be631d40.c000.snappy.parquet.crc | Bin 0 -> 16 bytes
...4303-9c85-9a97be631d40.c000.snappy.parquet | Bin 0 -> 1002 bytes
...-830a-1569f823b6ee.c000.snappy.parquet.crc | Bin 0 -> 20 bytes
...47c2-830a-1569f823b6ee.c000.snappy.parquet | Bin 0 -> 1025 bytes
.../python-deltalake/dask-deltalake.ipynb | 84 ++++++++++--------
9 files changed, 52 insertions(+), 38 deletions(-)
create mode 100644 data/people_countries_delta_dask/_delta_log/.00000000000000000000.json.crc
create mode 100644 data/people_countries_delta_dask/_delta_log/00000000000000000000.json
create mode 100644 data/people_countries_delta_dask/country=Argentina/.part-00000-8d0390a3-f797-4265-b9c2-da1c941680a3.c000.snappy.parquet.crc
create mode 100644 data/people_countries_delta_dask/country=Argentina/part-00000-8d0390a3-f797-4265-b9c2-da1c941680a3.c000.snappy.parquet
create mode 100644 data/people_countries_delta_dask/country=China/.part-00000-88fba1af-b28d-4303-9c85-9a97be631d40.c000.snappy.parquet.crc
create mode 100644 data/people_countries_delta_dask/country=China/part-00000-88fba1af-b28d-4303-9c85-9a97be631d40.c000.snappy.parquet
create mode 100644 data/people_countries_delta_dask/country=Germany/.part-00000-030076e1-5ec9-47c2-830a-1569f823b6ee.c000.snappy.parquet.crc
create mode 100644 data/people_countries_delta_dask/country=Germany/part-00000-030076e1-5ec9-47c2-830a-1569f823b6ee.c000.snappy.parquet
diff --git a/data/people_countries_delta_dask/_delta_log/.00000000000000000000.json.crc b/data/people_countries_delta_dask/_delta_log/.00000000000000000000.json.crc
new file mode 100644
index 0000000000000000000000000000000000000000..2a72af845e3be8faf94e4ae4cad7f252601ce4ec
GIT binary patch
literal 28
kcmYc;N@ieSU}88K%`D#IcH1qb2R(?3l%>ettSpmcx*JT>Zhovt
z>3<;p0p7)vvL|nP(7Sj40r4p4MeyKDnkK;?2Zb=adG9+P=6zpgc*O%Fq&q{=TE-AS^JCfPRuV@z(Bc0u925-pMF>`DX%t{
zOc9tz>`2+bid@a$MMSVd5n2{A9u!yPq9`T^Ia#w4DwW7F97T#sM|pH9kU^3J(crKa
z&RFp9B(%@?gpmp&wQVe48M{Fko%n3Rx6wdNv`Kp1V}HupQ$KQj?njxHVntQ@x}=ht
z(yky$Df0-Xm=5W&F7@+H^MInARrh%g?hfX`Wf7v!M6yYL2&9MfQ9&=RmBU8>&-}@Q
z1paHT41bltU*iORDU|4Wdnx^p4xrpmT}~IkCXT0Z8TU>klzzeFI-;Kiigk&86EH|XYZ?1S6Un>3DZ+ElfI*$dJTijmBV)oN=CFv#pM<#<
zJ)Uu`r-e}v+mY7L#xD1qQ0v{FYsm%KA&)&DqvE^{PW7}5C
f?6%vj1FOxNChu^kW9=$^_@|d2s=(VD!mIxYqs#sN
literal 0
HcmV?d00001
diff --git a/data/people_countries_delta_dask/country=China/.part-00000-88fba1af-b28d-4303-9c85-9a97be631d40.c000.snappy.parquet.crc b/data/people_countries_delta_dask/country=China/.part-00000-88fba1af-b28d-4303-9c85-9a97be631d40.c000.snappy.parquet.crc
new file mode 100644
index 0000000000000000000000000000000000000000..f5463cc908d0d1e5947907b7e8d178da344cceae
GIT binary patch
literal 16
XcmYc;N@ieSU}9*R+;aBjtmm@l0o2T<^^_z8*!-z2py=tUt6GjIO$dvE@4a=3fT#t2hf#V_X1
zo~Q;^V41@zLf5(yLP(agFz$sDhbu5SV9qHBEPNgRJb(W!x2C_kpXz0hbvPG`n+$N5
zaRI-2{Q32THkL3jR6~GvBaZ<{Y`&SldwxlPlr*twq#z{gS0r7_CdqY8Nped%K`4pK
zkVMxc$gzqk0t?cFo`wyg;$t|2bb}0VcQKo1@Sd2$+0K4G8nf{9C~^jT$Z!tf{HCN`
z?0aDp9|dg4H&MqBS`psx+25khD2Tm)2k}Bn6y4C<7WkRh>p7(A>pUzGh1?~_mfCKh
zQgW}wEuly-1*$@be3Eg2e2{PjS%Vd3l|P}}CS7YFDBy!899buP5riKSE+K24Nf8QE
zvau{SLB!+{Y+D~QM(SHS5-3c#n68jsPWmH>;*_Mka~TuzRK^sMmvA3TC2yqyxLnq>
zQFubxn1P)%ahBfaVdRYhEnA~Iw1Tv)KNQiKPoW8C6CTrW^nea|%v=^T?R#cw#t+Bb
zY?)CEuP`>tX5Zt!8=0*;Q&Y4DwpP7@?UmdMPGB8Oe4q7v&;fv#&EIUZHJ#1MOUAOX
z|BqN2`>%+5Go&3X?!P?@iwnbwHnpp?RxI@P1oVB-K;u0j@|FM+-W#YuVy#v
d_1cYIomFh!;BKS0rMKa|rXTtWZ|M;J@?ST3@IU|n
literal 0
HcmV?d00001
diff --git a/data/people_countries_delta_dask/country=Germany/.part-00000-030076e1-5ec9-47c2-830a-1569f823b6ee.c000.snappy.parquet.crc b/data/people_countries_delta_dask/country=Germany/.part-00000-030076e1-5ec9-47c2-830a-1569f823b6ee.c000.snappy.parquet.crc
new file mode 100644
index 0000000000000000000000000000000000000000..7c8b129fb104fd31e78b06c30b4e6e7d579a6475
GIT binary patch
literal 20
bcmYc;N@ieSU}CtG;M85Lp6oyZeIHocbm7dEDWHAI9S{Xyr`EVF!qbn)36kWLRqueBS#;KW1n`D$)
z3W^{uy7ynWtmr~;(}jX7e}afVLAvrJna&L8Mq$XD-1FY&zV|&dTi0*92-w07o}7Mp
zp_-`CYYBA#GgJUTRjGu#!?4>EQLjYxonb8Y1*ulCj_+Q+`nWK&iQg|aWx`t`@*hc!
zs&x?x0x#fY{Pg_&=YNiV(?ONvcdH;
zJE#~8Aa82f8Kfx=ooT4VH+c3OFPcop9ohzl$zt@z)sccXvXVp_H_OSch#zk!WPc#A
z1Xx~D_=RpiPSWj24CE4Qn3*LK&&I75x&;CVW;m|q#@wTf$C
zA1MV#BTw@a+c_s@&!}yGn=|tO^9O_FEX=$Hqe^4d$QZfHzJMKR$k+!JuK+uyTiH50
zQ5GrdD(Azv=ZIZFp*nfK&J!6Y{b9t<)ttJs3Vh`cMY0zW~EOr3c6)EUTB1R@pu
z$J~LH?(a&gZ6#?u@>8p7b^9_5603dhz{-Y1t?9(5J(c&P9=%7SFce-$Is(Y5b;h>Z
z2Zx8%31c$e|3^$h{a3_;L*VP<`(GWzm2qGd*KyaJT4mWA^~0djZ3Vsz+}4_3_uZ}+
mNWba(Ud?Sa8ntU)L#(>8DTAiBVsz+NTYTUr4Qq@3{J#N6{P=YM
literal 0
HcmV?d00001
diff --git a/notebooks/python-deltalake/dask-deltalake.ipynb b/notebooks/python-deltalake/dask-deltalake.ipynb
index cbd14e4..423a79d 100644
--- a/notebooks/python-deltalake/dask-deltalake.ipynb
+++ b/notebooks/python-deltalake/dask-deltalake.ipynb
@@ -5,7 +5,7 @@
"id": "2324f66b-f367-4b07-9892-0a3d8b9153d2",
"metadata": {},
"source": [
- "# Dask-Deltalake Integration\n",
+ "# Using Delta Lake with Dask\n",
"\n",
"Delta Lake is a great storage format for Dask analyses. This page will explain why and how to use Delta Lake with Dask.\n",
"\n",
@@ -63,7 +63,7 @@
"outputs": [],
"source": [
"# read delta table into Dask DataFrame\n",
- "delta_path = \"../../data/people_countries_delta_dask/\"\n",
+ "delta_path = \"../../data/people_countries_delta_dask\"\n",
"ddf = ddt.read_deltalake(delta_path)"
]
},
@@ -78,7 +78,7 @@
{
"cell_type": "code",
"execution_count": 3,
- "id": "928a93d0-0472-4d2b-9582-282f6c227618",
+ "id": "cf5d9296-5914-497b-be00-fe7371ed6d57",
"metadata": {},
"outputs": [
{
@@ -114,21 +114,7 @@
" Ernesto | \n",
" Guevara | \n",
" Argentina | \n",
- " <NA> | \n",
- " \n",
- " \n",
- " 0 | \n",
- " Wolfgang | \n",
- " Manche | \n",
- " Germany | \n",
- " <NA> | \n",
- "
\n",
- " \n",
- " 1 | \n",
- " Soraya | \n",
- " Jala | \n",
- " Germany | \n",
- " <NA> | \n",
+ " NaN | \n",
"
\n",
" \n",
" 0 | \n",
@@ -144,17 +130,31 @@
" China | \n",
" Asia | \n",
"
\n",
+ " \n",
+ " 0 | \n",
+ " Wolfgang | \n",
+ " Manche | \n",
+ " Germany | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Soraya | \n",
+ " Jala | \n",
+ " Germany | \n",
+ " NaN | \n",
+ "
\n",
" \n",
"\n",
""
],
"text/plain": [
" first_name last_name country continent\n",
- "0 Ernesto Guevara Argentina \n",
- "0 Wolfgang Manche Germany \n",
- "1 Soraya Jala Germany \n",
+ "0 Ernesto Guevara Argentina NaN\n",
"0 Bruce Lee China Asia\n",
- "1 Jack Ma China Asia"
+ "1 Jack Ma China Asia\n",
+ "0 Wolfgang Manche Germany NaN\n",
+ "1 Soraya Jala Germany NaN"
]
},
"execution_count": 3,
@@ -232,7 +232,7 @@
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 10,
"id": "b3b749a1-6e5e-41e9-b362-9faafcf9d616",
"metadata": {},
"outputs": [
@@ -242,7 +242,7 @@
"dask.dataframe.core.DataFrame"
]
},
- "execution_count": 6,
+ "execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
@@ -261,18 +261,10 @@
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 11,
"id": "eed2de99-738c-4e9b-b2e5-fcd5ab3b89f7",
"metadata": {},
"outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/Users/rpelgrim/miniforge3/envs/dask-delta-0140/lib/python3.11/site-packages/dask/dataframe/core.py:8272: UserWarning: Insufficient elements for `head`. 3 elements requested, only 1 elements available. Try passing larger `npartitions` to `head`.\n",
- " warnings.warn(\n"
- ]
- },
{
"data": {
"text/html": [
@@ -308,16 +300,32 @@
" Argentina | \n",
" <NA> | \n",
" \n",
+ " \n",
+ " 1 | \n",
+ " Wolfgang | \n",
+ " Manche | \n",
+ " Germany | \n",
+ " <NA> | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " Soraya | \n",
+ " Jala | \n",
+ " Germany | \n",
+ " <NA> | \n",
+ "
\n",
" \n",
"\n",
""
],
"text/plain": [
" first_name last_name country continent\n",
- "0 Ernesto Guevara Argentina "
+ "0 Ernesto Guevara Argentina \n",
+ "1 Wolfgang Manche Germany \n",
+ "2 Soraya Jala Germany "
]
},
- "execution_count": 7,
+ "execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
@@ -338,7 +346,7 @@
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 7,
"id": "7e09a4a2-b2e0-44ea-b0c3-58c7fbbadeec",
"metadata": {},
"outputs": [
@@ -351,7 +359,7 @@
}
],
"source": [
- "!ls ../../data/people_countries_delta"
+ "!ls ../../data/people_countries_delta_dask"
]
},
{
@@ -778,7 +786,7 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python [conda env:dask-delta-0140] *",
+ "display_name": "Python [conda env:dask-delta-0140]",
"language": "python",
"name": "conda-env-dask-delta-0140-py"
},