From de86c8e809644192fc180e8c99b1334ece3482fe Mon Sep 17 00:00:00 2001 From: Richard Pelgrim <68642378+rrpelgrim@users.noreply.github.com> Date: Thu, 14 Dec 2023 14:22:06 +0000 Subject: [PATCH 1/3] Create dask-deltalake.ipynb --- notebooks/delta-rs/dask-deltalake.ipynb | 800 ++++++++++++++++++++++++ 1 file changed, 800 insertions(+) create mode 100644 notebooks/delta-rs/dask-deltalake.ipynb diff --git a/notebooks/delta-rs/dask-deltalake.ipynb b/notebooks/delta-rs/dask-deltalake.ipynb new file mode 100644 index 0000000..cbd14e4 --- /dev/null +++ b/notebooks/delta-rs/dask-deltalake.ipynb @@ -0,0 +1,800 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "2324f66b-f367-4b07-9892-0a3d8b9153d2", + "metadata": {}, + "source": [ + "# Dask-Deltalake Integration\n", + "\n", + "Delta Lake is a great storage format for Dask analyses. This page will explain why and how to use Delta Lake with Dask.\n", + "\n", + "You will learn how to read Delta Lakes into Dask DataFrames, how to query Delta tables with Dask, and the unique advantages Delta Lake offers the Dask community.\n", + "\n", + "Here are some of the benefits that Delta Lake provides Dask users:\n", + "- better performance with file skipping\n", + "- enhanced file skipping via Z Ordering\n", + "- ACID transactions for reliable writes\n", + "- easy time-travel functionality" + ] + }, + { + "cell_type": "markdown", + "id": "546a47b5-5614-4078-80be-61e2b365cedc", + "metadata": {}, + "source": [ + "> ❗️ `dask-deltatable` doesn't currently work with deltalake=0.14, use deltalake=13.0 or lower. See https://github.com/dask-contrib/dask-deltatable/issues/65" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "001e4111-23d7-4db2-9eda-68cb57ba46d2", + "metadata": {}, + "outputs": [], + "source": [ + "import dask_deltatable as ddt\n", + "import dask.dataframe as dd\n", + "import pandas as pd\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "id": "c61cc62d-ed7f-4d1c-b613-e7555708c0ac", + "metadata": {}, + "source": [ + "## Read Delta Lake into a Dask DataFrame" + ] + }, + { + "cell_type": "markdown", + "id": "64b6710b-aa00-4f97-a2d5-b6dc9810a9b8", + "metadata": {}, + "source": [ + "Let's start with some data stored in a Delta Lake on disk. Read it into a Dask DataFrame using `dask-deltatable.read_deltalake`:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "6b4a1fca-0b21-4a34-998c-7fd07d86f1ff", + "metadata": {}, + "outputs": [], + "source": [ + "# read delta table into Dask DataFrame\n", + "delta_path = \"../../data/people_countries_delta_dask/\"\n", + "ddf = ddt.read_deltalake(delta_path)" + ] + }, + { + "cell_type": "markdown", + "id": "f4b07844-cba9-4095-8429-38b62f69c9e6", + "metadata": {}, + "source": [ + "Dask is a library for efficient distributed computing and works with [lazy evaluation](https://docs.dask.org/en/stable/user-interfaces.html#laziness-and-computing). Function calls to `dask.dataframe` build a task graph in the background. To trigger computation, call `.compute()`:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "928a93d0-0472-4d2b-9582-282f6c227618", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
first_namelast_namecountrycontinent
0ErnestoGuevaraArgentina<NA>
0WolfgangMancheGermany<NA>
1SorayaJalaGermany<NA>
0BruceLeeChinaAsia
1JackMaChinaAsia
\n", + "
" + ], + "text/plain": [ + " first_name last_name country continent\n", + "0 Ernesto Guevara Argentina \n", + "0 Wolfgang Manche Germany \n", + "1 Soraya Jala Germany \n", + "0 Bruce Lee China Asia\n", + "1 Jack Ma China Asia" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ddf.compute()" + ] + }, + { + "cell_type": "markdown", + "id": "868bc406-14d1-4905-8dfc-1af2279c82ed", + "metadata": {}, + "source": [ + "You can read in specific versions of Delta tables by specifying a `version` number or a timestamp:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "44b79297-0d22-4411-a84c-9b385c204624", + "metadata": {}, + "outputs": [], + "source": [ + "# # with specific version\n", + "# ddf = ddt.read_deltalake(delta_path, version=3)\n", + "\n", + "# # with specific datetime\n", + "# ddt.read_deltalake(delta_path, datetime=\"2018-12-19T16:39:57-08:00\")" + ] + }, + { + "cell_type": "markdown", + "id": "b8651b12-aeaf-4664-b26f-a3daaabff0b0", + "metadata": {}, + "source": [ + "`dask-deltatable` also supports reading from remote sources like S3 with:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "cbf00cc0-298c-4655-ae9a-b29d12e83d2a", + "metadata": {}, + "outputs": [], + "source": [ + "# ddt.read_deltalake(\"s3://bucket_name/delta_path\", version=3)" + ] + }, + { + "cell_type": "markdown", + "id": "c2d50dc6-0d2f-4fe6-9a8a-2cd8e3afb5eb", + "metadata": {}, + "source": [ + "> To read data from remote sources you'll need to make sure the credentials are properly configured in environment variables or config files. Refer to your cloud provider documentation to configure these." + ] + }, + { + "cell_type": "markdown", + "id": "fc085856-4a88-4087-a1da-bff626094236", + "metadata": {}, + "source": [ + "## What can I do with a Dask Deltatable?" + ] + }, + { + "cell_type": "markdown", + "id": "d550eec3-7d19-4591-b752-74d98eccb997", + "metadata": {}, + "source": [ + "Reading a Delta Lake in with `dask-deltatable` returns a regular Dask DataFrame. You can perform [all the regular Dask operations](https://docs.dask.org/en/stable/dataframe.html) on this DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "b3b749a1-6e5e-41e9-b362-9faafcf9d616", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dask.dataframe.core.DataFrame" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(ddf)" + ] + }, + { + "cell_type": "markdown", + "id": "d8bb7107-fdce-4d22-a759-f5802155a46d", + "metadata": {}, + "source": [ + "Let's take a look at the first few rows:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "eed2de99-738c-4e9b-b2e5-fcd5ab3b89f7", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/rpelgrim/miniforge3/envs/dask-delta-0140/lib/python3.11/site-packages/dask/dataframe/core.py:8272: UserWarning: Insufficient elements for `head`. 3 elements requested, only 1 elements available. Try passing larger `npartitions` to `head`.\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
first_namelast_namecountrycontinent
0ErnestoGuevaraArgentina<NA>
\n", + "
" + ], + "text/plain": [ + " first_name last_name country continent\n", + "0 Ernesto Guevara Argentina " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ddf.head(n=3)" + ] + }, + { + "cell_type": "markdown", + "id": "717d4956-e095-465b-b506-ba4e376a7503", + "metadata": {}, + "source": [ + "`dask.dataframe.head()` shows you the first rows of the first partition in the dataframe. In this case, the first partition only has 1 row.\n", + "\n", + "This is because the Delta Lake has been partitioned by country:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "7e09a4a2-b2e0-44ea-b0c3-58c7fbbadeec", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[34m_delta_log\u001b[m\u001b[m \u001b[34mcountry=Argentina\u001b[m\u001b[m \u001b[34mcountry=China\u001b[m\u001b[m \u001b[34mcountry=Germany\u001b[m\u001b[m\n" + ] + } + ], + "source": [ + "!ls ../../data/people_countries_delta" + ] + }, + { + "cell_type": "markdown", + "id": "bdec4f14-921f-420f-bc19-671c4f210f6b", + "metadata": {}, + "source": [ + "`dask-deltatable` neatly reads in the partitioned Delta Lake into corresponding Dask DataFrame partitions:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "f1ba689d-e198-4710-9152-0aafe761880e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# see number of partitions\n", + "ddf.npartitions" + ] + }, + { + "cell_type": "markdown", + "id": "f1ffc4f2-45ea-4884-b861-d8293110e97c", + "metadata": {}, + "source": [ + "You can inspect a single partition using `dask.dataframe.get_partition()`:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "cf5724fa-2203-4917-965d-e566cd82b16e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
first_namelast_namecountrycontinent
0WolfgangMancheGermany<NA>
1SorayaJalaGermany<NA>
\n", + "
" + ], + "text/plain": [ + " first_name last_name country continent\n", + "0 Wolfgang Manche Germany \n", + "1 Soraya Jala Germany " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ddf.get_partition(n=1).compute()" + ] + }, + { + "cell_type": "markdown", + "id": "081a17cb-5ff9-4797-9329-38178f3342f9", + "metadata": {}, + "source": [ + "## Perform Dask Operations" + ] + }, + { + "cell_type": "markdown", + "id": "38d76a77-e09b-41c9-b516-4ecda564738a", + "metadata": {}, + "source": [ + "Let's perform some basic computations over the Delta Lake data that's now stored in our Dask DataFrame. \n", + "\n", + "Suppose you want to group the dataset by the `country` column:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "298a51f4-06b9-46f0-a654-6adbabf7eee8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
first_namelast_namecontinent
country
Argentina110
Germany220
China222
\n", + "
" + ], + "text/plain": [ + " first_name last_name continent\n", + "country \n", + "Argentina 1 1 0\n", + "Germany 2 2 0\n", + "China 2 2 2" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ddf.groupby(['country']).count().compute()" + ] + }, + { + "cell_type": "markdown", + "id": "11ed91c2-0fcf-4114-b057-902bc546c0d1", + "metadata": {}, + "source": [ + "Dask executes this `groupby` operation in parallel across all available cores. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b40fc539-c9d2-4050-b699-2ed00b76f4d0", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "95219b2f-434b-491c-a02d-230f711ffcfe", + "metadata": {}, + "source": [ + "## Map Functions over Partitions" + ] + }, + { + "cell_type": "markdown", + "id": "230855dd-e3c2-461b-a969-7a0e8abed69d", + "metadata": {}, + "source": [ + "You can also use Dask's `map_partitions` method to map a custom Python function over all the partitions. \n", + "\n", + "Let's write a function that will replace the missing `continent` values with the right continent names." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "8e3fdaf8-d969-448a-af30-91886b166bac", + "metadata": {}, + "outputs": [], + "source": [ + "# define custom python function\n", + "\n", + "# get na_string\n", + "df = ddf.get_partition(0).compute()\n", + "na_string = df.iloc[0].continent\n", + "na_string\n", + "\n", + "# define function\n", + "def replace_proper(partition, na_string):\n", + " if [partition.country == \"Argentina\"]:\n", + " partition.loc[partition.country==\"Argentina\"] = partition.loc[partition.country==\"Argentina\"].replace(na_string, \"South America\")\n", + " if [partition.country == \"Germany\"]:\n", + " partition.loc[partition.country==\"Germany\"] = partition.loc[partition.country==\"Germany\"].replace(na_string, \"Europe\")\n", + " else:\n", + " pass\n", + " return partition " + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "a1f7d880-0152-47bb-a67c-3d880c1d3e8b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
first_namelast_namecountrycontinent
0ErnestoGuevaraArgentinaSouth America
0WolfgangMancheGermanyEurope
1SorayaJalaGermanyEurope
0BruceLeeChinaAsia
1JackMaChinaAsia
\n", + "
" + ], + "text/plain": [ + " first_name last_name country continent\n", + "0 Ernesto Guevara Argentina South America\n", + "0 Wolfgang Manche Germany Europe\n", + "1 Soraya Jala Germany Europe\n", + "0 Bruce Lee China Asia\n", + "1 Jack Ma China Asia" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# define metadata and map function over partitions\n", + "meta = dict(ddf.dtypes)\n", + "ddf3 = ddf.map_partitions(replace_proper, na_string, meta=meta)\n", + "ddf3.compute()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a639399-56d5-40cf-9e14-762f52fd78c8", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "3a475f84-cb5f-44ba-bd42-511ce879318c", + "metadata": {}, + "source": [ + "## Write to Delta Lake\n", + "After doing your data processing in Dask, you can write the data back out to Delta Lake using `to_deltalake`:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "084ba149-a179-4945-8c37-68732cf2c137", + "metadata": {}, + "outputs": [], + "source": [ + "# ddt.to_deltalake(ddf, \"tmp/test_write\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81d675e5-3363-45d5-ab86-4cd831bdcada", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "aec71505-2de9-408d-8a2f-ce48989eacd2", + "metadata": {}, + "source": [ + "## Contribute to `dask-deltalake`" + ] + }, + { + "cell_type": "markdown", + "id": "b0370ae5-08a9-4873-9479-e6cfc169995a", + "metadata": {}, + "source": [ + "To contribute, go to the [`dask-deltalake` Github repository](https://github.com/rrpelgrim/dask-deltatable)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3fb5cbc-b5ab-42a5-bf6c-937e194568e7", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad4eb23b-3dfb-469a-a7e8-6c23aa8bfb90", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:dask-delta-0140] *", + "language": "python", + "name": "conda-env-dask-delta-0140-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From e00f672648471176d9c12c132fb2eae015c307a3 Mon Sep 17 00:00:00 2001 From: Richard Pelgrim <68642378+rrpelgrim@users.noreply.github.com> Date: Fri, 26 Jan 2024 11:37:59 +0000 Subject: [PATCH 2/3] move dask notebook to python-deltalake --- notebooks/{delta-rs => python-deltalake}/dask-deltalake.ipynb | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename notebooks/{delta-rs => python-deltalake}/dask-deltalake.ipynb (100%) diff --git a/notebooks/delta-rs/dask-deltalake.ipynb b/notebooks/python-deltalake/dask-deltalake.ipynb similarity index 100% rename from notebooks/delta-rs/dask-deltalake.ipynb rename to notebooks/python-deltalake/dask-deltalake.ipynb From d101433c2869995f7ed2d90e801f07e0358e83d2 Mon Sep 17 00:00:00 2001 From: Richard Pelgrim <68642378+rrpelgrim@users.noreply.github.com> Date: Fri, 26 Jan 2024 14:11:33 +0000 Subject: [PATCH 3/3] add delta tables for dask notebook --- .../_delta_log/.00000000000000000000.json.crc | Bin 0 -> 28 bytes .../_delta_log/00000000000000000000.json | 6 ++ ...-b9c2-da1c941680a3.c000.snappy.parquet.crc | Bin 0 -> 16 bytes ...4265-b9c2-da1c941680a3.c000.snappy.parquet | Bin 0 -> 1018 bytes ...-9c85-9a97be631d40.c000.snappy.parquet.crc | Bin 0 -> 16 bytes ...4303-9c85-9a97be631d40.c000.snappy.parquet | Bin 0 -> 1002 bytes ...-830a-1569f823b6ee.c000.snappy.parquet.crc | Bin 0 -> 20 bytes ...47c2-830a-1569f823b6ee.c000.snappy.parquet | Bin 0 -> 1025 bytes .../python-deltalake/dask-deltalake.ipynb | 84 ++++++++++-------- 9 files changed, 52 insertions(+), 38 deletions(-) create mode 100644 data/people_countries_delta_dask/_delta_log/.00000000000000000000.json.crc create mode 100644 data/people_countries_delta_dask/_delta_log/00000000000000000000.json create mode 100644 data/people_countries_delta_dask/country=Argentina/.part-00000-8d0390a3-f797-4265-b9c2-da1c941680a3.c000.snappy.parquet.crc create mode 100644 data/people_countries_delta_dask/country=Argentina/part-00000-8d0390a3-f797-4265-b9c2-da1c941680a3.c000.snappy.parquet create mode 100644 data/people_countries_delta_dask/country=China/.part-00000-88fba1af-b28d-4303-9c85-9a97be631d40.c000.snappy.parquet.crc create mode 100644 data/people_countries_delta_dask/country=China/part-00000-88fba1af-b28d-4303-9c85-9a97be631d40.c000.snappy.parquet create mode 100644 data/people_countries_delta_dask/country=Germany/.part-00000-030076e1-5ec9-47c2-830a-1569f823b6ee.c000.snappy.parquet.crc create mode 100644 data/people_countries_delta_dask/country=Germany/part-00000-030076e1-5ec9-47c2-830a-1569f823b6ee.c000.snappy.parquet diff --git a/data/people_countries_delta_dask/_delta_log/.00000000000000000000.json.crc b/data/people_countries_delta_dask/_delta_log/.00000000000000000000.json.crc new file mode 100644 index 0000000000000000000000000000000000000000..2a72af845e3be8faf94e4ae4cad7f252601ce4ec GIT binary patch literal 28 kcmYc;N@ieSU}88K%`D#IcH1qb2R(?3l%>ettSpmcx*JT>Zhovt z>3<;p0p7)vvL|nP(7Sj40r4p4MeyKDnkK;?2Zb=adG9+P=6zpgc*O%Fq&q{=TE-AS^JCfPRuV@z(Bc0u925-pMF>`DX%t{ zOc9tz>`2+bid@a$MMSVd5n2{A9u!yPq9`T^Ia#w4DwW7F97T#sM|pH9kU^3J(crKa z&RFp9B(%@?gpmp&wQVe48M{Fko%n3Rx6wdNv`Kp1V}HupQ$KQj?njxHVntQ@x}=ht z(yky$Df0-Xm=5W&F7@+H^MInARrh%g?hfX`Wf7v!M6yYL2&9MfQ9&=RmBU8>&-}@Q z1paHT41bltU*iORDU|4Wdnx^p4xrpmT}~IkCXT0Z8TU>klzzeFI-;Kiigk&86EH|XYZ?1S6Un>3DZ+ElfI*$dJTijmBV)oN=CFv#pM<#< zJ)Uu`r-e}v+mY7L#xD1qQ0v{FYsm%KA&)&DqvE^{PW7}5C f?6%vj1FOxNChu^kW9=$^_@|d2s=(VD!mIxYqs#sN literal 0 HcmV?d00001 diff --git a/data/people_countries_delta_dask/country=China/.part-00000-88fba1af-b28d-4303-9c85-9a97be631d40.c000.snappy.parquet.crc b/data/people_countries_delta_dask/country=China/.part-00000-88fba1af-b28d-4303-9c85-9a97be631d40.c000.snappy.parquet.crc new file mode 100644 index 0000000000000000000000000000000000000000..f5463cc908d0d1e5947907b7e8d178da344cceae GIT binary patch literal 16 XcmYc;N@ieSU}9*R+;aBjtmm@l0o2T<^^_z8*!-z2py=tUt6GjIO$dvE@4a=3fT#t2hf#V_X1 zo~Q;^V41@zLf5(yLP(agFz$sDhbu5SV9qHBEPNgRJb(W!x2C_kpXz0hbvPG`n+$N5 zaRI-2{Q32THkL3jR6~GvBaZ<{Y`&SldwxlPlr*twq#z{gS0r7_CdqY8Nped%K`4pK zkVMxc$gzqk0t?cFo`wyg;$t|2bb}0VcQKo1@Sd2$+0K4G8nf{9C~^jT$Z!tf{HCN` z?0aDp9|dg4H&MqBS`psx+25khD2Tm)2k}Bn6y4C<7WkRh>p7(A>pUzGh1?~_mfCKh zQgW}wEuly-1*$@be3Eg2e2{PjS%Vd3l|P}}CS7YFDBy!899buP5riKSE+K24Nf8QE zvau{SLB!+{Y+D~QM(SHS5-3c#n68jsPWmH>;*_Mka~TuzRK^sMmvA3TC2yqyxLnq> zQFubxn1P)%ahBfaVdRYhEnA~Iw1Tv)KNQiKPoW8C6CTrW^nea|%v=^T?R#cw#t+Bb zY?)CEuP`>tX5Zt!8=0*;Q&Y4DwpP7@?UmdMPGB8Oe4q7v&;fv#&EIUZHJ#1MOUAOX z|BqN2`>%+5Go&3X?!P?@iwnbwHnpp?RxI@P1oVB-K;u0j@|FM+-W#YuVy#v d_1cYIomFh!;BKS0rMKa|rXTtWZ|M;J@?ST3@IU|n literal 0 HcmV?d00001 diff --git a/data/people_countries_delta_dask/country=Germany/.part-00000-030076e1-5ec9-47c2-830a-1569f823b6ee.c000.snappy.parquet.crc b/data/people_countries_delta_dask/country=Germany/.part-00000-030076e1-5ec9-47c2-830a-1569f823b6ee.c000.snappy.parquet.crc new file mode 100644 index 0000000000000000000000000000000000000000..7c8b129fb104fd31e78b06c30b4e6e7d579a6475 GIT binary patch literal 20 bcmYc;N@ieSU}CtG;M85Lp6oyZeIHocbm7dEDWHAI9S{Xyr`EVF!qbn)36kWLRqueBS#;KW1n`D$) z3W^{uy7ynWtmr~;(}jX7e}afVLAvrJna&L8Mq$XD-1FY&zV|&dTi0*92-w07o}7Mp zp_-`CYYBA#GgJUTRjGu#!?4>EQLjYxonb8Y1*ulCj_+Q+`nWK&iQg|aWx`t`@*hc! zs&x?x0x#fY{Pg_&=YNiV(?ONvcdH; zJE#~8Aa82f8Kfx=ooT4VH+c3OFPcop9ohzl$zt@z)sccXvXVp_H_OSch#zk!WPc#A z1Xx~D_=RpiPSWj24CE4Qn3*LK&&I75x&;CVW;m|q#@wTf$C zA1MV#BTw@a+c_s@&!}yGn=|tO^9O_FEX=$Hqe^4d$QZfHzJMKR$k+!JuK+uyTiH50 zQ5GrdD(Azv=ZIZFp*nfK&J!6Y{b9t<)ttJs3Vh`cMY0zW~EOr3c6)EUTB1R@pu z$J~LH?(a&gZ6#?u@>8p7b^9_5603dhz{-Y1t?9(5J(c&P9=%7SFce-$Is(Y5b;h>Z z2Zx8%31c$e|3^$h{a3_;L*VP<`(GWzm2qGd*KyaJT4mWA^~0djZ3Vsz+}4_3_uZ}+ mNWba(Ud?Sa8ntU)L#(>8DTAiBVsz+NTYTUr4Qq@3{J#N6{P=YM literal 0 HcmV?d00001 diff --git a/notebooks/python-deltalake/dask-deltalake.ipynb b/notebooks/python-deltalake/dask-deltalake.ipynb index cbd14e4..423a79d 100644 --- a/notebooks/python-deltalake/dask-deltalake.ipynb +++ b/notebooks/python-deltalake/dask-deltalake.ipynb @@ -5,7 +5,7 @@ "id": "2324f66b-f367-4b07-9892-0a3d8b9153d2", "metadata": {}, "source": [ - "# Dask-Deltalake Integration\n", + "# Using Delta Lake with Dask\n", "\n", "Delta Lake is a great storage format for Dask analyses. This page will explain why and how to use Delta Lake with Dask.\n", "\n", @@ -63,7 +63,7 @@ "outputs": [], "source": [ "# read delta table into Dask DataFrame\n", - "delta_path = \"../../data/people_countries_delta_dask/\"\n", + "delta_path = \"../../data/people_countries_delta_dask\"\n", "ddf = ddt.read_deltalake(delta_path)" ] }, @@ -78,7 +78,7 @@ { "cell_type": "code", "execution_count": 3, - "id": "928a93d0-0472-4d2b-9582-282f6c227618", + "id": "cf5d9296-5914-497b-be00-fe7371ed6d57", "metadata": {}, "outputs": [ { @@ -114,21 +114,7 @@ " Ernesto\n", " Guevara\n", " Argentina\n", - " <NA>\n", - " \n", - " \n", - " 0\n", - " Wolfgang\n", - " Manche\n", - " Germany\n", - " <NA>\n", - " \n", - " \n", - " 1\n", - " Soraya\n", - " Jala\n", - " Germany\n", - " <NA>\n", + " NaN\n", " \n", " \n", " 0\n", @@ -144,17 +130,31 @@ " China\n", " Asia\n", " \n", + " \n", + " 0\n", + " Wolfgang\n", + " Manche\n", + " Germany\n", + " NaN\n", + " \n", + " \n", + " 1\n", + " Soraya\n", + " Jala\n", + " Germany\n", + " NaN\n", + " \n", " \n", "\n", "" ], "text/plain": [ " first_name last_name country continent\n", - "0 Ernesto Guevara Argentina \n", - "0 Wolfgang Manche Germany \n", - "1 Soraya Jala Germany \n", + "0 Ernesto Guevara Argentina NaN\n", "0 Bruce Lee China Asia\n", - "1 Jack Ma China Asia" + "1 Jack Ma China Asia\n", + "0 Wolfgang Manche Germany NaN\n", + "1 Soraya Jala Germany NaN" ] }, "execution_count": 3, @@ -232,7 +232,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 10, "id": "b3b749a1-6e5e-41e9-b362-9faafcf9d616", "metadata": {}, "outputs": [ @@ -242,7 +242,7 @@ "dask.dataframe.core.DataFrame" ] }, - "execution_count": 6, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -261,18 +261,10 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 11, "id": "eed2de99-738c-4e9b-b2e5-fcd5ab3b89f7", "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/rpelgrim/miniforge3/envs/dask-delta-0140/lib/python3.11/site-packages/dask/dataframe/core.py:8272: UserWarning: Insufficient elements for `head`. 3 elements requested, only 1 elements available. Try passing larger `npartitions` to `head`.\n", - " warnings.warn(\n" - ] - }, { "data": { "text/html": [ @@ -308,16 +300,32 @@ " Argentina\n", " <NA>\n", " \n", + " \n", + " 1\n", + " Wolfgang\n", + " Manche\n", + " Germany\n", + " <NA>\n", + " \n", + " \n", + " 2\n", + " Soraya\n", + " Jala\n", + " Germany\n", + " <NA>\n", + " \n", " \n", "\n", "" ], "text/plain": [ " first_name last_name country continent\n", - "0 Ernesto Guevara Argentina " + "0 Ernesto Guevara Argentina \n", + "1 Wolfgang Manche Germany \n", + "2 Soraya Jala Germany " ] }, - "execution_count": 7, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -338,7 +346,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "id": "7e09a4a2-b2e0-44ea-b0c3-58c7fbbadeec", "metadata": {}, "outputs": [ @@ -351,7 +359,7 @@ } ], "source": [ - "!ls ../../data/people_countries_delta" + "!ls ../../data/people_countries_delta_dask" ] }, { @@ -778,7 +786,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python [conda env:dask-delta-0140] *", + "display_name": "Python [conda env:dask-delta-0140]", "language": "python", "name": "conda-env-dask-delta-0140-py" },