-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run TPC-H on Databricks #315
Comments
Initially I was running into this bug, which i resolved by downgrading from |
@jacobtomlinson just to clarify, is the goal here to:
You also mentioned something about running benchmarks for cudf.pandas, is that part of this? |
@rjzamora with your recent work merged into |
The primary goal is dask-expr + cudf on Databricks. Secondary goal is cudf.pandas on Databricks. |
Thanks for working on this @skirui-source !
I haven't updated my tpch-rapids branch of coiled/benchmarks to align with To run locally, I started with a 24.04 or 24.06 rapids environment. Then did the following:
Then, from within
The timing result will be appended to a import pandas as pd
import sqlite3
con = sqlite3.connect("./benchmark.db")
df = pd.read_sql_query("SELECT * from test_run", con)
print(df[["name", "duration"]])
con.close() Important notes on the data: I believe I needed to jump through a few annoying hoops to both generate the data and modify the code to handle differences between my data and the s3 data used by Coiled. For example, my data currently uses a single parquet file for each table, while Coiled uses a directory of files (which definitely makes more sense beyond sf100). This is why my
I would definitely focus on dask-expr + cudf for now. I believe pandas->arrow->pandas conversion is still a problem in cudf.pandas, and dask/dask-expr will do this a lot. |
initial batch of results (tested on dgx14):
|
Reproduce the TPC-H work that @rjzamora has been looking into with Dask+cuDF and the TPC-H work Coiled has been doing with Dask but on Databricks.
https://tpch.coiled.io/
The text was updated successfully, but these errors were encountered: