Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example of read/write parquet using dbfs:// #34

Open
raybellwaves opened this issue Mar 7, 2024 · 1 comment
Open

example of read/write parquet using dbfs:// #34

raybellwaves opened this issue Mar 7, 2024 · 1 comment

Comments

@raybellwaves
Copy link

raybellwaves commented Mar 7, 2024

Fun project!

I remember nerd sniping @martindurant to work on https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/dbfs.py (https://github.com/fsspec/filesystem_spec/blob/master/fsspec/registry.py#L152) when I was using databricks a few years ago.

May be a good test for parallel read/writes of parquet files to the databricks file system. Curious if it gets speed up compared to s3 for example.

Given this repo is slim it could be added to the README once tested

@benrutter
Copy link
Contributor

I know it's a fair bit after you asied your question- in terms of including in the readme, are you thinking an example of dbfs:// reads or metrics on speed comparison?

I'm curious about both, but I'm pretty sure this project only extends as far as providing dask cluster management, so I wouldn't (at least currently) expect it to perform differently from regular dask.

Definitely a good idea to include an example, I might find some time to double check it all works and put in a PR.

Side note: I use fsspec basically every day, and dbfs:// a fair bit, and somehow never realised until now that dbfs:// was an fsspec protocol 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants