Fix `dask_cudf.read_csv` #17612

rjzamora · 2024-12-17T16:52:32Z

Description

Recent changes in dask and dask-expr have broken dask_cudf.read_csv (dask/dask-expr#1178, dask/dask#11603). Fortunately, the breaking changes help us avoid legacy CSV code in the long run.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora · 2024-12-17T16:55:02Z

python/dask_cudf/dask_cudf/io/csv.py

NOTE: This code was translated from the "legacy" code in dask_cudf/_legacy/io/csv.py. After this PR is merged, that file can be removed in #17558

rjzamora · 2024-12-17T16:55:53Z

python/dask_cudf/dask_cudf/io/tests/test_csv.py

-    # Test chunksize deprecation
-    with pytest.warns(FutureWarning, match="deprecated"):
-        df3 = dask_cudf.read_csv(path, chunksize=None, dtype=typ)
-    dd.assert_eq(df, df3)


This was deprecated a long time ago, the new code path doesn't try to catch this anymore.

rjzamora · 2024-12-17T16:56:16Z

python/dask_cudf/dask_cudf/io/tests/test_csv.py

-    with pytest.warns(match="dask_cudf.io.csv.read_csv is now deprecated"):
-        df2 = dask_cudf.io.csv.read_csv(csv_path)
-    dd.assert_eq(df, df2, check_divisions=False)


The new csv code now lives at this "expected" path.

pentschev

I think this looks fine, I left some suggestions.

pentschev · 2024-12-17T17:32:54Z

python/dask_cudf/dask_cudf/backends.py

-        import dask_expr as dx
-        from fsspec.utils import stringify_path
+        try:
+            # TODO: Remove when cudf is pinned to dask>2024.12.0


Does it make sense for us to do this? rapids-dask-dependency should always pick the latest Dask, no?

By "when cudf is pinned", I mean "when rapids-dask-dependency is pinned". Is that your question?

rapids-dask-dependency currently pins to >=2024.11.2. This means another package with a dask<2024.12.0 requirement can still give us dask-2024.11.2 in practice, no?

pentschev · 2024-12-17T17:35:17Z

python/dask_cudf/dask_cudf/io/csv.py

+    >>> import dask_cudf
+    >>> df = dask_cudf.read_csv("myfiles.*.csv")
+
+    In some cases it can break up large files:


What cases are those? Or is it always dependent upon the file size and the value of blocksize? If the latter, maybe we should just rephrase to "It can also break up large files by specifying the size of each block via blocksize".

Sorry, I didn't spend any time reviewing these doc-strings, because they were directly copied from dask_cudf/_legacy/io/csv.py. Does it make sense to address these suggestions/questions in a follow-up (just to make sure CI is unblocked)?

pentschev · 2024-12-17T17:35:30Z

python/dask_cudf/dask_cudf/io/csv.py

+
+    >>> df = dask_cudf.read_csv("largefile.csv", blocksize="256 MiB")
+
+    It can read CSV files from external resources (e.g. S3, HTTP, FTP)


Suggested change

It can read CSV files from external resources (e.g. S3, HTTP, FTP)

It can read CSV files from external resources (e.g. S3, HTTP, FTP):

pentschev · 2024-12-17T17:36:29Z

python/dask_cudf/dask_cudf/io/csv.py

+    ----------
+    path : str, path object, or file-like object
+        Either a path to a file (a str, :py:class:`pathlib.Path`, or
+        py._path.local.LocalPath), URL (including http, ftp, and S3


Suggested change

py._path.local.LocalPath), URL (including http, ftp, and S3

``py._path.local.LocalPath``), URL (including HTTP, FTP, and S3

Maybe it could also just be :py:class:py._path.local.LocalPath?

pentschev · 2024-12-17T17:37:02Z

python/dask_cudf/dask_cudf/io/csv.py

+    path : str, path object, or file-like object
+        Either a path to a file (a str, :py:class:`pathlib.Path`, or
+        py._path.local.LocalPath), URL (including http, ftp, and S3
+        locations), or any object with a read() method (such as


Suggested change

locations), or any object with a read() method (such as

locations), or any object with a ``read()`` method (such as

pentschev · 2024-12-17T17:38:10Z

python/dask_cudf/dask_cudf/io/csv.py

+        The target task partition size. If ``None``, a single block
+        is used for each file.
+    **kwargs : dict
+        Passthrough key-word arguments that are sent to


Suggested change

Passthrough key-word arguments that are sent to

Passthrough keyword arguments that are sent to

wence-

Thanks Rick

rjzamora · 2024-12-17T18:16:55Z

/merge

rjzamora added 2 commits December 17, 2024 08:43

address breaking changes upstream

0cc93e5

remove chunksize support

7f3b968

rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Dec 17, 2024

rjzamora self-assigned this Dec 17, 2024

rjzamora requested a review from a team as a code owner December 17, 2024 16:52

github-actions bot added the Python Affects Python cuDF API. label Dec 17, 2024

rjzamora commented Dec 17, 2024

View reviewed changes

rjzamora mentioned this pull request Dec 17, 2024

Bump the oldest pyarrow version to 14.0.2 in test matrix #17611

Merged

3 tasks

pentschev approved these changes Dec 17, 2024

View reviewed changes

wence- approved these changes Dec 17, 2024

View reviewed changes

vyasr mentioned this pull request Dec 17, 2024

Check if nightlies have succeeded recently enough #17596

Merged

3 tasks

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Dec 17, 2024

rapids-bot bot merged commit 0058b52 into rapidsai:branch-25.02 Dec 17, 2024
111 checks passed

rjzamora deleted the fix-dask-read-csv branch December 17, 2024 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `dask_cudf.read_csv` #17612

Fix `dask_cudf.read_csv` #17612

rjzamora commented Dec 17, 2024

rjzamora Dec 17, 2024

rjzamora Dec 17, 2024

rjzamora Dec 17, 2024

pentschev left a comment

pentschev Dec 17, 2024

rjzamora Dec 17, 2024

pentschev Dec 17, 2024

rjzamora Dec 17, 2024

pentschev Dec 17, 2024

pentschev Dec 17, 2024

pentschev Dec 17, 2024

pentschev Dec 17, 2024

pentschev Dec 17, 2024

wence- left a comment

rjzamora commented Dec 17, 2024


		>>> df = dask_cudf.read_csv("largefile.csv", blocksize="256 MiB")

		It can read CSV files from external resources (e.g. S3, HTTP, FTP)

	py._path.local.LocalPath), URL (including http, ftp, and S3
	``py._path.local.LocalPath``), URL (including HTTP, FTP, and S3

	locations), or any object with a read() method (such as
	locations), or any object with a ``read()`` method (such as

	Passthrough key-word arguments that are sent to
	Passthrough keyword arguments that are sent to

Fix dask_cudf.read_csv #17612

Fix dask_cudf.read_csv #17612

Conversation

rjzamora commented Dec 17, 2024

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

rjzamora commented Dec 17, 2024

Fix `dask_cudf.read_csv` #17612

Fix `dask_cudf.read_csv` #17612