Skip to content

Commit

Permalink
Add documentation for low memory readers (#17314)
Browse files Browse the repository at this point in the history
Closes #16443

Authors:
  - Brian Tepera (https://github.com/btepera)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #17314
  • Loading branch information
btepera authored Nov 14, 2024
1 parent 353d2de commit 9da8eb2
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 3 deletions.
10 changes: 10 additions & 0 deletions docs/cudf/source/user_guide/io/io.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,3 +194,13 @@ If no value is set, behavior will be the same as the "STABLE" option.
+-----------------------+--------+--------+--------------+--------------+---------+--------+--------------+--------------+--------+
```

## Low Memory Considerations

By default, cuDF's parquet and json readers will try to read the entire file in one pass. This can cause problems when dealing with large datasets or when running workloads on GPUs with limited memory.

To better support low memory systems, cuDF provides a "low-memory" reader for parquet and json files. This low memory reader processes data in chunks, leading to lower peak memory usage due to the smaller size of intermediate allocations.

To read a parquet or json file in low memory mode, there are [cuDF options](https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/options/#api-options) that must be set globally prior to calling the reader. To set those options, call:
- `cudf.set_option("io.parquet.low_memory", True)` for parquet files, or
- `cudf.set_option("io.json.low_memory", True)` for json files.
16 changes: 13 additions & 3 deletions python/cudf/cudf/utils/ioutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,11 @@
-----
{remote_data_sources}
- Setting the cudf option `io.parquet.low_memory=True` will result in the chunked
low memory parquet reader being used. This can make it easier to read large
parquet datasets on systems with limited GPU memory. See all `available options
<https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/options/#api-options>`_.
Examples
--------
>>> import cudf
Expand Down Expand Up @@ -758,9 +763,14 @@
Notes
-----
When `engine='auto'`, and `line=False`, the `pandas` json
reader will be used. To override the selection, please
use `engine='cudf'`.
- When `engine='auto'`, and `line=False`, the `pandas` json
reader will be used. To override the selection, please
use `engine='cudf'`.
- Setting the cudf option `io.json.low_memory=True` will result in the chunked
low memory json reader being used. This can make it easier to read large
json datasets on systems with limited GPU memory. See all `available options
<https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/options/#api-options>`_.
See Also
--------
Expand Down

0 comments on commit 9da8eb2

Please sign in to comment.