diff --git a/docs/cudf/source/user_guide/io/io.md b/docs/cudf/source/user_guide/io/io.md index 62db062cc45..7d863d890e2 100644 --- a/docs/cudf/source/user_guide/io/io.md +++ b/docs/cudf/source/user_guide/io/io.md @@ -194,3 +194,13 @@ If no value is set, behavior will be the same as the "STABLE" option. +-----------------------+--------+--------+--------------+--------------+---------+--------+--------------+--------------+--------+ ``` + +## Low Memory Considerations + +By default, cuDF's parquet and json readers will try to read the entire file in one pass. This can cause problems when dealing with large datasets or when running workloads on GPUs with limited memory. + +To better support low memory systems, cuDF provides a "low-memory" reader for parquet and json files. This low memory reader processes data in chunks, leading to lower peak memory usage due to the smaller size of intermediate allocations. + +To read a parquet or json file in low memory mode, there are [cuDF options](https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/options/#api-options) that must be set globally prior to calling the reader. To set those options, call: +- `cudf.set_option("io.parquet.low_memory", True)` for parquet files, or +- `cudf.set_option("io.json.low_memory", True)` for json files. diff --git a/python/cudf/cudf/utils/ioutils.py b/python/cudf/cudf/utils/ioutils.py index aecb7ae7c5c..86ed749772f 100644 --- a/python/cudf/cudf/utils/ioutils.py +++ b/python/cudf/cudf/utils/ioutils.py @@ -210,6 +210,11 @@ ----- {remote_data_sources} +- Setting the cudf option `io.parquet.low_memory=True` will result in the chunked + low memory parquet reader being used. This can make it easier to read large + parquet datasets on systems with limited GPU memory. See all `available options + `_. + Examples -------- >>> import cudf @@ -758,9 +763,14 @@ Notes ----- -When `engine='auto'`, and `line=False`, the `pandas` json -reader will be used. To override the selection, please -use `engine='cudf'`. +- When `engine='auto'`, and `line=False`, the `pandas` json + reader will be used. To override the selection, please + use `engine='cudf'`. + +- Setting the cudf option `io.json.low_memory=True` will result in the chunked + low memory json reader being used. This can make it easier to read large + json datasets on systems with limited GPU memory. See all `available options + `_. See Also --------