Add documentation for low memory readers (#17314)

Closes #16443 Authors: - Brian Tepera (https://github.com/btepera) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #17314
rapidsai · Nov 14, 2024 · 9da8eb2 · 9da8eb2
1 parent 353d2de
commit 9da8eb2
Show file tree

Hide file tree

Showing 2 changed files with 23 additions and 3 deletions.
diff --git a/docs/cudf/source/user_guide/io/io.md b/docs/cudf/source/user_guide/io/io.md
@@ -194,3 +194,13 @@ If no value is set, behavior will be the same as the "STABLE" option.
     +-----------------------+--------+--------+--------------+--------------+---------+--------+--------------+--------------+--------+
 
 ```
+
+## Low Memory Considerations
+
+By default, cuDF's parquet and json readers will try to read the entire file in one pass. This can cause problems when dealing with large datasets or when running workloads on GPUs with limited memory.
+
+To better support low memory systems, cuDF provides a "low-memory" reader for parquet and json files. This low memory reader processes data in chunks, leading to lower peak memory usage due to the smaller size of intermediate allocations.
+
+To read a parquet or json file in low memory mode, there are [cuDF options](https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/options/#api-options) that must be set globally prior to calling the reader. To set those options, call:
+- `cudf.set_option("io.parquet.low_memory", True)` for parquet files, or
+- `cudf.set_option("io.json.low_memory", True)` for json files.
diff --git a/python/cudf/cudf/utils/ioutils.py b/python/cudf/cudf/utils/ioutils.py
@@ -210,6 +210,11 @@
 -----
 {remote_data_sources}
 
+- Setting the cudf option `io.parquet.low_memory=True` will result in the chunked
+  low memory parquet reader being used. This can make it easier to read large
+  parquet datasets on systems with limited GPU memory. See all `available options
+  <https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/options/#api-options>`_.
+
 Examples
 --------
 >>> import cudf
@@ -758,9 +763,14 @@
 
 Notes
 -----
-When `engine='auto'`, and `line=False`, the `pandas` json
-reader will be used. To override the selection, please
-use `engine='cudf'`.
+- When `engine='auto'`, and `line=False`, the `pandas` json
+  reader will be used. To override the selection, please
+  use `engine='cudf'`.
+
+- Setting the cudf option `io.json.low_memory=True` will result in the chunked
+  low memory json reader being used. This can make it easier to read large
+  json datasets on systems with limited GPU memory. See all `available options
+  <https://docs.rapids.ai/api/cudf/nightly/user_guide/api_docs/options/#api-options>`_.
 
 See Also
 --------