diff --git a/cpp/test/matrix/select_k.cuh b/cpp/test/matrix/select_k.cuh index f22f4f5fa7..7796b4b2fe 100644 --- a/cpp/test/matrix/select_k.cuh +++ b/cpp/test/matrix/select_k.cuh @@ -13,6 +13,7 @@ * See the License for the specific language governing permissions and * limitations under the License. */ +#pragma once #include "../test_utils.cuh" diff --git a/cpp/test/matrix/select_large_k.cu b/cpp/test/matrix/select_large_k.cu index baa07f5e87..ec993ee979 100644 --- a/cpp/test/matrix/select_large_k.cu +++ b/cpp/test/matrix/select_large_k.cu @@ -24,11 +24,11 @@ auto inputs_random_largek = testing::Values(select::params{100, 100000, 1000, tr select::params{100, 100000, 2048, false}, select::params{100, 100000, 1237, true}); -using ReferencedRandomFloatSizeT = +using ReferencedRandomFloatLargeSizeT = SelectK::params_random>; -TEST_P(ReferencedRandomFloatSizeT, LargeK) { run(); } // NOLINT -INSTANTIATE_TEST_CASE_P(SelectK, // NOLINT - ReferencedRandomFloatSizeT, +TEST_P(ReferencedRandomFloatLargeSizeT, LargeK) { run(); } // NOLINT +INSTANTIATE_TEST_CASE_P(SelectK, // NOLINT + ReferencedRandomFloatLargeSizeT, testing::Combine(inputs_random_largek, testing::Values(SelectAlgo::kRadix11bits, SelectAlgo::kRadix11bitsExtraPass))); diff --git a/docs/source/cpp_api.rst b/docs/source/cpp_api.rst index e60ef4e697..74f706bf46 100644 --- a/docs/source/cpp_api.rst +++ b/docs/source/cpp_api.rst @@ -8,13 +8,10 @@ C++ API :maxdepth: 4 cpp_api/core.rst - cpp_api/cluster.rst - cpp_api/distance.rst cpp_api/linalg.rst cpp_api/matrix.rst cpp_api/mdspan.rst cpp_api/mnmg.rst - cpp_api/neighbors.rst cpp_api/random.rst cpp_api/solver.rst cpp_api/sparse.rst diff --git a/docs/source/index.rst b/docs/source/index.rst index bee0e948ff..fb2a421652 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -36,16 +36,12 @@ While not exhaustive, the following general categories help summarize the accele * - Category - Examples - * - Nearest Neighbors - - pairwise distances, vector search, epsilon neighborhoods, neighborhood graph construction * - Data Formats - sparse & dense, conversions, data generation * - Dense Operations - linear algebra, matrix and vector operations, slicing, norms, factorization, least squares, svd & eigenvalue problems * - Sparse Operations - linear algebra, eigenvalue problems, slicing, norms, reductions, factorization, symmetrization, components & labeling - * - Basic Clustering - - spectral clustering, hierarchical clustering, k-means * - Solvers - combinatorial optimization, iterative solvers * - Statistics @@ -61,9 +57,6 @@ While not exhaustive, the following general categories help summarize the accele build.md cpp_api.rst pylibraft_api.rst - using_libraft.md - vector_search_tutorial.md - raft_ann_benchmarks.md raft_dask_api.rst using_raft_comms.rst developer_guide.md diff --git a/docs/source/raft_ann_benchmarks.md b/docs/source/raft_ann_benchmarks.md deleted file mode 100644 index 12a94e45ce..0000000000 --- a/docs/source/raft_ann_benchmarks.md +++ /dev/null @@ -1,597 +0,0 @@ -# RAFT ANN Benchmarks - -This project provides a benchmark program for various ANN search implementations. It's especially suitable for comparing GPU implementations as well as comparing GPU against CPU. - -> [!IMPORTANT] -> The vector search and clustering algorithms in RAFT are being migrated to a new library dedicated to vector search called [cuVS](https://github.com/rapidsai/cuvs). As a result, `raft-ann-bench` is being migrated to `cuvs-bench` and will be removed from RAFT altogether in the 24.12 (December) release. - - -## Table of Contents - -- [Installing the benchmarks](#installing-the-benchmarks) - - [Conda](#conda) - - [Docker](#docker) -- [How to run the benchmarks](#how-to-run-the-benchmarks) - - [Step 1: prepare dataset](#step-1-prepare-dataset) - - [Step 2: build and search index](#step-2-build-and-search-index) - - [Step 3: data export](#step-3-data-export) - - [Step 4: plot results](#step-4-plot-results) -- [Running the benchmarks](#running-the-benchmarks) - - [End to end: small-scale (<1M to 10M)](#end-to-end-small-scale-benchmarks-1m-to-10m) - - [End to end: large-scale (>10M)](#end-to-end-large-scale-benchmarks-10m-vectors) - - [Running with Docker containers](#running-with-docker-containers) - - [Evaluating the results](#evaluating-the-results) -- [Creating and customizing dataset configurations](#creating-and-customizing-dataset-configurations) -- [Adding a new ANN algorithm](#adding-a-new-ann-algorithm) -- [Parameter tuning guide](https://docs.rapids.ai/api/raft/nightly/ann_benchmarks_param_tuning/) -- [Wiki-all RAG/LLM Dataset](https://docs.rapids.ai/api/raft/nightly/wiki_all_dataset/) - -## Installing the benchmarks - -There are two main ways pre-compiled benchmarks are distributed: - -- [Conda](#Conda): For users not using containers but want an easy to install and use Python package. Pip wheels are planned to be added as an alternative for users that cannot use conda and prefer to not use containers. -- [Docker](#Docker): Only needs docker and [NVIDIA docker](https://github.com/NVIDIA/nvidia-docker) to use. Provides a single docker run command for basic dataset benchmarking, as well as all the functionality of the conda solution inside the containers. - -## Conda - -If containers are not an option or not preferred, the easiest way to install the ANN benchmarks is through conda. We provide packages for GPU enabled systems, as well for systems without a GPU. We suggest using mamba as it generally leads to a faster install time: - -```bash - -mamba create --name raft_ann_benchmarks -conda activate raft_ann_benchmarks - -# to install GPU package: -mamba install -c rapidsai -c conda-forge -c nvidia raft-ann-bench= cuda-version=11.8* - -# to install CPU package for usage in CPU-only systems: -mamba install -c rapidsai -c conda-forge raft-ann-bench-cpu -``` - -The channel `rapidsai` can easily be substituted `rapidsai-nightly` if nightly benchmarks are desired. The CPU package currently allows to run the HNSW benchmarks. - -Please see the [build instructions](ann_benchmarks_build.md) to build the benchmarks from source. - -## Docker - -We provide images for GPU enabled systems, as well as systems without a GPU. The following images are available: - -- `raft-ann-bench`: Contains GPU and CPU benchmarks, can run all algorithms supported. Will download million-scale datasets as required. Best suited for users that prefer a smaller container size for GPU based systems. Requires the NVIDIA Container Toolkit to run GPU algorithms, can run CPU algorithms without it. -- `raft-ann-bench-datasets`: Contains the GPU and CPU benchmarks with million-scale datasets already included in the container. Best suited for users that want to run multiple million scale datasets already included in the image. -- `raft-ann-bench-cpu`: Contains only CPU benchmarks with minimal size. Best suited for users that want the smallest containers to reproduce benchmarks on systems without a GPU. - -Nightly images are located in [dockerhub](https://hub.docker.com/r/rapidsai/raft-ann-bench/tags), meanwhile release (stable) versions are located in [NGC](https://hub.docker.com/r/rapidsai/raft-ann-bench), starting with release 23.12. - -- The following command pulls the nightly container for python version 10, cuda version 12, and RAFT version 23.10: - -```bash -docker pull rapidsai/raft-ann-bench:24.12a-cuda12.0-py3.10 #substitute raft-ann-bench for the exact desired container. -``` - -The CUDA and python versions can be changed for the supported values: - -Supported CUDA versions: 11.2 and 12.0 -Supported Python versions: 3.9 and 3.10. - -You can see the exact versions as well in the dockerhub site: - -- [RAFT ANN Benchmark images](https://hub.docker.com/r/rapidsai/raft-ann-bench/tags) -- [RAFT ANN Benchmark with datasets preloaded images](https://hub.docker.com/r/rapidsai/raft-ann-bench-cpu/tags) -- [RAFT ANN Benchmark CPU only images](https://hub.docker.com/r/rapidsai/raft-ann-bench-datasets/tags) - -**Note:** GPU containers use the CUDA toolkit from inside the container, the only requirement is a driver installed on the host machine that supports that version. So, for example, CUDA 11.8 containers can run in systems with a CUDA 12.x capable driver. Please also note that the Nvidia-Docker runtime from the [Nvidia Container Toolkit](https://github.com/NVIDIA/nvidia-docker) is required to use GPUs inside docker containers. - -[//]: # (- The following command (only available after RAPIDS 23.10 release) pulls the container:) - -[//]: # () -[//]: # (```bash) - -[//]: # (docker pull nvcr.io/nvidia/rapidsai/raft-ann-bench:24.12-cuda11.8-py3.10 #substitute raft-ann-bench for the exact desired container.) - -[//]: # (```) - -## How to run the benchmarks - -We provide a collection of lightweight Python scripts to run the benchmarks. There are 4 general steps to running the benchmarks and visualizing the results. -1. Prepare Dataset -2. Build Index and Search Index -3. Data Export -4. Plot Results - -### Step 1: Prepare Dataset -The script `raft_ann_bench.get_dataset` will download and unpack the dataset in directory -that the user provides. As of now, only million-scale datasets are supported by this -script. For more information on [datasets and formats](ann_benchmarks_dataset.md). - -The usage of this script is: -```bash -usage: get_dataset.py [-h] [--name NAME] [--dataset-path DATASET_PATH] [--normalize] - -options: - -h, --help show this help message and exit - --dataset DATASET dataset to download (default: glove-100-angular) - --dataset-path DATASET_PATH - path to download dataset (default: ${RAPIDS_DATASET_ROOT_DIR}) - --normalize normalize cosine distance to inner product (default: False) -``` - -When option `normalize` is provided to the script, any dataset that has cosine distances -will be normalized to inner product. So, for example, the dataset `glove-100-angular` -will be written at location `datasets/glove-100-inner/`. - -### Step 2: Build and Search Index -The script `raft_ann_bench.run` will build and search indices for a given dataset and its -specified configuration. - -The usage of the script `raft_ann_bench.run` is: -```bash -usage: __main__.py [-h] [--subset-size SUBSET_SIZE] [-k COUNT] [-bs BATCH_SIZE] [--dataset-configuration DATASET_CONFIGURATION] [--configuration CONFIGURATION] [--dataset DATASET] - [--dataset-path DATASET_PATH] [--build] [--search] [--algorithms ALGORITHMS] [--groups GROUPS] [--algo-groups ALGO_GROUPS] [-f] [-m SEARCH_MODE] - -options: - -h, --help show this help message and exit - --subset-size SUBSET_SIZE - the number of subset rows of the dataset to build the index (default: None) - -k COUNT, --count COUNT - the number of nearest neighbors to search for (default: 10) - -bs BATCH_SIZE, --batch-size BATCH_SIZE - number of query vectors to use in each query trial (default: 10000) - --dataset-configuration DATASET_CONFIGURATION - path to YAML configuration file for datasets (default: None) - --configuration CONFIGURATION - path to YAML configuration file or directory for algorithms Any run groups found in the specified file/directory will automatically override groups of the same name - present in the default configurations, including `base` (default: None) - --dataset DATASET name of dataset (default: glove-100-inner) - --dataset-path DATASET_PATH - path to dataset folder, by default will look in RAPIDS_DATASET_ROOT_DIR if defined, otherwise a datasets subdirectory from the calling directory (default: - os.getcwd()/datasets/) - --build - --search - --algorithms ALGORITHMS - run only comma separated list of named algorithms. If parameters `groups` and `algo-groups are both undefined, then group `base` is run by default (default: None) - --groups GROUPS run only comma separated groups of parameters (default: base) - --algo-groups ALGO_GROUPS - add comma separated . to run. Example usage: "--algo-groups=raft_cagra.large,hnswlib.large" (default: None) - -f, --force re-run algorithms even if their results already exist (default: False) - -m SEARCH_MODE, --search-mode SEARCH_MODE - run search in 'latency' (measure individual batches) or 'throughput' (pipeline batches and measure end-to-end) mode (default: throughput) - -t SEARCH_THREADS, --search-threads SEARCH_THREADS - specify the number threads to use for throughput benchmark. Single value or a pair of min and max separated by ':'. Example --search-threads=1:4. Power of 2 values between 'min' and 'max' will be used. If only 'min' is - specified, then a single test is run with 'min' threads. By default min=1, max=. (default: None) - -r, --dry-run dry-run mode will convert the yaml config for the specified algorithms and datasets to the json format that's consumed by the lower-level c++ binaries and then print the command to run execute the benchmarks but - will not actually execute the command. (default: False) -``` - -`dataset`: name of the dataset to be searched in [datasets.yaml](#yaml-dataset-config) - -`dataset-configuration`: optional filepath to custom dataset YAML config which has an entry for arg `dataset` - -`configuration`: optional filepath to YAML configuration for an algorithm or to directory that contains YAML configurations for several algorithms. [Here's how to configure an algorithm.](#yaml-algo-config) - -`algorithms`: runs all algorithms that it can find in YAML configs found by `configuration`. By default, only `base` group will be run. - -`groups`: run only specific groups of parameters configurations for an algorithm. Groups are defined in YAML configs (see `configuration`), and by default run `base` group - -`algo-groups`: this parameter is helpful to append any specific algorithm+group combination to run the benchmark for in addition to all the arguments from `algorithms` and `groups`. It is of the format `.`, or for example, `raft_cagra.large` - -For every algorithm run by this script, it outputs an index build statistics JSON file in `/result/build/<{algo},{group}.json>` -and an index search statistics JSON file in `/result/search/<{algo},{group},k{k},bs{batch_size}.json>`. NOTE: The filenames will not have ",{group}" if `group = "base"`. - -`dataset-path` : -1. data is read from `/` -2. indices are built in `//index` -3. build/search results are stored in `//result` - -`build` and `search` : if both parameters are not supplied to the script then -it is assumed both are `True`. - -`indices` and `algorithms` : these parameters ensure that the algorithm specified for an index -is available in `algos.yaml` and not disabled, as well as having an associated executable. - -### Step 3: Data Export -The script `raft_ann_bench.data_export` will convert the intermediate JSON outputs produced by `raft_ann_bench.run` to more -easily readable CSV files, which are needed to build charts made by `raft_ann_bench.plot`. - -```bash -usage: data_export.py [-h] [--dataset DATASET] [--dataset-path DATASET_PATH] - -options: - -h, --help show this help message and exit - --dataset DATASET dataset to download (default: glove-100-inner) - --dataset-path DATASET_PATH - path to dataset folder (default: ${RAPIDS_DATASET_ROOT_DIR}) -``` -Build statistics CSV file is stored in `/result/build/<{algo},{group}.csv>` -and index search statistics CSV file in `/result/search/<{algo},{group},k{k},bs{batch_size},{suffix}.csv>`, where suffix has three values: -1. `raw`: All search results are exported -2. `throughput`: Pareto frontier of throughput results is exported -3. `latency`: Pareto frontier of latency results is exported - - -### Step 4: Plot Results -The script `raft_ann_bench.plot` will plot results for all algorithms found in index search statistics -CSV files `/result/search/*.csv`. - -The usage of this script is: -```bash -usage: [-h] [--dataset DATASET] [--dataset-path DATASET_PATH] [--output-filepath OUTPUT_FILEPATH] [--algorithms ALGORITHMS] [--groups GROUPS] [--algo-groups ALGO_GROUPS] - [-k COUNT] [-bs BATCH_SIZE] [--build] [--search] [--x-scale X_SCALE] [--y-scale {linear,log,symlog,logit}] [--x-start X_START] [--mode {throughput,latency}] - [--time-unit {s,ms,us}] [--raw] - -options: - -h, --help show this help message and exit - --dataset DATASET dataset to plot (default: glove-100-inner) - --dataset-path DATASET_PATH - path to dataset folder (default: /home/coder/raft/datasets/) - --output-filepath OUTPUT_FILEPATH - directory for PNG to be saved (default: /home/coder/raft) - --algorithms ALGORITHMS - plot only comma separated list of named algorithms. If parameters `groups` and `algo-groups are both undefined, then group `base` is plot by default - (default: None) - --groups GROUPS plot only comma separated groups of parameters (default: base) - --algo-groups ALGO_GROUPS, --algo-groups ALGO_GROUPS - add comma separated . to plot. Example usage: "--algo-groups=raft_cagra.large,hnswlib.large" (default: None) - -k COUNT, --count COUNT - the number of nearest neighbors to search for (default: 10) - -bs BATCH_SIZE, --batch-size BATCH_SIZE - number of query vectors to use in each query trial (default: 10000) - --build - --search - --x-scale X_SCALE Scale to use when drawing the X-axis. Typically linear, logit or a2 (default: linear) - --y-scale {linear,log,symlog,logit} - Scale to use when drawing the Y-axis (default: linear) - --x-start X_START Recall values to start the x-axis from (default: 0.8) - --mode {throughput,latency} - search mode whose Pareto frontier is used on the y-axis (default: throughput) - --time-unit {s,ms,us} - time unit to plot when mode is latency (default: ms) - --raw Show raw results (not just Pareto frontier) of mode arg (default: False) -``` -`mode`: plots pareto frontier of `throughput` or `latency` results exported in the previous step - -`algorithms`: plots all algorithms that it can find results for the specified `dataset`. By default, only `base` group will be plotted. - -`groups`: plot only specific groups of parameters configurations for an algorithm. Groups are defined in YAML configs (see `configuration`), and by default run `base` group - -`algo-groups`: this parameter is helpful to append any specific algorithm+group combination to plot results for in addition to all the arguments from `algorithms` and `groups`. It is of the format `.`, or for example, `raft_cagra.large` - -The figure below is the resulting plot of running our benchmarks as of August 2023 for a batch size of 10, on an NVIDIA H100 GPU and an Intel Xeon Platinum 8480CL CPU. It presents the throughput (in Queries-Per-Second) performance for every level of recall. - -![Throughput vs recall plot comparing popular ANN algorithms with RAFT's at batch size 10](../../img/raft-vector-search-batch-10.png) - -## Running the benchmarks - -### End to end: small-scale benchmarks (<1M to 10M) - -The steps below demonstrate how to download, install, and run benchmarks on a subset of 10M vectors from the Yandex Deep-1B dataset By default the datasets will be stored and used from the folder indicated by the `RAPIDS_DATASET_ROOT_DIR` environment variable if defined, otherwise a datasets sub-folder from where the script is being called: - -```bash - -# (1) prepare dataset. -python -m raft_ann_bench.get_dataset --dataset deep-image-96-angular --normalize - -# (2) build and search index -python -m raft_ann_bench.run --dataset deep-image-96-inner --algorithms raft_cagra --batch-size 10 -k 10 - -# (3) export data -python -m raft_ann_bench.data_export --dataset deep-image-96-inner - -# (4) plot results -python -m raft_ann_bench.plot --dataset deep-image-96-inner -``` - -Configuration files already exist for the following list of the million-scale datasets. Please refer to [ann-benchmarks datasets](https://github.com/erikbern/ann-benchmarks/#data-sets) for more information, including actual train and sizes. These all work out-of-the-box with the `--dataset` argument. Other million-scale datasets from `ann-benchmarks.com` will work, but will require a json configuration file to be created in `$CONDA_PREFIX/lib/python3.xx/site-packages/raft_ann_bench/run/conf`, or you can specify the `--configuration` option to use a specific file. - -| Dataset Name | Train Rows | Columns | Test Rows | Distance | -|-----|------------|----|----------------|------------| -| `deep-image-96-angular` | 10M | 96 | 10K | Angular | -| `fashion-mnist-784-euclidean` | 60K | 784 | 10K | Euclidean | -| `glove-50-angular` | 1.1M | 50 | 10K | Angular | -| `glove-100-angular` | 1.1M | 100 | 10K | Angular | -| `mnist-784-euclidean` | 60K | 784 | 10K | Euclidean | -| `nytimes-256-angular` | 290K | 256 | 10K | Angular | -| `sift-128-euclidean` | 1M | 128 | 10K | Euclidean| - -All of the datasets above contain ground test datasets with 100 neighbors. Thus `k` for these datasets must be less than or equal to 100. - -### End to end: large-scale benchmarks (>10M vectors) - -`raft_ann_bench.get_dataset` cannot be used to download the [billion-scale datasets](ann_benchmarks_dataset.md#billion-scale) -due to their size. You should instead use our billion-scale datasets guide to download and prepare them. -All other python commands mentioned below work as intended once the -billion-scale dataset has been downloaded. -To download billion-scale datasets, visit [big-ann-benchmarks](http://big-ann-benchmarks.com/neurips21.html) - -We also provide a new dataset called `wiki-all` containing 88 million 768-dimensional vectors. This dataset is meant for benchmarking a realistic retrieval-augmented generation (RAG)/LLM embedding size at scale. It also contains 1M and 10M vector subsets for smaller-scale experiments. See our [Wiki-all Dataset Guide](https://docs.rapids.ai/api/raft/nightly/wiki_all_dataset/) for more information and to download the dataset. - -The steps below demonstrate how to download, install, and run benchmarks on a subset of 100M vectors from the Yandex Deep-1B dataset. Please note that datasets of this scale are recommended for GPUs with larger amounts of memory, such as the A100 or H100. -```bash - -mkdir -p datasets/deep-1B -# (1) prepare dataset -# download manually "Ground Truth" file of "Yandex DEEP" -# suppose the file name is deep_new_groundtruth.public.10K.bin -python -m raft_ann_bench.split_groundtruth --groundtruth datasets/deep-1B/deep_new_groundtruth.public.10K.bin -# two files 'groundtruth.neighbors.ibin' and 'groundtruth.distances.fbin' should be produced - -# (2) build and search index -python -m raft_ann_bench.run --dataset deep-1B --algorithms raft_cagra --batch-size 10 -k 10 - -# (3) export data -python -m raft_ann_bench.data_export --dataset deep-1B - -# (4) plot results -python -m raft_ann_bench.plot --dataset deep-1B -``` - -The usage of `python -m raft_ann_bench.split_groundtruth` is: -```bash -usage: split_groundtruth.py [-h] --groundtruth GROUNDTRUTH - -options: - -h, --help show this help message and exit - --groundtruth GROUNDTRUTH - Path to billion-scale dataset groundtruth file (default: None) -``` - -### Running with Docker containers - -Two methods are provided for running the benchmarks with the Docker containers. - -#### End-to-end run on GPU - -When no other entrypoint is provided, an end-to-end script will run through all the steps in [Running the benchmarks](#running-the-benchmarks) above. - -For GPU-enabled systems, the `DATA_FOLDER` variable should be a local folder where you want datasets stored in `$DATA_FOLDER/datasets` and results in `$DATA_FOLDER/result` (we highly recommend `$DATA_FOLDER` to be a dedicated folder for the datasets and results of the containers): -```bash -export DATA_FOLDER=path/to/store/datasets/and/results -docker run --gpus all --rm -it -u $(id -u) \ - -v $DATA_FOLDER:/data/benchmarks \ - rapidsai/raft-ann-bench:24.12a-cuda11.8-py3.10 \ - "--dataset deep-image-96-angular" \ - "--normalize" \ - "--algorithms raft_cagra,raft_ivf_pq --batch-size 10 -k 10" \ - "" -``` - -Usage of the above command is as follows: - -| Argument | Description | -|-----------------------------------------------------------|----------------------------------------------------------------------------------------------------| -| `rapidsai/raft-ann-bench:24.12a-cuda11.8-py3.10` | Image to use. Can be either `raft-ann-bench` or `raft-ann-bench-datasets` | -| `"--dataset deep-image-96-angular"` | Dataset name | -| `"--normalize"` | Whether to normalize the dataset | -| `"--algorithms raft_cagra,hnswlib --batch-size 10 -k 10"` | Arguments passed to the `run` script, such as the algorithms to benchmark, the batch size, and `k` | -| `""` | Additional (optional) arguments that will be passed to the `plot` script. | - -***Note about user and file permissions:*** The flag `-u $(id -u)` allows the user inside the container to match the `uid` of the user outside the container, allowing the container to read and write to the mounted volume indicated by the `$DATA_FOLDER` variable. - -#### End-to-end run on CPU - -The container arguments in the above section also be used for the CPU-only container, which can be used on systems that don't have a GPU installed. - -***Note:*** the image changes to `raft-ann-bench-cpu` container and the `--gpus all` argument is no longer used: -```bash -export DATA_FOLDER=path/to/store/datasets/and/results -docker run --rm -it -u $(id -u) \ - -v $DATA_FOLDER:/data/benchmarks \ - rapidsai/raft-ann-bench-cpu:24.12a-py3.10 \ - "--dataset deep-image-96-angular" \ - "--normalize" \ - "--algorithms hnswlib --batch-size 10 -k 10" \ - "" -``` - -#### Manually run the scripts inside the container - -All of the `raft-ann-bench` images contain the Conda packages, so they can be used directly by logging directly into the container itself: - -```bash -export DATA_FOLDER=path/to/store/datasets/and/results -docker run --gpus all --rm -it -u $(id -u) \ - --entrypoint /bin/bash \ - --workdir /data/benchmarks \ - -v $DATA_FOLDER:/data/benchmarks \ - rapidsai/raft-ann-bench:24.12a-cuda11.8-py3.10 -``` - -This will drop you into a command line in the container, with the `raft-ann-bench` python package ready to use, as described in the [Running the benchmarks](#running-the-benchmarks) section above: - -``` -(base) root@00b068fbb862:/data/benchmarks# python -m raft_ann_bench.get_dataset --dataset deep-image-96-angular --normalize -``` - -Additionally, the containers can be run in detached mode without any issue. - - -### Evaluating the results - -The benchmarks capture several different measurements. The table below describes each of the measurements for index build benchmarks: - -| Name | Description | -|------------|--------------------------------------------------------| -| Benchmark | A name that uniquely identifies the benchmark instance | -| Time | Wall-time spent training the index | -| CPU | CPU time spent training the index | -| Iterations | Number of iterations (this is usually 1) | -| GPU | GPU time spent building | -| index_size | Number of vectors used to train index | - - -The table below describes each of the measurements for the index search benchmarks. The most important measurements `Latency`, `items_per_second`, `end_to_end`. - -| Name | Description | -|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------| -| Benchmark | A name that uniquely identifies the benchmark instance | -| Time | The wall-clock time of a single iteration (batch) divided by the number of threads. | -| CPU | The average CPU time (user + sys time). This does not include idle time (which can also happen while waiting for GPU sync). | -| Iterations | Total number of batches. This is going to be `total_queries` / `n_queries`. | -| GPU | GPU latency of a single batch (seconds). In throughput mode this is averaged over multiple threads. | -| Latency | Latency of a single batch (seconds), calculated from wall-clock time. In throughput mode this is averaged over multiple threads. | -| Recall | Proportion of correct neighbors to ground truth neighbors. Note this column is only present if groundtruth file is specified in dataset configuration.| -| items_per_second | Total throughput, a.k.a Queries per second (QPS). This is approximately `total_queries` / `end_to_end`. | -| k | Number of neighbors being queried in each iteration | -| end_to_end | Total time taken to run all batches for all iterations | -| n_queries | Total number of query vectors in each batch | -| total_queries | Total number of vectors queries across all iterations ( = `iterations` * `n_queries`) | - -Note the following: -- A slightly different method is used to measure `Time` and `end_to_end`. That is why `end_to_end` = `Time` * `Iterations` holds only approximately. -- The actual table displayed on the screen may differ slightly as the hyper-parameters will also be displayed for each different combination being benchmarked. -- Recall calculation: the number of queries processed per test depends on the number of iterations. Because of this, recall can show slight fluctuations if less neighbors are processed then it is available for the benchmark. - -## Creating and customizing dataset configurations - -A single configuration will often define a set of algorithms, with associated index and search parameters, that can be generalize across datasets. We use YAML to define dataset specific and algorithm specific configurations. - -A default `datasets.yaml` is provided by RAFT in `${RAFT_HOME}/python/raft-ann-bench/src/raft_ann_bench/run/conf` with configurations available for several datasets. Here's a simple example entry for the `sift-128-euclidean` dataset: - -```yaml -- name: sift-128-euclidean - base_file: sift-128-euclidean/base.fbin - query_file: sift-128-euclidean/query.fbin - groundtruth_neighbors_file: sift-128-euclidean/groundtruth.neighbors.ibin - dims: 128 - distance: euclidean -``` - -Configuration files for ANN algorithms supported by `raft-ann-bench` are provided in `${RAFT_HOME}/python/raft-ann-bench/src/raft_ann_bench/run/conf`. `raft_cagra` algorithm configuration looks like: -```yaml -name: raft_cagra -groups: - base: - build: - graph_degree: [32, 64] - intermediate_graph_degree: [64, 96] - graph_build_algo: ["NN_DESCENT"] - search: - itopk: [32, 64, 128] - - large: - build: - graph_degree: [32, 64] - search: - itopk: [32, 64, 128] -``` -The default parameters for which the benchmarks are run can be overridden by creating a custom YAML file for algorithms with a `base` group. - -There config above has 2 fields: -1. `name` - define the name of the algorithm for which the parameters are being specified. -2. `groups` - define a run group which has a particular set of parameters. Each group helps create a cross-product of all hyper-parameter fields for `build` and `search`. - -The table below contains all algorithms supported by RAFT. Each unique algorithm will have its own set of `build` and `search` settings. The [ANN Algorithm Parameter Tuning Guide](ann_benchmarks_param_tuning.md) contains detailed instructions on choosing build and search parameters for each supported algorithm. - -| Library | Algorithms | -|-----------|---------------------------------------------------------------------------------------| -| FAISS GPU | `faiss_gpu_flat`, `faiss_gpu_ivf_flat`, `faiss_gpu_ivf_pq` | -| FAISS CPU | `faiss_cpu_flat`, `faiss_cpu_ivf_flat`, `faiss_cpu_ivf_pq` | -| GGNN | `ggnn` | -| HNSWlib | `hnswlib` | -| RAFT | `raft_brute_force`, `raft_cagra`, `raft_ivf_flat`, `raft_ivf_pq`, `raft_cagra_hnswlib`| - -## Adding a new ANN algorithm - -### Implementation and Configuration -Implementation of a new algorithm should be a C++ class that inherits `class ANN` (defined in `cpp/bench/ann/src/ann.h`) and implements all the pure virtual functions. - -In addition, it should define two `struct`s for building and searching parameters. The searching parameter class should inherit `struct ANN::AnnSearchParam`. Take `class HnswLib` as an example, its definition is: -```c++ -template -class HnswLib : public ANN { -public: - struct BuildParam { - int M; - int ef_construction; - int num_threads; - }; - - using typename ANN::AnnSearchParam; - struct SearchParam : public AnnSearchParam { - int ef; - int num_threads; - }; - - // ... -}; -``` - -The benchmark program uses JSON format in a configuration file to specify indexes to build, along with the build and search parameters. To add the new algorithm to the benchmark, need be able to specify `build_param`, whose value is a JSON object, and `search_params`, whose value is an array of JSON objects, for this algorithm in configuration file. The `build_param` and `search_param` arguments will vary depending on the algorithm. Take the configuration for `HnswLib` as an example: -```json -{ - "name" : "hnswlib.M12.ef500.th32", - "algo" : "hnswlib", - "build_param": {"M":12, "efConstruction":500, "numThreads":32}, - "file" : "/path/to/file", - "search_params" : [ - {"ef":10, "numThreads":1}, - {"ef":20, "numThreads":1}, - {"ef":40, "numThreads":1}, - ], - "search_result_file" : "/path/to/file" -}, -``` -How to interpret these JSON objects is totally left to the implementation and should be specified in `cpp/bench/ann/src/factory.cuh`: -1. First, add two functions for parsing JSON object to `struct BuildParam` and `struct SearchParam`, respectively: - ```c++ - template - void parse_build_param(const nlohmann::json& conf, - typename cuann::HnswLib::BuildParam& param) { - param.ef_construction = conf.at("efConstruction"); - param.M = conf.at("M"); - if (conf.contains("numThreads")) { - param.num_threads = conf.at("numThreads"); - } - } - - template - void parse_search_param(const nlohmann::json& conf, - typename cuann::HnswLib::SearchParam& param) { - param.ef = conf.at("ef"); - if (conf.contains("numThreads")) { - param.num_threads = conf.at("numThreads"); - } - } - ``` - -2. Next, add corresponding `if` case to functions `create_algo()` (in `cpp/bench/ann/) and `create_search_param()` by calling parsing functions. The string literal in `if` condition statement must be the same as the value of `algo` in configuration file. For example, - ```c++ - // JSON configuration file contains a line like: "algo" : "hnswlib" - if (algo == "hnswlib") { - // ... - } - ``` - - -### Adding a CMake Target -In `raft/cpp/bench/ann/CMakeLists.txt`, we provide a `CMake` function to configure a new Benchmark target with the following signature: -``` -ConfigureAnnBench( - NAME - PATH - INCLUDES - CXXFLAGS - LINKS -) -``` - -To add a target for `HNSWLIB`, we would call the function as: -``` -ConfigureAnnBench( - NAME HNSWLIB PATH bench/ann/src/hnswlib/hnswlib_benchmark.cpp INCLUDES - ${CMAKE_CURRENT_BINARY_DIR}/_deps/hnswlib-src/hnswlib CXXFLAGS "${HNSW_CXX_FLAGS}" -) -``` - -This will create an executable called `HNSWLIB_ANN_BENCH`, which can then be used to run `HNSWLIB` benchmarks. - -Add a new entry to `algos.yaml` to map the name of the algorithm to its binary executable and specify whether the algorithm requires GPU support. -```yaml -raft_ivf_pq: - executable: RAFT_IVF_PQ_ANN_BENCH - requires_gpu: true -``` - -`executable` : specifies the name of the binary that will build/search the index. It is assumed to be -available in `raft/cpp/build/`. -`requires_gpu` : denotes whether an algorithm requires GPU to run. diff --git a/docs/source/using_libraft.md b/docs/source/using_libraft.md deleted file mode 100644 index 70a17e289b..0000000000 --- a/docs/source/using_libraft.md +++ /dev/null @@ -1,64 +0,0 @@ -# Using The Pre-Compiled Binary - -At its core, RAFT is a header-only template library, which makes it very powerful in that APIs can be called with various different combinations of data types and only the templates which are actually used will be compiled into your binaries. This increased flexibility comes with a drawback that all the APIs need to be declared inline and thus calls which are made frequently in your code could be compiled again in each source file for which they are invoked. - -For most functions, compile-time overhead is minimal but some of RAFT's APIs take a substantial time to compile. As a rule of thumb, most functionality in `raft::distance`, `raft::neighbors`, and `raft::cluster` is expensive to compile and most functionality in other namespaces has little compile-time overhead. - -There are three ways to speed up compile times: - -1. Continue to use RAFT as a header-only library and create a CUDA source file - in your project to explicitly instantiate the templates which are slow to - compile. This can be tedious and will still require compiling the slow code - at least once, but it's the most flexible option if you are using types that - aren't already compiled into `libraft` - -2. If you are able to use one of the template types that are already being - compiled into `libraft`, you can use the pre-compiled template - instantiations, which are described in more detail in the following section. - -3. If you would like to use RAFT but either cannot or would prefer not to - compile any CUDA code yourself, you can simply add `libraft` to your link - libraries and use the growing set of `raft::runtime` APIs. - -### How do I verify template instantiations didn't compile into my binary? - -To verify that you are not accidentally instantiating templates that have not been pre-compiled in RAFT, set the `RAFT_EXPLICIT_INSTANTIATE_ONLY` macro. This only works if you are linking with the pre-compiled libraft (i.e., when `RAFT_COMPILED` has been defined). To check if, for instance, `raft::distance::distance` has been precompiled with specific template arguments, you can set `RAFT_EXPLICIT_INSTANTIATE_ONLY` at the top of the file you are compiling, as in the following example: - -```c++ - -#ifdef RAFT_COMPILED -#define RAFT_EXPLICIT_INSTANTIATE_ONLY -#endif - -#include -#include -#include - -int main() -{ - raft::resources handle{}; - - // Change IdxT to uint64_t and you will get an error because you are - // instantiating a template that has not been pre-compiled. - using IdxT = int; - - const float* x = nullptr; - const float* y = nullptr; - float* out = nullptr; - int m = 1024; - int n = 1024; - int k = 1024; - bool row_major = true; - raft::distance::distance( - handle, x, y, out, m, n, k, row_major, 2.0f); -} -``` - -## Runtime APIs - -RAFT contains a growing list of runtime APIs that, unlike the pre-compiled -template instantiations, allow you to link against `libraft` and invoke RAFT -directly from `cpp` files. The benefit to RAFT's runtime APIs is that they can -be used from code that is compiled with a `c++` compiler (rather than the CUDA -compiler `nvcc`). This enables the `runtime` APIs to power `pylibraft`. - diff --git a/docs/source/vector_search_tutorial.md b/docs/source/vector_search_tutorial.md deleted file mode 100644 index d1d5c57700..0000000000 --- a/docs/source/vector_search_tutorial.md +++ /dev/null @@ -1,409 +0,0 @@ -# Vector Search in C++ Tutorial - -## Table of Contents - -- [Step 1: Starting off with RAFT](#step-1-starting-off-with-raft) -- [Step 2: Generate some data](#step-2-generate-some-data) -- [Step 3: Using brute-force indexes](#step-3-using-brute-force-indexes) -- [Step 4: Using the ANN indexes](#step-4-using-the-ann-indexes) -- [Step 5: Evaluate neighborhood quality](#step-5-evaluate-neighborhood-quality) -- [Advanced Features](#advanced-features) - - [Serialization](#serialization) - - [Filtering](#filtering) - - [Stream Pools](#stream-pools) - - [Device Resources Manager](#device-resources-manager) - - [Device Memory Resources](#device-memory-resources) - - [Workspace Memory Resource](#workspace-memory-resource) - -RAFT has several important algorithms for performing vector search on the GPU and this tutorial walks through the primary vector search APIs from start to finish to provide a reference for quick setup and C++ API usage. - -This tutorial assumes RAFT has been installed and/or added to your build so that you are able to compile and run RAFT code. If not done already, please follow the [build and install instructions](build.md) and consider taking a look at the [example c++ template project](https://github.com/rapidsai/raft/tree/HEAD/cpp/template) for ready-to-go examples that you can immediately build and start playing with. Also take a look at RAFT's library of [reproducible vector search benchmarks](raft_ann_benchmarks.md) to run benchmarks that compare RAFT against other state-of-the-art nearest neighbors algorithms at scale. - -For more information about the various APIs demonstrated in this tutorial, along with comprehensive usage examples of all the APIs offered by RAFT, please refer to the [RAFT's C++ API Documentation](https://docs.rapids.ai/api/raft/nightly/cpp_api/). - -## Step 1: Starting off with RAFT - -### CUDA Development? - -If you are reading this tuturial then you probably know about CUDA and its relationship to general-purpose GPU computing (GPGPU). You probably also know about Nvidia GPUs but might not necessarily be familiar with the programming model nor GPU computing. The good news is that extensive knowledge of CUDA and GPUs are not needed in order to get started with or build applications with RAFT. RAFT hides away most of the complexities behind simple single-threaded stateless functions that are inherently asynchronous, meaning the result of a computation isn't necessarily read to be used when the function executes and control is given back to the user. The functions are, however, allowed to be chained together in a sequence of calls that don't need to wait for subsequent computations to complete in order to continue execution. In fact, the only time you need to wait for the computation to complete is when you are ready to use the result. - -A common structure you will encounter when using RAFT is a `raft::device_resources` object. This object is a container for important resources for a single GPU that might be needed during computation. If communicating with multiple GPUs, multiple `device_resources` might be needed, one for each GPU. `device_resources` contains several methods for managing its state but most commonly, you'll call the `sync_stream()` to guarantee all recently submitted computation has completed (as mentioned above.) - -A simple example of using `raft::device_resources` in RAFT: - -```c++ -#include - -raft::device_resources res; -// Call a bunch of RAFT functions in sequence... -res.sync_stream() -``` - -### Host vs Device Memory - -We differentiate between two different types of memory. `host` memory is your traditional RAM memory that is primarily accessible by applications on the CPU. `device` memory, on the other hand, is what we call the special memory on the GPU, which is not accessible from the CPU. In order to access host memory from the GPU, it needs to be explicitly copied to the GPU and in order to access device memory by the CPU, it needs to be explicitly copied there. We have several mechanisms available for allocating and managing the lifetime of device memory on the stack so that we don't need to explicitly allocate and free pointers on the heap. For example, instead of a `std::vector` for host memory, we can use `rmm::device_uvector` on the device. The following function will copy an array from host memory to device memory: - -```c++ -#include -#include -#include - -raft::device_resources res; - -std::vector my_host_vector = {0, 1, 2, 3, 4}; -rmm::device_uvector my_device_vector(my_host_vector.size(), res.get_stream()); - -raft::copy(my_device_vector.data(), my_host_vector.data(), my_host_vector.size(), res.get_stream()); -``` - -Since a stream is involved in the copy operation above, RAFT functions can be invoked immediately so long as the same `device_resources` instances is used (or, more specifically, the same main stream from the `devices_resources`.) As you might notice in the example above, `res.get_stream()` can be used to extract the main stream from a `device_resources` instance. - -### Multi-dimensional data representation - -`rmm::device_uvector` is a great mechanism for allocating and managing a chunk of device memory. While it's possible to use a single array to represent objects in higher dimensions like matrices, it lacks the means to pass that information along. For example, in addition to knowing that we have a 2d structure, we would need to know the number of rows, the number of columns, and even whether we read the columns or rows first (referred to as column- or row-major respectively). - -For this reason, RAFT relies on the `mdspan` standard, which was composed specifically for this purpose. To be even more, `mdspan` itself doesn't actually allocate or own any data on host or device because it's just a view over an existing memory on host device. The `mdspan` simply gives us a way to represent multi-dimensional data so we can pass along the needed metadata to our APIs. Even more powerful is that we can design functions that only accept a matrix of `float` in device memory that is laid out in row-major format. - -The memory-owning counterpart to the `mdspan` is the `mdarray` and the `mdarray` can allocate memory on device or host and carry along with it the metadata about its shape and layout. An `mdspan` can be produced from an `mdarray` for invoking RAFT APIs with `mdarray.view()`. They also follow similar paradigms to the STL, where we represent an immutable `mdspan` of `int` using `mdspan` instead of `const mdspan` to ensure it's the type carried along by the `mdspan` that's not allowed to change. - -Many RAFT functions require `mdspan` to represent immutable input data and there's no implicit conversion between `mdspan` and `mdspan` we use `raft::make_const_mdspan()` to alleviate the pain of constructing a new `mdspan` to invoke these functions. - -The following example demonstrates how to create `mdarray` matrices in both device and host memory, copy one to the other, and create mdspans out of them: - -```c++ -#include -#include -#include - -raft::device_resources res; - -int n_rows = 10; -int n_cols = 10; - -auto device_matrix = raft::make_device_matrix(res, n_rows, n_cols); -auto host_matrix = raft::make_host_matrix(res, n_rows, n_cols); - -// Set the diagonal to 1 -for(int i = 0; i < n_rows; i++) { - host_matrix(i, i) = 1; -} - -raft::copy(res, device_matrix.view(), host_matrix.view()); -``` - -## Step 2: Generate some data - -Let's build upon the fundamentals from the prior section and actually invoke some of RAFT's computational APIs on the device. A good starting point is data generation. - -```c++ -#include -#include - -raft::device_resources res; - -int n_rows = 10000; -int n_cols = 10000; - -auto dataset = raft::make_device_matrix(res, n_rows, n_cols); -auto labels = raft::make_device_vector(res, n_rows); - -raft::random::make_blobs(res, dataset.view(), labels.view()); -``` - -That's it. We've now generated a random 10kx10k matrix with points that cleanly separate into Gaussian clusters, along with a vector of cluster labels for each of the data points. Notice the `cuh` extension in the header file include for `make_blobs`. This signifies to us that this file contains CUDA device functions like kernel code so the CUDA compiler, `nvcc` is needed in order to compile any code that uses it. Generally, any source files that include headers with a `cuh` extension use the `.cu` extension instead of `.cpp`. The rule here is that `cpp` source files contain code which can be compiled with a C++ compiler like `g++` while `cu` files require the CUDA compiler. - -Since the `make_blobs` code generates the random dataset on the GPU device, we didn't need to do any host to device copies in this one. `make_blobs` is also asynchronous, so if we don't need to copy and use the data in host memory right away, we can continue calling RAFT functions with the `device_resources` instance and the data transformations will all be scheduled on the same stream. - -## Step 3: Using brute-force indexes - -### Build brute-force index - -Consider the `(10k, 10k)` shaped random matrix we generated in the previous step. We want to be able to find the k-nearest neighbors for all points of the matrix, or what we refer to as the all-neighbors graph, which means finding the neighbors of all data points within the same matrix. -```c++ -#include - -raft::device_resources res; - -// set number of neighbors to search for -int const k = 64; - -auto bfknn_index = raft::neighbors::brute_force::build(res, - raft::make_const_mdspan(dataset.view())); -``` - -### Query brute-force index - -```c++ - -// using matrix `dataset` from previous example -auto search = raft::make_const_mdspan(dataset.view()); - -// Indices and Distances are of dimensions (n, k) -// where n is number of rows in the search matrix -auto reference_indices = raft::make_device_matrix(res, search.extent(0), k); // stores index of neighbors -auto reference_distances = raft::make_device_matrix(res, search.extent(0), k); // stores distance to neighbors - -raft::neighbors::brute_force::search(res, - bfknn_index, - search, - reference_indices.view(), - reference_distances.view()); -``` - -We have established several things here by building a flat index. Now we know the exact 64 neighbors of all points in the matrix, and this algorithm can be generally useful in several ways: -1. Creating a baseline to compare against when building an approximate nearest neighbors index. -2. Directly using the brute-force algorithm when accuracy is more important than speed of computation. Don't worry, our implementation is still the best in-class and will provide not only significant speedups over other brute force methods, but also be quick relatively when the matrices are small! - - -## Step 4: Using the ANN indexes - -### Build a CAGRA index - -Next we'll train an ANN index. We'll use our graph-based CAGRA algorithm for this example but the other index types use a very similar pattern. - -```c++ -#include - -raft::device_resources res; - -// use default index parameters -raft::neighbors::cagra::index_params index_params; - -auto index = raft::neighbors::cagra::build(res, index_params, raft::make_const_mdspan(dataset.view())); -``` - -### Query the CAGRA index - -Now that we've trained a CAGRA index, we can query it by first allocating our output `mdarray` objects and passing the trained index model into the search function. - -```c++ -// create output arrays -auto indices = raft::make_device_matrix(res, n_rows, k); -auto distances = raft::make_device_matrix(res, n_rows, k); - -// use default search parameters -raft::neighbors::cagra::search_params search_params; - -// search K nearest neighbors -raft::neighbors::cagra::search( -res, search_params, index, search, indices.view(), distances.view()); -``` - -## Step 5: Evaluate neighborhood quality - -In step 3 we built a flat index and queried for exact neighbors while in step 4 we build an ANN index and queried for approximate neighbors. How do you quickly figure out the quality of our approximate neighbors and whether it's in an acceptable range based on your needs? Just compute the `neighborhood_recall` which gives a single value in the range [0, 1]. Closer the value to 1, higher the quality of the approximation. - -```c++ -#include - -raft::device_resources res; - -// Assuming matrices as type raft::device_matrix_view and variables as -// indices : approximate neighbor indices -// reference_indices : exact neighbor indices -// distances : approximate neighbor distances -// reference_distances : exact neighbor distances - -// We want our `neighborhood_recall` value in host memory -float const recall_scalar = 0.0; -auto recall_value = raft::make_host_scalar(recall_scalar); - -raft::stats::neighborhood_recall(res, - raft::make_const_mdspan(indices.view()), - raft::make_const_mdspan(reference_indices.view()), - recall_value.view(), - raft::make_const_mdspan(distances.view()), - raft::make_const_mdspan(reference_distances.view())); - -res.sync_stream(); -``` - -Notice we can run invoke the functions for index build and search for both algorithms, one right after the other, because we don't need to access any outputs from the algorithms in host memory. We will need to synchronize the stream on the `raft::device_resources` instance before we can read the result of the `neighborhood_recall` computation, though. - -Similar to a Numpy array, when we use a `host_scalar`, we are really using a multi-dimensional structure that contains only a single dimension, and further a single element. We can use element indexing to access the resulting element directly. -```c++ -std::cout << recall_value(0) << std::endl; -``` - -While it may seem like unnecessary additional work to wrap the result in a `host_scalar` mdspan, this API choice is made intentionally to support the possibility of also receiving the result as a `device_scalar` so that it can be used directly on the device for follow-on computations without having to incur the synchronization or transfer cost of bringing the result to host. This pattern becomes even more important when the result is being computed in a loop, such as an iterative solver, and the cost of synchronization and device-to-host (d2h) transfer becomes very expensive. - -## Advanced features - -The following sections present some advanced features that we have found can be useful for squeezing more utilization out of GPU hardware. As you've seen in this tutorial, RAFT provides several very useful tools and building blocks for developing accelerated applications beyond vector search capabilities. - -### Serialization - -Most of the indexes in `raft::neighbors` can be serialized to/from streams and files on disk. The index types that support this feature have include files with the naming convention `_serialize.cuh`. The serialization functions are similar across the different index types, with the primary difference being that some index types require a pointer to all the training data for search. Since the original training dataset can be quite large, the `serialize()` function for these index types includes an argument `include_dataset`, which allows the user to specify whether the dataset should be included in the serialized form. The index types that allow for this also include a method `update_datasets()` to allow for the dataset to be re-attached to the index after it is deserialized. - -The following example demonstrates serializing and deserializing a CAGRA index to and from a file. For index types that don't require the training data, you can remove the `include_dataset` and `update_dataset()` parts. We will assume the CAGRA index has been built using the code from [Step 4](#build-a-cagra-index) above: - -```c++ -#include -#include - -using namespace raft::neighbors; - -raft::neighbors::cagra::serialize(res, "cagra_serialized.dat", index, false); - -auto index_deser = raft::neighbors::cagra::deserialize(res, "cagra_serialized.dat"); -index_deser.update_dataset(dataset); -``` - -### Filtering - -As of RAFT 23.10, support for pre-filtering of neighbors has been added to ANN index. This search feature can enable multiple use-cases, such as filtering a vector based on it's attributes (hybrid searches), the removal of vectors already added to the index, or the control of access in searches for security purposes. -The filtering is available through the `search_with_filtering()` function of the ANN index, and is done by applying a predicate function on the GPU, which usually have the signature `(uint32_t query_ix, uint32_t sample_ix) -> bool`. - -One of the most commonly used mechanism for filtering is the bitset: the bitset is a data structure that allows to test the presence of a value in a set through a fast lookup, and is implemented as a bit array so that every element contains a `0` or a `1` (respectively `false` and `true` in boolean logic). RAFT provides a `raft::core::bitset` class that can be used to create and manipulate bitsets on the GPU, and a `raft::core::bitset_view` class that can be used to pass bitsets to filtering functions. - -The following example demonstrates how to use the filtering API (assume the CAGRA index is built using the code from [Step 4](#build-a-cagra-index) above: - -```c++ -#include -#include - -using namespace raft::neighbors; - -cagra::search_params search_params; - -// create a bitset to filter the search -auto removed_indices = raft::make_device_vector(res, n_removed_indices); -raft::core::bitset removed_indices_bitset( - res, removed_indices.view(), dataset.extent(0)); - -// ... Populate the bitset ... - -// search K nearest neighbours according to a bitset filter -auto neighbors = raft::make_device_matrix(res, n_queries, k); -auto distances = raft::make_device_matrix(res, n_queries, k); -cagra::search_with_filtering(res, search_params, index, queries, neighbors, distances, - filtering::bitset_filter(removed_indices_bitset.view())); -``` - -### Stream pools - -Within each CPU thread, CUDA uses `streams` to submit asynchronous work. You can think of a stream as a queue. Each stream can submit work to the GPU independently of other streams but work submitted within each stream is queued and executed in the order in which it was submitted. Similar to how we can use thread pools to bound the parallelism of CPU threads, we can use CUDA stream pools to bound the amount of concurrent asynchronous work that can be scheduled on a GPU. Each instance of `device_resources` has a main stream, but can also create a stream pool. For a single CPU thread, multiple different instances of `device_resources` can be created with different main streams and used to invoke a series of RAFT functions concurrently on the same or different GPU devices, so long as the target devices have available resources to perform the work. Once a device is saturated, queued work on streams will be scheduled and wait for a chance to do more work. During this time the streams are waiting, the CPU thread will still continue its own execution asynchronously unless `sync_stream_pool()` is called, causing the thread to block and wait for the thread pools to complete. - -Also, beware that before splitting GPU work onto multiple different concurrent streams, it can often be important to wait for the main stream in the `device_resources`. This can be done with `wait_stream_pool_on_stream()`. - -To summarize, if wanting to execute multiple different streams in parallel, we would often use a stream pool like this: -```c++ -#include - -#include -#include - -int n_streams = 5; - -rmm::cuda_stream stream; -std::shared_ptr stream_pool(5) -raft::device_resources res(stream.view(), stream_pool); - -// Submit some work on the main stream... - -res.wait_stream_pool_on_stream() -for(int i = 0; i < n_streams; ++i) { - rmm::cuda_stream_view stream_from_pool = res.get_next_usable_stream(); - raft::device_resources pool_res(stream_from_pool); - // Submit some work with pool_res... -} - -res.sync_stream_pool(); -``` - -### Device resources manager - -In multi-threaded applications, it is often useful to create a set of -`raft::device_resources` objects on startup to avoid the overhead of -re-initializing underlying resources every time a `raft::device_resources` object -is needed. To help simplify this common initialization logic, RAFT -provides a `raft::device_resources_manager` to handle this for downstream -applications. On startup, the application can specify certain limits on the -total resource consumption of the `raft::device_resources` objects that will be -generated: -```c++ -#include - -void initialize_application() { - // Set the total number of CUDA streams to use on each GPU across all CPU - // threads. If this method is not called, the default stream per thread - // will be used. - raft::device_resources_manager::set_streams_per_device(16); - - // Create a memory pool with given max size in bytes. Passing std::nullopt will allow - // the pool to grow to the available memory of the device. - raft::device_resources_manager::set_max_mem_pool_size(std::nullopt); - - // Set the initial size of the memory pool in bytes. - raft::device_resources_manager::set_init_mem_pool_size(16000000); - - // If neither of the above methods are called, no memory pool will be used -} -``` -While this example shows some commonly used settings, -`raft::device_resources_manager` provides support for several other -resource options and constraints, including options to initialize entire -stream pools that can be used by an individual `raft::device_resources` object. After -this initialization method is called, the following function could be called -from any CPU thread: -```c++ -void foo() { - raft::device_resources const& res = raft::device_resources_manager::get_device_resources(); - // Submit some work with res - res.sync_stream(); -} -``` - -If any `raft::device_resources_manager` setters are called _after_ the first -call to `raft::device_resources_manager::get_device_resources()`, these new -settings are ignored, and a warning will be logged. If a thread calls -`raft::device_resources_manager::get_device_resources()` multiple times, it is -guaranteed to access the same underlying `raft::device_resources` object every -time. This can be useful for chaining work in different calls on the same -thread without keeping a persistent reference to the resources object. - -### Device memory resources - -The RAPIDS software ecosystem makes heavy use of the [RAPIDS Memory Manager](https://github.com/rapidsai/rmm) (RMM) to enable zero-copy sharing of device memory across various GPU-enabled libraries such as PyTorch, Jax, Tensorflow, and FAISS. A really powerful feature of RMM is the ability to set a memory resource, such as a pooled memory resource that allocates a block of memory up front to speed up subsequent smaller allocations, and have all the libraries in the GPU ecosystem recognize and use that same memory resource for all of their memory allocations. - -As an example, the following code snippet creates a `pool_memory_resource` and sets it as the default memory resource, which means all other libraries that use RMM will now allocate their device memory from this same pool: -```c++ -#include - -rmm::mr::cuda_memory_resource cuda_mr; -// Construct a resource that uses a coalescing best-fit pool allocator -// set the initial size to half of the free device memory -auto init_size = rmm::percent_of_free_device_memory(50); -rmm::mr::pool_memory_resource pool_mr{&cuda_mr, init_size}; -rmm::mr::set_current_device_resource(&pool_mr); // Updates the current device resource pointer to `pool_mr` -``` - -The `raft::device_resources` object will now also use the `rmm::current_device_resource`. This isn't limited to C++, however. Often a user will be interacting with PyTorch, RAPIDS, or Tensorflow through Python and so they can set and use RMM's `current_device_resource` [right in Python](https://github.com/rapidsai/rmm#using-rmm-in-python-code). - -### Workspace memory resource - -As mentioned above, `raft::device_resources` will use `rmm::current_device_resource` by default for all memory allocations. However, there are times when a particular algorithm might benefit from using a different memory resource such as a `managed_memory_resource`, which creates a unified memory space between device and host memory, paging memory in and out of device as needed. Most of RAFT's algorithms allocate temporary memory as needed to perform their computations and we can control the memory resource used for these temporary allocations through the `workspace_resource` in the `raft::device_resources` instance. - -For some applications, the `managed_memory_resource`, can enable a memory space that is larger than the GPU, thus allowing a natural spilling to host memory when needed. This isn't always the best way to use managed memory, though, as it can quickly lead to thrashing and severely impact performance. Still, when it can be used, it provides a very powerful tool that can also avoid out of memory errors when enough host memory is available. - -The following creates a managed memory allocator and set it as the `workspace_resource` of the `raft::device_resources` instance: -```c++ -#include -#include - -std::shared_ptr managed_resource; -raft::device_resource res(managed_resource); -``` - -The `workspace_resource` uses an `rmm::mr::limiting_resource_adaptor`, which limits the total amount of allocation possible. This allows RAFT algorithms to work within the confines of the memory constraints imposed by the user so that things like batch sizes can be automatically set to reasonable values without exceeding the allotted memory. By default, this limit restricts the memory allocation space for temporary workspace buffers to the memory available on the device. - -The below example specifies the total number of bytes that RAFT can use for temporary workspace allocations to 3GB: -```c++ -#include -#include - -#include - -std::shared_ptr managed_resource; -raft::device_resource res(managed_resource, std::make_optional(3 * 1024^3)); -``` \ No newline at end of file