Skip to content

Commit

Permalink
Enable kernel & memcpy overlapping in IVF index building (#230)
Browse files Browse the repository at this point in the history
Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using `batch_load_iterator` in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current `batch_load_iterator` implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with `cudaMemcpyAsync` on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk. 

We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance). 
algo | dataset | model | with prefetching (s) | without prefetching (s) | speedup
-- | -- | -- | -- | -- | --
IVF-PQ | deep-100M | d64b5n50K | 97.3547 | 100.36 | 1.03
IVF-PQ | wiki-all-10M | d64-nlist16K | 14.9763 | 18.1602 | 1.21
IVF-Flat | deep-100M | nlist50K | 78.8188 | 81.4461 | 1.03

This PR is related to the issue submitted to RAFT: rapidsai/raft#2106

Authors:
  - Rui Lan (https://github.com/abc99lr)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #230
  • Loading branch information
abc99lr authored Jul 31, 2024
1 parent e67caa5 commit eb4d38e
Show file tree
Hide file tree
Showing 12 changed files with 382 additions and 31 deletions.
4 changes: 4 additions & 0 deletions cpp/bench/ann/src/cuvs/cuvs_ivf_flat_wrapper.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
#include <raft/core/resource/cuda_stream.hpp>
#include <raft/linalg/unary_op.cuh>
#include <raft/util/cudart_utils.hpp>
#include <rmm/cuda_stream_pool.hpp>

#include <cassert>
#include <fstream>
Expand Down Expand Up @@ -96,6 +97,9 @@ class cuvs_ivf_flat : public algo<T>, public algo_gpu {
template <typename T, typename IdxT>
void cuvs_ivf_flat<T, IdxT>::build(const T* dataset, size_t nrow)
{
// Create a CUDA stream pool with 1 stream (besides main stream) for kernel/copy overlapping.
size_t n_streams = 1;
raft::resource::set_cuda_stream_pool(handle_, std::make_shared<rmm::cuda_stream_pool>(n_streams));
index_ = std::make_shared<cuvs::neighbors::ivf_flat::index<T, IdxT>>(
std::move(cuvs::neighbors::ivf_flat::build(
handle_,
Expand Down
4 changes: 4 additions & 0 deletions cpp/bench/ann/src/cuvs/cuvs_ivf_pq_wrapper.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
#include <raft/linalg/unary_op.cuh>
#include <raft/neighbors/refine.cuh>
#include <raft/util/cudart_utils.hpp>
#include <rmm/cuda_stream_pool.hpp>

#include <type_traits>

Expand Down Expand Up @@ -115,6 +116,9 @@ void cuvs_ivf_pq<T, IdxT>::load(const std::string& file)
template <typename T, typename IdxT>
void cuvs_ivf_pq<T, IdxT>::build(const T* dataset, size_t nrow)
{
// Create a CUDA stream pool with 1 stream (besides main stream) for kernel/copy overlapping.
size_t n_streams = 1;
raft::resource::set_cuda_stream_pool(handle_, std::make_shared<rmm::cuda_stream_pool>(n_streams));
auto dataset_v = raft::make_device_matrix_view<const T, IdxT>(dataset, IdxT(nrow), dim_);
std::make_shared<cuvs::neighbors::ivf_pq::index<IdxT>>(
std::move(cuvs::neighbors::ivf_pq::build(handle_, index_params_, dataset_v)))
Expand Down
79 changes: 79 additions & 0 deletions cpp/include/cuvs/neighbors/ivf_flat.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -445,11 +445,18 @@ void build(raft::resources const& handle,
/**
* @brief Build the index from the dataset for efficient search.
*
* Note, if index_params.add_data_on_build is set to true, the user can set a
* stream pool in the input raft::resource with at least one stream to enable kernel and copy
* overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
* // use default index parameters
* ivf_flat::index_params index_params;
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping. This is only applicable if index_params.add_data_on_build is set to true
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // create and fill the index from a [N, D] dataset
* auto index = ivf_flat::build(handle, dataset, index_params);
* @endcode
Expand All @@ -468,11 +475,18 @@ auto build(raft::resources const& handle,
/**
* @brief Build the index from the dataset for efficient search.
*
* Note, if index_params.add_data_on_build is set to true, the user can set a
* stream pool in the input raft::resource with at least one stream to enable kernel and copy
* overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
* // use default index parameters
* ivf_flat::index_params index_params;
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping. This is only applicable if index_params.add_data_on_build is set to true
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // create and fill the index from a [N, D] dataset
* ivf_flat::index<decltype(dataset::value_type), decltype(dataset::index_type)> index;
* ivf_flat::build(handle, dataset, index_params, index);
Expand All @@ -492,11 +506,18 @@ void build(raft::resources const& handle,
/**
* @brief Build the index from the dataset for efficient search.
*
* Note, if index_params.add_data_on_build is set to true, the user can set a
* stream pool in the input raft::resource with at least one stream to enable kernel and copy
* overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
* // use default index parameters
* ivf_flat::index_params index_params;
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping. This is only applicable if index_params.add_data_on_build is set to true
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // create and fill the index from a [N, D] dataset
* auto index = ivf_flat::build(handle, dataset, index_params);
* @endcode
Expand All @@ -515,11 +536,18 @@ auto build(raft::resources const& handle,
/**
* @brief Build the index from the dataset for efficient search.
*
* Note, if index_params.add_data_on_build is set to true, the user can set a
* stream pool in the input raft::resource with at least one stream to enable kernel and copy
* overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
* // use default index parameters
* ivf_flat::index_params index_params;
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping. This is only applicable if index_params.add_data_on_build is set to true
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // create and fill the index from a [N, D] dataset
* ivf_flat::index<decltype(dataset::value_type), decltype(dataset::index_type)> index;
* ivf_flat::build(handle, dataset, index_params, index);
Expand All @@ -539,11 +567,18 @@ void build(raft::resources const& handle,
/**
* @brief Build the index from the dataset for efficient search.
*
* Note, if index_params.add_data_on_build is set to true, the user can set a
* stream pool in the input raft::resource with at least one stream to enable kernel and copy
* overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
* // use default index parameters
* ivf_flat::index_params index_params;
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping. This is only applicable if index_params.add_data_on_build is set to true
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // create and fill the index from a [N, D] dataset
* auto index = ivf_flat::build(handle, dataset, index_params);
* @endcode
Expand All @@ -562,11 +597,18 @@ auto build(raft::resources const& handle,
/**
* @brief Build the index from the dataset for efficient search.
*
* Note, if index_params.add_data_on_build is set to true, the user can set a
* stream pool in the input raft::resource with at least one stream to enable kernel and copy
* overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
* // use default index parameters
* ivf_flat::index_params index_params;
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping. This is only applicable if index_params.add_data_on_build is set to true
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // create and fill the index from a [N, D] dataset
* ivf_flat::index<decltype(dataset::value_type), decltype(dataset::index_type)> index;
* ivf_flat::build(handle, dataset, index_params, index);
Expand Down Expand Up @@ -710,6 +752,7 @@ auto extend(raft::resources const& handle,
* @param[in] handle
* @param[in] new_vectors raft::device_matrix_view to a row-major matrix [n_rows, index.dim()]
* @param[in] new_indices optional raft::device_vector_view to a vector of indices [n_rows].
*
* If the original index is empty (`orig_index.size() == 0`), you can pass `std::nullopt`
* here to imply a continuous range `[0...n_rows)`.
* @param[inout] idx pointer to index, to be overwritten in-place
Expand Down Expand Up @@ -786,6 +829,9 @@ void extend(raft::resources const& handle,
/**
* @brief Build a new index containing the data of the original plus new extra vectors.
*
* Note, the user can set a stream pool in the input raft::resource with
* at least one stream to enable kernel and copy overlapping.
*
* Implementation note:
* The new data is clustered according to existing kmeans clusters, then the cluster
* centers are adjusted to match the newly labeled data.
Expand All @@ -798,6 +844,9 @@ void extend(raft::resources const& handle,
* index_params.kmeans_trainset_fraction = 1.0; // use whole dataset for kmeans training
* // train the index from a [N, D] dataset
* auto index_empty = ivf_flat::build(handle, index_params, dataset);
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // fill the index with the data
* std::optional<raft::host_vector_view<const IdxT, IdxT>> no_op = std::nullopt;
* auto index = ivf_flat::extend(handle, new_vectors, no_op, index_empty);
Expand All @@ -821,6 +870,9 @@ auto extend(raft::resources const& handle,
/**
* @brief Extend the index in-place with the new data.
*
* Note, the user can set a stream pool in the input raft::resource with
* at least one stream to enable kernel and copy overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
Expand All @@ -829,6 +881,9 @@ auto extend(raft::resources const& handle,
* index_params.kmeans_trainset_fraction = 1.0; // use whole dataset for kmeans training
* // train the index from a [N, D] dataset
* auto index_empty = ivf_flat::build(handle, index_params, dataset);
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // fill the index with the data
* std::optional<raft::host_vector_view<const IdxT, IdxT>> no_op = std::nullopt;
* ivf_flat::extend(handle, dataset, no_opt, &index_empty);
Expand All @@ -850,6 +905,9 @@ void extend(raft::resources const& handle,
/**
* @brief Build a new index containing the data of the original plus new extra vectors.
*
* Note, the user can set a stream pool in the input raft::resource with
* at least one stream to enable kernel and copy overlapping.
*
* Implementation note:
* The new data is clustered according to existing kmeans clusters, then the cluster
* centers are adjusted to match the newly labeled data.
Expand All @@ -862,6 +920,9 @@ void extend(raft::resources const& handle,
* index_params.kmeans_trainset_fraction = 1.0; // use whole dataset for kmeans training
* // train the index from a [N, D] dataset
* auto index_empty = ivf_flat::build(handle, dataset, index_params, dataset);
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // fill the index with the data
* std::optional<raft::host_vector_view<const IdxT, IdxT>> no_op = std::nullopt;
* auto index = ivf_flat::extend(handle, new_vectors, no_op, index_empty);
Expand All @@ -885,6 +946,9 @@ auto extend(raft::resources const& handle,
/**
* @brief Extend the index in-place with the new data.
*
* Note, the user can set a stream pool in the input raft::resource with
* at least one stream to enable kernel and copy overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
Expand All @@ -893,6 +957,9 @@ auto extend(raft::resources const& handle,
* index_params.kmeans_trainset_fraction = 1.0; // use whole dataset for kmeans training
* // train the index from a [N, D] dataset
* auto index_empty = ivf_flat::build(handle, index_params, dataset);
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // fill the index with the data
* std::optional<raft::host_vector_view<const IdxT, IdxT>> no_op = std::nullopt;
* ivf_flat::extend(handle, dataset, no_opt, &index_empty);
Expand All @@ -914,6 +981,9 @@ void extend(raft::resources const& handle,
/**
* @brief Build a new index containing the data of the original plus new extra vectors.
*
* Note, the user can set a stream pool in the input raft::resource with
* at least one stream to enable kernel and copy overlapping.
*
* Implementation note:
* The new data is clustered according to existing kmeans clusters, then the cluster
* centers are adjusted to match the newly labeled data.
Expand All @@ -926,6 +996,9 @@ void extend(raft::resources const& handle,
* index_params.kmeans_trainset_fraction = 1.0; // use whole dataset for kmeans training
* // train the index from a [N, D] dataset
* auto index_empty = ivf_flat::build(handle, dataset, index_params, dataset);
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // fill the index with the data
* std::optional<raft::host_vector_view<const IdxT, IdxT>> no_op = std::nullopt;
* auto index = ivf_flat::extend(handle, new_vectors, no_op, index_empty);
Expand All @@ -949,6 +1022,9 @@ auto extend(raft::resources const& handle,
/**
* @brief Extend the index in-place with the new data.
*
* Note, the user can set a stream pool in the input raft::resource with
* at least one stream to enable kernel and copy overlapping.
*
* Usage example:
* @code{.cpp}
* using namespace cuvs::neighbors;
Expand All @@ -957,6 +1033,9 @@ auto extend(raft::resources const& handle,
* index_params.kmeans_trainset_fraction = 1.0; // use whole dataset for kmeans training
* // train the index from a [N, D] dataset
* auto index_empty = ivf_flat::build(handle, index_params, dataset);
* // optional: create a stream pool with at least one stream to enable kernel and copy
* // overlapping
* raft::resource::set_cuda_stream_pool(handle, std::make_shared<rmm::cuda_stream_pool>(1));
* // fill the index with the data
* std::optional<raft::host_vector_view<const IdxT, IdxT>> no_op = std::nullopt;
* ivf_flat::extend(handle, dataset, no_opt, &index_empty);
Expand Down
Loading

0 comments on commit eb4d38e

Please sign in to comment.