Skip to content

Commit

Permalink
CAGRA new vector addition (#151)
Browse files Browse the repository at this point in the history
This PR introduces the new vector addition feature to CAGRA.

Rel: rapidsai/raft#1775
Original PR: rapidsai/raft#2157

CAGRA-Q is not supported

## Usage
```cpp
auto additional_dataset = raft::make_host_matrix<float, int64_t>(res,updated_dataset_size, dim);
cuvs::neighbors::cagra::extend(handle, raft::make_const_mdspan(additiona_dataset.view()), cagra_index);
```

## Algorithm

Graph degree: d

The algorithm consists of two stages: rank-based reordering and reverse edge addition.
1. Rank-based reordering
1-1. Obtain d' (=2d) nearest neighbor vectors (V) of a given new vector using the CAGRA search
1-2. Count the number of detourable edges using the result of step 1 and the neighbor list of the input index. Then we prune (3*d/2) edges in the same way as the CAGRA graph optimization. Through this operation, we decide d/2 neighbors.
2. Reverse edge addition
2-1. Count the number of incoming edges for all nodes.
2-2. Add d/2 reverse edges from the nodes added to the neighbor list in Step 1 by replacing a node with a new node. To prevent the connection to the replaced node from being lost, we add the node to the neighbor list of the new node. This allow us to make a detour connection. The replaced nodes are the largest number of incoming edge nodes in the 2/d nodes from the back of the neighbor list without duplication with the nodes already in the neighbor list.

## Performance
In this experiment, we first split the dataset into two parts: the initial and the additional part. Then, we extend the CAGRA index built by the initial part to include the additional part.
![search-eval](https://github.com/rapidsai/raft/assets/12711693/0fbae9e5-defc-4263-9d34-176667fb3359)


We can see a larger recall drop compared to the baseline by increasing the number of added vectors.
Therefore, rebuilding the CAGRA index is recommended when one wants to add a lot of vectors.

Authors:
  - tsuki (https://github.com/enp1s0)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #151
  • Loading branch information
enp1s0 authored Jul 9, 2024
1 parent c5e3ec8 commit bf53940
Show file tree
Hide file tree
Showing 12 changed files with 1,225 additions and 81 deletions.
3 changes: 3 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,9 @@ add_library(
src/neighbors/cagra_build_float.cu
src/neighbors/cagra_build_int8.cu
src/neighbors/cagra_build_uint8.cu
src/neighbors/cagra_extend_float.cu
src/neighbors/cagra_extend_int8.cu
src/neighbors/cagra_extend_uint8.cu
src/neighbors/cagra_optimize.cu
src/neighbors/cagra_search_float.cu
src/neighbors/cagra_search_int8.cu
Expand Down
258 changes: 256 additions & 2 deletions cpp/include/cuvs/neighbors/cagra.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
#include <raft/core/host_device_accessor.hpp>
#include <raft/core/host_mdspan.hpp>
#include <raft/core/mdspan.hpp>
#include <raft/core/mdspan_types.hpp>
#include <raft/core/resource/stream_view.hpp>
#include <raft/core/resources.hpp>
#include <raft/util/integer_utils.hpp>
Expand Down Expand Up @@ -173,6 +174,25 @@ struct search_params : cuvs::neighbors::search_params {
/**
* @}
*/

/**
* @defgroup cagra_cpp_extend_params CAGRA index extend parameters
* @{
*/

struct extend_params {
/** The additional dataset is divided into chunks and added to the graph. This is the knob to
* adjust the tradeoff between the recall and operation throughput. Large chunk sizes can result
* in high throughput, but use more working memory (O(max_chunk_size*degree^2)). This can also
* degrade recall because no edges are added between the nodes in the same chunk. Auto select when
* 0. */
uint32_t max_chunk_size = 0;
};

/**
* @}
*/

static_assert(std::is_aggregate_v<index_params>);
static_assert(std::is_aggregate_v<search_params>);

Expand Down Expand Up @@ -644,11 +664,244 @@ auto build(raft::resources const& res,
*/

/**
* @defgroup cagra_cpp_index_search CAGRA search functions
* @defgroup cagra_cpp_index_extend CAGRA extend functions
* @{
*/

/** @brief Add new vectors to a CAGRA index
*
* Usage example:
* @code{.cpp}
* using namespace raft::neighbors;
* auto additional_dataset = raft::make_device_matrix<float, int64_t>(handle,add_size,dim);
* // set_additional_dataset(additional_dataset.view());
*
* cagra::extend_params params;
* cagra::extend(res, params, raft::make_const_mdspan(additional_dataset.view()), index);
* @endcode
*
* @param[in] handle raft resources
* @param[in] params extend params
* @param[in] additional_dataset additional dataset on device memory
* @param[in,out] idx CAGRA index
* @param[out] new_dataset_buffer_view memory buffer view for the dataset including the additional
* part. The data will be copied from the current index in this function. The num rows must be the
* sum of the original and additional datasets, cols must be the dimension of the dataset, and the
* stride must be the same as the original index dataset. This view will be stored in the output
* index. It is the caller's responsibility to ensure that dataset stays alive as long as the index.
* This option is useful when users want to manage the memory space for the dataset themselves.
* @param[out] new_graph_buffer_view memory buffer view for the graph including the additional part.
* The data will be copied from the current index in this function. The num rows must be the sum of
* the original and additional datasets and cols must be the graph degree. This view will be stored
* in the output index. It is the caller's responsibility to ensure that dataset stays alive as long
* as the index. This option is useful when users want to manage the memory space for the graph
* themselves.
*/
void extend(
raft::resources const& handle,
const cagra::extend_params& params,
raft::device_matrix_view<const float, int64_t, raft::row_major> additional_dataset,
cuvs::neighbors::cagra::index<float, uint32_t>& idx,
std::optional<raft::device_matrix_view<float, int64_t, raft::layout_stride>>
new_dataset_buffer_view = std::nullopt,
std::optional<raft::device_matrix_view<uint32_t, int64_t>> new_graph_buffer_view = std::nullopt);

/** @brief Add new vectors to a CAGRA index
*
* Usage example:
* @code{.cpp}
* using namespace raft::neighbors;
* auto additional_dataset = raft::make_host_matrix<float, int64_t>(handle,add_size,dim);
* // set_additional_dataset(additional_dataset.view());
*
* cagra::extend_params params;
* cagra::extend(res, params, raft::make_const_mdspan(additional_dataset.view()), index);
* @endcode
*
* @param[in] handle raft resources
* @param[in] params extend params
* @param[in] additional_dataset additional dataset on host memory
* @param[in,out] idx CAGRA index
* @param[out] new_dataset_buffer_view memory buffer view for the dataset including the additional
* part. The data will be copied from the current index in this function. The num rows must be the
* sum of the original and additional datasets, cols must be the dimension of the dataset, and the
* stride must be the same as the original index dataset. This view will be stored in the output
* index. It is the caller's responsibility to ensure that dataset stays alive as long as the index.
* This option is useful when users want to manage the memory space for the dataset themselves.
* @param[out] new_graph_buffer_view memory buffer view for the graph including the additional part.
* The data will be copied from the current index in this function. The num rows must be the sum of
* the original and additional datasets and cols must be the graph degree. This view will be stored
* in the output index. It is the caller's responsibility to ensure that dataset stays alive as long
* as the index. This option is useful when users want to manage the memory space for the graph
* themselves.
*/
void extend(
raft::resources const& handle,
const cagra::extend_params& params,
raft::host_matrix_view<const float, int64_t, raft::row_major> additional_dataset,
cuvs::neighbors::cagra::index<float, uint32_t>& idx,
std::optional<raft::device_matrix_view<float, int64_t, raft::layout_stride>>
new_dataset_buffer_view = std::nullopt,
std::optional<raft::device_matrix_view<uint32_t, int64_t>> new_graph_buffer_view = std::nullopt);

/** @brief Add new vectors to a CAGRA index
*
* Usage example:
* @code{.cpp}
* using namespace raft::neighbors;
* auto additional_dataset = raft::make_device_matrix<int8_t, int64_t>(handle,add_size,dim);
* // set_additional_dataset(additional_dataset.view());
*
* cagra::extend_params params;
* cagra::extend(res, params, raft::make_const_mdspan(additional_dataset.view()), index);
* @endcode
*
* @param[in] handle raft resources
* @param[in] params extend params
* @param[in] additional_dataset additional dataset on device memory
* @param[in,out] idx CAGRA index
* @param[out] new_dataset_buffer_view memory buffer view for the dataset including the additional
* part. The data will be copied from the current index in this function. The num rows must be the
* sum of the original and additional datasets, cols must be the dimension of the dataset, and the
* stride must be the same as the original index dataset. This view will be stored in the output
* index. It is the caller's responsibility to ensure that dataset stays alive as long as the index.
* This option is useful when users want to manage the memory space for the dataset themselves.
* @param[out] new_graph_buffer_view memory buffer view for the graph including the additional part.
* The data will be copied from the current index in this function. The num rows must be the sum of
* the original and additional datasets and cols must be the graph degree. This view will be stored
* in the output index. It is the caller's responsibility to ensure that dataset stays alive as long
* as the index. This option is useful when users want to manage the memory space for the graph
* themselves.
*/
void extend(
raft::resources const& handle,
const cagra::extend_params& params,
raft::device_matrix_view<const int8_t, int64_t, raft::row_major> additional_dataset,
cuvs::neighbors::cagra::index<int8_t, uint32_t>& idx,
std::optional<raft::device_matrix_view<int8_t, int64_t, raft::layout_stride>>
new_dataset_buffer_view = std::nullopt,
std::optional<raft::device_matrix_view<uint32_t, int64_t>> new_graph_buffer_view = std::nullopt);

/** @brief Add new vectors to a CAGRA index
*
* Usage example:
* @code{.cpp}
* using namespace raft::neighbors;
* auto additional_dataset = raft::make_host_matrix<int8_t, int64_t>(handle,add_size,dim);
* // set_additional_dataset(additional_dataset.view());
*
* cagra::extend_params params;
* cagra::extend(res, params, raft::make_const_mdspan(additional_dataset.view()), index);
* @endcode
*
* @param[in] handle raft resources
* @param[in] params extend params
* @param[in] additional_dataset additional dataset on host memory
* @param[in,out] idx CAGRA index
* @param[out] new_dataset_buffer_view memory buffer view for the dataset including the additional
* part. The data will be copied from the current index in this function. The num rows must be the
* sum of the original and additional datasets, cols must be the dimension of the dataset, and the
* stride must be the same as the original index dataset. This view will be stored in the output
* index. It is the caller's responsibility to ensure that dataset stays alive as long as the index.
* This option is useful when users want to manage the memory space for the dataset themselves.
* @param[out] new_graph_buffer_view memory buffer view for the graph including the additional part.
* The data will be copied from the current index in this function. The num rows must be the sum of
* the original and additional datasets and cols must be the graph degree. This view will be stored
* in the output index. It is the caller's responsibility to ensure that dataset stays alive as long
* as the index. This option is useful when users want to manage the memory space for the graph
* themselves.
*/
void extend(
raft::resources const& handle,
const cagra::extend_params& params,
raft::host_matrix_view<const int8_t, int64_t, raft::row_major> additional_dataset,
cuvs::neighbors::cagra::index<int8_t, uint32_t>& idx,
std::optional<raft::device_matrix_view<int8_t, int64_t, raft::layout_stride>>
new_dataset_buffer_view = std::nullopt,
std::optional<raft::device_matrix_view<uint32_t, int64_t>> new_graph_buffer_view = std::nullopt);

/** @brief Add new vectors to a CAGRA index
*
* Usage example:
* @code{.cpp}
* using namespace raft::neighbors;
* auto additional_dataset = raft::make_host_matrix<uint8_t, int64_t>(handle,add_size,dim);
* // set_additional_dataset(additional_dataset.view());
*
* cagra::extend_params params;
* cagra::extend(res, params, raft::make_const_mdspan(additional_dataset.view()), index);
* @endcode
*
* @param[in] handle raft resources
* @param[in] params extend params
* @param[in] additional_dataset additional dataset on host memory
* @param[in,out] idx CAGRA index
* @param[out] new_dataset_buffer_view memory buffer view for the dataset including the additional
* part. The data will be copied from the current index in this function. The num rows must be the
* sum of the original and additional datasets, cols must be the dimension of the dataset, and the
* stride must be the same as the original index dataset. This view will be stored in the output
* index. It is the caller's responsibility to ensure that dataset stays alive as long as the index.
* This option is useful when users want to manage the memory space for the dataset themselves.
* @param[out] new_graph_buffer_view memory buffer view for the graph including the additional part.
* The data will be copied from the current index in this function. The num rows must be the sum of
* the original and additional datasets and cols must be the graph degree. This view will be stored
* in the output index. It is the caller's responsibility to ensure that dataset stays alive as long
* as the index. This option is useful when users want to manage the memory space for the graph
* themselves.
*/
void extend(
raft::resources const& handle,
const cagra::extend_params& params,
raft::device_matrix_view<const uint8_t, int64_t, raft::row_major> additional_dataset,
cuvs::neighbors::cagra::index<uint8_t, uint32_t>& idx,
std::optional<raft::device_matrix_view<uint8_t, int64_t, raft::layout_stride>>
new_dataset_buffer_view = std::nullopt,
std::optional<raft::device_matrix_view<uint32_t, int64_t>> new_graph_buffer_view = std::nullopt);

/** @brief Add new vectors to a CAGRA index
*
* Usage example:
* @code{.cpp}
* using namespace raft::neighbors;
* auto additional_dataset = raft::make_host_matrix<uint8_t, int64_t>(handle,add_size,dim);
* // set_additional_dataset(additional_dataset.view());
*
* cagra::extend_params params;
* cagra::extend(res, params, raft::make_const_mdspan(additional_dataset.view()), index);
* @endcode
*
* @param[in] handle raft resources
* @param[in] params extend params
* @param[in] additional_dataset additional dataset on host memory
* @param[in,out] idx CAGRA index
* @param[out] new_dataset_buffer_view memory buffer view for the dataset including the additional
* part. The data will be copied from the current index in this function. The num rows must be the
* sum of the original and additional datasets, cols must be the dimension of the dataset, and the
* stride must be the same as the original index dataset. This view will be stored in the output
* index. It is the caller's responsibility to ensure that dataset stays alive as long as the index.
* This option is useful when users want to manage the memory space for the dataset themselves.
* @param[out] new_graph_buffer_view memory buffer view for the graph including the additional part.
* The data will be copied from the current index in this function. The num rows must be the sum of
* the original and additional datasets and cols must be the graph degree. This view will be stored
* in the output index. It is the caller's responsibility to ensure that dataset stays alive as long
* as the index. This option is useful when users want to manage the memory space for the graph
* themselves.
*/
void extend(
raft::resources const& handle,
const cagra::extend_params& params,
raft::host_matrix_view<const uint8_t, int64_t, raft::row_major> additional_dataset,
cuvs::neighbors::cagra::index<uint8_t, uint32_t>& idx,
std::optional<raft::device_matrix_view<uint8_t, int64_t, raft::layout_stride>>
new_dataset_buffer_view = std::nullopt,
std::optional<raft::device_matrix_view<uint32_t, int64_t>> new_graph_buffer_view = std::nullopt);
/**
* @}
*/

/**
* @defgroup cagra_cpp_index_search CAGRA search functions
* @{
* @brief Search ANN using the constructed index.
*
* See the [cagra::build](#cagra::build) documentation for a usage example.
Expand All @@ -658,13 +911,14 @@ auto build(raft::resources const& res,
*
* @param[in] res raft resources
* @param[in] params configure the search
* @param[in] index cagra index
* @param[in] idx cagra index
* @param[in] queries a device matrix view to a row-major matrix [n_queries, index->dim()]
* @param[out] neighbors a device matrix view to the indices of the neighbors in the source dataset
* [n_queries, k]
* @param[out] distances a device matrix view to the distances to the selected neighbors [n_queries,
* k]
*/

void search(raft::resources const& res,
cuvs::neighbors::cagra::search_params const& params,
const cuvs::neighbors::cagra::index<float, uint32_t>& index,
Expand Down
Loading

0 comments on commit bf53940

Please sign in to comment.