Skip to content

Commit

Permalink
Merge pull request #6 from jlmelville/v0.2.0
Browse files Browse the repository at this point in the history
V0.2.0
  • Loading branch information
jlmelville authored Sep 21, 2019
2 parents 7067cea + 30b4bda commit ede2764
Show file tree
Hide file tree
Showing 12 changed files with 482 additions and 170 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: RcppHNSW
Title: 'Rcpp' Bindings for 'hnswlib', a Library for Approximate Nearest Neighbors
Version: 0.1.0.9000
Version: 0.2.0
Authors@R: c(person("James", "Melville", email = "[email protected]",
role = c("aut", "cre")), person("Aaron", "Lun", role = "ctb"))
Description: 'Hnswlib' is a C++ library for Approximate Nearest Neighbors. This
Expand Down
26 changes: 26 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
# RcppHNSW 0.2.0

## New features

* Updated hnswlib to <https://github.com/nmslib/hnswlib/commit/c5c38f0>
(20 September 2019).
* A new method, `markDeleted`, that will remove an object from being retrieved
from the index.
* A new method, `resizeIndex`, that allows the index to be increased without
having to save and reload the index.
* A new method, `size` is available for the index objects and reports the
number of items added to the index.


## Bug fixes and minor improvements

* `hnsw_search` would `stop` if the number of rows in the input matrix was
smaller than `k`. This check has been removed. Note that the correct behavior is
to ensure that `k` is smaller than or equal to `index$size()` where `index` is
the index you are searching. Because the `size()` method is new to this version,
to preserve compatibility with old indexes, this check *hasn't* been added to
`hnsw_search`. If this matters to you, manually compare `index$size()` with `k`
before running `hnsw_search`. An error will be thrown if `k` neighbors can't be
found in the index. Thank you to [Yuxing Liao](https://github.com/yxngl) for
spotting this and the pull request to remove the check.

# RcppHNSW 0.1.0

Initial release.
4 changes: 2 additions & 2 deletions R/hnsw.R
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ hnsw_knn <- function(X, k = 10, distance = "euclidean",
hnsw_search(X = X, ann = ann, k = k, ef = ef, verbose = verbose)
}

#' Build a nearest neighbor index
#' Build an hnswlib nearest neighbor index
#'
#' @param X a numeric matrix of data to add. Each of the n rows is an item in
#' the index.
Expand Down Expand Up @@ -159,7 +159,7 @@ hnsw_build <- function(X, distance = "euclidean", M = 16, ef = 200,
ann
}

#' Search an HNSW nearest neighbor index
#' Search an hnswlib nearest neighbor index
#'
#' @param X A numeric matrix of data to search for neighbors.
#' @param ann an instance of a \code{HnswL2}, \code{HnswCosine} or \code{HnswIp}
Expand Down
108 changes: 72 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,15 @@ Rcpp bindings for [hnswlib](https://github.com/nmslib/hnswlib).

### Status

*October 20 2018*. By inserting some preprocessor symbols into HNSW, these
*September 20 2019*. RcppHNSW 0.2.0 is now available on CRAN, up to date with
hnswlib at <https://github.com/nmslib/hnswlib/commit/c5c38f0>, with new methods:
`size`, `resizeIndex` and `markDeleted`. Also, a bug that prevented searching
with datasets smaller than `k` has been fixed. Thanks to
[Yuxing Liao](https://github.com/yxngl) for spotting that.

*January 21 2019*. RcppHNSW is now available on CRAN.

*October 20 2018*. By inserting some preprocessor symbols into hnswlib, these
bindings no longer require a non-portable compiler flag and hence will pass `R
CMD CHECK` without any warnings: previously you would be warned about
`-march=native`. The price paid is not using specialized functions for the
Expand Down Expand Up @@ -102,12 +110,12 @@ res <- ann$getNNsList(data[1, ], k = 4, include_distances = TRUE)
# Inner Product: HnswIP
```

And here's a rough equivalent of the serialization/deserialization example from
Here's a rough equivalent of the serialization/deserialization example from
the
[hnswlib README](https://github.com/nmslib/hnswlib#python-bindings-examples).
Although the index must have its initial size specified when its created, you
can increase its size by saving it to disk, then specifying a new larger size
when you read it back, as the following demonstrates:
[hnswlib README](https://github.com/nmslib/hnswlib#python-bindings-examples),
but using the recently-added `resizeIndex` method to increase the size of the
index after its initial specification, avoiding having to read from or write
to disk:

```R
library("RcppHNSW")
Expand Down Expand Up @@ -136,15 +144,8 @@ p$addItems(data1)
idx <- p$getAllNNs(data1, k = 1)
message("Recall for the first batch: ", formatC(mean(idx == 1:nrow(data1))))

filename <- "first_half.bin"
# Serialize index
p$save(filename)

# Reinitialize and load the index
rm(p)
message("Loading index from ", filename)
# Increase the total capacity, so that it will handle the new data
p <- new(HnswL2, dim, filename, num_elements)
p$resizeIndex(num_elements)

message("Adding the second batch of ", nrow(data2), " elements")
p$addItems(data2)
Expand All @@ -156,6 +157,21 @@ idx <- p$getAllNNs(data, k = 1)
# res$dist contains the distance matrix, res$item stores the indexes

message("Recall for two batches: ", formatC(mean(idx == 1:num_elements)))
```

Although there's no longer any need for this, for completeness, here's how you
would use `save` and `new` to achieve the same effect without `resizeIndex`:

```R
filename <- "first_half.bin"
# Serialize index
p$save(filename)

# Reinitialize and load the index
rm(p)
message("Loading index from ", filename)
# Increase the total capacity, so that it will handle the new data
p <- new(HnswL2, dim, filename, num_elements)
unlink(filename)
```

Expand Down Expand Up @@ -195,40 +211,60 @@ with `dim` dimensions from the specified `filename`.
maximum capacity of `max_elements`. This is a way to increase the capacity of
the index without a complete rebuild.
* `setEf(ef)` set search parameter `ef`.
* `addItem(v)` add vector `v` to the index.
* `addItems(m)` add the row vectors of the matrix `m` to the index.
* `addItem(v)` add vector `v` to the index. Internally, each vector gets an
increasing integer label, with the first vector added getting the label `1`, the
second `2` and so on. These labels are returned in `getNNs` and related methods
to identify which vector in the index are neighbors.
* `addItems(m)` add the row vectors of the matrix `m` to the index. Internally,
each row vector gets an increasing integer label, with the first row added
getting the label `1`, the second `2` and so on. These labels are returned in
`getNNs` and related methods to identify which vector in the index are
neighbors.
* `save(filename)` saves an index to the specified `filename`. To load an index,
use the `new(HnswL2, dim, filename)` constructor (see above).
* `getNNs(v, k)` return a vector of the indices of the `k`-nearest neighbors of
the vector `v`. Indices are numbered from one. If `k` neighbors can't be found,
an error will be thrown. This normally means that `ef` or `M` have been set
too small, but also bear in mind that you can't return more items than were
put into the index.
* `getNNs(v, k)` return a vector of the labels of the `k`-nearest neighbors of
the vector `v`. Labels are integers numbered from one, representing the
insertion order into the index, e.g. the label `1` represents the first item
added to the index. If `k` neighbors can't be found, an error will be thrown.
This normally means that `ef` or `M` have been set too small, but also bear in
mind that you can't return more items than were put into the index.
* `getNNsList(v, k, include_distances = FALSE)` return a list containing a
vector named `item` with the indices of the `k`-nearest neighbors of
the vector `v`. Indices are numbered from one. If `include_distances = TRUE`
then also return a vector `distance` containing the distances. If `k` neighbors
can't be found, an error is thrown.
* `getAllNNs(m, k)` return a matrix of the indices of the `k`-nearest neighbors
of each row vector in `m`. Indices are numbered from one. If `k` neighbors
can't be found, an error is thrown.
vector named `item` with the labels of the `k`-nearest neighbors of the vector
`v`. Labels are integers numbered from one, representing the insertion order
into the index, e.g. the label `1` represents the first item added to the index.
If `include_distances = TRUE` then also return a vector `distance` containing
the distances. If `k` neighbors can't be found, an error is thrown.
* `getAllNNs(m, k)` return a matrix of the labels of the `k`-nearest neighbors
of each row vector in `m`. Labels are integers numbered from one, representing
the insertion order into the index, e.g. the label `1` represents the first item
added to the index.. If `k` neighbors can't be found, an error is thrown.
* `getAllNNsList(m, k, include_distances = FALSE)` return a list containing a
matrix named `item` with the indices of the `k`-nearest neighbors of each row
vector in `m`. Indices are numbered from one. If `include_distances = TRUE`
then also return a matrix `distance` containing the distances. If `k` neighbors
can't be found, an error is thrown.
matrix named `item` with the labels of the `k`-nearest neighbors of each row
vector in `m`. Labels are integers numbered from one, representing the insertion
order into the index, e.g. the label `1` represents the first item added to the
index. If `include_distances = TRUE` then also return a matrix `distance`
containing the distances. If `k` neighbors can't be found, an error is thrown.
* `size()` returns the number of items in the index. This is an upper limit on
the number of neighbors you can expect to return from `getNNs` and the other
search methods.
* `markDeleted(i)` marks the item with label `i` (the `i`th item added to the
index) as deleted. This means that the item will not be returned in any further
searches of the index. It does not reduce the memory used by the index. Calls to
`size()` do *not* reflect the number of marked deleted items.
* `resize(max_elements)` changes the maximum capacity of the index to
`max_elements`.

### Differences from Python Bindings

* Multi-threading is not supported.
* Arbitrary integer labeling is not supported. Items are labeled
`0, 1, 2 ... N`.
* Arbitrary integer labeling is not supported. Where labels are used, e.g. in
the return value of `getNNsList` or as input in `markDeleted`, the labels
represent the order in which the items were added to the index, using 1-indexing
to be consistent with R. So in the Python bindings, the first item in the index
has a default of label `0`, but here it will have label `1`.
* The interface roughly follows the Python one but deviates with naming and also
rolls the declaration and initialization of the index into one call. And as noted
above, you must pass arguments by position, not keyword.
rolls the declaration and initialization of the index into one call. And as
noted above, you must pass arguments by position, not keyword.
* I have made a change to the C++ `hnswalg.h` code to use the
[`showUpdate` macro from RcppAnnoy](https://github.com/eddelbuettel/rcppannoy/blob/498a2c241df0fcac140d80f9ee0a6985d0f08687/inst/include/annoylib.h#L57),
rather than `std::cerr` directly.
Expand Down
25 changes: 18 additions & 7 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
## Release Summary

This is the first submission of the package.
This is a minor release to upgrade the HNSW library to a new version, add some
new methods and to fix a user-reported bug with the search function. Also, a URL
and BugReports entry has been added to the DESCRIPTION.

## Test environments

* ubuntu 14.04 (on travis-ci), R 3.4.4, R 3.5.1, R-devel
* ubuntu 16.04 (on rhub), R 3.4.4
* fedora (on rhub), R-devel
* ubuntu 14.04 (on travis-ci), R 3.4.4, R 3.6.0, R-devel
* ubuntu 16.04 (on rhub), R 3.6.1
* fedora 30 (on rhub), R-devel
* debian (on rhub), R-devel
* mac OS X High Sierra (on travis-ci), R 3.4.4, R 3.5.2
* local Windows 10 build, R 3.5.2
* mac OS X High Sierra (on travis-ci), R 3.5.3, R 3.6.1
* local Windows 10 build, R 3.6.1
* Windows Server 2008 (on rhub) R-devel
* Windows Server 2012 (on appveyor) R 3.5.2
* Windows Server 2012 (on appveyor) R 3.6.1
* win-builder (devel)

## R CMD check results
Expand All @@ -24,6 +26,15 @@ There was a message about possibly mis-spelled words in DESCRIPTION:

This is spelled correctly.

With r-hub checking on Windows only there was a message:

"N checking for non-standard things in the check directory
Found the following files/directories:
'examples_x64' 'tests_i386' 'tests_x64'
'RcppHNSW-Ex_i386.Rout' 'RcppHNSW-Ex_x64.Rout' 'examples_i386'"

This would seem to be something to do with r-hub rather than a real problem.

## Downstream dependencies

None.
35 changes: 25 additions & 10 deletions inst/include/bruteforce.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#pragma once
#include <unordered_map>
#include <fstream>
#include <mutex>

namespace hnswlib {
template<typename dist_t>
Expand Down Expand Up @@ -35,22 +36,37 @@ class BruteforceSearch : public AlgorithmInterface<dist_t> {
size_t data_size_;
DISTFUNC <dist_t> fstdistfunc_;
void *dist_func_param_;
std::mutex index_lock;

std::unordered_map<labeltype,size_t > dict_external_to_internal;

void addPoint(void *datapoint, labeltype label) {
if(dict_external_to_internal.count(label))
throw std::runtime_error("Ids have to be unique");

int idx;
{
std::unique_lock<std::mutex> lock(index_lock);



auto search=dict_external_to_internal.find(label);
if (search != dict_external_to_internal.end()) {
idx=search->second;
}
else{
if (cur_element_count >= maxelements_) {
throw std::runtime_error("The number of elements exceeds the specified limit\n");
}
idx=cur_element_count;
dict_external_to_internal[label] = idx;
cur_element_count++;
}
}
memcpy(data_ + size_per_element_ * idx + data_size_, &label, sizeof(labeltype));
memcpy(data_ + size_per_element_ * idx, datapoint, data_size_);



if (cur_element_count >= maxelements_) {
throw std::runtime_error("The number of elements exceeds the specified limit\n");
};
memcpy(data_ + size_per_element_ * cur_element_count + data_size_, &label, sizeof(labeltype));
memcpy(data_ + size_per_element_ * cur_element_count, datapoint, data_size_);
dict_external_to_internal[label]=cur_element_count;

cur_element_count++;
};

void removePoint(labeltype cur_external) {
Expand Down Expand Up @@ -123,7 +139,6 @@ class BruteforceSearch : public AlgorithmInterface<dist_t> {

input.close();

return;
}

};
Expand Down
Loading

0 comments on commit ede2764

Please sign in to comment.