Merge pull request #6 from jlmelville/v0.2.0

V0.2.0
jlmelville · Sep 21, 2019 · ede2764 · ede2764
2 parents 7067cea + 30b4bda
commit ede2764
Show file tree

Hide file tree

Showing 12 changed files with 482 additions and 170 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: RcppHNSW
 Title: 'Rcpp' Bindings for 'hnswlib', a Library for Approximate Nearest Neighbors
-Version: 0.1.0.9000
+Version: 0.2.0
 Authors@R: c(person("James", "Melville", email = "[email protected]", 
   role = c("aut", "cre")), person("Aaron", "Lun", role = "ctb"))
 Description: 'Hnswlib' is a C++ library for Approximate Nearest Neighbors. This 

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,29 @@
+# RcppHNSW 0.2.0
+
+## New features
+
+* Updated hnswlib to <https://github.com/nmslib/hnswlib/commit/c5c38f0> 
+(20 September 2019).
+* A new method, `markDeleted`, that will remove an object from being retrieved
+from the index.
+* A new method, `resizeIndex`, that allows the index to be increased without 
+having to save and reload the index.
+* A new method, `size` is available for the index objects and reports the
+number of items added to the index.
+
+
+## Bug fixes and minor improvements
+
+* `hnsw_search` would `stop` if the number of rows in the input matrix was 
+smaller than `k`. This check has been removed. Note that the correct behavior is
+to ensure that `k` is smaller than or equal to `index$size()` where `index` is
+the index you are searching. Because the `size()` method is new to this version,
+to preserve compatibility with old indexes, this check *hasn't* been added to
+`hnsw_search`. If this matters to you, manually compare `index$size()` with `k`
+before running `hnsw_search`. An error will be thrown if `k` neighbors can't be
+found in the index. Thank you to [Yuxing Liao](https://github.com/yxngl) for 
+spotting this and the pull request to remove the check.
+
 # RcppHNSW 0.1.0
 
 Initial release.
diff --git a/R/hnsw.R b/R/hnsw.R
@@ -92,7 +92,7 @@ hnsw_knn <- function(X, k = 10, distance = "euclidean",
   hnsw_search(X = X, ann = ann, k = k, ef = ef, verbose = verbose)
 }
 
-#' Build a nearest neighbor index
+#' Build an hnswlib nearest neighbor index
 #'
 #' @param X a numeric matrix of data to add. Each of the n rows is an item in
 #'   the index.
@@ -159,7 +159,7 @@ hnsw_build <- function(X, distance = "euclidean", M = 16, ef = 200,
   ann
 }
 
-#' Search an HNSW nearest neighbor index
+#' Search an hnswlib nearest neighbor index
 #'
 #' @param X A numeric matrix of data to search for neighbors.
 #' @param ann an instance of a \code{HnswL2}, \code{HnswCosine} or \code{HnswIp}

diff --git a/README.md b/README.md
@@ -10,7 +10,15 @@ Rcpp bindings for [hnswlib](https://github.com/nmslib/hnswlib).
 
 ### Status
 
-*October 20 2018*. By inserting some preprocessor symbols into HNSW, these 
+*September 20 2019*. RcppHNSW 0.2.0 is now available on CRAN, up to date with
+hnswlib at <https://github.com/nmslib/hnswlib/commit/c5c38f0>, with new methods:
+`size`, `resizeIndex` and `markDeleted`. Also, a bug that prevented searching
+with datasets smaller than `k` has been fixed. Thanks to 
+[Yuxing Liao](https://github.com/yxngl) for spotting that.
+
+*January 21 2019*. RcppHNSW is now available on CRAN.
+
+*October 20 2018*. By inserting some preprocessor symbols into hnswlib, these 
 bindings no longer require a non-portable compiler flag and hence will pass `R
 CMD CHECK` without any warnings: previously you would be warned about
 `-march=native`. The price paid is not using specialized functions for the
@@ -102,12 +110,12 @@ res <- ann$getNNsList(data[1, ], k = 4, include_distances = TRUE)
 # Inner Product: HnswIP
 ```
 
-And here's a rough equivalent of the serialization/deserialization example from
+Here's a rough equivalent of the serialization/deserialization example from
 the 
-[hnswlib README](https://github.com/nmslib/hnswlib#python-bindings-examples).
-Although the index must have its initial size specified when its created, you
-can increase its size by saving it to disk, then specifying a new larger size
-when you read it back, as the following demonstrates:
+[hnswlib README](https://github.com/nmslib/hnswlib#python-bindings-examples), 
+but using the recently-added `resizeIndex` method to increase the size of the
+index after its initial specification, avoiding having to read from or write
+to disk:
 
 ```R
 library("RcppHNSW")
@@ -136,15 +144,8 @@ p$addItems(data1)
 idx <- p$getAllNNs(data1, k = 1)
 message("Recall for the first batch: ", formatC(mean(idx == 1:nrow(data1))))
 
-filename <- "first_half.bin"
-# Serialize index
-p$save(filename)
-
-# Reinitialize and load the index
-rm(p)
-message("Loading index from ", filename)
 # Increase the total capacity, so that it will handle the new data
-p <- new(HnswL2, dim, filename, num_elements)
+p$resizeIndex(num_elements)
 
 message("Adding the second batch of ", nrow(data2), " elements")
 p$addItems(data2)
@@ -156,6 +157,21 @@ idx <- p$getAllNNs(data, k = 1)
 # res$dist contains the distance matrix, res$item stores the indexes
 
 message("Recall for two batches: ", formatC(mean(idx == 1:num_elements)))
+```
+
+Although there's no longer any need for this, for completeness, here's how you
+would use `save` and `new` to achieve the same effect without `resizeIndex`:
+
+```R
+filename <- "first_half.bin"
+# Serialize index
+p$save(filename)
+
+# Reinitialize and load the index
+rm(p)
+message("Loading index from ", filename)
+# Increase the total capacity, so that it will handle the new data
+p <- new(HnswL2, dim, filename, num_elements)
 unlink(filename)
 ```
 
@@ -195,40 +211,60 @@ with `dim` dimensions from the specified `filename`.
 maximum capacity of `max_elements`. This is a way to increase the capacity of
 the index without a complete rebuild.
 * `setEf(ef)` set search parameter `ef`.
-* `addItem(v)` add vector `v` to the index.
-* `addItems(m)` add the row vectors of the matrix `m` to the index.
+* `addItem(v)` add vector `v` to the index. Internally, each vector gets an
+increasing integer label, with the first vector added getting the label `1`, the
+second `2` and so on. These labels are returned in `getNNs` and related methods
+to identify which vector in the index are neighbors.
+* `addItems(m)` add the row vectors of the matrix `m` to the index. Internally,
+each row vector gets an increasing integer label, with the first row added
+getting the label `1`, the second `2` and so on. These labels are returned in
+`getNNs` and related methods to identify which vector in the index are
+neighbors.
 * `save(filename)` saves an index to the specified `filename`. To load an index,
 use the `new(HnswL2, dim, filename)` constructor (see above).
-* `getNNs(v, k)` return a vector of the indices of the `k`-nearest neighbors of
-the vector `v`. Indices are numbered from one. If `k` neighbors can't be found,
-an error will be thrown. This normally means that `ef` or `M` have been set 
-too small, but also bear in mind that you can't return more items than were
-put into the index.
+* `getNNs(v, k)` return a vector of the labels of the `k`-nearest neighbors of
+the vector `v`. Labels are integers numbered from one, representing the
+insertion order into the index, e.g. the label `1` represents the first item
+added to the index. If `k` neighbors can't be found, an error will be thrown.
+This normally means that `ef` or `M` have been set too small, but also bear in
+mind that you can't return more items than were put into the index.
 * `getNNsList(v, k, include_distances = FALSE)` return a list containing a
-vector named `item` with the indices of the `k`-nearest neighbors of
-the vector `v`. Indices are numbered from one. If `include_distances = TRUE`
-then also return a vector `distance` containing the distances. If `k` neighbors
-can't be found, an error is thrown.
-* `getAllNNs(m, k)` return a matrix of the indices of the `k`-nearest neighbors
-of each row vector in `m`. Indices are numbered from one. If `k` neighbors
-can't be found, an error is thrown.
+vector named `item` with the labels of the `k`-nearest neighbors of the vector
+`v`. Labels are integers numbered from one, representing the insertion order
+into the index, e.g. the label `1` represents the first item added to the index.
+If `include_distances = TRUE` then also return a vector `distance` containing
+the distances. If `k` neighbors can't be found, an error is thrown.
+* `getAllNNs(m, k)` return a matrix of the labels of the `k`-nearest neighbors
+of each row vector in `m`. Labels are integers numbered from one, representing
+the insertion order into the index, e.g. the label `1` represents the first item
+added to the index.. If `k` neighbors can't be found, an error is thrown.
 * `getAllNNsList(m, k, include_distances = FALSE)` return a list containing a
-matrix named `item` with the indices of the `k`-nearest neighbors of each row
-vector in `m`. Indices are numbered from one. If `include_distances = TRUE`
-then also return a matrix `distance` containing the distances. If `k` neighbors
-can't be found, an error is thrown.
+matrix named `item` with the labels of the `k`-nearest neighbors of each row
+vector in `m`. Labels are integers numbered from one, representing the insertion
+order into the index, e.g. the label `1` represents the first item added to the
+index. If `include_distances = TRUE` then also return a matrix `distance`
+containing the distances. If `k` neighbors can't be found, an error is thrown.
 * `size()` returns the number of items in the index. This is an upper limit on
 the number of neighbors you can expect to return from `getNNs` and the other
 search methods.
+* `markDeleted(i)` marks the item with label `i` (the `i`th item added to the
+index) as deleted. This means that the item will not be returned in any further
+searches of the index. It does not reduce the memory used by the index. Calls to
+`size()` do *not* reflect the number of marked deleted items.
+* `resize(max_elements)` changes the maximum capacity of the index to 
+`max_elements`.
 
 ### Differences from Python Bindings
 
 * Multi-threading is not supported.
-* Arbitrary integer labeling is not supported. Items are labeled 
-`0, 1, 2 ... N`.
+* Arbitrary integer labeling is not supported. Where labels are used, e.g. in
+the return value of `getNNsList` or as input in `markDeleted`, the labels 
+represent the order in which the items were added to the index, using 1-indexing
+to be consistent with R. So in the Python bindings, the first item in the index
+has a default of label `0`, but here it will have label `1`.
 * The interface roughly follows the Python one but deviates with naming and also
-rolls the declaration and initialization of the index into one call. And as noted
-above, you must pass arguments by position, not keyword.
+rolls the declaration and initialization of the index into one call. And as
+noted above, you must pass arguments by position, not keyword.
 * I have made a change to the C++ `hnswalg.h` code to use the 
 [`showUpdate` macro from RcppAnnoy](https://github.com/eddelbuettel/rcppannoy/blob/498a2c241df0fcac140d80f9ee0a6985d0f08687/inst/include/annoylib.h#L57),
 rather than `std::cerr` directly.

diff --git a/cran-comments.md b/cran-comments.md
@@ -1,17 +1,19 @@
 ## Release Summary
 
-This is the first submission of the package.
+This is a minor release to upgrade the HNSW library to a new version, add some
+new methods and to fix a user-reported bug with the search function. Also, a URL
+and BugReports entry has been added to the DESCRIPTION.
 
 ## Test environments
 
-* ubuntu 14.04 (on travis-ci), R 3.4.4, R 3.5.1, R-devel
-* ubuntu 16.04 (on rhub), R 3.4.4
-* fedora (on rhub), R-devel
+* ubuntu 14.04 (on travis-ci), R 3.4.4, R 3.6.0, R-devel
+* ubuntu 16.04 (on rhub), R 3.6.1
+* fedora 30 (on rhub), R-devel
 * debian (on rhub), R-devel
-* mac OS X High Sierra (on travis-ci), R 3.4.4, R 3.5.2
-* local Windows 10 build, R 3.5.2
+* mac OS X High Sierra (on travis-ci), R 3.5.3, R 3.6.1
+* local Windows 10 build, R 3.6.1
 * Windows Server 2008 (on rhub) R-devel
-* Windows Server 2012 (on appveyor) R 3.5.2
+* Windows Server 2012 (on appveyor) R 3.6.1
 * win-builder (devel)
 
 ## R CMD check results
@@ -24,6 +26,15 @@ There was a message about possibly mis-spelled words in DESCRIPTION:
 
 This is spelled correctly.
 
+With r-hub checking on Windows only there was a message:
+
+"N  checking for non-standard things in the check directory
+   Found the following files/directories:
+     'examples_x64' 'tests_i386' 'tests_x64'
+     'RcppHNSW-Ex_i386.Rout' 'RcppHNSW-Ex_x64.Rout' 'examples_i386'"
+
+This would seem to be something to do with r-hub rather than a real problem.
+
 ## Downstream dependencies
 
 None.
diff --git a/inst/include/bruteforce.h b/inst/include/bruteforce.h
@@ -1,6 +1,7 @@
 #pragma once
 #include <unordered_map>
 #include <fstream>
+#include <mutex>
 
 namespace hnswlib {
 template<typename dist_t>
@@ -35,22 +36,37 @@ class BruteforceSearch : public AlgorithmInterface<dist_t> {
   size_t data_size_;
   DISTFUNC <dist_t> fstdistfunc_;
   void *dist_func_param_;
+  std::mutex index_lock;
 
   std::unordered_map<labeltype,size_t > dict_external_to_internal;
 
   void addPoint(void *datapoint, labeltype label) {
-    if(dict_external_to_internal.count(label))
-      throw std::runtime_error("Ids have to be unique");
+
+    int idx;
+    {
+      std::unique_lock<std::mutex> lock(index_lock);
+
+
+
+      auto search=dict_external_to_internal.find(label);
+      if (search != dict_external_to_internal.end()) {
+        idx=search->second;
+      }
+      else{
+        if (cur_element_count >= maxelements_) {
+          throw std::runtime_error("The number of elements exceeds the specified limit\n");
+        }
+        idx=cur_element_count;
+        dict_external_to_internal[label] = idx;
+        cur_element_count++;
+      }
+    }
+    memcpy(data_ + size_per_element_ * idx + data_size_, &label, sizeof(labeltype));
+    memcpy(data_ + size_per_element_ * idx, datapoint, data_size_);
+
 
 
-    if (cur_element_count >= maxelements_) {
-      throw std::runtime_error("The number of elements exceeds the specified limit\n");
-    };
-    memcpy(data_ + size_per_element_ * cur_element_count + data_size_, &label, sizeof(labeltype));
-    memcpy(data_ + size_per_element_ * cur_element_count, datapoint, data_size_);
-    dict_external_to_internal[label]=cur_element_count;
 
-    cur_element_count++;
   };
 
   void removePoint(labeltype cur_external) {
@@ -123,7 +139,6 @@ class BruteforceSearch : public AlgorithmInterface<dist_t> {
 
     input.close();
 
-    return;
   }
 
 };