diff --git a/codelabs/get-started-with-vector-db-0/index.md b/codelabs/get-started-with-vector-db-00/index.md similarity index 100% rename from codelabs/get-started-with-vector-db-0/index.md rename to codelabs/get-started-with-vector-db-00/index.md diff --git a/codelabs/get-started-with-vector-db-0/pic/embedding_arithmetic.jpg b/codelabs/get-started-with-vector-db-00/pic/embedding_arithmetic.jpg similarity index 100% rename from codelabs/get-started-with-vector-db-0/pic/embedding_arithmetic.jpg rename to codelabs/get-started-with-vector-db-00/pic/embedding_arithmetic.jpg diff --git a/codelabs/get-started-with-vector-db-0/pic/nearest_neighbors_example.jpg b/codelabs/get-started-with-vector-db-00/pic/nearest_neighbors_example.jpg similarity index 100% rename from codelabs/get-started-with-vector-db-0/pic/nearest_neighbors_example.jpg rename to codelabs/get-started-with-vector-db-00/pic/nearest_neighbors_example.jpg diff --git a/codelabs/get-started-with-vector-db-0/pic/photo-1617739680032-8ab04113d6dc.png b/codelabs/get-started-with-vector-db-00/pic/photo-1617739680032-8ab04113d6dc.png similarity index 100% rename from codelabs/get-started-with-vector-db-0/pic/photo-1617739680032-8ab04113d6dc.png rename to codelabs/get-started-with-vector-db-00/pic/photo-1617739680032-8ab04113d6dc.png diff --git a/codelabs/get-started-with-vector-db-0/slides.key b/codelabs/get-started-with-vector-db-00/slides.key similarity index 100% rename from codelabs/get-started-with-vector-db-0/slides.key rename to codelabs/get-started-with-vector-db-00/slides.key diff --git a/codelabs/get-started-with-vector-db-1/index.md b/codelabs/get-started-with-vector-db-01/index.md similarity index 100% rename from codelabs/get-started-with-vector-db-1/index.md rename to codelabs/get-started-with-vector-db-01/index.md diff --git a/codelabs/get-started-with-vector-db-1/pic/architecture_diagram.png b/codelabs/get-started-with-vector-db-01/pic/architecture_diagram.png similarity index 100% rename from codelabs/get-started-with-vector-db-1/pic/architecture_diagram.png rename to codelabs/get-started-with-vector-db-01/pic/architecture_diagram.png diff --git a/codelabs/get-started-with-vector-db-2/index.md b/codelabs/get-started-with-vector-db-02/index.md similarity index 100% rename from codelabs/get-started-with-vector-db-2/index.md rename to codelabs/get-started-with-vector-db-02/index.md diff --git a/codelabs/get-started-with-vector-db-2/pic/architecture_diagram.png b/codelabs/get-started-with-vector-db-02/pic/architecture_diagram.png similarity index 100% rename from codelabs/get-started-with-vector-db-2/pic/architecture_diagram.png rename to codelabs/get-started-with-vector-db-02/pic/architecture_diagram.png diff --git a/codelabs/get-started-with-vector-db-3/index.md b/codelabs/get-started-with-vector-db-03/index.md similarity index 100% rename from codelabs/get-started-with-vector-db-3/index.md rename to codelabs/get-started-with-vector-db-03/index.md diff --git a/codelabs/get-started-with-vector-db-4/index.md b/codelabs/get-started-with-vector-db-04/index.md similarity index 100% rename from codelabs/get-started-with-vector-db-4/index.md rename to codelabs/get-started-with-vector-db-04/index.md diff --git a/codelabs/get-started-with-vector-db-4/pic/hnsw_visualized.jpg b/codelabs/get-started-with-vector-db-04/pic/hnsw_visualized.jpg similarity index 100% rename from codelabs/get-started-with-vector-db-4/pic/hnsw_visualized.jpg rename to codelabs/get-started-with-vector-db-04/pic/hnsw_visualized.jpg diff --git a/codelabs/get-started-with-vector-db-5/index.md b/codelabs/get-started-with-vector-db-05/index.md similarity index 100% rename from codelabs/get-started-with-vector-db-5/index.md rename to codelabs/get-started-with-vector-db-05/index.md diff --git a/codelabs/get-started-with-vector-db-5/pic/voronoi_diagram.png b/codelabs/get-started-with-vector-db-05/pic/voronoi_diagram.png similarity index 100% rename from codelabs/get-started-with-vector-db-5/pic/voronoi_diagram.png rename to codelabs/get-started-with-vector-db-05/pic/voronoi_diagram.png diff --git a/codelabs/get-started-with-vector-db-6/index.md b/codelabs/get-started-with-vector-db-06/index.md similarity index 99% rename from codelabs/get-started-with-vector-db-6/index.md rename to codelabs/get-started-with-vector-db-06/index.md index 1997c4c..0f8a14d 100644 --- a/codelabs/get-started-with-vector-db-6/index.md +++ b/codelabs/get-started-with-vector-db-06/index.md @@ -225,7 +225,7 @@ import numpy as np from scipy.cluster.vq import kmeans2 -class ProductQuantizer: +class ProductQuantizer(object): def __init__(self, M=16, K=256): self.M = 16 diff --git a/codelabs/get-started-with-vector-db-6/pic/product_quantization.png b/codelabs/get-started-with-vector-db-06/pic/product_quantization.png similarity index 100% rename from codelabs/get-started-with-vector-db-6/pic/product_quantization.png rename to codelabs/get-started-with-vector-db-06/pic/product_quantization.png diff --git a/codelabs/get-started-with-vector-db-7/index.md b/codelabs/get-started-with-vector-db-07/index.md similarity index 90% rename from codelabs/get-started-with-vector-db-7/index.md rename to codelabs/get-started-with-vector-db-07/index.md index 1f43c4f..c10750b 100644 --- a/codelabs/get-started-with-vector-db-7/index.md +++ b/codelabs/get-started-with-vector-db-07/index.md @@ -1,5 +1,5 @@ summary: A deep dive into Hierarchical Navigable Small Worlds (HNSW) -id: vector-database-101-scalar-quantization-and-product-quantization +id: vector-database-101-hierarchical-navigable-small-worlds categories: Getting Started tags: getting-started status: Hidden @@ -13,7 +13,7 @@ Feedback Link: https://github.com/milvus-io/milvus ## Introduction Duration: 1 -Hey there - welcome back to [Milvus tutorials](https://codelabs.milvus.io/). In the previous tutorial, we +Hey there - welcome back to [Milvus tutorials](https://codelabs.milvus.io/). In the previous tutorial, we took a look at scalar quantization and product quantization - two indexing strategies which are used to reduce the overall _size_ of the database without reducing the scope of our search. To better illustrate how scalar quantization and product quantization works, we also implemented our own versions in Python. In this tutorial, we'll build on top of that knowledge by looking at what is perhaps the most commonly used primary algorithm today: Hierarchical Navigable Small Worlds (HNSW). HNSW performs very well when it comes to both speed and accuracy, making it an incredibly robust vector search algorithm. Despite it being popular, understanding HNSW can be a bit tricky, but don't fret - in the next couple of sections, we'll break down HNSW into its individual steps, developing our own simple implementation along the way. @@ -74,7 +74,7 @@ As with the skip list, the query vector will appear in upper layers with exponen ## Implementing HNSW Duration: 8 -HNSW is not trivial to implement, so we'll implement only a very basic version here. As usual, let's start with a dataset and a query vector: +HNSW is not trivial to implement, so we'll implement only a very basic version here. As usual, let's start with creating a dataset of (128 dimensional) vectors: ```python >>> import numpy as np @@ -111,7 +111,7 @@ def _search_layer(graph, entry, query, ef=1): # loop through all nearest neighbors to the candidate vector for e in graph[cv[1]][1]: - d = np.linalg.norm(graph[cv][0] - query) + d = np.linalg.norm(graph[e][0] - query) if (d, e) not in visit: visit.add((d, e)) @@ -140,7 +140,7 @@ def search(index, query, ef=1): best_v = 0 # set the initial best vertex to the entry point for graph in index: - (best_v, best_d) = _search_layer(graph, best_v, query, ef=1)[0] + best_d, best_v = _search_layer(graph, best_v, query, ef=1)[0] if graph[best_v][2]: best_v = graph[best_v][2] else: @@ -161,7 +161,7 @@ def _get_insert_layer(L, mL): With everything in place, we can now implement the insertion function. ```python -def insert(index, vec, L=5, efc=10): +def insert(self, vec, efc=10): # if the index is empty, insert the vector into all layers and return if not index[0]: @@ -172,19 +172,19 @@ def insert(index, vec, L=5, efc=10): return l = _get_insert_layer(1/np.log(L)) - start_v = 0 + start_v = 0 for n, graph in enumerate(index): # perform insertion for layers [l, L) only if n < l: - start_v, _ = _search_layer(graph, start_v, vec, ef=1)[0] + _, start_v = _search_layer(graph, start_v, vec, ef=1)[0] else: - node = (vec, [], len(index[n+1]) if n < L-1 else None) + node = (vec, [], len(_index[n+1]) if n < L-1 else None) nns = _search_layer(graph, start_v, vec, ef=efc) for nn in nns: - node.append(nn[1]) # outbound connections to NNs - graph[nn[1]].append(len(graph)) # inbound connections to node + node[1].append(nn[1]) # outbound connections to NNs + graph[nn[1]][1].append(len(graph)) # inbound connections to node graph.append(node) # set the starting vertex to the nearest neighbor in the next layer @@ -193,17 +193,20 @@ def insert(index, vec, L=5, efc=10): If the index is empty, we'll insert `vec` into all layers and return immediately. This serves to initialize the index and allow for successful insertions later. If the index has already been populated, we begin insertion by first computing the insertion layer via the `get_insert_layer` function we implemented in the previous step. From there, we find the nearest neighbor to the vector in the uppermost graph. This process continues for the layers below it until we reach layer `l`, the insertion layer. -For layer `l` and all those below it, we first find the nearest neighbors to `vec` up to a pre-determined number `ef`. We then create connections from the node to its nearest neighbors and vice versa. Note that a proper implementation should also have a pruning technique to prevent early vectors from being connected to too many others - I'll leave that as an exercise for the reader (). +For layer `l` and all those below it, we first find the nearest neighbors to `vec` up to a pre-determined number `ef`. We then create connections from the node to its nearest neighbors and vice versa. Note that a proper implementation should also have a pruning technique to prevent early vectors from being connected to too many others - I'll leave that as an exercise for the reader :sunny:. We now have both search (query) and insert functionality complete. Let's combine everything together in a class: ```python from bisect import insort from heapq import heapify, heappop, heappush + import numpy as np +from ._base import _BaseIndex -class HNSW: + +class HNSW(_BaseIndex): def __init__(self, L=5, mL=0.62, efc=10): self._L = L @@ -230,7 +233,7 @@ class HNSW: # loop through all nearest neighbors to the candidate vector for e in graph[cv[1]][1]: - d = np.linalg.norm(graph[cv][0] - query) + d = np.linalg.norm(graph[e][0] - query) if (d, e) not in visit: visit.add((d, e)) @@ -243,6 +246,10 @@ class HNSW: return nns + def create(self, dataset): + for v in dataset: + self.insert(v) + def search(self, query, ef=1): # if the index is empty, return an empty list @@ -251,23 +258,23 @@ class HNSW: best_v = 0 # set the initial best vertex to the entry point for graph in self._index: - best_v, best_d = HNSW._search_layer(graph, best_v, query, ef=1)[0] + best_d, best_v = HNSW._search_layer(graph, best_v, query, ef=1)[0] if graph[best_v][2]: best_v = graph[best_v][2] else: return HNSW._search_layer(graph, best_v, query, ef=ef) def _get_insert_layer(self): - # ml is a multiplicative factor used to normalized the distribution + # ml is a multiplicative factor used to normalize the distribution l = -int(np.log(np.random.random()) * self._mL) return min(l, self._L-1) def insert(self, vec, efc=10): # if the index is empty, insert the vector into all layers and return - if not index[0]: + if not self._index[0]: i = None - for graph in index[::-1]: + for graph in self._index[::-1]: graph.append((vec, [], i)) i = 0 return @@ -275,17 +282,17 @@ class HNSW: l = self._get_insert_layer() start_v = 0 - for n, graph in enumerate(index): + for n, graph in enumerate(self._index): # perform insertion for layers [l, L) only if n < l: - start_v, _ = _search_layer(graph, start_v, vec, ef=1)[0] + _, start_v = self._search_layer(graph, start_v, vec, ef=1)[0] else: - node = (vec, [], len(index[n+1]) if n < self._L-1 else None) - nns = _search_layer(graph, start_v, vec, ef=efc) + node = (vec, [], len(self._index[n+1]) if n < self._L-1 else None) + nns = self._search_layer(graph, start_v, vec, ef=efc) for nn in nns: - node.append(nn[1]) # outbound connections to NNs - graph[nn[1]].append(len(graph)) # inbound connections to node + node[1].append(nn[1]) # outbound connections to NNs + graph[nn[1]][1].append(len(graph)) # inbound connections to node graph.append(node) # set the starting vertex to the nearest neighbor in the next layer diff --git a/codelabs/get-started-with-vector-db-07/pic/better_graph.png b/codelabs/get-started-with-vector-db-07/pic/better_graph.png new file mode 100644 index 0000000..a3761cd Binary files /dev/null and b/codelabs/get-started-with-vector-db-07/pic/better_graph.png differ diff --git a/codelabs/get-started-with-vector-db-07/pic/hnsw_visualized.jpg b/codelabs/get-started-with-vector-db-07/pic/hnsw_visualized.jpg new file mode 100644 index 0000000..f6f759d Binary files /dev/null and b/codelabs/get-started-with-vector-db-07/pic/hnsw_visualized.jpg differ diff --git a/codelabs/get-started-with-vector-db-07/pic/skip_list.png b/codelabs/get-started-with-vector-db-07/pic/skip_list.png new file mode 100644 index 0000000..0f4aa74 Binary files /dev/null and b/codelabs/get-started-with-vector-db-07/pic/skip_list.png differ diff --git a/codelabs/get-started-with-vector-db-08/index.md b/codelabs/get-started-with-vector-db-08/index.md new file mode 100644 index 0000000..60e7844 --- /dev/null +++ b/codelabs/get-started-with-vector-db-08/index.md @@ -0,0 +1,283 @@ +summary: A deep dive into Approximate Nearest Neighbor Oh Yeah (Annoy) +id: vector-database-101-approximate-nearest-neighbor-oh-yeah +categories: Getting Started +tags: getting-started +status: Hidden +authors: Frank Liu +Feedback Link: https://github.com/milvus-io/milvus + +--- + +# Vector Database 101 - Approximate Nearest Neighbor Oh Yeah + +## Introduction +Duration: 1 + +Hey there - welcome back to [Milvus tutorials](https://codelabs.milvus.io/). In the previous tutorial, we did a deep dive into Hierarchical Navigable Small Worlds, or HNSW for short. HNSW is a graph-based indexing algorithm that's one of the most popular indexing strategies used in vector databases today. + +In this tutorial, we'll switch gears and talk about _tree-based vector indexes_. Specifically, we'll talk about Approximate Nearest Neighbor Oh Yeah (Annoy) - an algorithm that uses a forest of trees to conduct nearest neighbor search. For those who are familiar with random forests or gradient-boosted decision trees, Annoy can seem like a very natural extension of these algorithms, only for nearest neighbor search rather than machine learning. As with our HNSW tutorial, we'll first walk through how Annoy works from a high level before developing our own simple Python implementation. + +
+ +
+

Annoy, visualized (from https://github.com/spotify/annoy).

+ +## Annoy basics +Duration: 3 + +Where HNSW is built upon the connected graph and skip list, Annoy uses binary search trees as the core data structure. The key idea behind Annoy (and other tree-based indexes) is to repeatedly partition our vector space and to search only a subset of the partitions for nearest neighbors. If this sounds a bit like IVF, you're absolutely right; the idea is the same, but the execution is a bit different. + +The best way to understand Annoy is to visualize how a single tree is built. Keep in mind that high-dimensional hyperspaces are very different from 2D/3D Euclidean spaces from an intuitive perspective, so the images below should ideally only be used as a reference. + +Let's start with indexing. For Annoy, this is a recursive process where the maximum size of the call stack is the depth of the tree. In the first iteration, two random dataset vectors __a__ and __b__ are selected, and the full hyperspace is split along a hyperplane equidistant from both __a__ and __b__. Vectors which lie in the "left" half of the hyperspace get assigned to left half of the tree, while vectors which lie in the "right" half of the subspace are assigned to the right half of the tree. Note that this can be done without actually computing the hyperplane itself - for every dataset vector, we simply need to determine whether __a__ (left) or __b__ (right) is closer. + +
+ +
+

After the first, second, and Nth iteration, respectively. Source.

+ +The second iteration repeats this process for both left and right subtrees output by the first iteration, resulting in a tree with a depth of two and four leaf nodes. This continues for the third iteration, fourth iteration, etc... all the way until a leaf node has fewer than some pre-defined number of elements `K`. In the [original Annoy implementation](https://github.com/spotify/annoy/blob/master/src/annoylib.h#L892), `K` is a variable value that can be set by the user. + +With an index fully built, we can now move on to querying. Given some query vector __q__, we can perform a search simply by traversing the tree. Each intermediate node is split by a hyperplane, and we can figure out which side of the hyperplane the query vector falls on by computing the distance to the left and right vectors. We'll continue to do this until we hit a leaf node. The leaf node will contain an array of at most `K` vectors, which we can then rank and return to the user. + +## Implementing Annoy + +Now that we know how Annoy works, let's get started with an implementation. As usual, we'll first create a dataset of (128 dimensional) vectors: + +```python +>>> import numpy as np +>>> dataset = np.random.normal(size=(1000, 128)) +``` + +Let's first define a `Node` class containing left and right subtrees: + +```python +class Node(object): + + def __init__(vecs=[]): + self._vecs = vecs + self._left = None + self._right = None + + @property + def vecs(self): + return self._vecs + + @property + def left(self): + return self._left + + @property + def right(self): + return self._right +``` + +The `vecs` variable contains a list of all vectors that are contained within the node. If the length of this list is less than some value `K`, then they will remain as-is; otherwise, these vectors will then get propogated to `left` and `right`, with `vecs[0]` and `vecs[1]` remaining as the two randomly selected vectors used to split the hyperplane. + +Let's now move to indexing. Recall first that every node in the tree is split by a hyperplane orthogonal to the line which connects two randomly selected dataset vectors. Conveniently for us, we can figure out which side of the hyperplane a query vector lies on simply by computing distance. As usual, we'll use numpy's vectorized math for this: + +```python +def _is_query_in_left_half(q, node): + # returns `True` if query vector resides in left half + dist_l = np.linalg.norm(q - node.vecs[0]) + dist_r = np.linalg.norm(q - node.vecs[1]) + return dist_l < dist_r +``` + +Now let's move to building the actual tree. + +```python +import random + + +def _split_node(node, K=64, imb=0.95): + + # stopping condition: maximum # of vectors for a leaf node + if len(node.vecs) <= K: + return + node.left = Node() + node.right = Node() + + for n in range(5): + + # take two random indexes and swap to [0] and [1] + idxs = random.sample(range(len(node.vecs)), 2) + (node.vecs[0], node.vecs[idx[0]]) = (node.vecs[idx[0]], node.vecs[0]) + (node.vecs[1], node.vecs[idx[1]]) = (node.vecs[idx[1]], node.vecs[1]) + + # split vectors into halves + for vec in node.vecs: + if _is_query_in_left_half(vec, node): + node.left.vecs.append(vec) + else: + node.right.vecs.append(vec) + + # redo tree build process if imbalance is high + rat = len(node.left.vecs) / len(node.vecs) + if rat > imb or rat < (1 - imb): + continue + + # we're done; remove vectors from input-level node + # first two vectors correspond to `left` and `right`, respectively + del node.vecs[2:] + + +def _build_tree(node, K, imb): + + _split_node(node, K=K, imb=imb) + if node.left and node.right: + _build_tree(node.left, K=K, imb=imb) + _build_tree(node.right, K=K, imb=imb) + + +def build_tree(vecs, K=64, imb=0.95) + + root = Node() + root.vecs = vecs + _build_tree(root, K=K, imb=imb) + +``` + +This is a denser block of code, so let's walk through it step-by-step. Given an already-initialized `Node`, we first randomly select two vectors and split the dataset into left and right halves. We then use the function we defined earlier to determine which of the two halves the subvectors belong to. Note that we've added in an `imb` parameter to maintain tree balance - if one side of the tree contains more than 95% of the all subvectors, we redo the split process. + +With node splitting in place, the `build_tree` function will simply recursively call itself on all nodes. Leaf nodes are defined as those which contain fewer than `K` subvectors. + +Great, so we've built a binary tree that lets us significantly reduce the scope of our search. Now let's implement querying as well. Querying is fairly straightforward; we simply traverse the tree, continuously moving along the left or right branches until we've arrived at the one we're interested in: + +```python +def query_tree(q, root): + + node = root + + while not node.vecs: + # iteratively determine whether right or left node is closer + if _is_query_in_left_half(q, node): + node = node.left + else: + node = node.right + + # find nearest neighbor in leaf node + (nn, m_dist) = (None, float("inf")) + for v in node.vecs: + dist = np.linalg.norm(v - q) + if dist < m_dist: + (nn, m_dist) = (v, dist) + + return nn +``` + +This chunk of code will greedily traverse the tree, returning a single nearest neighbor (`nq = 1`). Recall, however, that we're often times interested in finding multiple nearest neighbors. Additionally, it's entirely possible for multiple nearest neighbors to live in other leaf nodes as well. How can we solve these issues? + +## Run, forest, run +Duration: 1 + +(Yes, I do realize that the main character's name is spelled "Forrest" in the [American classic](https://en.wikipedia.org/wiki/Forrest_Gump).) + +In a previous tutorial on IVF, recall that we often expanded our search beyond the Voronoi cell closest to the query vector. The reason is due to _cell edges_ - if a query vector is close to a cell edge, it's very likely that some of its nearest neighbors may be in a neighboring cell. In high-dimensional spaces, these "edges" are much more common, so a large-ish value of `nprobe` is often used when high recall is needed. + +For tree-based indexes, we face the same problem - some of our nearest neighbors may be outside of the nearest leaf node/polygon. Annoy solves this by 1) allowing for searches on both sides of a split, and 2) creating a _forest_ of trees. + +Let's first expand on our implementation in the previous section to search both sides of a split: + +```python + + std::vector nns; + while (nns.size() < (size_t)search_k && !q.empty()) { + const pair& top = q.top(); + T d = top.first; + S i = top.second; + Node* nd = _get(i); + q.pop(); + if (nd->n_descendants == 1 && i < _n_items) { + nns.push_back(i); + } else if (nd->n_descendants <= _K) { + const S* dst = nd->children; + nns.insert(nns.end(), dst, &dst[nd->n_descendants]); + } else { + T margin = D::margin(nd, v, _f); + q.push(make_pair(D::pq_distance(d, margin, 1), static_cast(nd->children[1]))); + q.push(make_pair(D::pq_distance(d, margin, 0), static_cast(nd->children[0]))); + } + } + + // Get distances for all items + // To avoid calculating distance multiple times for any items, sort by id + std::sort(nns.begin(), nns.end()); + vector > nns_dist; + S last = -1; + for (size_t i = 0; i < nns.size(); i++) { + S j = nns[i]; + if (j == last) + continue; + last = j; + if (_get(j)->n_descendants == 1) // This is only to guard a really obscure case, #284 + nns_dist.push_back(make_pair(D::distance(v_node, _get(j), _f), j)); + } + + size_t m = nns_dist.size(); + size_t p = n < m ? n : m; // Return this many items + std::partial_sort(nns_dist.begin(), nns_dist.begin() + p, nns_dist.end()); + for (size_t i = 0; i < p; i++) { + if (distances) + distances->push_back(D::normalized_distance(nns_dist[i].first)); + result->push_back(nns_dist[i].second); + } + +def _select_nearby(q, node, thresh=0) + # functions identically to _is_query_in_left_half, but can return both + dist_l = np.linalg.norm(q - node.vecs[0]) + dist_r = np.linalg.norm(q - node.vecs[1]) + diff = math.abs(dist_l - dist_r) + if diff < thresh: + return (node.left, node.right) + if dist_l < dist_r: + return (node.left,) + return (node.right,) + + +def query_tree(q, root, thresh=0.5): + + nodes = [root] + nns = [] + + while True: + # iteratively determine whether right or left node is closer + + if _select_nearby(q, node): + node = node.left + else: + node = node.right + + # find nearest neighbor in leaf node + (nn, m_dist) = (None, float("inf")) + for v in node.vecs: + dist = np.linalg.norm(v - q) + if dist < m_dist: + (nn, m_dist) = (v, dist) + + return nn +``` + +Next, we'll add a function to allow us to build the full index as a forest of trees: + +```python +def build_index(vecs, nt=8, K=64, imb=0.95): + return [build_tree(vecs, K=K, imb=imb) for _ in range(nt)] +``` + +With everything implemented, let's now put it all together, as we've done for IVF, SQ, PQ, and HNSW: + +```python + +``` + +And that's it for Annoy! + +## Wrapping up +Duration: 1 + +In this tutorial, we did a deep dive into Annoy, a tree-based indexing strategy with a playful name. As mentioned in our previous tutorial, Python is not the most ideal language for implementing vector search data structures due to interpreter overhead, but we nonetheless try to make use of as much numpy-based array math as possible. There are also many optimizations that we can do to prevent copying memory back and forth, but I'll leave those (once again) as an exercise for the reader :sunny:. + +In the next tutorial, we'll continue our deep dive into indexing strategies with a rundown of the Vamana algorithm - also known more commonly as _DiskANN_ - a unique graph-based indexing algorithm that is tailored specifically towards querying directly from solid state hard drives. + +All code for this tutorial is freely available on Github: https://github.com/fzliu/vector-search. diff --git a/codelabs/get-started-with-vector-db-09/index.md b/codelabs/get-started-with-vector-db-09/index.md new file mode 100644 index 0000000..01a1380 --- /dev/null +++ b/codelabs/get-started-with-vector-db-09/index.md @@ -0,0 +1,45 @@ +summary: A deep dive into DiskANN and the Vamana algorithm +id: vector-database-101-diskann-and-the-vamana-algorithm +categories: Getting Started +tags: getting-started +status: Hidden +authors: Frank Liu +Feedback Link: https://github.com/milvus-io/milvus + +--- + +# Vector Database 101 - DiskANN and the Vamana Algorithm + +## Introduction +Duration: 1 + +Hey there - welcome back to [Milvus tutorials](https://codelabs.milvus.io/). In the previous tutorial, we did a deep dive into Approximate Nearest Neighbors Oh Yeah, or Annoy for short. HNSW is a tree-based indexing algorithm that uses random projections to iteratively divide the subspace of . Although Annoy isn't commonly used as an indexing algorithm today, + +In this tutorial, we'll talk about _DiskANN_ - a disk-based index that is mean to enable large-scale storage . Unlike previous tutorials, there won't be a Python implementation, but we'll still discuss the algorithm along with how it works + +
+ +
+

Annoy, visualized (from https://github.com/spotify/annoy).

+ +## DiskANN overview +Duration: 3 + +
+ +
+

Description

+ +## The Vamana algorithm +Duration: 2 + +## Running on-disk +Duration: 2 + + +## Wrapping up +Duration: 1 + +In this tutorial, we did a deep dive into DiskANN, a tree-based indexing strategy with a playful name. As mentioned in our previous tutorial, Python is not the most ideal language for implementing vector search data structures due to interpreter overhead, but we nonetheless try to make use of as much numpy-based array math as possible. There are also many optimizations that we can do to prevent copying memory back and forth, but I'll leave those (once again) as an exercise for the reader :sunny:. + +This concludes our diff --git a/codelabs/get-started-with-vector-db-10/get-started-with-vector-db-10.md b/codelabs/get-started-with-vector-db-10/get-started-with-vector-db-10.md new file mode 100644 index 0000000..dc5ea5b --- /dev/null +++ b/codelabs/get-started-with-vector-db-10/get-started-with-vector-db-10.md @@ -0,0 +1,84 @@ +summary: A high-level guide on how to choose the right vector index for your application. +id: vector-database-101-choosing-the-right-vector-index +categories: Getting Started +tags: getting-started +status: Hidden +authors: Frank Liu +Feedback Link: https://github.com/milvus-io/milvus + +--- + +# Vector Database 101 - Choosing the Right Vector Index + +## A quick recap +Duration: 3 + +In our Vector Database 101 series, we've learned that vector databases are purpose-built pieces of infrastructure meant to conduct _approximate nearest neighbor search_ across large datasets of high-dimensional vectors (typically over 96 dimensions and sometimes over 10k). These vectors are meant to represent the semantics of _unstructured data_, i.e. data that cannot be fit into traditional databases such as relational databases, wide-column stores, or document databases. + +Conducting efficient approximate nearest neighbor search requires a data structure known as a _vector index_. These indexes enable efficient traversal of the entire database; rather than have to perform brute-force search with each vector, we can + +n the past several posts, we went over a variety of in-memory vector search algorithms and indexing strategies available to you as on your vector search journey. For those who missed out, here's a list and quick summary of each: + +- Brute-force search (`FLAT`) + + Brute-force search, also known as "flat" indexing, is an approach that compares the query vector with every other vector in the database. While it may seem naive and inefficient, flat indexing can yield surprisingly good results for small datasets, especially when parallelized with accelerators like GPUs or FPGAs. + +- Inverted file index (`IVF`) + + IVF is a partition-based indexing strategy that assigns all database vectors to the partition with the closest centroid. Cluster centroids are determined using unsupervised clustering (typically k-means). With the centroids and assignments in place, an inverted index is created, correlating each centroid with a list of vectors in its cluster. IVF is generally a solid choice for small- to medium-size datasets. + +- Scalar quantization (`SQ`) + + Scalar quantization converts floating point vectors (typically `float32` or `float64`) into integer vectors by dividing each dimension into bins. The process involves determining maximum and minimum values of each dimension, calculating start values and step sizes, and performing quantization by subtracting start values and dividing by step sizes. The quantized dataset typically uses 8-bit unsigned integers, but lower values (5-bit, 4-bit, and even 2-bit) are common as well. + +- Product quantization (`PQ`) + + Scalar quantization disregards distribution along each vector dimension, which can potentially lead to underutilized bins. Product quantization (PQ) is a more powerful alternative which performs both compression and reduction: high-dimensional vectors are mapped to low-dimensional quantized vectors assigning fixed-length chunks of the original vector to a single quantized value. `PQ` typically involves splitting vectors, applying k-means clustering across all splits, and converting centroid indices. + +- Hierarchical Navigable Small Worlds (`HNSW`) + + HNSW is perhaps the most commonly used vectoring indexing strategy today. It combines two concepts: skip lists and Navigable Small Worlds (NSWs). Skip lists are effectively layered linked lists for faster random access (`O(log n)` for skip lists vs `O(n)` for linked lists). In HNSW, we create a hierarchical graph of NSWs. Searching in HNSW involves starting at the top layer and moving towards the nearest neighbor in each layer until we find the closest match. Inserts work by finding the nearest neighbor and adding connections. + +- Approximate Nearest Neighbors Oh Yeah (`Annoy`) + + `Annoy` is a tree-based index that uses binary search trees as its core data structure. It partitions the vector space recursively to create a binary tree, where each node is split by a hyperplane equidistant from two randomly selected child vectors. The splitting process continues until leaf nodes have fewer than a predefined number of elements. Querying simply involves iteratively the tree to determine which side of the hyperplane the query vector falls on. + +Don't worry if some of these summaries feel a bit obtuse. Vector search algorithms can be fairly complex but are often easier to explain with visualizations and a bit of code. If you're interested in any of these, feel free to click on the link - it will take you to the original article explaining each algorithm/index in detail. + +## Picking a vector index +Duration: 2 + +So how exactly do we choose the right vector index? This is a fairly open-ended question, but one of the key principles to keep in mind is that the right index will depend on your application requirements. For example: are you primarily interested in query speed (with a static database), or will your application require a lot of inserts and deletes? Do you have any constraints on the machine type you're using, such as limited memory or limited CPU? Or perhaps the domain of data that you'll be inserting will change over time? All of these factors contribute to the most optimal index type to use. + +Let's first go over a simple index selection flowchart first. + +PLEASE INSERT A DIAGRAM HERE + +__100% recall__: This one is fairly simple - use `FLAT` search if you need 100% accuracy. All efficient data structures for vector search perform _approximate_ nearest neighbor search, meaning that there's going to be a loss of recall once the index size hits a certain threshold. + +__`index_size` < 10MB__: If your total index size is tiny (fewer than 5k 512-dimensional `float32` vectors), just use `FLAT` search. The overhead associated with index building, maintainence, and querying is simply not worth it for a tiny dataset. + +__10MB < `index_size` < 2GB__: If your total index size is small (fewer than 100k 512-dimensional `float32` vectors), my personal recommendation is to go with a standard inverted-file index (e.g. `IVF`). An inverted-file index can reduce the search scope by around an order of magnitude without while still maintaining fairly high recall. + +__2GB < `index_size` < 20GB__: Once your reach a mid-size index (fewer than 10M 512-dimensional `float32` vectors), you'll want to start considering other `PQ` and `HNSW` index types. Both will give you reasonable query speed and throughput, but `PQ` allows you to use significantly less memory at the expense of low recall, while `HNSW` often gives you 95%+ recall at the expensive of high memory usage - around 1.5x the total size of your index. For dataset sizes in this range, composite `IVF` indexes (`IVF_SQ`, `IVF_PQ`) can also work well, but I would use them only if you have limited compute resources. + +__20GB < `index_size` < 200GB__: For large datasets (fewer than 100M 512-dimensional `float32` vectors), I recommend the use of _composite indexes_: `IVF_PQ` for memory-constrained applications and `HNSW_SQ` for applications that require high recall. We mentioned this very briefly in a prior post, but as a quick recap, a composite index refers to an indexing technique that combines multiple vector search strategies into a single index. This effectively combines the best of both indexes; `HNSW_SQ`, for example, retains most of `HNSW`'s base query speed and throughput but with a significantly reduced index size. We won't dive too deep into composite indexes here, but for those interested, [FAISS's documentation](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes-(composite)) provides a great overview. + +One last note on Annoy - we don't recommend using it simply because it fits into a similar category as HNSW but is, generally speaking, not as performant. Annoy is certainly the most uniquely named index though, so it gets bonus points there. + +## A word on disk indexes +Duration: 1 + +Another option we haven't dove into explicitly in this blog post is disk-based indexes. In a nutshell, disk-based indexes leverage the architecture of NVMe disks by colocating individual search subspaces into their own NVMe page. In conjunction with zero seek latency, this enables efficient storage of both graph- and tree-based vector indexes. + +These index types are becoming increasingly popular since they enable the storage and search of billions of vectors on a single machine while still maintaining a reasonable level of performance. The downside to disk-based indexes should be obvious as well: because disk reads are significantly slower than RAM reads, disk-based indexes often experience increased query latencies, sometime by over 10x! If you are willing to sacrifice a latency and throughput for the ability to store billions of vectors at minimal cost, disk-based indexes are the way to go. Conversely, if your application requires high performance (often at the expense of increase compute costs), you'll want to stick with `IVF_PQ` or `HNSW_SQ`. + +## Wrapping up +Duration: 1 + +In this tutorial, did a quick recap of some of the vector indexing strategies available to you, in addition to providing a simple flowchart to help determine the optimal strategy given your data size and compute limitations. Please note that this flowchart is a very general guideline and not a hard-and-fast rule. Ultimately, you'll need to understand the strengths and weaknesses of each indexing option, as well as whether a composite index can help you squeeze out the last bit of performance your application needs. All of these index types are freely available to you in Milvus, so you'll be able to experiment as you see fit. Go out there and experiment! + +Although this concludes our mini-series on vector indexes, our Vector Database 101 series will continue. In the next couple of articles, we'll go over some common applications and usage patterns. + + +