Merge pull request #33 from zilliztech/refactor_2

Refactor and add latest blogs
zilliztech · Oct 31, 2023 · 09c4038 · 09c4038
2 parents 9c51578 + 27e414c
commit 09c4038
Show file tree

Hide file tree

Showing 23 changed files with 444 additions and 25 deletions.
diff --git a/...abs/get-started-with-vector-db-0/index.md → ...bs/get-started-with-vector-db-00/index.md b/...abs/get-started-with-vector-db-0/index.md → ...bs/get-started-with-vector-db-00/index.md
diff --git a/...-vector-db-0/pic/embedding_arithmetic.jpg → ...vector-db-00/pic/embedding_arithmetic.jpg b/...-vector-db-0/pic/embedding_arithmetic.jpg → ...vector-db-00/pic/embedding_arithmetic.jpg
diff --git a/...or-db-0/pic/nearest_neighbors_example.jpg → ...r-db-00/pic/nearest_neighbors_example.jpg b/...or-db-0/pic/nearest_neighbors_example.jpg → ...r-db-00/pic/nearest_neighbors_example.jpg
diff --git a/.../pic/photo-1617739680032-8ab04113d6dc.png → .../pic/photo-1617739680032-8ab04113d6dc.png b/.../pic/photo-1617739680032-8ab04113d6dc.png → .../pic/photo-1617739680032-8ab04113d6dc.png
diff --git a/...s/get-started-with-vector-db-0/slides.key → .../get-started-with-vector-db-00/slides.key b/...s/get-started-with-vector-db-0/slides.key → .../get-started-with-vector-db-00/slides.key
diff --git a/...abs/get-started-with-vector-db-1/index.md → ...bs/get-started-with-vector-db-01/index.md b/...abs/get-started-with-vector-db-1/index.md → ...bs/get-started-with-vector-db-01/index.md
diff --git a/...-vector-db-1/pic/architecture_diagram.png → ...vector-db-01/pic/architecture_diagram.png b/...-vector-db-1/pic/architecture_diagram.png → ...vector-db-01/pic/architecture_diagram.png
diff --git a/...abs/get-started-with-vector-db-2/index.md → ...bs/get-started-with-vector-db-02/index.md b/...abs/get-started-with-vector-db-2/index.md → ...bs/get-started-with-vector-db-02/index.md
diff --git a/...-vector-db-2/pic/architecture_diagram.png → ...vector-db-02/pic/architecture_diagram.png b/...-vector-db-2/pic/architecture_diagram.png → ...vector-db-02/pic/architecture_diagram.png
diff --git a/...abs/get-started-with-vector-db-3/index.md → ...bs/get-started-with-vector-db-03/index.md b/...abs/get-started-with-vector-db-3/index.md → ...bs/get-started-with-vector-db-03/index.md
diff --git a/...abs/get-started-with-vector-db-4/index.md → ...bs/get-started-with-vector-db-04/index.md b/...abs/get-started-with-vector-db-4/index.md → ...bs/get-started-with-vector-db-04/index.md
diff --git a/...-with-vector-db-4/pic/hnsw_visualized.jpg → ...with-vector-db-04/pic/hnsw_visualized.jpg b/...-with-vector-db-4/pic/hnsw_visualized.jpg → ...with-vector-db-04/pic/hnsw_visualized.jpg
diff --git a/...abs/get-started-with-vector-db-5/index.md → ...bs/get-started-with-vector-db-05/index.md b/...abs/get-started-with-vector-db-5/index.md → ...bs/get-started-with-vector-db-05/index.md
diff --git a/...-with-vector-db-5/pic/voronoi_diagram.png → ...with-vector-db-05/pic/voronoi_diagram.png b/...-with-vector-db-5/pic/voronoi_diagram.png → ...with-vector-db-05/pic/voronoi_diagram.png
diff --git a/...abs/get-started-with-vector-db-6/index.md → ...bs/get-started-with-vector-db-06/index.md b/...abs/get-started-with-vector-db-6/index.md → ...bs/get-started-with-vector-db-06/index.md
@@ -225,7 +225,7 @@ import numpy as np
 from scipy.cluster.vq import kmeans2
 
 
-class ProductQuantizer:
+class ProductQuantizer(object):
 
     def __init__(self, M=16, K=256):
         self.M = 16

diff --git a/...-vector-db-6/pic/product_quantization.png → ...vector-db-06/pic/product_quantization.png b/...-vector-db-6/pic/product_quantization.png → ...vector-db-06/pic/product_quantization.png
diff --git a/...abs/get-started-with-vector-db-7/index.md → ...bs/get-started-with-vector-db-07/index.md b/...abs/get-started-with-vector-db-7/index.md → ...bs/get-started-with-vector-db-07/index.md
@@ -1,5 +1,5 @@
 summary: A deep dive into Hierarchical Navigable Small Worlds (HNSW)
-id: vector-database-101-scalar-quantization-and-product-quantization
+id: vector-database-101-hierarchical-navigable-small-worlds
 categories: Getting Started
 tags: getting-started
 status: Hidden
@@ -13,7 +13,7 @@ Feedback Link: https://github.com/milvus-io/milvus
 ## Introduction
 Duration: 1
 
-Hey there - welcome back to [Milvus tutorials](https://codelabs.milvus.io/). In the previous tutorial, we 
+Hey there - welcome back to [Milvus tutorials](https://codelabs.milvus.io/). In the previous tutorial, we took a look at scalar quantization and product quantization - two indexing strategies which are used to reduce the overall _size_ of the database without reducing the scope of our search. To better illustrate how scalar quantization and product quantization works, we also implemented our own versions in Python.
 
 In this tutorial, we'll build on top of that knowledge by looking at what is perhaps the most commonly used primary algorithm today: Hierarchical Navigable Small Worlds (HNSW). HNSW performs very well when it comes to both speed and accuracy, making it an incredibly robust vector search algorithm. Despite it being popular, understanding HNSW can be a bit tricky, but don't fret - in the next couple of sections, we'll break down HNSW into its individual steps, developing our own simple implementation along the way.
 
@@ -74,7 +74,7 @@ As with the skip list, the query vector will appear in upper layers with exponen
 ## Implementing HNSW
 Duration: 8
 
-HNSW is not trivial to implement, so we'll implement only a very basic version here. As usual, let's start with a dataset and a query vector:
+HNSW is not trivial to implement, so we'll implement only a very basic version here. As usual, let's start with creating a dataset of (128 dimensional) vectors:
 
 ```python
 >>> import numpy as np
@@ -111,7 +111,7 @@ def _search_layer(graph, entry, query, ef=1):
 
         # loop through all nearest neighbors to the candidate vector
         for e in graph[cv[1]][1]:
-            d = np.linalg.norm(graph[cv][0] - query)
+            d = np.linalg.norm(graph[e][0] - query)
             if (d, e) not in visit:
                 visit.add((d, e))
 
@@ -140,7 +140,7 @@ def search(index, query, ef=1):
 
     best_v = 0  # set the initial best vertex to the entry point
     for graph in index:
-        (best_v, best_d) = _search_layer(graph, best_v, query, ef=1)[0]
+        best_d, best_v = _search_layer(graph, best_v, query, ef=1)[0]
         if graph[best_v][2]:
             best_v = graph[best_v][2]
         else:
@@ -161,7 +161,7 @@ def _get_insert_layer(L, mL):
 With everything in place, we can now implement the insertion function.
 
 ```python
-def insert(index, vec, L=5, efc=10):
+def insert(self, vec, efc=10):
 
     # if the index is empty, insert the vector into all layers and return
     if not index[0]:
@@ -172,19 +172,19 @@ def insert(index, vec, L=5, efc=10):
         return
 
     l = _get_insert_layer(1/np.log(L))
-    start_v = 0
 
+    start_v = 0
     for n, graph in enumerate(index):
 
         # perform insertion for layers [l, L) only
         if n < l:
-            start_v, _ = _search_layer(graph, start_v, vec, ef=1)[0]
+            _, start_v = _search_layer(graph, start_v, vec, ef=1)[0]
         else:
-            node = (vec, [], len(index[n+1]) if n < L-1 else None)
+            node = (vec, [], len(_index[n+1]) if n < L-1 else None)
             nns = _search_layer(graph, start_v, vec, ef=efc)
             for nn in nns:
-                node.append(nn[1])  # outbound connections to NNs
-                graph[nn[1]].append(len(graph))  # inbound connections to node
+                node[1].append(nn[1])  # outbound connections to NNs
+                graph[nn[1]][1].append(len(graph))  # inbound connections to node
             graph.append(node)
 
         # set the starting vertex to the nearest neighbor in the next layer
@@ -193,17 +193,20 @@ def insert(index, vec, L=5, efc=10):
 
 If the index is empty, we'll insert `vec` into all layers and return immediately. This serves to initialize the index and allow for successful insertions later. If the index has already been populated, we begin insertion by first computing the insertion layer via the `get_insert_layer` function we implemented in the previous step. From there, we find the nearest neighbor to the vector in the uppermost graph. This process continues for the layers below it until we reach layer `l`, the insertion layer.
 
-For layer `l` and all those below it, we first find the nearest neighbors to `vec` up to a pre-determined number `ef`. We then create connections from the node to its nearest neighbors and vice versa. Note that a proper implementation should also have a pruning technique to prevent early vectors from being connected to too many others - I'll leave that as an exercise for the reader ().
+For layer `l` and all those below it, we first find the nearest neighbors to `vec` up to a pre-determined number `ef`. We then create connections from the node to its nearest neighbors and vice versa. Note that a proper implementation should also have a pruning technique to prevent early vectors from being connected to too many others - I'll leave that as an exercise for the reader :sunny:.
 
 We now have both search (query) and insert functionality complete. Let's combine everything together in a class:
 
 ```python
 from bisect import insort
 from heapq import heapify, heappop, heappush
+
 import numpy as np
 
+from ._base import _BaseIndex
 
-class HNSW:
+
+class HNSW(_BaseIndex):
 
     def __init__(self, L=5, mL=0.62, efc=10):
         self._L = L
@@ -230,7 +233,7 @@ class HNSW:
 
             # loop through all nearest neighbors to the candidate vector
             for e in graph[cv[1]][1]:
-                d = np.linalg.norm(graph[cv][0] - query)
+                d = np.linalg.norm(graph[e][0] - query)
                 if (d, e) not in visit:
                     visit.add((d, e))
 
@@ -243,6 +246,10 @@ class HNSW:
 
         return nns
 
+    def create(self, dataset):
+        for v in dataset:
+            self.insert(v)
+
     def search(self, query, ef=1):
 
         # if the index is empty, return an empty list
@@ -251,41 +258,41 @@ class HNSW:
 
         best_v = 0  # set the initial best vertex to the entry point
         for graph in self._index:
-            best_v, best_d = HNSW._search_layer(graph, best_v, query, ef=1)[0]
+            best_d, best_v = HNSW._search_layer(graph, best_v, query, ef=1)[0]
             if graph[best_v][2]:
                 best_v = graph[best_v][2]
             else:
                 return HNSW._search_layer(graph, best_v, query, ef=ef)
 
     def _get_insert_layer(self):
-        # ml is a multiplicative factor used to normalized the distribution
+        # ml is a multiplicative factor used to normalize the distribution
         l = -int(np.log(np.random.random()) * self._mL)
         return min(l, self._L-1)
 
     def insert(self, vec, efc=10):
 
         # if the index is empty, insert the vector into all layers and return
-        if not index[0]:
+        if not self._index[0]:
             i = None
-            for graph in index[::-1]:
+            for graph in self._index[::-1]:
                 graph.append((vec, [], i))
                 i = 0
             return
 
         l = self._get_insert_layer()
 
         start_v = 0
-        for n, graph in enumerate(index):
+        for n, graph in enumerate(self._index):
 
             # perform insertion for layers [l, L) only
             if n < l:
-                start_v, _ = _search_layer(graph, start_v, vec, ef=1)[0]
+                _, start_v = self._search_layer(graph, start_v, vec, ef=1)[0]
             else:
-                node = (vec, [], len(index[n+1]) if n < self._L-1 else None)
-                nns = _search_layer(graph, start_v, vec, ef=efc)
+                node = (vec, [], len(self._index[n+1]) if n < self._L-1 else None)
+                nns = self._search_layer(graph, start_v, vec, ef=efc)
                 for nn in nns:
-                    node.append(nn[1])  # outbound connections to NNs
-                    graph[nn[1]].append(len(graph))  # inbound connections to node
+                    node[1].append(nn[1])  # outbound connections to NNs
+                    graph[nn[1]][1].append(len(graph))  # inbound connections to node
                 graph.append(node)
 
             # set the starting vertex to the nearest neighbor in the next layer

diff --git a/codelabs/get-started-with-vector-db-07/pic/better_graph.png b/codelabs/get-started-with-vector-db-07/pic/better_graph.png
diff --git a/codelabs/get-started-with-vector-db-07/pic/hnsw_visualized.jpg b/codelabs/get-started-with-vector-db-07/pic/hnsw_visualized.jpg
diff --git a/codelabs/get-started-with-vector-db-07/pic/skip_list.png b/codelabs/get-started-with-vector-db-07/pic/skip_list.png