Skip to content

Commit

Permalink
Merge pull request #33 from zilliztech/refactor_2
Browse files Browse the repository at this point in the history
Refactor and add latest blogs
  • Loading branch information
shanghaikid authored Oct 31, 2023
2 parents 9c51578 + 27e414c commit 09c4038
Show file tree
Hide file tree
Showing 23 changed files with 444 additions and 25 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ import numpy as np
from scipy.cluster.vq import kmeans2


class ProductQuantizer:
class ProductQuantizer(object):

def __init__(self, M=16, K=256):
self.M = 16
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
summary: A deep dive into Hierarchical Navigable Small Worlds (HNSW)
id: vector-database-101-scalar-quantization-and-product-quantization
id: vector-database-101-hierarchical-navigable-small-worlds
categories: Getting Started
tags: getting-started
status: Hidden
Expand All @@ -13,7 +13,7 @@ Feedback Link: https://github.com/milvus-io/milvus
## Introduction
Duration: 1

Hey there - welcome back to [Milvus tutorials](https://codelabs.milvus.io/). In the previous tutorial, we
Hey there - welcome back to [Milvus tutorials](https://codelabs.milvus.io/). In the previous tutorial, we took a look at scalar quantization and product quantization - two indexing strategies which are used to reduce the overall _size_ of the database without reducing the scope of our search. To better illustrate how scalar quantization and product quantization works, we also implemented our own versions in Python.

In this tutorial, we'll build on top of that knowledge by looking at what is perhaps the most commonly used primary algorithm today: Hierarchical Navigable Small Worlds (HNSW). HNSW performs very well when it comes to both speed and accuracy, making it an incredibly robust vector search algorithm. Despite it being popular, understanding HNSW can be a bit tricky, but don't fret - in the next couple of sections, we'll break down HNSW into its individual steps, developing our own simple implementation along the way.

Expand Down Expand Up @@ -74,7 +74,7 @@ As with the skip list, the query vector will appear in upper layers with exponen
## Implementing HNSW
Duration: 8

HNSW is not trivial to implement, so we'll implement only a very basic version here. As usual, let's start with a dataset and a query vector:
HNSW is not trivial to implement, so we'll implement only a very basic version here. As usual, let's start with creating a dataset of (128 dimensional) vectors:

```python
>>> import numpy as np
Expand Down Expand Up @@ -111,7 +111,7 @@ def _search_layer(graph, entry, query, ef=1):

# loop through all nearest neighbors to the candidate vector
for e in graph[cv[1]][1]:
d = np.linalg.norm(graph[cv][0] - query)
d = np.linalg.norm(graph[e][0] - query)
if (d, e) not in visit:
visit.add((d, e))

Expand Down Expand Up @@ -140,7 +140,7 @@ def search(index, query, ef=1):

best_v = 0 # set the initial best vertex to the entry point
for graph in index:
(best_v, best_d) = _search_layer(graph, best_v, query, ef=1)[0]
best_d, best_v = _search_layer(graph, best_v, query, ef=1)[0]
if graph[best_v][2]:
best_v = graph[best_v][2]
else:
Expand All @@ -161,7 +161,7 @@ def _get_insert_layer(L, mL):
With everything in place, we can now implement the insertion function.

```python
def insert(index, vec, L=5, efc=10):
def insert(self, vec, efc=10):

# if the index is empty, insert the vector into all layers and return
if not index[0]:
Expand All @@ -172,19 +172,19 @@ def insert(index, vec, L=5, efc=10):
return

l = _get_insert_layer(1/np.log(L))
start_v = 0

start_v = 0
for n, graph in enumerate(index):

# perform insertion for layers [l, L) only
if n < l:
start_v, _ = _search_layer(graph, start_v, vec, ef=1)[0]
_, start_v = _search_layer(graph, start_v, vec, ef=1)[0]
else:
node = (vec, [], len(index[n+1]) if n < L-1 else None)
node = (vec, [], len(_index[n+1]) if n < L-1 else None)
nns = _search_layer(graph, start_v, vec, ef=efc)
for nn in nns:
node.append(nn[1]) # outbound connections to NNs
graph[nn[1]].append(len(graph)) # inbound connections to node
node[1].append(nn[1]) # outbound connections to NNs
graph[nn[1]][1].append(len(graph)) # inbound connections to node
graph.append(node)

# set the starting vertex to the nearest neighbor in the next layer
Expand All @@ -193,17 +193,20 @@ def insert(index, vec, L=5, efc=10):

If the index is empty, we'll insert `vec` into all layers and return immediately. This serves to initialize the index and allow for successful insertions later. If the index has already been populated, we begin insertion by first computing the insertion layer via the `get_insert_layer` function we implemented in the previous step. From there, we find the nearest neighbor to the vector in the uppermost graph. This process continues for the layers below it until we reach layer `l`, the insertion layer.

For layer `l` and all those below it, we first find the nearest neighbors to `vec` up to a pre-determined number `ef`. We then create connections from the node to its nearest neighbors and vice versa. Note that a proper implementation should also have a pruning technique to prevent early vectors from being connected to too many others - I'll leave that as an exercise for the reader ().
For layer `l` and all those below it, we first find the nearest neighbors to `vec` up to a pre-determined number `ef`. We then create connections from the node to its nearest neighbors and vice versa. Note that a proper implementation should also have a pruning technique to prevent early vectors from being connected to too many others - I'll leave that as an exercise for the reader :sunny:.

We now have both search (query) and insert functionality complete. Let's combine everything together in a class:

```python
from bisect import insort
from heapq import heapify, heappop, heappush

import numpy as np

from ._base import _BaseIndex

class HNSW:

class HNSW(_BaseIndex):

def __init__(self, L=5, mL=0.62, efc=10):
self._L = L
Expand All @@ -230,7 +233,7 @@ class HNSW:

# loop through all nearest neighbors to the candidate vector
for e in graph[cv[1]][1]:
d = np.linalg.norm(graph[cv][0] - query)
d = np.linalg.norm(graph[e][0] - query)
if (d, e) not in visit:
visit.add((d, e))

Expand All @@ -243,6 +246,10 @@ class HNSW:

return nns

def create(self, dataset):
for v in dataset:
self.insert(v)

def search(self, query, ef=1):

# if the index is empty, return an empty list
Expand All @@ -251,41 +258,41 @@ class HNSW:

best_v = 0 # set the initial best vertex to the entry point
for graph in self._index:
best_v, best_d = HNSW._search_layer(graph, best_v, query, ef=1)[0]
best_d, best_v = HNSW._search_layer(graph, best_v, query, ef=1)[0]
if graph[best_v][2]:
best_v = graph[best_v][2]
else:
return HNSW._search_layer(graph, best_v, query, ef=ef)

def _get_insert_layer(self):
# ml is a multiplicative factor used to normalized the distribution
# ml is a multiplicative factor used to normalize the distribution
l = -int(np.log(np.random.random()) * self._mL)
return min(l, self._L-1)

def insert(self, vec, efc=10):

# if the index is empty, insert the vector into all layers and return
if not index[0]:
if not self._index[0]:
i = None
for graph in index[::-1]:
for graph in self._index[::-1]:
graph.append((vec, [], i))
i = 0
return

l = self._get_insert_layer()

start_v = 0
for n, graph in enumerate(index):
for n, graph in enumerate(self._index):

# perform insertion for layers [l, L) only
if n < l:
start_v, _ = _search_layer(graph, start_v, vec, ef=1)[0]
_, start_v = self._search_layer(graph, start_v, vec, ef=1)[0]
else:
node = (vec, [], len(index[n+1]) if n < self._L-1 else None)
nns = _search_layer(graph, start_v, vec, ef=efc)
node = (vec, [], len(self._index[n+1]) if n < self._L-1 else None)
nns = self._search_layer(graph, start_v, vec, ef=efc)
for nn in nns:
node.append(nn[1]) # outbound connections to NNs
graph[nn[1]].append(len(graph)) # inbound connections to node
node[1].append(nn[1]) # outbound connections to NNs
graph[nn[1]][1].append(len(graph)) # inbound connections to node
graph.append(node)

# set the starting vertex to the nearest neighbor in the next layer
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 09c4038

Please sign in to comment.