Hi,

This post is about the approximate nearest neighbor (ANN) algorithm. The code for this post is here, where I provide an example of using a framework and a python implementation. Most python implementation were written with the help of a LLM. I’m amazed, how helpful they are for learning new things. I see them like a drunken professor, which with the right approach will be a very helpful tool.

As a next step in understanding RAGs, I want to have a closer look at approximate nearest neighbor algorithms. Basically, the purpose is to find the closest vector to a query vector in a database. Since I’m also interested into the implementation, I follow mostly this amazing blog post. Vector search is the basic component of vector databases and their main purpose. ANN algorithms are looking for a close match instead of an exact match. This loss of accuracy allows an improvement of efficieny, which allows the search through much bigger datasets, high-dimensional data and real-time apllications.

This improved efficiency is based on indexes, which are pre-processed data-structures of embeddings. This also includes dimension reduction of the intitial embedding. There are multiple techniques, how the embedding can be indexed for a fast retrieval. To get a retrieval, there are multiple approaches to calculate the distance between two embeddings (e.g. cosine distance, euclidean distance). To get an extensive list of almost all metrics, please have a look at the NMSLIB manual here.

For this post, there will be ashort introduction for:

  • hash-based indexing
  • tree-based indexing
  • graph-based indexing
  • cluster-based indexing

The ressources for this post are:

As always, look at the code and the ressources. I’m sparse with graphics, but sometimes they make it much easier to understand the concept.

Graph-based Indexing (e.g. HNSW)

Hierarchical Navigable Small World uses a graph for approximity nearest neighbor search. A “proximity graph” is used, which links two vertices based on their closeness (e.g. euclidean distance). There are two fundamental techniques, which contribute to HNSW. It’s a probability skip list and navigable small world graphs.

A probability skip list allows fast search, while using a linked list for easy and fast insertion of new elements. This is done by adding multiple layers of linked lists, where intermediate nodes are skipped in the low layers. This allows a fast search by following the skips until the target is found.

Navigable small world graphs are proximity graphs, which include long-range and short-range links to reduce search times. Each graph has an pre-defined entry-point, which connects to several vertices. Via greedy routing the closest vertice to the search query will selected until a local minimum is detected.

HNSW combines the hierarchical multi-layers from the skip list with the long-range (sparse) and short-range (dense) links of the NSW graph. Please look at one of the references. With the visual explanation from those blogs, it is much easier to grasp the concept og HNSW.

Ressources:

Cluster-based Indexing (e.g. product quantization)

Product quantization is a way to cluster chunks of a vector. This reduces the total size of the database by reducing the overall precision of the vectors. This is different from PCA, which reduce the dimension of a vector. The idea is, that the algorithm looks at each dimension separately and “bins” each value into discrete buckets. The search will performed over the quantized vectors, which results in reduced memory consumption and a speed-up.

Basically, the vector is split into subvectors. Each subvector is quantized by k-means and will be represented by the index of its closest centroid.

I look into three different way of PQ

  • simple PQ
  • Inverted File + Product Quantization, which introduces a better data structure to achieve a much denser subdivision of the search space
  • Optimized Product Quantization with minimizing quantization distortions

Ressources:

Hash-based Indexing (e.g. LSH)

Locality sensitive hashing is an approximate technique, which puts similiar data points into same buckets or locations with high probability. For retrieval, you only need to explore a small subset instead of the whole dataset. LSH addresses this by increasing the likelihood of hash value collisions for similiar inputs. By those collisions similiar embeddings will be grouped together.

LSH works in three major steps:

  • Shingling (converting the documents into a set of characters of length k) - not needed here, since we already use embeddings
  • Minhashing (calculates similiarity between two data points by assigning Minhash signatures)
  • Locality sensitive hashing (clustering similar items together)

Ressources:

Tree-based Indexing (e.g. Annoy)

Approximate Nearest Neighbors Oh Yeah (Annoy) is an algorithm, which uses a forest of trees to search for the nearest neighbor. Annoy is an optimized approach and hard to implement by yourself. To keep in mind, annoy was published by spotify, which was replaced by Voyager. Anyway, traditional tree-based approaches like KD-Trees suffer from the high dimensionality. The sklearn inbuild version of a kd-tree takes 3 sec to get the approx. nearest neighbors for 10 samples. The documentation of annoy is a bit sparse.

The core idea is to recursively partition the data. During the build phase, the data is divided at each node based on different criteria (e.g., splitting along a dimension like in kd-trees, partitioning by distance to chosen centers like in ball trees, or splitting based on a random projection like in RP trees). This splitting continues until the nodes contain a certain number of points.

For the query phase, the search traverses the tree. Starting at the root, the algorithm uses the node’s splitting criteria to decide which child branch is most likely to contain the nearest neighbors for the given query point.

Ressources:

This is a brief overview of approximate nearest neighbor algorithms. I hope, you find it helpful. You can find the code for an example dataset here.

Thank you for your attention.