ANN Benchmarks with Etienne Dilocker — Weaviate Podcast #16
Approximate Nearest Neighbor (ANN) search is one of the core drivers of industrial scale Vector Search . This article and podcast describes techniques to understand exactly how well ANN algorithms solve this problem when benchmarked on different:
- Dataset Sizes (e.g. 1 million versus 1 billion vectors)
- Vector Dimensionality (e.g. 384-d versus 2048-d)
- Vector Distributions (e.g. clustering properties / distance histograms)
Many ANN algorithms have been developed such as ANNOY [2, 3] and HNSW . These algorithms enable us to find the most similar vectors to our query point. The key trade-off with these algorithms is: accuracy for speed. This podcast with Weaviate CTO and Co-Founder Etienne Dilocker describes recent work on benchmarking these accuracy / speed trade-offs. This enables Weaviate and Vector Search users to get a sense of details such as: how many queries per second (QPS) can we get based on the details bulleted above.
From these benchmarks, Weaviate users will also get a sense of how changing the HNSW hyperparameters alters performance.
This article breaks down the technical chapters of the podcast to hopefully make the content itself more digestable and beginner-friendly.
TLDR of 10 chapters in Weaviate podcast #16 with Etienne Dilocker:
ANN Benchmarks on Weaviate
The podcast opens with a broad introduction of what we are here to discuss. This describes opening questions such as: What are ANN Benchmarks? What lead Etienne to create this benchmark? Etienne describes the story of connecting concrete numbers to claims of speed in Vector Search throughput.
What makes each ANN Dataset Different?
As bulleted in the article overview, the key distinctions in vectorized ANN datasets are the dataset size, the dimensionality of the dataset, and the distribution of distances generally. This last bullet is probably the toughest to understand, and we will return to the topic later with “Problems with Random Vector Benchmarks”.
The purpose of these ANN benchmarks is to avoid linear scaling of runtime with dataset size, e.g. how a binary search tree creates O(log n) search rather than O(n). Etienne describes how nice it is to see this optimization come to life and have minimal performance decrease when adding more vectors (e.g. 1,000 to 100,000 vectors and so on).
How do these relate to my Dataset?
I think the key question for most Weaviate users looking at these benchmarks is to answer: How do these benchmarks relate to my dataset? As Etienne describes, the key detail to look for is probably the vector dimensionality of your dataset. Etienne describes how it is common to compress text into say 384-d and images to higher dimensions, e.g. 1024-d — although with CLIP we are seeing more of a unification between image-text vector spaces. Another key detail of this (with respect to how this impacts you and your use case with Weaviate) — is to see how varying the efConstruction, maxConnections, and ef hyperparameters can change performance. What kind of performance differences can you expect by turning these knobs?
Problems with Random Vector Benchmarks
One of the most interesting topics that have come up recently in the ANN benchmarking space is the problem with Random Vector Benchmarks. To further clarify, the “ground truth” in an ANN benchmark are the exact nearest neighbors. ANN algorithms use shortcuts such as space dividing hyperplanes or graph connectivity heuristics that trade off speed for the exact nearest neighbors. A key detail of the algorithms is that they exploit cluster structure in vector representations to achieve this. Random Vector Benchmarks are uniformly spaced out without any cluster structure, thus negating the utility of the ANN algorithms.
The Role of Clustering in ANN and Deep Learning
Transitioning from the problem with Random Vector Benchmarks — we discussed the general phenomenon of clustering and how clusters are evidence in Deep Learning-based representations. Datasets, such as MS MARCO to take an example, will have a cluster structure in their vector representations derived from the inference of a Sentence Transformer (or similar models). This prior of knowing data will be heavily clustered is key to the design of these ANN search algorithms.
Recommendation as Language Modeling
This brought us to the broad topic of symbolic algorithms on top of neural representations versus generative models in Deep Learning. There is a lot to parse with this, which is why I think the Recommendation task is a great place to start to think about this. Recommendation could be mapped to the same problem as Search — you liked “Ironman”? Here are the 5 movies most semantically similar to Ironman. Recommendation based on representation continues with algorithms like Collaborative Filtering that weight these ratings with user similarity distances as well. One of the groundbreaking works in Deep NLP has been T5 (Text-to-Text Transfer Transformer), named because it shows all NLP tasks can be unified into the language modeling framework, just prefix the input with a prompt describing the task. For example, “Recommendation: User A rated Batman a 3 out of 5 and Ironman a 5 out of 5. The next movie recommended should be <mask>”. There is still a massive opportunity to explore how these prompts should be designed to unify the tasks and wether this is worth doing to begin with compared to alternative ways to represent tasks.
Speaking of symbolic algorithms on top of neural representations — Re-Ranking is probably the term that encapsulates how important this is to Search. It might help to start with the thinking that we have a paid advertiser whose results we bubble up to the top when within the top K of a query. We apply a post-processing step to sort these results. We may also want to weight some combination of exploitation-exploration with organizing the list of top K results. It seems like there is a massive opportunity to explore the nuances of how people use Neural Search and how powerful the Re-Ranking interface is for adapting the Vector Search Engine to these applications.
Class-Property Schema in Weaviate
Returning to the details of Weaviate — I was very happy to learn how HNSW indices work with the Class-Property Schema. This data schema is illustrated in the diagram below:
So using the example above, say we had 1 million authors and 1 billion articles — we would have separate HNSW indices for the authors and articles. I think this is a beautiful abstraction that could be used for multimodal data (Article — (has image) — ArticleImages), as well as heterogeneous data in the same domain, e.g. text links such as (Kaggle Notebook — (has reference) — ArXivPaper). I have also seen questions about using this for multilingual representations in the Weaviate slack chat.
Weaviate Core and Python Inference Containers
This is one of the key things I am trying to figure out in my personal education of Weaviate. Etienne explained how to configure the Python Inference containers (including how to add a new HuggingFace tokenizer and model through the docker script). I did not have any experience with Docker prior to joining Weaviate, but I’m finding it somewhat accessible. I started with simple Docker scripts to run hello_world style python apps. I’m still a little confused about how environment variables are accessed and that kind of thing, but am making progress. We are also very lucky to have Dasith Edirisinghe taking on the topic of “How to add Custom Modules to Weaviate” in the 2022 Google Summer of Code (GSoC ‘22)!
The Weaviate Podcast
Thanks for checking out this recap. If you are interested in Vector Search and Deep Learning technology, you may be interested in our Weaviate podcast! Our recent episodes include:
- Etienne Dilocker — ANN Benchmarks (#16)
- Maximillian Werk — Jina AI’s Neural Search (#15)
- UNC NLP team advised by Prof. Mohit Bansal — VL-Adapter for Efficient CLIP tuning (#14)
- Rick Lamers — Data Science with Orchest (#13)
- Jonathan Frankle — Efficient Deep Learning (#12)
- Han Xiao — Jina AI (#11)
- Yury Malkov and Etienne Dilocker — HNSW (#10)
- Karen Beckers — Search in eCommerce (#9)
- Brady Neal — Causal Inference (#8)
- [Most Viewed All Time] Arvind Neelakantan — OpenAI Embeddings API (#7)
- Alex Cannan — How Zencastr uses Vector Search (#6)
- Michael Wechner — Katie Knowledge Management Bot (#5)
- Malte Pietsch — Haystack from Deepset (#4)
- Bob van Luijt — Weaviate Overview (#3)
- Charles Pierse — Weaviate and Keenious (#2)
- Etienne Dilocker — Weaviate Overview (#1)
Follow our previous podcast guests on Twitter!
Etienne Dilocker (@etiennedi), Charles Pierse (@cdpierse), Bob van Luijt (@bobvanluijt), Malte Pietsch (@malte_pietsch), Michael Wechner @MichiWechner, Arvind Neelakantan (@arvind_io), Brady Neal @CasualBrady, Yury Malkov (@MalkovYury), Han Xiao (@hxiao), Jonathan Frankle (@jefrankle), Rick Lamers (@RickLamers), Mohit Bansal (@mohitban47), Jaemin Cho (@jmin__cho), Yi Lin Sung (@yilin_sung), and Maximilian Werk (@maximilianwerk).
- ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. Martin Aumuller, Erik Bernhardsson, and Alexander Faithfull; 2018. arXiv:1807.05614.
- https://github.com/spotify/annoy. Accessed May 2022.
- Nearest neighbors and vector models — part 2 — algorithms and data structures. https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html. Accessed May 2022.
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. Yury A. Malkov, Dmitry A. Yashunin; 2016. arXiv:1603.09320.