Dissimilarity search: implementing in-memory vector search algorithms to PostgreSQL
Thursday, October 24 at 10:30–11:20
Searching over vector spaces is a well-studied concept in mathematics and computer science, but the problem space has new meaning with the rise of AI/ML systems, particularly generative AI, that output very large vectors. Prior to generative AI, we predominately saw vector search in databases through geospatial, full-text, and bioinformatic/cheminformatic searches, all of which are supported in PostgreSQL either natively or through extensions. However, each of these domains have specific requirements that avoided some of the challenge we've seen searching over vectors from AI/ML systems, including using small (3/4-dim) vectors, using vectors as a filter instead of finding order (where "K-nearest neighbor" (K=NN) requires ordering), or allowing for high latency searches.
Modern approximate nearest neighbor search (ANN) has evolved over the past 25 years, and has produced a variety of algorithms that propose efficient search methods over vector spaces that can return related results with a high degree of relevancy (measured as "recall"). Some of these algorithms include IVF ("inverted file"), HNSW ("hierarchical navigable small worlds"), LSH ("locality-sensitive hashing"), and others. However, most ANN algorithmic design focuses on the efficiency of how the index behaves in memory -- some of these performance and search relevancy benefits may not carry over to databases due to considerations like on-disk layout, ongoing updates, available system memory, and resource constraints due to other workloads on the same system.
In this talk, we'll take a deep look at a variety of ANN algorithms that can be implemented in PostgreSQL, and what considerations we need to make to ensure they're implemented efficiently. We'll explore the pgvector implementations of the IVFFlat and HNSW algorithms and design decisions that allow for efficient indexing and search relevancy while achieving a higher level of recall, including index data structures, organization of data pages, effective update management, and query costing. We'll also explore PostgreSQL-specific details that impact these implementations, including lock management, background workers for index build parallelism, leveraging expressions to extend index implementations, and more. We'll also explain the tradeoffs between in-memory and on-disk representations of a variety of ANN algorithms.
At the end of this talk, you'll understand the considerations you need to make when implementing an ANN search algorithm for PostgreSQL, which can also be extended to other databases systems.