Product quantization (PQ) is a powerful vector quantization method that addresses the challenge of efficiently storing and searching high-dimensional data. Its core mechanism involves splitting a high-dimensional vector into several lower-dimensional subvectors. Each subvector space then has its own codebook, learned through clustering algorithms like k-means, allowing each subvector to be represented by a compact code (an index into its respective codebook). The original vector is thus represented by a concatenation of these subvector codes. This technique significantly reduces memory footprint and accelerates similarity search operations, as distances can be computed more efficiently using precomputed lookup tables. PQ is particularly vital in approximate nearest neighbor (ANN) search, powering large-scale image retrieval, recommendation systems, and semantic search in vector databases. More recently, it has found application in compressing the KV cache of large language models (LLMs) to enable their deployment on edge devices, transforming memory-bound attention calculations into compute-bound operations.
Product quantization is a data compression technique that splits large data vectors into smaller parts, quantizing each part independently. This makes it highly efficient for storing and searching vast amounts of data, and it's now being used to compress the internal memory (KV cache) of large AI models, allowing them to run on smaller, less powerful devices.
PQ, Optimized Product Quantization (OPQ), Composite Quantization, Locality-Sensitive Hashing (LSH) (related concept), Inverted File System with Product Quantization (IVFPQ)
Was this definition helpful?