LOOKAT

LOOKAT is an innovative technique for compressing the Key-Value (KV) cache within transformer-based large language models (LLMs), addressing a critical bottleneck for their deployment on resource-constrained edge devices. Unlike conventional quantization methods that primarily reduce storage, LOOKAT specifically targets and significantly reduces the memory bandwidth consumed by attention calculations. It operates by drawing inspiration from vector database compression techniques, applying product quantization (PQ) and asymmetric distance computation (ADC) to the key vectors. This involves decomposing key vectors into subspaces, learning optimized codebooks, and then computing attention scores efficiently via lookup tables. This fundamental shift transforms the attention mechanism from being bottlenecked by memory access to being compute-bound, thereby enabling the practical deployment of large LLMs on devices with limited memory bandwidth, such as smartphones or IoT hardware.

Core Mechanism of LOOKAT

Problem Addressed by LOOKAT: Current quantization methods for KV cache compress storage but fail to reduce the memory bandwidth required for attention calculation, as keys must be dequantized to FP16 before use. LOOKAT specifically tackles this bandwidth limitation for LLMs on edge devices.
Product Quantization and Asymmetric Distance Computation in LOOKAT: LOOKAT applies product quantization (PQ) and asymmetric distance computation (ADC) by decomposing key vectors into subspaces. It learns codebooks for these subspaces and computes attention tables using lookup tables, a technique adapted from vector databases.
Attention Calculation Transformation by LOOKAT

Core Mechanism of LOOKAT

Performance and Benefits of LOOKAT

Theoretical Foundations of LOOKAT

Sources

At a glance

Executive summary

TL;DR

Key points

Use cases

Related topics