quantization

Gold definitionUpdated Apr 2, 2026

Definition

Quantization in machine learning reduces the precision of model weights and activations from high-precision formats (e.g., float32) to lower-precision integers (e.g., int8, int4). This process significantly decreases memory footprint and computational cost, enabling efficient deployment on resource-constrained hardware.

At a glance

Executive summary

Quantization makes AI models smaller and faster by converting their complex numbers into simpler, lower-precision numbers. This allows powerful models to run on devices with limited memory and processing power, like smartphones or embedded systems, without much loss in performance.

TL;DR

Quantization shrinks big AI models by simplifying their internal numbers, making them run efficiently on small devices.

Key points

Converts high-precision floating-point numbers (weights, activations) to lower-precision integers.
Enables deployment of large ML models on memory- and compute-constrained edge devices.
Used by AI hardware manufacturers, mobile AI developers, and cloud providers for inference cost reduction.
Unlike pruning, which removes redundant connections, quantization keeps all connections but reduces their precision.
Research focuses on extreme low-bit quantization (e.g., 2-bit, 1-bit), post-training quantization (PTQ), and quantization-aware training (QAT).

Use cases

Deploying large language models or image recognition models directly on smartphones for offline functionality.
Enabling real-time object detection on surveillance cameras or IoT devices with embedded computer vision models.
Running complex perception models on in-car embedded systems for autonomous driving features.
Reducing the cost and latency of serving large models in data centers by using quantized versions.
Fitting more Mixture-of-Experts (MoE) models on edge devices, e.g., 64 experts on 4GB memory.

Also known as

Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), Dynamic Quantization, Static Quantization, Mixed-Precision Quantization, Binary Neural Networks (BNN), Ternary Neural Networks (TNN), INT8 quantization, INT4 quantization