Quantization in machine learning reduces the precision of model weights and activations from high-precision formats (e.g., float32) to lower-precision integers (e.g., int8, int4). This process significantly decreases memory footprint and computational cost, enabling efficient deployment on resource-constrained hardware.
Quantization makes AI models smaller and faster by converting their complex numbers into simpler, lower-precision numbers. This allows powerful models to run on devices with limited memory and processing power, like smartphones or embedded systems, without much loss in performance.
Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), Dynamic Quantization, Static Quantization, Mixed-Precision Quantization, Binary Neural Networks (BNN), Ternary Neural Networks (TNN), INT8 quantization, INT4 quantization
Was this definition helpful?