Recent advancements in model compression are focusing on enhancing the efficiency of large language models (LLMs) while maintaining their performance. Techniques such as adaptive pruning and family-aware quantization are gaining traction, allowing for significant reductions in computational costs without extensive retraining. Adaptive pruning leverages agent-guided methods to intelligently select which layers to prune, improving factual knowledge retention and overall accuracy. Meanwhile, family-aware quantization addresses the limitations of traditional calibration data by generating high-fidelity samples from related models, resulting in reduced accuracy loss during deployment. Additionally, new frameworks like Hessian Robust Quantization are reshaping the loss landscape to enhance robustness against quantization noise, while quantization-aware unlearning methods are being developed to effectively manage the removal of sensitive information. These innovations are not only improving the practicality of deploying LLMs on resource-constrained devices but also addressing critical challenges in knowledge retention and data privacy, positioning the field for broader commercial applications.
Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, ...
As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as Sparse...
Data-free knowledge distillation enables model compression without original training data, critical for privacy-sensitive tabular domains. However, existing methods does not perform well on tabular da...
Although post-training quantization (PTQ) provides an efficient numerical compression scheme for deploying large language models (LLMs) on resource-constrained devices, the representativeness and univ...
Deploying Deep Neural Networks (DNNs) on resource-constrained embedded systems requires aggressive model compression techniques like quantization and pruning. However, ensuring that the compressed mod...
Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error....
Layer-wise mixed-precision quantization (LMPQ) enables effective compression under extreme low-bit settings by allocating higher precision to sensitive layers. However, existing methods typically trea...
The unmatched ability of Deep Neural Networks in capturing complex patterns in large and noisy datasets is often associated with their large hypothesis space, and consequently to the vast amount of pa...
PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict l...
Machine unlearning aims to remove specific knowledge (e.g., copyrighted or private data) from a trained model without full retraining. In practice, models are often quantized (e.g., 4-bit) for deploym...