Current research in model optimization is increasingly focused on enhancing the efficiency and performance of large language and vision-language models, addressing the pressing challenges of computational costs and generalization capabilities. Recent work on masked diffusion language models reveals strategies to improve generalization without the typical performance plateaus seen in traditional models, while techniques like Prefill-Only Pruning optimize inference by selectively retaining critical layers during different processing stages. Additionally, adaptive pruning frameworks are emerging, allowing models to dynamically adjust their architecture based on the heterogeneity of input data, which enhances accuracy while reducing parameter counts. Innovations such as FlashHead and Quant Experts further tackle the bottlenecks in classification and quantization, respectively, enabling faster and more efficient inference. Collectively, these advancements suggest a shift towards more modular, context-aware systems that can deliver high performance with significantly lower resource demands, making them more viable for commercial applications across various domains.
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured ...
Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their auto-regressive counterparts. In thi...
Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to...
Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, ...
Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving con...
In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterpar...
Standard mixed-precision training of neural networks requires many bytes of accelerator memory for each model parameter. These bytes reflect not just the parameter itself, but also its gradient and on...
In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metr...
Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights a...
Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computa...