Recent advancements in large language model (LLM) inference optimization are addressing critical challenges in efficiency and resource management. Techniques such as TIDE and RPS-Serve are enhancing throughput and reducing latency by enabling early exits and modality-aware scheduling, respectively, which is particularly beneficial for multimodal applications. Speculative decoding methods, including OnlineSpec and ConFu, are leveraging iterative feedback loops to improve draft model accuracy and speed, while LycheeDecode and LycheeCluster are tackling long-context processing bottlenecks through innovative cache management strategies. Additionally, frameworks like EcoThink are focusing on sustainability by optimizing energy consumption without sacrificing performance. The integration of these approaches not only improves the responsiveness of LLMs but also opens avenues for their deployment in resource-constrained environments, addressing both commercial demands for efficiency and the growing need for environmentally responsible AI solutions. As the field evolves, the emphasis on practical, scalable solutions is becoming increasingly pronounced.
Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at in...
Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts wh...
As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parall...
As the Web transitions from static retrieval to generative interaction, the escalating environmental footprint of Large Language Models (LLMs) presents a critical sustainability challenge. Current par...
Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce addition...
Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during dec...
This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parame...
We present \textbf{ITQ3\_S} (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for large language models (LLMs) that integrates \textbf{TurboQuant (TQ)}, a rot...
KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that redu...
Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verif...