Recent research on improving the efficiency of large language models (LLMs) is increasingly focused on reducing computational costs while maintaining high performance across various tasks. Techniques such as confidence-guided self-refinement and adaptive model selection are gaining traction, enabling models to dynamically adjust their computational resources based on real-time performance metrics. This shift is particularly relevant for applications requiring rapid inference, such as chatbots and automated reasoning systems, where reducing energy consumption can lead to significant cost savings. Innovations like the Collaborative Memory Transformer and hybrid attention mechanisms are addressing the challenges posed by long-context processing, allowing models to handle larger inputs without a proportional increase in resource demands. Furthermore, approaches that leverage reinforcement learning for generative selection are demonstrating that smaller models can achieve competitive results, enhancing the scalability of LLM applications. Collectively, these advancements indicate a clear trend towards more efficient, adaptable, and cost-effective LLM deployment in commercial settings.
Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises...
Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a...
The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory...
Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a sy...
Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we prop...
Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address t...
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art ...
The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While exi...
The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structu...
Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large ...