Proof pending. Core topic summary fields are still materializing.
Recent advancements in large language model (LLM) efficiency focus on reducing computational costs while maintaining or enhancing reasoning capabilities. Techniques such as confidence-guided self-refinement, adaptive model selection, and hybrid attention mechanisms are being developed to optimize token usage and improve accuracy. These innovations are crucial for builders aiming to deploy LLMs in resource-constrained environments, as they allow for scalable solutions that balance performance and efficiency. By leveraging methods like collaborative reasoning and selective halting, developers can create systems that intelligently allocate resources, ensuring effective processing without excessive overhead. This ongoing research is vital for the future of AI applications, enabling more sustainable and accessible technologies.
Topic-specific paper and score movement from the daily diff ledger.
Large Language Models (LLMs) often rely on test-time scaling via parallel decoding (for example, 512 samples) to boost reasoning accuracy, but this incurs substantial compute. We introduce CoRefine, a...
Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises...
The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory...
Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as exp...
Large reasoning models, such as OpenAI o1 and DeepSeek-R1, tend to become increasingly verbose as their reasoning capabilities improve. These inflated Chain-of-Thought (CoT) trajectories often exceed ...
Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address t...
Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we prop...
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, ...
Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices ...
Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a sy...
Freshness
Canonical route: /topics
Agent Handoff
Canonical ID llm-efficiency | Route /topic/llm-efficiency
REST example
curl https://sciencetostartup.com/api/v1/agent-handoff/topic/llm-efficiencyMCP example
{
"tool": "search_papers",
"arguments": {
"query": "LLM Efficiency",
"cluster": "LLM Efficiency"
}
}source_context
{
"surface": "topic",
"mode": "topic",
"query": "LLM Efficiency",
"normalized_query": "llm-efficiency",
"route": "/topic/llm-efficiency",
"paper_ref": null,
"topic_slug": "llm-efficiency",
"benchmark_ref": null,
"dataset_ref": null
}Use This Via API or MCP
Topic pages bundle paper counts, viability trends, author concentration, and top questions into one canonical surface your agents can reference before they open Signal Canvas or create a workspace.