LLM Inference

TrendingProof pending

18papers

5.4viability

+125%30d

Proof pending

Proof pending. Core topic summary fields are still materializing.

State of the Field

Recent advancements in large language model (LLM) inference focus on enhancing efficiency and accuracy during token generation. Techniques such as KV-Fold and Latent Phase-Shift Rollback optimize long-context processing and error correction without requiring extensive retraining. Meanwhile, architectures like ArcLight and DUAL-BLADE improve performance on many-core CPUs and edge devices by addressing memory management and I/O bottlenecks. These innovations are crucial for developers aiming to deploy LLMs in real-world applications, as they enable faster, more reliable inference while maintaining fidelity across various contexts. The ongoing research in this field is vital for building scalable AI solutions that can operate effectively under resource constraints, ultimately benefiting a wide range of industries.

Last updated May 29, 2026

Topic-linked question coverage is still building for this proof surface.

Topic trend

Topic-specific paper and score movement from the daily diff ledger.

Papers

1-10 of 18

Research Paper·May 12, 2026

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model proce...

8.0 viability

Research Paper·Apr 20, 2026

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce $\textbf{...

7.0 viability

Research Paper·Mar 8, 2026

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many-core CPU platforms. Many-core CPUs are widely dep...

7.0 viability

Research Paper·Apr 30, 2026

Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target d...

7.0 viabilityHas code

Research Paper·Apr 20, 2026

WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

While distributed device-edge speculative decoding enhances resource utilization across heterogeneous nodes, its performance is often bottlenecked by conventional token-level verification strategies. ...

7.0 viability

Research Paper·Feb 10, 2026

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solu...

6.0 viability

Research Paper·May 14, 2026

PreFT: Prefill-only finetuning for efficient inference

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels...

6.0 viability

Research Paper·Apr 14, 2026

SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation,...

6.0 viability

Research Paper·May 20, 2026·B2BConsumer

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterog...

5.0 viability

Research Paper·Apr 29, 2026

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which of...

5.0 viability

Page 1 of 2

LLM Inference

Proof pending

State of the Field

Topic trend

Papers

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

PreFT: Prefill-only finetuning for efficient inference

SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Filters

Topic proof surfaces

LLM Inference

Use this topic page as a durable research-area proof surface