Recent advancements in inference optimization are focusing on enhancing efficiency and accuracy across various machine learning models. Techniques like state-space duality and autoregressive caching are enabling inference systems to run seamlessly on multiple hardware platforms without the need for custom kernels, significantly reducing operational complexity. Meanwhile, methods such as CORAL are addressing the persistent miscalibration in large language models by optimizing inference-time steering, resulting in substantial accuracy improvements without retraining. Additionally, the introduction of neural amortization frameworks for probabilistic graphical models is streamlining MPE inference, allowing for more effective local search strategies that leverage fixed graph structures. Innovations like the HiFloat4 data format are also contributing to reduced hardware requirements and power consumption, making inference more sustainable. Collectively, these developments are poised to solve critical commercial challenges, particularly in environments where rapid, accurate decision-making is essential, such as healthcare diagnostics and automated systems.
State-space model releases are typically coupled to fused CUDA and Triton kernels, inheriting a hard dependency on NVIDIA hardware. We show that Mamba-2's state space duality algorithm -- diagonal sta...
Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is ex...
Most Probable Explanation (MPE) inference in Probabilistic Graphical Models (PGMs) is a fundamental yet computationally challenging problem arising in domains such as diagnosis, planning, and structur...
This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning. Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits...
Inference in large-scale AI models is typically performed on dense parameter matrices, leading to inference cost and system complexity that scale unsustainably with model size. This limitation does no...