Recent advancements in large language model (LLM) optimization are focusing on enhancing efficiency and accessibility for enterprise applications. Frameworks like OptiKIT are automating model optimization, enabling non-expert teams to deploy LLMs effectively while improving GPU throughput by over twofold. Innovations such as EntropyCache and FlashPrefill are addressing computational bottlenecks by optimizing key-value caching and context prefilling, achieving significant speedups without sacrificing accuracy. Meanwhile, Causal Prompt Optimization is reshaping prompt design by utilizing causal inference to tailor prompts to specific queries, enhancing robustness and reducing costs. Additionally, frameworks like ALTER are tackling the challenge of knowledge unlearning in LLMs, ensuring safer deployments by isolating and removing unwanted information with minimal impact on model utility. These developments indicate a shift towards more efficient, user-friendly LLM solutions that can meet the diverse needs of organizations while addressing critical operational challenges.
Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expe...
Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV cach...
Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mech...
Large Language Models (LLMs) are increasingly embedded in enterprise workflows, yet their performance remains highly sensitive to prompt design. Automatic Prompt Optimization (APO) seeks to mitigate t...
Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling p...
Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. H...
Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they contin...
In-context learning (ICL) adapts large language models by conditioning on a small set of ICL examples, avoiding costly parameter updates. Among other factors, performance is often highly sensitive to ...
Large language models (LLMs) have demonstrated remarkable capabilities, but their massive scale poses significant challenges for practical deployment. Structured pruning offers a promising solution by...
Token-level sparse attention mechanisms, exemplified by DeepSeek Sparse Attention (DSA), achieve fine-grained key selection by scoring every historical token for each query using a lightweight indexer...