Recent advancements in AI infrastructure are focusing on enhancing memory management and computational efficiency for large language models (LLMs). Notably, frameworks like BudgetMem are optimizing runtime memory usage by implementing query-aware performance-cost control, allowing for more efficient memory allocation based on task demands. Concurrently, innovations such as native position-independent caching (PIC) are addressing inefficiencies in key-value caching, significantly reducing latency and improving throughput without sacrificing accuracy. The introduction of algorithms like Qrita is streamlining the sampling process in LLMs, achieving higher performance with lower memory usage. Additionally, the Governed Memory architecture is tackling governance challenges in multi-agent workflows, ensuring effective memory sharing and compliance across autonomous agents. These developments collectively aim to resolve critical commercial challenges, such as improving the scalability and responsiveness of AI systems in real-world applications, while also paving the way for more sophisticated, memory-efficient AI solutions.
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be ...
The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been...
Top-k and Top-p are the dominant truncation operators in the sampling of large language models. Despite their widespread use, implementing them efficiently over large vocabularies remains a significan...
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easie...
Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set...
Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by token...
Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-awa...
Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising...