Recent advancements in large language model (LLM) architecture are focusing on enhancing efficiency and reasoning capabilities while addressing inherent limitations of traditional transformers. The introduction of memory-augmented attention mechanisms, such as MANAR, allows for more effective integration of global context, enabling models to scale linearly rather than quadratically, which is crucial for real-time applications. Meanwhile, the NeuroGame Transformer redefines attention through game-theoretic principles, improving the modeling of complex token interactions and achieving competitive performance with fewer parameters. Depth-recurrent transformers are also emerging, allowing for variable-depth reasoning that can adapt to task complexity, thereby enhancing generalization. These innovations not only promise to reduce computational costs but also aim to mitigate issues like parameter entanglement and hallucinations, making LLMs more reliable for commercial applications in areas such as customer service, content generation, and data analysis. As these architectures evolve, they are set to reshape the landscape of AI-driven solutions across industries.
Standard Mixture-of-Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing-Free MoE which eliminates any hard-coded centralized designs i...
Standard attention mechanisms in transformers are limited by their pairwise formulation, which hinders the modeling of higher-order dependencies among tokens. We introduce the NeuroGame Transformer (N...
MANAR (Memory-augmented Attention with Navigational Abstract Conceptual Representation), contextualization layer generalizes standard multi-head attention (MHA) by instantiating the principles of Glob...
The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a...
Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architectur...
In psychological support and emotional companionship scenarios, the core limitation of large language models (LLMs) lies not merely in response quality, but in their reliance on local next-token predi...
Large language models (LLMs) currently suffer from parameter entanglement, where general reasoning capabilities (logic) and specific factual knowledge (facts) exist in a superposition state within sha...
Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logi...
A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys...
Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each componen...