Recent advancements in transformer optimization are focused on enhancing efficiency and performance while addressing inherent limitations. Techniques like adaptive looping and gated memory banks are being explored to improve mathematical reasoning and commonsense understanding without significantly increasing parameter counts. Additionally, spectral conditioning of attention layers is showing promise in stabilizing performance by refining the Jacobian properties, while structured Hadamard transforms are reducing the memory footprint and computational costs associated with dense output projections. The introduction of data-aware random feature kernels aims to tackle the quadratic complexity of attention mechanisms, allowing for linear scaling in sequence length while maintaining accuracy. Furthermore, query-oriented key-value selection methods are streamlining attention processes, achieving substantial speedups in inference times without sacrificing performance. Collectively, these efforts indicate a shift towards more resource-efficient models that can operate effectively in real-world applications, addressing commercial needs for faster and more capable AI systems.
Chain-of-thought (CoT) prompting enables reasoning in language models but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining repres...
The dense output projection in multi-head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing...
We present a theoretical analysis of the Jacobian of an attention block within a transformer, showing that it is governed by the query, key, and value projections that define the attention mechanism. ...
Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss cor...
Post-training quantization (PTQ) of transformers is known to suffer from severe accuracy degradation due to structured activation outliers, as originally analyzed by Bondarenko et al. (EMNLP 2021) in ...
We present QUOKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. Wh...
Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence lengt...