Recent research in large language model (LLM) interpretability is focusing on understanding the internal mechanisms that govern their decision-making processes, with implications for commercial applications such as AI safety and user interaction. Studies are revealing that LLMs exhibit a form of introspective awareness, capable of detecting and responding to injected steering vectors, which could enhance their adaptability in real-world tasks. Additionally, new methods for token-level causal attribution are being developed to clarify how specific inputs influence predictions, addressing concerns about biases and errors. The exploration of intra-memory knowledge conflicts is also gaining traction, providing insights into how conflicting information is encoded and can be managed. Furthermore, frameworks for discovering functional modules within LLMs are emerging, which could lead to more efficient and interpretable models. Collectively, these advancements are positioning the field to better harness LLMs for applications requiring nuanced understanding and control, such as conversational agents and automated decision-making systems.
While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, ...
Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strong...
Recent work shows that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept, a phenomenon cited as evidence of "introspective aware...
Categorical perception (CP) -- enhanced discriminability at category boundaries -- is among the most studied phenomena in perceptual psychology. This paper reports that analogous geometric warping occ...
In language models (LMs), intra-memory knowledge conflict largely arises when inconsistent information about the same event is encoded within the model's parametric knowledge. While prior work has pri...
Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-...
The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. This study investigates the internal neural representations of ...
If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how m...
Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how...
Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the i...