activation velocity

Activation velocity is a metric designed to quantify the cumulative change or 'drift' in the internal activation states of a Large Language Model (LLM) over the course of a multi-turn conversation. It operates by tracking how the model's internal representations, particularly those associated with specific intents, evolve from one conversational turn to the next. This mechanism is crucial for identifying subtle, evolving threats or norm violations that might not be apparent in single turns but accumulate over time. The primary motivation for its development is to enhance privacy guardrails for agentic LLMs, addressing the limitations of traditional, turn-by-turn semantic filters which can be bypassed or become computationally expensive in long interactions. By detecting shifts in activation space, activation velocity enables more robust and efficient enforcement of contextual integrity, particularly in applications requiring sustained privacy protection in interactive AI systems.

Understanding Activation Velocity

Core Concept of Activation Velocity: Activation velocity quantifies the cumulative change in an LLM's internal representations across multiple conversational turns. This metric helps track how the model's internal state evolves, especially concerning potential shifts towards privacy-violating intent.
Purpose in LLM Guardrails with Activation Velocity: It is specifically designed to capture threats that emerge or accumulate over long conversations, where single-turn analysis might fail. This allows guardrail frameworks to identify evolving risks that bypass simpler semantic checks, as highlighted by NeuroFilter's approach (2601.14660v1).

Mechanism of Activation Velocity

Measuring Cumulative Drift with Activation Velocity: Activation velocity works by measuring the 'cumulative drift' in the model's activation space. This means it tracks the aggregate change in the internal neural patterns associated with a conversation's progression, rather than just the state at any single point (2601.14660v1).
Integration with NeuroFilter and Activation Velocity

Understanding Activation Velocity

Mechanism of Activation Velocity

Impact and Applications of Activation Velocity

Sources

At a glance

Executive summary

TL;DR

Key points

Use cases

Related topics