Activation velocity is a metric designed to quantify the cumulative change or 'drift' in the internal activation states of a Large Language Model (LLM) over the course of a multi-turn conversation. It operates by tracking how the model's internal representations, particularly those associated with specific intents, evolve from one conversational turn to the next. This mechanism is crucial for identifying subtle, evolving threats or norm violations that might not be apparent in single turns but accumulate over time. The primary motivation for its development is to enhance privacy guardrails for agentic LLMs, addressing the limitations of traditional, turn-by-turn semantic filters which can be bypassed or become computationally expensive in long interactions. By detecting shifts in activation space, activation velocity enables more robust and efficient enforcement of contextual integrity, particularly in applications requiring sustained privacy protection in interactive AI systems.
Activation velocity is a technique used in AI models, specifically large language models, to detect when a conversation is subtly shifting towards privacy-violating content over many turns. It does this by tracking how the model's internal thought patterns change cumulatively, providing a more robust and efficient way to enforce privacy rules than checking each message individually.
Was this definition helpful?