Recent advancements in AI safety are increasingly focused on proactive measures to mitigate risks associated with AI agents and large language models. New frameworks, such as rule-based activation monitoring and pre-execution firewalls, are being developed to enhance the precision and transparency of safety mechanisms, allowing for real-time detection of harmful behaviors without the need for extensive retraining. The introduction of benchmarks that evaluate the timing of interventions, rather than just their accuracy, is shifting the focus toward early detection, which can yield significant cost savings in enterprise settings. Additionally, approaches that enhance safety alignment against prompt injection attacks are gaining traction, demonstrating improved robustness while maintaining model utility. As AI systems become more integrated into critical applications, addressing vulnerabilities through innovative safety protocols is essential for ensuring their responsible deployment and minimizing potential harm. This evolving landscape highlights a concerted effort to create more resilient AI technologies that can operate safely in complex environments.
Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it ...
Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to ind...
Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing acti...
AI agents increasingly act through external tools: they query databases, execute shell commands, read and write files, and send network requests. Yet in most current agent stacks, model-generated tool...
Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is abs...
Recently, multimodal large language models (MLLMs) have emerged as a unified paradigm for language and image generation. Compared with diffusion models, MLLMs possess a much stronger capability for se...
As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffe...
As Spoken Language Models (SLMs) integrate speech and text modalities, they inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to ...
Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text ...
Out-of-distribution (OOD) detection is essential for deploying deep learning models reliably, yet no single method performs consistently across architectures and datasets -- a scorer that leads on one...