Recent research on large language model (LLM) security is increasingly focused on identifying and mitigating vulnerabilities that could be exploited by malicious actors. One significant area of concern is the potential for covert attacks, such as steganographic finetuning, which allows harmful content to be embedded within seemingly benign outputs. This has prompted a shift towards understanding the complex interplay between safety mechanisms, with proposals like the Disentangled Safety Hypothesis highlighting the need for more nuanced defenses against jailbreak attacks. Additionally, advancements in multi-tenant LLM serving systems aim to address timing side channels that could leak sensitive information, while new frameworks for watermarking and functional fingerprinting are emerging to protect intellectual property. As LLMs become integral to critical applications, the focus is shifting towards comprehensive risk assessment and treatment strategies that encompass both model behavior and broader system vulnerabilities, ensuring that security measures do not compromise performance or usability.
Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-...
Understanding and addressing potential safety alignment risks in large language models (LLMs) is critical for ensuring their safe and trustworthy deployment. In this paper, we highlight an insidious s...
A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface...
Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mech...
Large Language Models (LLMs) generate responses based on user prompts. Often, these prompts may contain highly sensitive information, including personally identifiable information (PII), which could b...
Large Language Models (LLMs) are increasingly integrated into safety-critical workflows, yet existing security analyses remain fragmented and often isolate model behavior from the broader system conte...
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-ov...
The emergence of Large Language Model-enhanced Search Engines (LLMSEs) has revolutionized information retrieval by integrating web-scale search capabilities with AI-powered summarization. While these ...
As Large Language Models (LLMs) for code increasingly utilize massive, often non-permissively licensed datasets, evaluating data contamination through Membership Inference Attacks (MIAs) has become cr...
Large language models (LLMs) deployed behind APIs and retrieval-augmented generation (RAG) stacks are vulnerable to prompt injection attacks that may override system policies, subvert intended behavio...