115 papers - avg viability 6.0
Recent research on large language model (LLM) security is increasingly focused on developing practical defenses against various attack vectors, including backdoor and prompt injection attacks. Techniques such as Tail-risk Intrinsic Geometric Smoothing and PRISM offer real-time, low-latency solutions that maintain model utility while mitigating risks of malicious exploitation. The field is also exploring the nuanced mechanics of safety alignment, with approaches like the Disentangled Safety Hypothesis revealing the complexities of harmfulness detection and execution. Meanwhile, multi-agent systems are being scrutinized for their potential to propagate sensitive information, prompting the need for innovative defenses that can operate seamlessly during generation. The emergence of frameworks like PIArena highlights the demand for standardized evaluation platforms to assess the robustness of these defenses across diverse scenarios. As LLMs become more integrated into applications, these advancements aim to enhance their reliability and safety in real-world deployments.
A plug-and-play defense mechanism against backdoor attacks in large language models that maintain high performance and low latency.
Steg-AI provides a security layer for LLMs by detecting steganographically hidden malicious prompts and responses, preventing covert harmful content generation.
This research systematically measures real-world prompt injection attacks in LLM-based resume screening, revealing prevalence, trends, and attack vectors.
SafeSteer is a decoding-level defense for MLLMs that improves safety by up to 33.40% without fine-tuning.
An automatic framework for multi-turn LLM jailbreaking using cumulative low-risk inputs, with demonstrated high success rates and a proposed defense strategy.
Leveraging lightweight LLMs as low-latency judges to secure public chatbots against prompt attacks in real-time production environments.
A real-time system that detects and mitigates secret leakage in multi-agent LLM pipelines by analyzing generation dynamics at each token.
A novel smoothing defense method for LLMs that guarantees protection against jailbreaking attacks by disrupting and rectifying prompts.
ThoughtSteer: A novel backdoor attack on continuous latent reasoning in language models that evades existing defenses and achieves high success rates.
Surgical attacks on LLM safety mechanisms enable novel jailbreaking and reveal architectural vulnerabilities, paving the way for robust security tools.