ARXIV:2604.18519 · LLM SAFETY · SUBMITTED 21 APR · 20:33 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Difan Jiao · Yilun Liu · Ye Yuan · Zhenwei Tang · Linfeng Du · Haolun Wu · +1 at arXiv

A lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain A lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters.

Evidence 0 refs | 4 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features…

METHOD

Full abstract

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Code availability…

WHY NOW

LLM Safety moved forward this cycle; last verified April 2026. Public score 8.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainA lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters.

Evidence0 refs | 4 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

A lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters.

Segment

LLM Safety

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "fad5322a-282b-46b6-bd0d-5b784ec3ad1a", "arxiv_id": "2604.18519", "canonical_route": "/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "llm-safety-from-within-detecting-harmful-content-with-internal-representations", "endpoints": { "paper_pack": "/api/v1/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations/paper-pack", "build_passport": "/api/v1/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "LLM Safety From Within: Detecting Harmful Content with Internal Representations", "normalized_query": "2604.18519", "route": "/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations", "paper_ref": "llm-safety-from-within-detecting-harmful-content-with-internal-representations", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations#webpage", "url": "https://sciencetostartup.com/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations", "name": "LLM Safety From Within: Detecting Harmful Content with Internal Representations", "description": "A lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations#scholarlyArticle", "headline": "LLM Safety From Within: Detecting Harmful Content with Internal Representations", "description": "A lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters.", "url": "https://sciencetostartup.com/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations", "sameAs": "https://arxiv.org/abs/2604.18519", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.18519" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-20T17:17:07.000Z", "author": [ { "@type": "Person", "name": "Difan Jiao", "affiliation": { "@type": "Organization", "name": "University of Toronto" } }, { "@type": "Person", "name": "Yilun Liu", "affiliation": { "@type": "Organization", "name": "Ludwig Maximilian University of Munich" } }, { "@type": "Person", "name": "Ye Yuan", "affiliation": { "@type": "Organization", "name": "McGill University" } }, { "@type": "Person", "name": "Zhenwei Tang", "affiliation": { "@type": "Organization", "name": "University of Toronto" } }, { "@type": "Person", "name": "Linfeng Du", "affiliation": { "@type": "Organization", "name": "McGill University" } }, { "@type": "Person", "name": "Haolun Wu", "affiliation": { "@type": "Organization", "name": "McGill University" } }, { "@type": "Person", "name": "Ashton Anderson", "affiliation": { "@type": "Organization", "name": "University of Toronto" } } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Safety" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Safety", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "LLM Safety From Within: Detecting Harmful Content with Inter", "item": "https://sciencetostartup.com/paper/llm-safety-from-within-detecting-harmful-content-with-internal-representations" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"LLM Safety From Within: Detecting Harmful Content with Inter\"?", "acceptedAnswer": { "@type": "Answer", "text": "Revolutionize content moderation by detecting harmful content using internal representations of LLMs for improved safety." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "This approach can be packaged as a SaaS-based content moderation platform that integrates easily with existing social media systems, offering a plug-and-play solution for enhanced harmful content detection." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "Develop an AI moderation tool for social media platforms to detect harmful or offensive content more accurately by leveraging internal model representations of existing LLMs." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "This method could potentially replace or enhance existing guard models that rely solely on output layers for content detection. By utilizing 'internal layer' data, it presents a significant leap forward in model-based content moderation." } } ] } ] }

Competitive landscape

A lightweight LLM safety model that detects harmful content by leveraging internal representations, outperforming state-of-the-art with significantly fewer parameters.

Segment

LLM Safety

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

LLM Safety From Within: Detecting Harmful Content with Internal Representations

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline