ARXIV:2604.09189 · LLM SAFETY · SUBMITTED 13 APR · 20:23 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

Avni Mittal · arXiv

The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps. Existing benchmarks evaluate models against external standards but do not measure whether…

METHOD

Full abstract

LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal…

WHY NOW

LLM Safety moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainThe Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.

Segment

LLM Safety

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "14dc4764-8057-44fc-b97c-176a4f034ea7", "arxiv_id": "2604.09189", "canonical_route": "/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies", "endpoints": { "paper_pack": "/api/v1/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies/paper-pack", "build_passport": "/api/v1/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies", "normalized_query": "2604.09189", "route": "/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies", "paper_ref": "do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies#webpage", "url": "https://sciencetostartup.com/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies", "name": "Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies", "description": "The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies#scholarlyArticle", "headline": "Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies", "description": "The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.", "url": "https://sciencetostartup.com/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies", "sameAs": "https://arxiv.org/abs/2604.09189", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.09189" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-10T10:18:45.000Z", "author": [ { "@type": "Person", "name": "Avni Mittal" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Safety" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Safety", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-St", "item": "https://sciencetostartup.com/paper/do-llms-follow-their-own-rules-a-reflexive-audit-of-self-stated-safety-policies" } ] } ] }

Competitive landscape

The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.

Segment

LLM Safety

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline