ARXIV:2604.28082 · LLM ALIGNMENT · SUBMITTED 01 MAY · 15:05 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Characterizing the Consistency of the Emergent Misalignment Persona

Anietta Weckauff · Yuchen Zhang · Maksym Andriushchenko · arXiv

Investigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data.

Blocked on Code›Score3.0Evidence unverified

Opportunity summary

Pain Investigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Investigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data. While prior work has found a correlation between harmful behavior and self-assessment in emergently misaligned models, it remains unclear how…

METHOD

Full abstract

Fine-tuning large language models (LLMs) on narrowly misaligned data generalizes to broadly misaligned behavior, a phenomenon termed emergent misalignment (EM). While prior work has found a correlation between harmful behavior and self-assessment in emergently misaligned models, it remains unclear how consistent this correspondence is across tasks and whether it varies across fine-tuning domains. We characterize the consistency of the EM persona by fine-tuning Qwen 2.5 32B Instruct on six narrowly misaligned domains (e.g., insecure code, risky financial advice, bad medical advice) and administering experiments including harmfulness evaluation, self-assessment, choosing between two descriptions of AI systems, output recognition, and score prediction. Our results reveal two distinct patterns: coherent-persona models, in which harmful behavior and self-reported misalignment are coupled, and inverted-persona models, which produce harmful outputs while identifying as aligned AI systems. These findings reveal a more fine-grained picture of the effects of emergent misalignment, calling into question the consistency of the EM persona.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. Our results reveal two distinct patterns: coherent-persona models, in which harmful behavior and self-reported misalignment are coupled, and inverted-persona models, which produce harmful outputs…

WHY NOW

LLM Alignment moved forward this cycle; last verified May 2026. Public score 3.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainInvestigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

Investigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Investigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data.

Segment

LLM Alignment

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "c2e070d8-5f10-44c6-bbc4-c644cab1c9f1", "arxiv_id": "2604.28082", "canonical_route": "/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "characterizing-the-consistency-of-the-emergent-misalignment-persona", "endpoints": { "paper_pack": "/api/v1/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona/paper-pack", "build_passport": "/api/v1/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Characterizing the Consistency of the Emergent Misalignment Persona", "normalized_query": "2604.28082", "route": "/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona", "paper_ref": "characterizing-the-consistency-of-the-emergent-misalignment-persona", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona#webpage", "url": "https://sciencetostartup.com/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona", "name": "Characterizing the Consistency of the Emergent Misalignment Persona", "description": "Investigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona#scholarlyArticle", "headline": "Characterizing the Consistency of the Emergent Misalignment Persona", "description": "Investigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data.", "url": "https://sciencetostartup.com/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona", "sameAs": "https://arxiv.org/abs/2604.28082", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.28082" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-30T16:26:53.000Z", "author": [ { "@type": "Person", "name": "Anietta Weckauff" }, { "@type": "Person", "name": "Yuchen Zhang" }, { "@type": "Person", "name": "Maksym Andriushchenko" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Alignment" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Alignment", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Characterizing the Consistency of the Emergent Misalignment ", "item": "https://sciencetostartup.com/paper/characterizing-the-consistency-of-the-emergent-misalignment-persona" } ] } ] }

Competitive landscape

Investigating the inconsistent alignment persona of large language models after fine-tuning on misaligned data.

Segment

LLM Alignment

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Characterizing the Consistency of the Emergent Misalignment Persona

Characterizing the Consistency of the Emergent Misalignment Persona

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline