ARXIV:2605.10633 · LLM SAFETY & ALIGNMENT · SUBMITTED 12 MAY · 20:16 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

Krishak Aneja · Manas Mittal · Anmol Goel · Ponnurangam Kumaraguru · Vamshi Krishna Bonagiri · arXiv

Mapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning.

Blocked on Code›Score3.0Evidence unverified

Opportunity summary

Pain Mapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Mapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning. While prior work links these failures to specific directions in the activation space,…

METHOD

Full abstract

Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above $40$%, while amplifying them suppresses the failure mode to less than $3$%. Leveraging the structural stability of the personality space, we show that vectors extracted $\textit{a priori}$ from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model's internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes.

WHY NOW

LLM Safety & Alignment moved forward this cycle; last verified May 2026. Public score 3.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainMapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

Mapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Mapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning.

Segment

LLM Safety & Alignment

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "52711f34-edd7-4d17-8b17-feea0c22c6a6", "arxiv_id": "2605.10633", "canonical_route": "/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms", "endpoints": { "paper_pack": "/api/v1/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms/paper-pack", "build_passport": "/api/v1/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs", "normalized_query": "2605.10633", "route": "/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms", "paper_ref": "intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms#webpage", "url": "https://sciencetostartup.com/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms", "name": "Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs", "description": "Mapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms#scholarlyArticle", "headline": "Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs", "description": "Mapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning.", "url": "https://sciencetostartup.com/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms", "sameAs": "https://arxiv.org/abs/2605.10633", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.10633" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-11T14:21:57.000Z", "author": [ { "@type": "Person", "name": "Krishak Aneja" }, { "@type": "Person", "name": "Manas Mittal" }, { "@type": "Person", "name": "Anmol Goel" }, { "@type": "Person", "name": "Ponnurangam Kumaraguru" }, { "@type": "Person", "name": "Vamshi Krishna Bonagiri" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Safety & Alignment" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Safety & Alignment", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Intrinsic Guardrails: How Semantic Geometry of Personality I", "item": "https://sciencetostartup.com/paper/intrinsic-guardrails-how-semantic-geometry-of-personality-interacts-with-emergent-misalignment-in-llms" } ] } ] }

Competitive landscape

Mapping the latent personality space of LLMs reveals intrinsic guardrails that can be leveraged to suppress emergent harmful behaviors during fine-tuning.

Segment

LLM Safety & Alignment

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline