ARXIV:2604.08986 · LLM TRAINING · SUBMITTED 13 APR · 20:28 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

Jihwan Oh · Soowon Oh · Murad Aghazada · Minchan Jeong · Sungnyun Kim · Se-Young Yun · arXiv

A new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning.

Blocked on Code›Score3.0Evidence unverified

Opportunity summary

Pain A new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning. However, identifying an optimal persona is time-consuming, and its impact on output quality remains…

METHOD

Full abstract

Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters. However, identifying an optimal persona is time-consuming, and its impact on output quality remains poorly understood. Prior work has mainly addressed this issue at the prompt level via inference-time strategies, incurring additional computation. In this work, we avoid inference-time prompt search by tackling persona sensitivity during training, aiming to train models that adapt their behavior to diverse personas while preserving task performance. In particular, we find that reinforcement learning with verifiable rewards (RLVR) systematically reduces sensitivity to persona prompts, but also reveals an inherent trade-off of outcome-based optimization: while RLVR improves robustness on tasks with verifiable goals, it can also degrade persona expressivity when needed, e.g., in-character role-playing. To address this limitation, we propose PerMix-RLVR, a persona-mixed RLVR strategy that mitigates the persona robustness-fidelity trade-off, preserving strong robustness to harmful persona variation while enabling faithful persona adoption when required. Concretely, PerMix-RLVR improves persona stability score (PSS) over RLVR by +21.2% on MATH500, while also enhancing persona fidelity by +11.4% on PersonaGym.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. Persona prompting has been widely adopted to steer large language models (LLMs) behavior and improve their instruction performance by assigning specific characters.

WHY NOW

LLM Training moved forward this cycle; last verified April 2026. Public score 3.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainA new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "e41008ea-c311-43b3-a876-f0324d5a6d7c", "arxiv_id": "2604.08986", "canonical_route": "/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment", "endpoints": { "paper_pack": "/api/v1/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment/paper-pack", "build_passport": "/api/v1/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment", "normalized_query": "2604.08986", "route": "/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment", "paper_ref": "permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment#webpage", "url": "https://sciencetostartup.com/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment", "name": "PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment", "description": "A new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment#scholarlyArticle", "headline": "PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment", "description": "A new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning.", "url": "https://sciencetostartup.com/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment", "sameAs": "https://arxiv.org/abs/2604.08986", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.08986" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-10T05:45:32.000Z", "author": [ { "@type": "Person", "name": "Jihwan Oh" }, { "@type": "Person", "name": "Soowon Oh" }, { "@type": "Person", "name": "Murad Aghazada" }, { "@type": "Person", "name": "Minchan Jeong" }, { "@type": "Person", "name": "Sungnyun Kim" }, { "@type": "Person", "name": "Se-Young Yun" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Training" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "PerMix-RLVR: Preserving Persona Expressivity under Verifiabl", "item": "https://sciencetostartup.com/paper/permix-rlvr-preserving-persona-expressivity-under-verifiable-reward-alignment" } ] } ] }

Competitive landscape

A new training strategy for LLMs that balances persona expressivity with task performance by mitigating a trade-off in reinforcement learning.

Segment

LLM Training

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

PerMix-RLVR: Preserving Persona Expressivity under Verifiable-Reward Alignment

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline