ARXIV:2604.01597 · LLM POST-TRAINING · SUBMITTED 03 APR · 20:50 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

Dong Shu · Denghui Zhang · Jessica Hullman · arXiv

Accelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain Accelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Accelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training.

METHOD

Full abstract

Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. Code availability is flagged in the production record; the public repository link still…

WHY NOW

LLM Post-Training moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainAccelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores.

Evidence0 refs | 0 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

Accelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Accelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores.

Segment

LLM Post-Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "797b3bdf-8da8-4fdf-a552-10239aa80002", "arxiv_id": "2604.01597", "canonical_route": "/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training", "endpoints": { "paper_pack": "/api/v1/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training/paper-pack", "build_passport": "/api/v1/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training", "normalized_query": "2604.01597", "route": "/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training", "paper_ref": "learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training#webpage", "url": "https://sciencetostartup.com/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training", "name": "Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training", "description": "Accelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training#scholarlyArticle", "headline": "Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training", "description": "Accelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores.", "url": "https://sciencetostartup.com/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training", "sameAs": "https://arxiv.org/abs/2604.01597", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.01597" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T04:09:31.000Z", "author": [ { "@type": "Person", "name": "Dong Shu" }, { "@type": "Person", "name": "Denghui Zhang" }, { "@type": "Person", "name": "Jessica Hullman" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Post-Training" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Post-Training", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Learning from the Right Rollouts: Data Attribution for PPO-b", "item": "https://sciencetostartup.com/paper/learning-from-the-right-rollouts-data-attribution-for-ppo-based-llm-post-training" } ] } ] }

Competitive landscape

Accelerate LLM training and reduce unfaithful reasoning by intelligently filtering training data using influence scores.

Segment

LLM Post-Training

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline