ARXIV:2603.18428 · LLM GENERATION · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

Asmita Bhardwaj · Yuya Jeremy Ong · Eelaaf Zahid · Basel Shbita · arXiv

A reinforcement learning-based decoder sampler that learns to adapt LLM sampling parameters at test-time for improved and controllable generation quality without retraining.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A reinforcement learning-based decoder sampler that learns to adapt LLM sampling parameters at test-time for improved and controllable generation quality without retraining.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A reinforcement learning-based decoder sampler that learns to adapt LLM sampling parameters at test-time for improved and controllable generation quality without retraining. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential…

METHOD

Full abstract

Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements.…

WHY NOW

LLM Generation moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA reinforcement learning-based decoder sampler that learns to adapt LLM sampling parameters at test-time for improved and controllable generation quality without retraining.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

A reinforcement learning-based decoder sampler that learns to adapt LLM sampling parameters at test-time for improved and controllable generation quality without retraining.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A reinforcement learning-based decoder sampler that learns to adapt LLM sampling parameters at test-time for improved and controllable generation quality without retraining.

Segment

LLM Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(13)

Reference metadata pending (adb0c91669f211d3e99590d8efb1a6d06aabf016)

Reference metadata pending (0247c8299c1ead8fb2d4e5b82aeb9f1058048c87)

Reference metadata pending (ca31b8584b6c022ef15ddfe994fe361e002b7729)

Reference metadata pending (131c6f328c11706de2c43cd16e0b7c5d5e610b6a)

Reference metadata pending (88b62496cbc52072bfa8f4b29d172b0477b701bc)

Reference metadata pending (be8e58320203a92bfacc1a1f95f6e65f3ee4fa5c)

Reference metadata pending (90abbc2cf38462b954ae1b772fac9532e2ccd8b0)

Reference metadata pending (d9f6ada77448664b71128bb19df15765336974a6)

Reference metadata pending (cf4aa38ae31b43fd07abe13b4ffdb265babb7be1)

Reference metadata pending (29de7c0fb3c09eaf55b20619bceaeafe72fd87a6)

Reference metadata pending (a9cd8efe9184dddb1bedbbec3a356c4dfb22fe63)

Reference metadata pending (e8f1d69cdf4d92350ce237e2d6a615ebe2e52e43)

Reference metadata pending (bdb3f20fe41bb95f6bc9d162e827de8db3f952d7)

{ "contract_version": "paper-r2", "paper_id": "548fc8d1-f447-43f5-95e5-4a1440094261", "arxiv_id": "2603.18428", "canonical_route": "/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation", "endpoints": { "paper_pack": "/api/v1/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation/paper-pack", "build_passport": "/api/v1/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation", "normalized_query": "2603.18428", "route": "/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation", "paper_ref": "adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation#webpage", "url": "https://sciencetostartup.com/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation", "name": "Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation", "description": "A reinforcement learning-based decoder sampler that learns to adapt LLM sampling parameters at test-time for improved and controllable generation quality without retraining.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation#scholarlyArticle", "headline": "Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation", "description": "A reinforcement learning-based decoder sampler that learns to adapt LLM sampling parameters at test-time for improved and controllable generation quality without retraining.", "url": "https://sciencetostartup.com/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation", "sameAs": "https://arxiv.org/abs/2603.18428", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.18428" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-19T02:44:42.000Z", "author": [ { "@type": "Person", "name": "Asmita Bhardwaj" }, { "@type": "Person", "name": "Yuya Jeremy Ong" }, { "@type": "Person", "name": "Eelaaf Zahid" }, { "@type": "Person", "name": "Basel Shbita" } ], "citation": [ { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "adb0c91669f211d3e99590d8efb1a6d06aabf016" }, "url": "https://www.semanticscholar.org/paper/adb0c91669f211d3e99590d8efb1a6d06aabf016" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "0247c8299c1ead8fb2d4e5b82aeb9f1058048c87" }, "url": "https://www.semanticscholar.org/paper/0247c8299c1ead8fb2d4e5b82aeb9f1058048c87" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "ca31b8584b6c022ef15ddfe994fe361e002b7729" }, "url": "https://www.semanticscholar.org/paper/ca31b8584b6c022ef15ddfe994fe361e002b7729" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "131c6f328c11706de2c43cd16e0b7c5d5e610b6a" }, "url": "https://www.semanticscholar.org/paper/131c6f328c11706de2c43cd16e0b7c5d5e610b6a" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "88b62496cbc52072bfa8f4b29d172b0477b701bc" }, "url": "https://www.semanticscholar.org/paper/88b62496cbc52072bfa8f4b29d172b0477b701bc" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "be8e58320203a92bfacc1a1f95f6e65f3ee4fa5c" }, "url": "https://www.semanticscholar.org/paper/be8e58320203a92bfacc1a1f95f6e65f3ee4fa5c" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "90abbc2cf38462b954ae1b772fac9532e2ccd8b0" }, "url": "https://www.semanticscholar.org/paper/90abbc2cf38462b954ae1b772fac9532e2ccd8b0" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "d9f6ada77448664b71128bb19df15765336974a6" }, "url": "https://www.semanticscholar.org/paper/d9f6ada77448664b71128bb19df15765336974a6" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "cf4aa38ae31b43fd07abe13b4ffdb265babb7be1" }, "url": "https://www.semanticscholar.org/paper/cf4aa38ae31b43fd07abe13b4ffdb265babb7be1" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "29de7c0fb3c09eaf55b20619bceaeafe72fd87a6" }, "url": "https://www.semanticscholar.org/paper/29de7c0fb3c09eaf55b20619bceaeafe72fd87a6" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "a9cd8efe9184dddb1bedbbec3a356c4dfb22fe63" }, "url": "https://www.semanticscholar.org/paper/a9cd8efe9184dddb1bedbbec3a356c4dfb22fe63" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "e8f1d69cdf4d92350ce237e2d6a615ebe2e52e43" }, "url": "https://www.semanticscholar.org/paper/e8f1d69cdf4d92350ce237e2d6a615ebe2e52e43" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "bdb3f20fe41bb95f6bc9d162e827de8db3f952d7" }, "url": "https://www.semanticscholar.org/paper/bdb3f20fe41bb95f6bc9d162e827de8db3f952d7" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Generation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Generation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Adaptive Decoding via Test-Time Policy Learning for Self-Imp", "item": "https://sciencetostartup.com/paper/adaptive-decoding-via-test-time-policy-learning-for-self-improving-generation" } ] } ] }