ARXIV:2605.12379 · REINFORCEMENT LEARNING · SUBMITTED 13 MAY · 20:59 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

Fairoz Nower Khan · Nabuat Zaman Nahim · Peizhong Ju · arXiv

A method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement.

Ship in 2-4 weeks›Score6.0Evidence unverified

Opportunity summary

Pain A method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL…

METHOD

Full abstract

Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained language models. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.

RESULT

ScienceToStartup currently rates this 6.0/10 on the public viability pass. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without…

WHY NOW

Reinforcement Learning moved forward this cycle; last verified May 2026. Public score 6.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score6.0

PainA method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "8191ca03-0b3c-46ab-a62e-0af4fba7efdb", "arxiv_id": "2605.12379", "canonical_route": "/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "discrete-flow-matching-for-offline-to-online-reinforcement-learning", "endpoints": { "paper_pack": "/api/v1/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning/paper-pack", "build_passport": "/api/v1/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Discrete Flow Matching for Offline-to-Online Reinforcement Learning", "normalized_query": "2605.12379", "route": "/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning", "paper_ref": "discrete-flow-matching-for-offline-to-online-reinforcement-learning", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning#webpage", "url": "https://sciencetostartup.com/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning", "name": "Discrete Flow Matching for Offline-to-Online Reinforcement Learning", "description": "A method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning#scholarlyArticle", "headline": "Discrete Flow Matching for Offline-to-Online Reinforcement Learning", "description": "A method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement.", "url": "https://sciencetostartup.com/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning", "sameAs": "https://arxiv.org/abs/2605.12379", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.12379" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-12T16:44:02.000Z", "author": [ { "@type": "Person", "name": "Fairoz Nower Khan" }, { "@type": "Person", "name": "Nabuat Zaman Nahim" }, { "@type": "Person", "name": "Peizhong Ju" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 6 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Reinforcement Learning" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Reinforcement Learning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Discrete Flow Matching for Offline-to-Online Reinforcement L", "item": "https://sciencetostartup.com/paper/discrete-flow-matching-for-offline-to-online-reinforcement-learning" } ] } ] }

Competitive landscape

A method for offline-to-online reinforcement learning in discrete action spaces that uses discrete flow matching and a path-space penalty for stable improvement.

Segment

Reinforcement Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline