ARXIV:2604.19544 · MULTIMODAL AI · SUBMITTED 22 APR · 20:32 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Zhihong Zhang · Jie Zhao · Xiaojian Huang · Jin Xu · Zhuodong Luo · Xin Liu · +2 at arXiv

A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.

Evidence 0 refs | 4 sources | 83% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data. Training a good MRM requires high-quality multimodal preference data.

METHOD

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data.

Full abstract

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench. A public repository is…

WHY NOW

Multimodal AI moved forward this cycle; last verified April 2026. Public score 8.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainA system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.

Evidence0 refs | 4 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.

Segment

Multimodal AI

Adoption evidence

Public code linked for build inspection

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "8244368b-b270-429e-a511-cf00bd18efdf", "arxiv_id": "2604.19544", "canonical_route": "/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling", "endpoints": { "paper_pack": "/api/v1/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling/paper-pack", "build_passport": "/api/v1/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling", "normalized_query": "2604.19544", "route": "/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling", "paper_ref": "dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling#webpage", "url": "https://sciencetostartup.com/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling", "name": "DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling", "description": "A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling#scholarlyArticle", "headline": "DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling", "description": "A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.", "url": "https://sciencetostartup.com/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling", "sameAs": "https://arxiv.org/abs/2604.19544", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.19544" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-21T15:02:50.000Z", "author": [ { "@type": "Person", "name": "Zhihong Zhang" }, { "@type": "Person", "name": "Jie Zhao" }, { "@type": "Person", "name": "Xiaojian Huang" }, { "@type": "Person", "name": "Jin Xu" }, { "@type": "Person", "name": "Zhuodong Luo" }, { "@type": "Person", "name": "Xin Liu" }, { "@type": "Person", "name": "Jiansheng Wei" }, { "@type": "Person", "name": "Xuejin Chen" } ], "codeRepository": "https://github.com/zhang123434/DT2IT-MRM", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal AI" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling#software", "name": "DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling - Source Code", "description": "A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.", "codeRepository": "https://github.com/zhang123434/DT2IT-MRM", "url": "https://github.com/zhang123434/DT2IT-MRM" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal AI", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "DT2IT-MRM: Debiased Preference Construction and Iterative Tr", "item": "https://sciencetostartup.com/paper/dt2it-mrm-debiased-preference-construction-and-iterative-training-for-multimodal-reward-modeling" } ] } ] }

Competitive landscape

A system for constructing and iteratively training multimodal reward models that achieves state-of-the-art performance by debiasing preference data.

Segment

Multimodal AI

Adoption evidence

Public code linked for build inspection

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline