ARXIV:2604.10966 · MULTIMODAL REWARD MODELING · SUBMITTED 14 APR · 16:47 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

Yinuo Yang · Zixian Ma · Manasi Ganti · Jieyu Zhang · Ranjay Krishna · arXiv

A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality. Conventional discriminative reward models evaluate each response independently, requiring…

METHOD

Full abstract

We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse…

WHY NOW

Multimodal Reward Modeling moved forward this cycle; last verified April 2026. Public score 8.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainA multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.

Segment

Multimodal Reward Modeling

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "2aa25413-36ab-423e-a519-10ec97f2e86d", "arxiv_id": "2604.10966", "canonical_route": "/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass", "endpoints": { "paper_pack": "/api/v1/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass/paper-pack", "build_passport": "/api/v1/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass", "normalized_query": "2604.10966", "route": "/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass", "paper_ref": "you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass#webpage", "url": "https://sciencetostartup.com/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass", "name": "You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass", "description": "A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass#scholarlyArticle", "headline": "You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass", "description": "A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.", "url": "https://sciencetostartup.com/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass", "sameAs": "https://arxiv.org/abs/2604.10966", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.10966" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-13T04:02:03.000Z", "author": [ { "@type": "Person", "name": "Yinuo Yang" }, { "@type": "Person", "name": "Zixian Ma" }, { "@type": "Person", "name": "Manasi Ganti" }, { "@type": "Person", "name": "Jieyu Zhang" }, { "@type": "Person", "name": "Ranjay Krishna" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal Reward Modeling" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal Reward Modeling", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "You Only Judge Once: Multi-response Reward Modeling in a Sin", "item": "https://sciencetostartup.com/paper/you-only-judge-once-multi-response-reward-modeling-in-a-single-forward-pass" } ] } ] }

Competitive landscape

A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.

Segment

Multimodal Reward Modeling

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline