Evidence Receipt. Related Resources.
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
Use This Via API or MCP
Use this Signal Canvas via API or MCP
Route this paper proof surface into REST, MCP, or developer workflows while preserving the same evidence receipt and related-resource context.
Page Freshness
Signal Canvas proof surface
Canonical route: /signal-canvas/rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training
- Proof freshness
- stale
- Proof status
- partial
- Display score
- 9/10
- Last proof check
- 2026-03-19
- Score updated
- 2026-04-02
- Score fresh until
- 2026-05-02
- References
- 0
- Source count
- 0
- Coverage
- 50%
This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.
Agent Handoff
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
Canonical ID rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training | Route /signal-canvas/rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training
REST example
curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-trainingMCP example
{
"tool": "search_signal_canvas",
"arguments": {
"mode": "paper",
"paper_ref": "rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training",
"query_text": "Summarize Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training"
}
}source_context
{
"surface": "signal_canvas",
"mode": "paper",
"query": "Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training",
"normalized_query": "2603.16139",
"route": "/signal-canvas/rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training",
"paper_ref": "rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training",
"topic_slug": null,
"benchmark_ref": null,
"dataset_ref": null
}Preparing verified analysis
Dimensions overall score 9.0
GitHub Code Pulse
Claim map
- Evidencepartial
The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase.
ImplicationmissingImplication not extracted yet.
Verificationpartialpartial
- Evidencepartial
It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50).
ImplicationmissingImplication not extracted yet.
Verificationpartialpartial
- Evidencepartial
For example, our IOMM-B (3.6B) model was trained from scratch using only ~1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage).
ImplicationmissingImplication not extracted yet.
Verificationpartialpartial
- Evidencepartial
The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.
ImplicationmissingImplication not extracted yet.
Verificationpartialpartial
- Evidencepartial
Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data.
ImplicationmissingImplication not extracted yet.
Verificationpartialpartial
- Evidencepartial
To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework.
ImplicationmissingImplication not extracted yet.
Verificationpartialpartial
- Evidencepartial
Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance.
ImplicationmissingImplication not extracted yet.
Verificationpartialpartial
- Evidencepartial
In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework.
ImplicationpartialThe abstract explicitly introduces IOMM as a 'data-efficient two-stage training framework' to address bottlenecks in UMM visual generation.
Verificationpartialpartial
- Evidencepartial
The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase.
ImplicationpartialThe abstract clearly states the purpose and data source for the first stage of the proposed framework.
Verificationpartialpartial
- Evidencepartial
thereby removing the dependency on paired data for this costly phase.
ImplicationpartialThe abstract directly states this benefit of the first stage of IOMM.
Verificationpartialpartial
- Evidencepartial
The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.
ImplicationpartialThe abstract describes the data used in the second stage of the IOMM framework.
Verificationpartialpartial
- Evidencepartial
For example, our IOMM-B (3.6B) model was trained from scratch using only ~1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50).
ImplicationpartialThis is a specific, verifiable numerical result presented in the abstract.
Verificationpartialpartial