Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training | Signal Canvas | ScienceToStartup

← Back to Paper

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Stale81d agoVerification pending / evidence receipt incomplete

Clone Repo Export Brief Open in Build Loop Connect with Author

Use This Via API or MCP

Use this Signal Canvas via API or MCP

Route this paper proof surface into REST, MCP, or developer workflows while preserving the same evidence receipt and related-resource context.

Signal Canvas guide REST guide MCP guide

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training

stale

Proof freshness: stale
Proof status: partial
Display score: 9/10
Last proof check: 2026-03-19
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 0
Source count: 0
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Canonical ID rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training | Route /signal-canvas/rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training",
    "query_text": "Summarize Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training",
  "normalized_query": "2603.16139",
  "route": "/signal-canvas/rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training",
  "paper_ref": "rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Paper mode· single-doc scopescope: rethinking-umm-visual-generation-masked-modeling-for-efficient-image-only-pre-training

Preparing verified analysis

GitHub Code Pulse

Stars

25

Health

C

Last commit

4/11/2026

Forks

0

Open repository

Claim map

Strong 12Mixed 0Weak 0

Evidencepartial
The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase.
Implicationmissing
Implication not extracted yet.
Verificationpartial
partial
Evidencepartial
It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50).
Implicationmissing
Implication not extracted yet.
Verificationpartial
partial
Evidencepartial
For example, our IOMM-B (3.6B) model was trained from scratch using only ~1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage).
Implicationmissing
Implication not extracted yet.
Verificationpartial
partial
Evidencepartial
The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.
Implicationmissing
Implication not extracted yet.
Verificationpartial
partial
Evidencepartial
Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data.
Implicationmissing
Implication not extracted yet.
Verificationpartial
partial
Evidencepartial
To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework.
Implicationmissing
Implication not extracted yet.
Verificationpartial
partial
Evidencepartial
Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance.
Implicationmissing
Implication not extracted yet.
Verificationpartial
partial
Evidencepartial
In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework.
Implicationpartial
The abstract explicitly introduces IOMM as a 'data-efficient two-stage training framework' to address bottlenecks in UMM visual generation.
Verificationpartial
partial
Evidencepartial
The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase.
Implicationpartial
The abstract clearly states the purpose and data source for the first stage of the proposed framework.
Verificationpartial
partial
Evidencepartial
thereby removing the dependency on paired data for this costly phase.
Implicationpartial
The abstract directly states this benefit of the first stage of IOMM.
Verificationpartial
partial
Evidencepartial
The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.
Implicationpartial
The abstract describes the data used in the second stage of the IOMM framework.
Verificationpartial
partial
Evidencepartial
For example, our IOMM-B (3.6B) model was trained from scratch using only ~1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50).
Implicationpartial
This is a specific, verifiable numerical result presented in the abstract.
Verificationpartial
partial