MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling | ScienceToStartup

PROBLEM

A framework for generating physically plausible videos by unifying semantic, geometric, and temporal cues into a pseudo-RGB format, improving visual quality and physical consistency. To address this, we propose MMPhysVideo, the first framework to…

METHOD

Full abstract

Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. Code availability is flagged…

WHY NOW

Generative Video moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Paper Pack

10.48550/arXiv.2604.02817

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

A framework for generating physically plausible videos by unifying semantic, geometric, and temporal cues into a pseudo-RGB format, improving visual quality and physical consistency.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Derived fallback

Read summaries are estimated from adjacent metadata, not verified extraction rows.

Proof status

unverified

0 refs; 0 sources; 0% coverage.

What was readable

linkedon filenot materializedderived fallbacknot indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

7.0

Time to MVP

MVP estimate missing

Commercial

code

Export

Preparing verified analysis

lens / agent

RESULT

PROBLEM

METHOD

WHY NOW

Generative Video moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Claim map

Abstract-backed public claims while anchored extraction refreshes.

Strong 0Mixed 0Weak 4

Evidencepartial
A framework for generating physically plausible videos by unifying semantic, geometric, and temporal cues into a pseudo-RGB format, improving visual quality and physical consistency. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
ScienceToStartup currently rates this 7.0/10 on the public viability pass. Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. Code availability is flagged in the production record; the public repository link still needs proof alignment.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Generative Video moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial

PDF

Preview the source document here, or use the hero PDF action for a new tab.

REFERENCES

Reference metadata is not materialized in the public index yet. The source PDF remains the authority; cache refresh is optional.

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

Prior WorkGeneration Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

7.0

Prior WorkChain of Event-Centric Causal Thought for Physically Plausible Video Generation

7.0

Extension

Builds On ThisFeeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

3.0

Related Resources

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

0/3 checks · 0%

References 0 / 33+ references
Sources 0 / 22+ sources
Coverage 0% / 50%50%+ coverage

Build Passport

UNVERIFIED

Build Passport

Build passport pending - Proof Lab budget No verified cost estimate / $7.00 cap

status

missing

reason

passport_row_missing

proof status

unverified

cost/budget

No verified cost estimate

confidence low

next verification path

/api/v1/paper/2604.02817v1/paper-pack
/api/v1/paper/2604.02817v1/build-passport
Generate required assets: Dockerfile.minimal, RUN.sh, EXPECTED_OUTPUT.json, COST_PASSPORT.json

Build artifacts

Brief

missing

Build brief missing until Build Passport data exists.

Source missing: Build Passport payload.

Experiment plan

missing

Experiment plan missing until prototype path is available.

No prototype path attached.

Validation checklist

missing

Validation checklist missing until required assets, cost, and regulatory flags are verified.

No checklist artifact is attached to the Build Passport payload.

Signal Canvas

Derived signals show verified:false until source-backed receipts exist.

Open Signal Canvas

SignalSourceStrengthFreshnessOwner action

Evidence coverage

OpportunityKernel evidence_receipt

0 refs / 0 sources / 0% coverage

unknown

Verify missing sources before using this as buyer proof. verified:false

Build readiness

BuildPassport EvidenceState

passport absent

unknown

Run Proof Lab or inspect typed missing state. verified:false

Artifact maturity

GitHub and Hugging Face maturity payloads

No public artifact surface observed

unknown

Open source artifacts or mark the gap as missing. verified:false

Viability breakdown

decision rows

DimensionCurrent readEvidenceGapsNext test

Technical feasibility

partial

Current read

Runnable path is not fully verified.

Evidence

No Build Passport payload attached.

Gaps

No verified reproduction transcript attached.

Next test

Run minimal reproduction from the Build Passport prototype path.

Market urgency

missing

Current read

Buyer urgency is not verified from source.

Talent buckets

no invented people

Scientific founder

Needed now

No named scientific founder assigned.

Paper authors are not treated as operators without consent.

People

No named person assigned.

Gaps

Founder commitment not verified.

Next verification path

Confirm founder/operator owner.

Translational engineer

Needed now

Prototype owner missing.

Build Passport does not name an implementer.

People

No named person assigned.

ARTIFACTS

No public artifacts yet.

DEFENSIBILITY

Defensibility and confidence evidence pending.

Paper Pack

10.48550/arXiv.2604.02817

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

A framework for generating physically plausible videos by unifying semantic, geometric, and temporal cues into a pseudo-RGB format, improving visual quality and physical consistency.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Derived fallback

Read summaries are estimated from adjacent metadata, not verified extraction rows.

Proof status

unverified

0 refs; 0 sources; 0% coverage.

What was readable

linkedon filenot materializedderived fallbacknot indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

7.0

Time to MVP

MVP estimate missing

Commercial

code

Export

Preparing verified analysis

lens / agent

RESULT

PROBLEM

METHOD

WHY NOW

Generative Video moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Claim map

Abstract-backed public claims while anchored extraction refreshes.

Strong 0Mixed 0Weak 4

Evidencepartial
A framework for generating physically plausible videos by unifying semantic, geometric, and temporal cues into a pseudo-RGB format, improving visual quality and physical consistency. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
ScienceToStartup currently rates this 7.0/10 on the public viability pass. Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. Code availability is flagged in the production record; the public repository link still needs proof alignment.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Generative Video moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial

PDF

Preview the source document here, or use the hero PDF action for a new tab.

REFERENCES

Reference metadata is not materialized in the public index yet. The source PDF remains the authority; cache refresh is optional.

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

Prior WorkGeneration Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

7.0

Prior WorkChain of Event-Centric Causal Thought for Physically Plausible Video Generation

7.0

Extension

Builds On ThisFeeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

3.0

Related Resources

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Payload preview

Inspect payload

{
  "contract_version": "paper-r2",
  "paper_id": "c4fa5736-62c2-42b0-b4e2-6196481cf2a8",
  "arxiv_id": "2604.02817",
  "canonical_route": "/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling",
  "active_tab": "synced from current hash by the drawer client",
  "selected_artifact": "mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling",
  "endpoints": {
    "paper_pack": "/api/v1/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling/paper-pack",
    "build_passport": "/api/v1/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling/build-passport",
    "mcp_resource": "sciencetostartup://surfaces/paper-workspace"
  }
}

Page Freshness

Canonical route, proof status, last verified, refs, sources, and coverage.

Page Freshness

Paper proof surface

Canonical route: /paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling

stale

Proof freshness: unknown
Proof status: unverified
Display score: 7/10
Last proof check: 2026-04-06
Score updated: 2026-04-06
Score fresh until: 2026-05-06
References: 0
Source count: 0
Coverage: 0%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Endpoint list, payload shape, route context, and copyable handoff data.

Agent Handoff

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Canonical ID mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling | Route /paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling

MCP example

{
  "tool": "get_paper",
  "arguments": {
    "arxiv_id": "2604.02817"
  }
}

source_context

{
  "surface": "paper",
  "mode": "paper",
  "query": "MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling",
  "normalized_query": "2604.02817",
  "route": "/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling",
  "paper_ref": "mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Buildability Receipt

Verdict, compute envelope, blockers, signature state, and receipt links.

Paper proof page receipt window

Watch and verify: MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

/buildability/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling

Watchwatch

Subject: MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Time to first demo

Insufficient data

No first-demo timestamp, owner estimate, or elapsed demo receipt is attached to this surface.

Compute envelope

Structured compute envelope

Insufficient data

Source Proof anchors

Visual citations from the paper document graph.

JSON-LD twin

The application/ld+json payload rendered for agents.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebPage",
      "@id": "https://sciencetostartup.com/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling#webpage",
      "url": "https://sciencetostartup.com/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling",
      "name": "MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling",
      "description": "A framework for generating physically plausible videos by unifying semantic, geometric, and temporal cues into a pseudo-RGB format, improving visual quality and physical consistency.",
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      }
    },
    {
      "@type": "ScholarlyArticle",
      "@id": "https://sciencetostartup.com/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling#scholarlyArticle",
      "headline": "MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling",
      "description": "A framework for generating physically plausible videos by unifying semantic, geometric, and temporal cues into a pseudo-RGB format, improving visual quality and physical consistency.",
      "url": "https://sciencetostartup.com/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling",
      "sameAs": "https://arxiv.org/abs/2604.02817",
      "identifier": {
        "@type": "PropertyValue",
        "propertyID": "arXiv",
        "value": "2604.02817"
      },
      "isAccessibleForFree": true,
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      },
      "datePublished": "2026-04-03T07:32:24.000Z",
      "author": [
        {
          "@type": "Person",
          "name": "Shubo Lin"
        },
        {
          "@type": "Person",
          "name": "Xuanyang Zhang"
        },
        {
          "@type": "Person",
          "name": "Wei Cheng"
        },
        {
          "@type": "Person",
          "name": "Weiming Hu"
        },
        {
          "@type": "Person",
          "name": "Gang Yu"
        },
        {
          "@type": "Person",
          "name": "Jin Gao"
        }
      ],
      "additionalProperty": [
        {
          "@type": "PropertyValue",
          "propertyID": "viabilityScore",
          "value": 7
        },
        {
          "@type": "PropertyValue",
          "propertyID": "researchDomain",
          "value": "Generative Video"
        },
        {
          "@type": "PropertyValue",
          "propertyID": "commercialReadiness",
          "value": "code"
        }
      ]
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Home",
          "item": "https://sciencetostartup.com"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Generative Video",
          "item": "https://sciencetostartup.com/topics"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "MMPhysVideo: Scaling Physical Plausibility in Video Generati",
          "item": "https://sciencetostartup.com/paper/mmphysvideo-scaling-physical-plausibility-in-video-generation-via-joint-multimodal-modeling"
        }
      ]
    }
  ]
}

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Claim map

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Timeline

Timeline

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Claim map

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Constellation map

Competitive landscape

Buzz

Available agents

API/MCP endpoints

Tool contracts

Payload preview

Schema validation

Job trace

Evidence map

Page Freshness

Paper proof surface

Agent Handoff

MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Buildability Receipt

Watch and verify: MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

Compute envelope

Source Proof anchors

JSON-LD twin

Evidence ids

Freshness

Hash state

Signature state

Blockers