ARXIV:2603.07222 · SELF-SUPERVISED LEARNING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

arXiv

VINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background.

Blocked on Code›Score7.0Evidence unverified

Opportunity summary

Pain VINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

VINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background. While video provides rich temporal variation, dense in-the-wild streams…

METHOD

Full abstract

Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders. To address this, we propose VINO (Video-driven Invariance for Non-Contextual Objects), a teacher-student framework that learns robust image encoders from dense video by imposing a structural information bottleneck. Using a class-agnostic structural prior solely to generate views-not as semantic pseudo-labels-VINO forms an asymmetric distillation problem. The teacher predicts from a foreground-union view with the background suppressed, while the student observes object-conditioned scene views that retain surrounding context but remove competing instances. Matching these targets via masked distillation makes background cues unreliable, pushing the representation toward object-centric invariances. We further enforce temporal object permanence via teacher-anchored cross-time distillation over track-matched objects, and stabilize part-to-whole consistency with mask-guided local views. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background. Pretrained on the dense Walking Tours Venice video, VINO achieves 34.8 CorLoc, yielding highly focused, shape-biased representations that substantially outperform prior dense-video and motion-guided SSL baselines.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Through attention visualization and unsupervised object discovery on PASCAL VOC, we demonstrate that VINO effectively disentangles foreground from background.

WHY NOW

Self-Supervised Learning moved forward this cycle; last verified April 2026. Public score 7.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainVINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

VINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

VINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background.

Segment

Self-Supervised Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "c86a2748-10c9-4aac-b2f9-b0982340b09f", "arxiv_id": "2603.07222", "canonical_route": "/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization", "endpoints": { "paper_pack": "/api/v1/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization/paper-pack", "build_passport": "/api/v1/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization", "normalized_query": "2603.07222", "route": "/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization", "paper_ref": "vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization#webpage", "url": "https://sciencetostartup.com/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization", "name": "VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization", "description": "VINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization#scholarlyArticle", "headline": "VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization", "description": "VINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background.", "url": "https://sciencetostartup.com/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization", "sameAs": "https://arxiv.org/abs/2603.07222", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.07222" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-07T14:05:26.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Self-Supervised Learning" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Self-Supervised Learning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "VINO: Video-driven Invariance for Non-contextual Objects via", "item": "https://sciencetostartup.com/paper/vino-video-driven-invariance-for-non-contextual-objects-via-structural-prior-guided-de-contextualization" } ] } ] }

Competitive landscape

VINO is a self-supervised learning framework that learns robust image encoders from dense video by imposing a structural information bottleneck, effectively disentangling foreground from background.

Segment

Self-Supervised Learning

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline