ARXIV:2604.02252 · COMPUTER VISION · SUBMITTED 03 APR · 20:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Naomi Kombol · Ivan Martinović · Siniša Šegvić · Giorgos Tolias · arXiv

A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.

Evidence 0 refs | 0 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based…

METHOD

Full abstract

Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. While this improves accuracy through finer strides, it comes at a significant computational cost. A public repository is linked, so build verification can inspect…

WHY NOW

Computer Vision moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.

Evidence0 refs | 0 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.

Segment

Computer Vision

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "20b9d487-af3e-4afc-8073-0df2a2ec0fec", "arxiv_id": "2604.02252", "canonical_route": "/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation", "endpoints": { "paper_pack": "/api/v1/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation/paper-pack", "build_passport": "/api/v1/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation", "normalized_query": "2604.02252", "route": "/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation", "paper_ref": "spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation#webpage", "url": "https://sciencetostartup.com/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation", "name": "SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation", "description": "A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation#scholarlyArticle", "headline": "SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation", "description": "A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.", "url": "https://sciencetostartup.com/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation", "sameAs": "https://arxiv.org/abs/2604.02252", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02252" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T16:45:34.000Z", "author": [ { "@type": "Person", "name": "Naomi Kombol" }, { "@type": "Person", "name": "Ivan Martinović" }, { "@type": "Person", "name": "Siniša Šegvić" }, { "@type": "Person", "name": "Giorgos Tolias" } ], "codeRepository": "https://github.com/naomikombol/SPAR", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Computer Vision" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation#software", "name": "SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation - Source Code", "description": "A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.", "codeRepository": "https://github.com/naomikombol/SPAR", "url": "https://github.com/naomikombol/SPAR" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Computer Vision", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Seg", "item": "https://sciencetostartup.com/paper/spar-single-pass-any-resolution-vit-for-open-vocabulary-segmentation" } ] } ] }

Competitive landscape

A resolution-agnostic Vision Transformer that enables efficient, high-resolution open-vocabulary segmentation by distilling fine-grained spatial understanding into a single-pass model.

Segment

Computer Vision

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline