ARXIV:2604.02583 · MULTIMODAL RETRIEVAL · SUBMITTED 06 APR · 20:15 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder

Wei Li · Yufan Ren · Hanqing Jiang · Jianhui Ding · Zhen Peng · Leman Feng · +3 at arXiv

FusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain FusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

FusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its…

METHOD

Full abstract

We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses inter-view complementary relationships and selectively emphasizes informative visual cues across multiple views to get a more robustly fused visual feature for better 3D model matching. Furthermore, FusionBERT proposes a normal-aware 3D model encoder that can further enhance the 3D geometric feature of an object model by jointly encoding point normals and 3D positions, enabling a more robust representation learning for textureless or color-degraded 3D models. Extensive image-3D retrieval experiments demonstrate that FusionBERT achieves significantly higher retrieval accuracy than SOTA multimodal large models under both single-view and multi-view settings, establishing a strong baseline for multi-view multimodal retrieval.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Extensive image-3D retrieval experiments demonstrate that FusionBERT achieves significantly higher retrieval accuracy than SOTA multimodal large models under both single-view and multi-view settings, establishing…

WHY NOW

Multimodal Retrieval moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainFusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

FusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

FusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding.

Segment

Multimodal Retrieval

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "7fd589e0-4ec5-4037-ade1-e900f08b2b5b", "arxiv_id": "2604.02583", "canonical_route": "/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder", "endpoints": { "paper_pack": "/api/v1/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder/paper-pack", "build_passport": "/api/v1/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder", "normalized_query": "2604.02583", "route": "/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder", "paper_ref": "fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder#webpage", "url": "https://sciencetostartup.com/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder", "name": "FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder", "description": "FusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder#scholarlyArticle", "headline": "FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder", "description": "FusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding.", "url": "https://sciencetostartup.com/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder", "sameAs": "https://arxiv.org/abs/2604.02583", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02583" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T23:36:33.000Z", "author": [ { "@type": "Person", "name": "Wei Li" }, { "@type": "Person", "name": "Yufan Ren" }, { "@type": "Person", "name": "Hanqing Jiang" }, { "@type": "Person", "name": "Jianhui Ding" }, { "@type": "Person", "name": "Zhen Peng" }, { "@type": "Person", "name": "Leman Feng" }, { "@type": "Person", "name": "Yichun Shentu" }, { "@type": "Person", "name": "Guoqiang Xu" }, { "@type": "Person", "name": "Baigui Sun" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal Retrieval" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal Retrieval", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "FusionBERT: Multi-View Image-3D Retrieval via Cross-Attentio", "item": "https://sciencetostartup.com/paper/fusionbert-multi-view-image-3d-retrieval-via-cross-attention-visual-fusion-and-normal-aware-3d-encoder" } ] } ] }

Competitive landscape

FusionBERT enables robust multi-view image-to-3D model retrieval by adaptively fusing visual cues and enhancing 3D geometry encoding.

Segment

Multimodal Retrieval

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder

FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline