ARXIV:2604.11095 · MULTIMODAL RETRIEVAL · SUBMITTED 14 APR · 16:47 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Bottleneck Tokens for Unified Multimodal Retrieval

Siyu Sun · Jing Ren · Zhaohe Liao · Dongxiao Mao · Xiangyuan Ren · Yiyi Zhang · +5 at arXiv

Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.

Ship in 2-4 weeks›Score8.0Evidence unverified

Opportunity summary

Pain Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead. First, existing methods rely on implicit pooling, which overloads the hidden state of…

METHOD

Full abstract

Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., <EOS>) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of…

WHY NOW

Multimodal Retrieval moved forward this cycle; last verified April 2026. Public score 8.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainBottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.

Segment

Multimodal Retrieval

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "fe9126ac-044e-4401-a04b-51c88fed4d02", "arxiv_id": "2604.11095", "canonical_route": "/paper/bottleneck-tokens-for-unified-multimodal-retrieval", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "bottleneck-tokens-for-unified-multimodal-retrieval", "endpoints": { "paper_pack": "/api/v1/paper/bottleneck-tokens-for-unified-multimodal-retrieval/paper-pack", "build_passport": "/api/v1/paper/bottleneck-tokens-for-unified-multimodal-retrieval/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Bottleneck Tokens for Unified Multimodal Retrieval", "normalized_query": "2604.11095", "route": "/paper/bottleneck-tokens-for-unified-multimodal-retrieval", "paper_ref": "bottleneck-tokens-for-unified-multimodal-retrieval", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/bottleneck-tokens-for-unified-multimodal-retrieval#webpage", "url": "https://sciencetostartup.com/paper/bottleneck-tokens-for-unified-multimodal-retrieval", "name": "Bottleneck Tokens for Unified Multimodal Retrieval", "description": "Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/bottleneck-tokens-for-unified-multimodal-retrieval#scholarlyArticle", "headline": "Bottleneck Tokens for Unified Multimodal Retrieval", "description": "Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.", "url": "https://sciencetostartup.com/paper/bottleneck-tokens-for-unified-multimodal-retrieval", "sameAs": "https://arxiv.org/abs/2604.11095", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.11095" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-13T07:12:12.000Z", "author": [ { "@type": "Person", "name": "Siyu Sun" }, { "@type": "Person", "name": "Jing Ren" }, { "@type": "Person", "name": "Zhaohe Liao" }, { "@type": "Person", "name": "Dongxiao Mao" }, { "@type": "Person", "name": "Xiangyuan Ren" }, { "@type": "Person", "name": "Yiyi Zhang" }, { "@type": "Person", "name": "Haohua Zhao" }, { "@type": "Person", "name": "Weixiong Lin" }, { "@type": "Person", "name": "Jiang Shaohua" }, { "@type": "Person", "name": "Liqing Zhang" }, { "@type": "Person", "name": "Yuchao Zheng" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 8 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal Retrieval" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal Retrieval", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Bottleneck Tokens for Unified Multimodal Retrieval", "item": "https://sciencetostartup.com/paper/bottleneck-tokens-for-unified-multimodal-retrieval" } ] } ] }

Competitive landscape

Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.

Segment

Multimodal Retrieval

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Bottleneck Tokens for Unified Multimodal Retrieval

Bottleneck Tokens for Unified Multimodal Retrieval

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline