ARXIV:2603.17360 · IMAGE RETRIEVAL · SUBMITTED 19 MAR · 21:31 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: partial proof status

MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

arXiv

MCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy.

Blocked on Code›Score9.0Evidence partial

Opportunity summary

Pain MCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy.

Evidence 0 refs | 0 sources | 50% coverage

Blocker Evidence partial

Open Build Read PDF Signal Canvas Track

PROBLEM

MCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy. However, existing methods often struggle to extract the correct semantic cues from the reference image that best…

METHOD

Full abstract

Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.

RESULT

ScienceToStartup currently rates this 9.0/10 on the public viability pass. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. A…

WHY NOW

Image Retrieval moved forward this cycle; last verified April 2026. Public score 9.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score9.0

PainMCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy.

Evidence0 refs | 0 sources | 50% coverage

Blockermissing authors

Analysis summary

MCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: partial proof status

Competitive landscape

MCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy.

Segment

Image Retrieval

Adoption evidence

Public code linked for build inspection

Commercial read

9.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(19)

Reference metadata pending (265390e32c7dec42417a5c349971ccbaf29418cd)

Reference metadata pending (c0a1006fcd4aa954e8973ea3cb0173aa7e3b6eca)

Reference metadata pending (ab9c850b716d72646340d4f8bc9436e83d2ff55e)

Reference metadata pending (1b6e40c46f2e680620cf70218ae4edbc895d305f)

Reference metadata pending (a09aec45d4eff67fd244b0f4035895cdd3fe72e9)

Reference metadata pending (d1c3c6ec970e8e901d14b9a10b9c88e6e6338b8c)

Reference metadata pending (be989dda3af590f09d56a73a30e45ceb39018c9f)

Reference metadata pending (69778262994c3183ac02c7e535a3e9256c5231fa)

Reference metadata pending (fdd8e294292f4c98acf54b5ad4af96c08fe40734)

Reference metadata pending (4b8be374f863a47aa11dd8acf1aeed9bf9dda402)

Reference metadata pending (6361a6f8fc2e025907fd1f6a7c9f7171fa2a10aa)

Reference metadata pending (34b0ce6daca8fb1a95b568e1b6d573377e736e24)

Reference metadata pending (cba5788607dc08a8b7629b2c5019248cd241b5a5)

Reference metadata pending (ef6db51cb736116266025eb1eab2fb4f36b75310)

Reference metadata pending (78f69364531794550130389342b7bc0ff785b7e9)

Reference metadata pending (fd5129e8ebfaa5dcce3d4ce2839b90c6cd3ca39d)

Reference metadata pending (cf336d272a30d6ad6141db67faa64deb8791cd61)

Reference metadata pending (70a1899871f356a80dcd731ddf417e387cd91649)

Reference metadata pending (be8e559c80c82149c5f40e7f1d3b6b7ee8d2736d)

{ "contract_version": "paper-r2", "paper_id": "352fa82b-4df0-4195-b828-609e9a81421e", "arxiv_id": "2603.17360", "canonical_route": "/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval", "endpoints": { "paper_pack": "/api/v1/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval/paper-pack", "build_passport": "/api/v1/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval", "normalized_query": "2603.17360", "route": "/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval", "paper_ref": "mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval#webpage", "url": "https://sciencetostartup.com/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval", "name": "MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval", "description": "MCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval#scholarlyArticle", "headline": "MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval", "description": "MCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy.", "url": "https://sciencetostartup.com/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval", "sameAs": "https://arxiv.org/abs/2603.17360", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.17360" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-18T04:49:19.000Z", "citation": [ { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "265390e32c7dec42417a5c349971ccbaf29418cd" }, "url": "https://www.semanticscholar.org/paper/265390e32c7dec42417a5c349971ccbaf29418cd" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "c0a1006fcd4aa954e8973ea3cb0173aa7e3b6eca" }, "url": "https://www.semanticscholar.org/paper/c0a1006fcd4aa954e8973ea3cb0173aa7e3b6eca" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "ab9c850b716d72646340d4f8bc9436e83d2ff55e" }, "url": "https://www.semanticscholar.org/paper/ab9c850b716d72646340d4f8bc9436e83d2ff55e" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "1b6e40c46f2e680620cf70218ae4edbc895d305f" }, "url": "https://www.semanticscholar.org/paper/1b6e40c46f2e680620cf70218ae4edbc895d305f" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "a09aec45d4eff67fd244b0f4035895cdd3fe72e9" }, "url": "https://www.semanticscholar.org/paper/a09aec45d4eff67fd244b0f4035895cdd3fe72e9" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "d1c3c6ec970e8e901d14b9a10b9c88e6e6338b8c" }, "url": "https://www.semanticscholar.org/paper/d1c3c6ec970e8e901d14b9a10b9c88e6e6338b8c" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "be989dda3af590f09d56a73a30e45ceb39018c9f" }, "url": "https://www.semanticscholar.org/paper/be989dda3af590f09d56a73a30e45ceb39018c9f" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "69778262994c3183ac02c7e535a3e9256c5231fa" }, "url": "https://www.semanticscholar.org/paper/69778262994c3183ac02c7e535a3e9256c5231fa" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "fdd8e294292f4c98acf54b5ad4af96c08fe40734" }, "url": "https://www.semanticscholar.org/paper/fdd8e294292f4c98acf54b5ad4af96c08fe40734" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "4b8be374f863a47aa11dd8acf1aeed9bf9dda402" }, "url": "https://www.semanticscholar.org/paper/4b8be374f863a47aa11dd8acf1aeed9bf9dda402" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "6361a6f8fc2e025907fd1f6a7c9f7171fa2a10aa" }, "url": "https://www.semanticscholar.org/paper/6361a6f8fc2e025907fd1f6a7c9f7171fa2a10aa" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "34b0ce6daca8fb1a95b568e1b6d573377e736e24" }, "url": "https://www.semanticscholar.org/paper/34b0ce6daca8fb1a95b568e1b6d573377e736e24" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "cba5788607dc08a8b7629b2c5019248cd241b5a5" }, "url": "https://www.semanticscholar.org/paper/cba5788607dc08a8b7629b2c5019248cd241b5a5" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "ef6db51cb736116266025eb1eab2fb4f36b75310" }, "url": "https://www.semanticscholar.org/paper/ef6db51cb736116266025eb1eab2fb4f36b75310" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "78f69364531794550130389342b7bc0ff785b7e9" }, "url": "https://www.semanticscholar.org/paper/78f69364531794550130389342b7bc0ff785b7e9" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "fd5129e8ebfaa5dcce3d4ce2839b90c6cd3ca39d" }, "url": "https://www.semanticscholar.org/paper/fd5129e8ebfaa5dcce3d4ce2839b90c6cd3ca39d" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "cf336d272a30d6ad6141db67faa64deb8791cd61" }, "url": "https://www.semanticscholar.org/paper/cf336d272a30d6ad6141db67faa64deb8791cd61" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "70a1899871f356a80dcd731ddf417e387cd91649" }, "url": "https://www.semanticscholar.org/paper/70a1899871f356a80dcd731ddf417e387cd91649" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "be8e559c80c82149c5f40e7f1d3b6b7ee8d2736d" }, "url": "https://www.semanticscholar.org/paper/be8e559c80c82149c5f40e7f1d3b6b7ee8d2736d" } ], "codeRepository": "https://github.com/JJJJerry/WWW2026-MCoT-MVS", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 9 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Image Retrieval" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval#software", "name": "MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval - Source Code", "description": "MCoT-MVS enhances composed image retrieval by integrating multi-level vision features with multi-modal reasoning for improved semantic accuracy.", "codeRepository": "https://github.com/JJJJerry/WWW2026-MCoT-MVS", "url": "https://github.com/JJJJerry/WWW2026-MCoT-MVS" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Image Retrieval", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-", "item": "https://sciencetostartup.com/paper/mcot-mvs-multi-level-vision-selection-by-multi-modal-chain-of-thought-reasoning-for-composed-image-retrieval" } ] } ] }