ARXIV:2604.02071 · COMPUTER VISION · SUBMITTED 03 APR · 20:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

Soo Won Seo · KyungChae Lee · Hyungchan Cho · Taein Son · Nam Ik Cho · Jun Won Choi · arXiv

A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.

Evidence 0 refs | 0 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance.

METHOD

Full abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context.…

WHY NOW

Computer Vision moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.

Evidence0 refs | 0 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.

Segment

Computer Vision

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "227d7d49-5d47-4f5d-b6be-66009a886736", "arxiv_id": "2604.02071", "canonical_route": "/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection", "endpoints": { "paper_pack": "/api/v1/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection/paper-pack", "build_passport": "/api/v1/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection", "normalized_query": "2604.02071", "route": "/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection", "paper_ref": "mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection#webpage", "url": "https://sciencetostartup.com/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection", "name": "Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection", "description": "A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection#scholarlyArticle", "headline": "Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection", "description": "A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.", "url": "https://sciencetostartup.com/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection", "sameAs": "https://arxiv.org/abs/2604.02071", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.02071" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T14:01:58.000Z", "author": [ { "@type": "Person", "name": "Soo Won Seo" }, { "@type": "Person", "name": "KyungChae Lee" }, { "@type": "Person", "name": "Hyungchan Cho" }, { "@type": "Person", "name": "Taein Son" }, { "@type": "Person", "name": "Nam Ik Cho" }, { "@type": "Person", "name": "Jun Won Choi" } ], "codeRepository": "https://github.com/nowuss/InCoM-Net", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Computer Vision" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection#software", "name": "Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection - Source Code", "description": "A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.", "codeRepository": "https://github.com/nowuss/InCoM-Net", "url": "https://github.com/nowuss/InCoM-Net" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Computer Vision", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Mining Instance-Centric Vision-Language Contexts for Human-O", "item": "https://sciencetostartup.com/paper/mining-instance-centric-vision-language-contexts-for-human-object-interaction-detection" } ] } ] }

Competitive landscape

A novel network that integrates vision-language models with object detection to achieve state-of-the-art human-object interaction detection.

Segment

Computer Vision

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline