ARXIV:2602.15513 · AGENTS · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

arXiv

Enhance embodied agents with a novel memory framework improving exploration and reasoning efficiency.

Blocked on Code›Score7.0Evidence unverified

Opportunity summary

Pain Enhance embodied agents with a novel memory framework improving exploration and reasoning efficiency.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

Enhance embodied agents with a novel memory framework improving exploration and reasoning efficiency. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary…

METHOD

Full abstract

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM…

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 7.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainEnhance embodied agents with a novel memory framework improving exploration and reasoning efficiency.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

Enhance embodied agents with a novel memory framework improving exploration and reasoning efficiency.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

References(33)

MemVerse: Multimodal Memory for Lifelong Learning Agents

2025Junming Liu, Yifei Sun et al.

ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay

2025Gengyuan Zhang, Mingcong Ding et al.

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

2025Noah Frahm, Prakrut Patel et al.

EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval

2025Zebin Yang, Sunjian Zheng et al.

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

2025Andrew Kyle Lampinen, Martin Engelcke et al.

FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

2025Xiaolin Zhou, Tingyang Xiao et al.

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

2025Matthew Ho, Chen Si et al.

CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model

2025Zhuoyuan Yu, Yuxing Long et al.

Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering

2025M. Ginting, Dong-Ki Kim et al.

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

2025Yuncong Yang, Jiageng Liu et al.

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

2025Ziyu Zhu, Xilin Wang et al.

Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

2025Justin Lazarow, Kai Kang et al.

BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation

2025Zibo Zhou, Yue Hu et al.

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

2025Zihan Wang, Seungjun Lee et al.

Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

2025Kaixuan Jiang, Yang Liu et al.

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

2024Saumya Saxena, Blake Buchanan et al.

CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

2024Yihan Cao, Jiazhao Zhang et al.

3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning

2024Yuncong Yang, Han Yang et al.

A hierarchical active inference model of spatial alternation tasks and the hippocampal-prefrontal circuit

2024Toon Van de Maele, Bart Dhoedt et al.

Dynamic Open-Vocabulary 3D Scene Graphs for Long-Term Language-Guided Mobile Manipulation

2024Zhijie Yan, Shufei Li et al.

Showing 20 of 33 references

{ "contract_version": "paper-r2", "paper_id": "2a310374-bf9e-4001-b723-7107514a708d", "arxiv_id": "2602.15513", "canonical_route": "/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling", "endpoints": { "paper_pack": "/api/v1/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling/paper-pack", "build_passport": "/api/v1/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling", "normalized_query": "2602.15513", "route": "/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling", "paper_ref": "improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling#webpage", "url": "https://sciencetostartup.com/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling", "name": "Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling", "description": "Enhance embodied agents with a novel memory framework improving exploration and reasoning efficiency.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling#scholarlyArticle", "headline": "Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling", "description": "Enhance embodied agents with a novel memory framework improving exploration and reasoning efficiency.", "url": "https://sciencetostartup.com/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling", "sameAs": "https://arxiv.org/abs/2602.15513", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2602.15513" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-02-17T11:41:28.000Z", "author": [ { "@type": "Person", "name": "Ji Li", "affiliation": { "@type": "Organization", "name": "University of Hong Kong" } }, { "@type": "Person", "name": "Jing Xia", "affiliation": { "@type": "Organization", "name": "University of Hong Kong" } }, { "@type": "Person", "name": "Mingyi Li", "affiliation": { "@type": "Organization", "name": "Beijing Institute of Technology" } }, { "@type": "Person", "name": "Shiyan Hu", "affiliation": { "@type": "Organization", "name": "University of Hong Kong" } } ], "citation": [ { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "bd0e16fe2f26e000491632a1155e19ad7c15a1e0" }, "url": "https://www.semanticscholar.org/paper/bd0e16fe2f26e000491632a1155e19ad7c15a1e0" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "ad55c4a16c86c42f2aa02ae69859ee54ed4868eb" }, "url": "https://www.semanticscholar.org/paper/ad55c4a16c86c42f2aa02ae69859ee54ed4868eb" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "1d208522956b0cd5617742e125f8cd078dce380d" }, "url": "https://www.semanticscholar.org/paper/1d208522956b0cd5617742e125f8cd078dce380d" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "12c07b8a7aa20e6727bff9429cdd1a99b3797c6c" }, "url": "https://www.semanticscholar.org/paper/12c07b8a7aa20e6727bff9429cdd1a99b3797c6c" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "0734c22abdd6c21f7520d43c7068da67ecc88038" }, "url": "https://www.semanticscholar.org/paper/0734c22abdd6c21f7520d43c7068da67ecc88038" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "c1c7f3eb815c3ed026baa11a188cc5e5fb41c9cd" }, "url": "https://www.semanticscholar.org/paper/c1c7f3eb815c3ed026baa11a188cc5e5fb41c9cd" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "eff7efaf09c3ceff244a39c1bf66177ab1a4c10c" }, "url": "https://www.semanticscholar.org/paper/eff7efaf09c3ceff244a39c1bf66177ab1a4c10c" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "6fe825793e6e533b2a568d05da54182cab6ece49" }, "url": "https://www.semanticscholar.org/paper/6fe825793e6e533b2a568d05da54182cab6ece49" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "cce0e63ed24e94eb9ecfd3f15406a8d3ed63008e" }, "url": "https://www.semanticscholar.org/paper/cce0e63ed24e94eb9ecfd3f15406a8d3ed63008e" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "a9af39adbdb3557590804fc5f99e2970adf3d374" }, "url": "https://www.semanticscholar.org/paper/a9af39adbdb3557590804fc5f99e2970adf3d374" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "c4654d1dcfbb42ecfa1ae8e4ed8241c75929966a" }, "url": "https://www.semanticscholar.org/paper/c4654d1dcfbb42ecfa1ae8e4ed8241c75929966a" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "4f7d0dd29d49a0806719032ff81b894382ccb89d" }, "url": "https://www.semanticscholar.org/paper/4f7d0dd29d49a0806719032ff81b894382ccb89d" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "45d26a68c5531ce83624ed1c1ec7f3dfaca03236" }, "url": "https://www.semanticscholar.org/paper/45d26a68c5531ce83624ed1c1ec7f3dfaca03236" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "dcb2087598da588f43d472b4a01daad5c68b194a" }, "url": "https://www.semanticscholar.org/paper/dcb2087598da588f43d472b4a01daad5c68b194a" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "5096e8e89e053f1c2f5a0fac66e343e2068be627" }, "url": "https://www.semanticscholar.org/paper/5096e8e89e053f1c2f5a0fac66e343e2068be627" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "115c0c8278b90fb798866692a28f32283f3763ee" }, "url": "https://www.semanticscholar.org/paper/115c0c8278b90fb798866692a28f32283f3763ee" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "6b3890df023d2b63108dcc33d28b087faa77f416" }, "url": "https://www.semanticscholar.org/paper/6b3890df023d2b63108dcc33d28b087faa77f416" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "d7a18dd6fa0481f83955dbbc8ab4c88e8800e6ad" }, "url": "https://www.semanticscholar.org/paper/d7a18dd6fa0481f83955dbbc8ab4c88e8800e6ad" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "00d8e0a0d8cfe0f0e68654a28f9ac6006d9cd3a9" }, "url": "https://www.semanticscholar.org/paper/00d8e0a0d8cfe0f0e68654a28f9ac6006d9cd3a9" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "915bea15d4ea75b234792b54bfb290928fe64677" }, "url": "https://www.semanticscholar.org/paper/915bea15d4ea75b234792b54bfb290928fe64677" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Improving MLLMs in Embodied Exploration and Question Answeri", "item": "https://sciencetostartup.com/paper/improving-mllms-in-embodied-exploration-and-question-answering-with-human-inspired-memory-modeling" } ] }, { "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is the startup potential of \"Improving MLLMs in Embodied Exploration and Question Answeri\"?", "acceptedAnswer": { "@type": "Answer", "text": "Enhance embodied AI agents with human-inspired memory systems for superior exploration and question answering." } }, { "@type": "Question", "name": "What products could be built from this research?", "acceptedAnswer": { "@type": "Answer", "text": "Turn the framework into an API that integrates with robotics or virtual agents in industries requiring field exploration and data-driven decision making." } }, { "@type": "Question", "name": "What are the practical use cases?", "acceptedAnswer": { "@type": "Answer", "text": "Develop an AI-powered virtual assistant for real estate agents that enhances property inspections by retaining important visit details and answering client queries in real time." } }, { "@type": "Question", "name": "What industries could this research disrupt?", "acceptedAnswer": { "@type": "Answer", "text": "It could replace existing rigid memory mechanisms in exploratory AI systems, leading to more dynamic and adaptive robotic explorations." } } ] } ] }

References(33)

MemVerse: Multimodal Memory for Lifelong Learning Agents

2025Junming Liu, Yifei Sun et al.

ReEXplore: Improving MLLMs for Embodied Exploration with Contextualized Retrospective Experience Replay

2025Gengyuan Zhang, Mingcong Ding et al.

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

2025Noah Frahm, Prakrut Patel et al.

EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval

2025Zebin Yang, Sunjian Zheng et al.

Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

2025Andrew Kyle Lampinen, Martin Engelcke et al.

FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph

2025Xiaolin Zhou, Tingyang Xiao et al.

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

2025Matthew Ho, Chen Si et al.

CorrectNav: Self-Correction Flywheel Empowers Vision-Language-Action Navigation Model

2025Zhuoyuan Yu, Yuxing Long et al.

Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering

2025M. Ginting, Dong-Ki Kim et al.

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

2025Yuncong Yang, Jiageng Liu et al.

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

2025Ziyu Zhu, Xilin Wang et al.

Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

2025Justin Lazarow, Kai Kang et al.

BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation

2025Zibo Zhou, Yue Hu et al.

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

2025Zihan Wang, Seungjun Lee et al.

Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering

2025Kaixuan Jiang, Yang Liu et al.

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

2024Saumya Saxena, Blake Buchanan et al.

CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

2024Yihan Cao, Jiazhao Zhang et al.

3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning

2024Yuncong Yang, Han Yang et al.

A hierarchical active inference model of spatial alternation tasks and the hippocampal-prefrontal circuit

2024Toon Van de Maele, Bart Dhoedt et al.

Dynamic Open-Vocabulary 3D Scene Graphs for Long-Term Language-Guided Mobile Manipulation

2024Zhijie Yan, Shufei Li et al.

Showing 20 of 33 references

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(33)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(33)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline