ARXIV:2603.28088 · MULTIMODAL GENERATION AGENTS · SUBMITTED 31 MAR · 20:19 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Zefeng He · Siyuan Huang · Xiaoye Qu · Yafu Li · Tong Zhu · Yu Cheng · +1 at arXiv

GEMS is an agent-native multimodal generation framework that enhances foundational models with structured multi-agent loops, hierarchical memory, and domain-specific skills to achieve significant performance gains on complex and specialized tasks.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain GEMS is an agent-native multimodal generation framework that enhances foundational models with structured multi-agent loops, hierarchical memory, and domain-specific skills to achieve significant performance gains on complex and specialized tasks.

Evidence 74 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

METHOD

Full abstract

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Code availability is flagged in the production record; the…

WHY NOW

Multimodal Generation Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainGEMS is an agent-native multimodal generation framework that enhances foundational models with structured multi-agent loops, hierarchical memory, and domain-specific skills to achieve significant performance gains on complex and specialized tasks.

Evidence74 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Zefeng He · Siyuan Huang · Xiaoye Qu · Yafu Li · Tong Zhu · Yu Cheng · +1 at arXiv

Competitive landscape

Segment

Multimodal Generation Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "aa640f62-2ce7-41b5-a2c9-0be8bd941d10", "arxiv_id": "2603.28088", "canonical_route": "/paper/gems-agent-native-multimodal-generation-with-memory-and-skills", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "gems-agent-native-multimodal-generation-with-memory-and-skills", "endpoints": { "paper_pack": "/api/v1/paper/gems-agent-native-multimodal-generation-with-memory-and-skills/paper-pack", "build_passport": "/api/v1/paper/gems-agent-native-multimodal-generation-with-memory-and-skills/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "GEMS: Agent-Native Multimodal Generation with Memory and Skills", "normalized_query": "2603.28088", "route": "/paper/gems-agent-native-multimodal-generation-with-memory-and-skills", "paper_ref": "gems-agent-native-multimodal-generation-with-memory-and-skills", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/gems-agent-native-multimodal-generation-with-memory-and-skills#webpage", "url": "https://sciencetostartup.com/paper/gems-agent-native-multimodal-generation-with-memory-and-skills", "name": "GEMS: Agent-Native Multimodal Generation with Memory and Skills", "description": "GEMS is an agent-native multimodal generation framework that enhances foundational models with structured multi-agent loops, hierarchical memory, and domain-specific skills to achieve significant performance gains on complex and specialized tasks.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/gems-agent-native-multimodal-generation-with-memory-and-skills#scholarlyArticle", "headline": "GEMS: Agent-Native Multimodal Generation with Memory and Skills", "description": "GEMS is an agent-native multimodal generation framework that enhances foundational models with structured multi-agent loops, hierarchical memory, and domain-specific skills to achieve significant performance gains on complex and specialized tasks.", "url": "https://sciencetostartup.com/paper/gems-agent-native-multimodal-generation-with-memory-and-skills", "sameAs": "https://arxiv.org/abs/2603.28088", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.28088" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-30T06:42:55.000Z", "author": [ { "@type": "Person", "name": "Zefeng He" }, { "@type": "Person", "name": "Siyuan Huang" }, { "@type": "Person", "name": "Xiaoye Qu" }, { "@type": "Person", "name": "Yafu Li" }, { "@type": "Person", "name": "Tong Zhu" }, { "@type": "Person", "name": "Yu Cheng" }, { "@type": "Person", "name": "Yang Yang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Multimodal Generation Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Multimodal Generation Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "GEMS: Agent-Native Multimodal Generation with Memory and Ski", "item": "https://sciencetostartup.com/paper/gems-agent-native-multimodal-generation-with-memory-and-skills" } ] } ] }

Competitive landscape

Segment

Multimodal Generation Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

GEMS: Agent-Native Multimodal Generation with Memory and Skills

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline