ARXIV:2605.15198 · VISUAL REASONING · SUBMITTED 15 MAY · 20:12 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo · Rain Liu · Xinyan Chen · Pheng-Ann Heng · arXiv

A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.

Ship in 2-4 weeks›Score6.0Evidence unverified

Opportunity summary

Pain A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization. A straightforward approach is to directly generate images via…

METHOD

Full abstract

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

RESULT

ScienceToStartup currently rates this 6.0/10 on the public viability pass. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. A public repository is linked, so build…

WHY NOW

Visual Reasoning moved forward this cycle; last verified May 2026. Public score 6.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score6.0

PainA unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.

Segment

Visual Reasoning

Adoption evidence

Public code linked for build inspection

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "a5d07844-b74e-49c5-81f9-cec186450e54", "arxiv_id": "2605.15198", "canonical_route": "/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both", "endpoints": { "paper_pack": "/api/v1/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both/paper-pack", "build_passport": "/api/v1/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both", "normalized_query": "2605.15198", "route": "/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both", "paper_ref": "atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both#webpage", "url": "https://sciencetostartup.com/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both", "name": "ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both", "description": "A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both#scholarlyArticle", "headline": "ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both", "description": "A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.", "url": "https://sciencetostartup.com/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both", "sameAs": "https://arxiv.org/abs/2605.15198", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.15198" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-14T17:59:55.000Z", "author": [ { "@type": "Person", "name": "Ziyu Guo" }, { "@type": "Person", "name": "Rain Liu" }, { "@type": "Person", "name": "Xinyan Chen" }, { "@type": "Person", "name": "Pheng-Ann Heng" } ], "codeRepository": "https://github.com/ZiyuGuo99/ATLAS", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 6 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Visual Reasoning" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both#software", "name": "ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both - Source Code", "description": "A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.", "codeRepository": "https://github.com/ZiyuGuo99/ATLAS", "url": "https://github.com/ZiyuGuo99/ATLAS" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Visual Reasoning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ATLAS: Agentic or Latent Visual Reasoning? One Word is Enoug", "item": "https://sciencetostartup.com/paper/atlas-agentic-or-latent-visual-reasoning-one-word-is-enough-for-both" } ] } ] }

Competitive landscape

A unified framework for visual reasoning that uses a single 'functional token' to represent both agentic operations and latent reasoning units, improving efficiency and generalization.

Segment

Visual Reasoning

Adoption evidence

Public code linked for build inspection

Commercial read

6.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline