ARXIV:2604.27776 · AGENTS · SUBMITTED 01 MAY · 15:04 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Jinchao Li · Yunxin Li · Chenrui Zhao · Zhenran Xu · Baotian Hu · Min Zhang · arXiv

A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.

Evidence 0 refs | 4 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish…

METHOD

Full abstract

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far…

WHY NOW

Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.

Evidence0 refs | 4 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "35812504-37e9-4e82-a5eb-89c339876299", "arxiv_id": "2604.27776", "canonical_route": "/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments", "endpoints": { "paper_pack": "/api/v1/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments/paper-pack", "build_passport": "/api/v1/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments", "normalized_query": "2604.27776", "route": "/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments", "paper_ref": "windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments#webpage", "url": "https://sciencetostartup.com/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments", "name": "WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments", "description": "A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments#scholarlyArticle", "headline": "WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments", "description": "A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.", "url": "https://sciencetostartup.com/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments", "sameAs": "https://arxiv.org/abs/2604.27776", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.27776" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-30T12:13:27.000Z", "author": [ { "@type": "Person", "name": "Jinchao Li" }, { "@type": "Person", "name": "Yunxin Li" }, { "@type": "Person", "name": "Chenrui Zhao" }, { "@type": "Person", "name": "Zhenran Xu" }, { "@type": "Person", "name": "Baotian Hu" }, { "@type": "Person", "name": "Min Zhang" } ], "codeRepository": "https://github.com/HITsz-TMG/WindowsWorld", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments#software", "name": "WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments - Source Code", "description": "A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.", "codeRepository": "https://github.com/HITsz-TMG/WindowsWorld", "url": "https://github.com/HITsz-TMG/WindowsWorld" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "WindowsWorld: A Process-Centric Benchmark of Autonomous GUI ", "item": "https://sciencetostartup.com/paper/windowsworld-a-process-centric-benchmark-of-autonomous-gui-agents-in-professional-cross-application-environments" } ] } ] }

Competitive landscape

A new benchmark for evaluating autonomous GUI agents on complex, cross-application professional workflows, revealing significant performance gaps in current leading models.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline