ARXIV:2606.03103 · AGENTS · SUBMITTED 03 JUN · 20:33 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: partial proof status

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Wenkai Wang · Tao Xiong · Jingchen Ni · Yunpeng Bao · Xiyun Li · Tianqi Liu · +3 at arXiv

A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.

Ship in 2-4 weeks›Score7.0Evidence partial

Opportunity summary

Pain A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.

Evidence 0 refs | 4 sources | 83% coverage

Blocker Evidence partial

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront.

METHOD

Full abstract

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft. A public repository is linked, so build verification can inspect implementation evidence instead…

WHY NOW

Agents moved forward this cycle; last verified June 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.

Evidence0 refs | 4 sources | 83% coverage

Blockerno shell-level blocker reported

Analysis summary

A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: partial proof status

Competitive landscape

A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "36c7438e-c8d3-4722-b087-23a4760e961c", "arxiv_id": "2606.03103", "canonical_route": "/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration", "endpoints": { "paper_pack": "/api/v1/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration/paper-pack", "build_passport": "/api/v1/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration", "normalized_query": "2606.03103", "route": "/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration", "paper_ref": "deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration#webpage", "url": "https://sciencetostartup.com/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration", "name": "DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration", "description": "A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration#scholarlyArticle", "headline": "DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration", "description": "A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.", "url": "https://sciencetostartup.com/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration", "sameAs": "https://arxiv.org/abs/2606.03103", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2606.03103" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-06-02T03:42:34.000Z", "author": [ { "@type": "Person", "name": "Wenkai Wang" }, { "@type": "Person", "name": "Tao Xiong" }, { "@type": "Person", "name": "Jingchen Ni" }, { "@type": "Person", "name": "Yunpeng Bao" }, { "@type": "Person", "name": "Xiyun Li" }, { "@type": "Person", "name": "Tianqi Liu" }, { "@type": "Person", "name": "Hongcan Guo" }, { "@type": "Person", "name": "Zilong Huang" }, { "@type": "Person", "name": "Shengyu Zhang" } ], "codeRepository": "https://github.com/mrwwk/DeskCraft", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration#software", "name": "DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration - Source Code", "description": "A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.", "codeRepository": "https://github.com/mrwwk/DeskCraft", "url": "https://github.com/mrwwk/DeskCraft" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "DeskCraft: Benchmarking Desktop Agents on Professional Workf", "item": "https://sciencetostartup.com/paper/deskcraft-benchmarking-desktop-agents-on-professional-workflows-and-human-in-the-loop-collaboration" } ] } ] }

Competitive landscape

A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline