ARXIV:2605.15777 · AGENTS · SUBMITTED 18 MAY · 20:28 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Kean Shi · Zihang Li · Tianyi Ma · Zengji Tu · Jialong Wu · Xinbo Xu · +10 at arXiv

SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.

Evidence 0 refs | 4 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents. However, existing web and GUI agent benchmarks often rely on simplified settings,…

METHOD

Full abstract

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in…

WHY NOW

Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainSaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.

Evidence0 refs | 4 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "e8cdbd90-04d1-47ec-861c-56c21ef88adb", "arxiv_id": "2605.15777", "canonical_route": "/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows", "endpoints": { "paper_pack": "/api/v1/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows/paper-pack", "build_passport": "/api/v1/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?", "normalized_query": "2605.15777", "route": "/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows", "paper_ref": "saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows#webpage", "url": "https://sciencetostartup.com/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows", "name": "SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?", "description": "SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows#scholarlyArticle", "headline": "SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?", "description": "SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.", "url": "https://sciencetostartup.com/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows", "sameAs": "https://arxiv.org/abs/2605.15777", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.15777" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-15T09:35:15.000Z", "author": [ { "@type": "Person", "name": "Kean Shi" }, { "@type": "Person", "name": "Zihang Li" }, { "@type": "Person", "name": "Tianyi Ma" }, { "@type": "Person", "name": "Zengji Tu" }, { "@type": "Person", "name": "Jialong Wu" }, { "@type": "Person", "name": "Xinbo Xu" }, { "@type": "Person", "name": "Qingyao Yang" }, { "@type": "Person", "name": "Ruoyu Wu" }, { "@type": "Person", "name": "Weichu Xie" }, { "@type": "Person", "name": "Ming Wu" }, { "@type": "Person", "name": "Jason Zeng" }, { "@type": "Person", "name": "Michael Heinrich" }, { "@type": "Person", "name": "Elvis Zhang" }, { "@type": "Person", "name": "Liang Chen" }, { "@type": "Person", "name": "Kuan Li" }, { "@type": "Person", "name": "Baobao Chang" } ], "codeRepository": "https://github.com/UniPat-AI/SaaS-Bench", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows#software", "name": "SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows? - Source Code", "description": "SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.", "codeRepository": "https://github.com/UniPat-AI/SaaS-Bench", "url": "https://github.com/UniPat-AI/SaaS-Bench" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS", "item": "https://sciencetostartup.com/paper/saas-bench-can-computer-use-agents-leverage-real-world-saas-to-solve-professional-workflows" } ] } ] }

Competitive landscape

SaaS-Bench, a benchmark for evaluating computer-using agents on realistic professional workflows across 23 SaaS systems, revealing significant limitations in current LLM-based agents.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline