ARXIV:2603.10577 · VISION-LANGUAGE MODELS · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents

arXiv

CUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments.

Blocked on Code›Score5.0Evidence unverified

Opportunity summary

Pain CUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

CUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable…

METHOD

Full abstract

Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.

RESULT

ScienceToStartup currently rates this 5.0/10 on the public viability pass. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and…

WHY NOW

Vision-Language Models moved forward this cycle; last verified April 2026. Public score 5.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score5.0

PainCUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments.

Evidence0 refs | 0 sources | 17% coverage

Blockermissing authors

Analysis summary

CUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsPartialProof: unverified proof status

Competitive landscape

CUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "5b527a01-e349-49bb-92e0-32deaed39b57", "arxiv_id": "2603.10577", "canonical_route": "/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents", "endpoints": { "paper_pack": "/api/v1/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents/paper-pack", "build_passport": "/api/v1/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", "normalized_query": "2603.10577", "route": "/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents", "paper_ref": "cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents#webpage", "url": "https://sciencetostartup.com/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents", "name": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", "description": "CUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents#scholarlyArticle", "headline": "CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents", "description": "CUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments.", "url": "https://sciencetostartup.com/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents", "sameAs": "https://arxiv.org/abs/2603.10577", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.10577" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-11T09:28:41.000Z", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 5 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Vision-Language Models" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Vision-Language Models", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "CUAAudit: Meta-Evaluation of Vision-Language Models as Audit", "item": "https://sciencetostartup.com/paper/cuaaudit-meta-evaluation-of-vision-language-models-as-auditors-of-autonomous-computer-use-agents" } ] } ] }

Competitive landscape

CUAAudit leverages Vision-Language Models to autonomously evaluate the performance of Computer-Use Agents in real-world environments.

Segment

Vision-Language Models

Adoption evidence

No public code link in the paper record yet

Commercial read

5.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents

CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline