Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

PROBLEM

A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements. Existing evaluations often cannot tell.

METHOD

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell.

Full abstract

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. When a phone-use agent avoids harm, does that show safety, or simply inability to act? A public repository is linked, so build verification can…

WHY NOW

Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Paper Pack

10.48550/arXiv.2605.07630

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Parse run linked

A document parse run is attached to this paper.

Proof status

unverified

0 refs; 4 sources; 83% coverage.

What was readable

linkedon file17 anchorsderived fallbacknot indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

7.0

Time to MVP

MVP estimate missing

Commercial

coderepo url

Export

Preparing verified analysis

lens / agent

RESULT

PROBLEM

A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements. Existing evaluations often cannot tell.

METHOD

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell.

WHY NOW

Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Claim map

Abstract-backed public claims while anchored extraction refreshes.

Strong 0Mixed 0Weak 4

Evidencepartial
A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements. Existing evaluations often cannot tell.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
ScienceToStartup currently rates this 7.0/10 on the public viability pass. When a phone-use agent avoids harm, does that show safety, or simply inability to act? A public repository is linked, so build verification can inspect implementation evidence instead of treating the paper as PDF-only.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial

PDF

Preview the source document here, or use the hero PDF action for a new tab.

REFERENCES

Reference metadata is not materialized in the public index yet. The source PDF remains the authority; cache refresh is optional.

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

Prior WorkAgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

7.0

Prior WorkSkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

7.0

Prior WorkModels That Know How Evaluations Are Designed Score Safer

7.0

Prior WorkClawSafety: "Safe" LLMs, Unsafe Agents

Related Resources

AI research agents(glossary)
Agents(glossary)
TransportAgents(glossary)
What is the future of AI agents according to Nothing's CEO?(question)
How do LLM efficiency advancements impact the development of AI agents?(question)
How does AgentXRay contribute to the explainability of AI agents in complex decision-making processes?(question)
Agents – Use Cases(use_case)
AI Agents – Use Cases(use_case)

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents | ScienceToStartup

Paper Pack

10.48550/arXiv.2605.07630

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Parse run linked

A document parse run is attached to this paper.

Proof status

unverified

0 refs; 4 sources; 83% coverage.

What was readable

linkedon file17 anchorsderived fallbacknot indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

7.0

Time to MVP

MVP estimate missing

Commercial

coderepo url

Export

Preparing verified analysis

lens / agent

RESULT

PROBLEM

A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements. Existing evaluations often cannot tell.

METHOD

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell.

WHY NOW

Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Claim map

Abstract-backed public claims while anchored extraction refreshes.

Strong 0Mixed 0Weak 4

Evidencepartial
A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements. Existing evaluations often cannot tell.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
ScienceToStartup currently rates this 7.0/10 on the public viability pass. When a phone-use agent avoids harm, does that show safety, or simply inability to act? A public repository is linked, so build verification can inspect implementation evidence instead of treating the paper as PDF-only.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial
Evidencepartial
Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.
Implicationpartial
Abstract-backed fallback claim; anchored extraction has not materialized a public claim row yet.
Verificationpartial
partial

PDF

Preview the source document here, or use the hero PDF action for a new tab.

REFERENCES

Reference metadata is not materialized in the public index yet. The source PDF remains the authority; cache refresh is optional.

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

Prior WorkAgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

7.0

Prior WorkSkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

7.0

Prior WorkModels That Know How Evaluations Are Designed Score Safer

7.0

Prior WorkClawSafety: "Safe" LLMs, Unsafe Agents

Related Resources

AI research agents(glossary)
Agents(glossary)
TransportAgents(glossary)
What is the future of AI agents according to Nothing's CEO?(question)
How do LLM efficiency advancements impact the development of AI agents?(question)
How does AgentXRay contribute to the explainability of AI agents in complex decision-making processes?(question)
Agents – Use Cases(use_case)
AI Agents – Use Cases(use_case)

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Payload preview

Inspect payload

{
  "contract_version": "paper-r2",
  "paper_id": "97dde4d0-a83c-4973-bbae-99c3fe42450b",
  "arxiv_id": "2605.07630",
  "canonical_route": "/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents",
  "active_tab": "synced from current hash by the drawer client",
  "selected_artifact": "safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents",
  "endpoints": {
    "paper_pack": "/api/v1/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents/paper-pack",
    "build_passport": "/api/v1/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents/build-passport",
    "mcp_resource": "sciencetostartup://surfaces/paper-workspace"
  }
}

Page Freshness

Canonical route, proof status, last verified, refs, sources, and coverage.

Page Freshness

Paper proof surface

Canonical route: /paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-05-11
Score updated: 2026-05-11
Score fresh until: 2026-06-10
References: 0
Source count: 4
Coverage: 83%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Endpoint list, payload shape, route context, and copyable handoff data.

Agent Handoff

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Canonical ID safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents | Route /paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents

MCP example

{
  "tool": "get_paper",
  "arguments": {
    "arxiv_id": "2605.07630"
  }
}

source_context

{
  "surface": "paper",
  "mode": "paper",
  "query": "Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents",
  "normalized_query": "2605.07630",
  "route": "/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents",
  "paper_ref": "safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Buildability Receipt

Verdict, compute envelope, blockers, signature state, and receipt links.

Paper proof page receipt window

Ready for execution: Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

/buildability/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents

Build Nowready

Subject: Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Verdict

Build Now

Verdict is Build Now because viability and implementation proof cleared the Wave 1 scaffold thresholds.

Time to first demo

Insufficient data

No first-demo timestamp, owner estimate, or elapsed demo receipt is attached to this surface.

Compute envelope

Structured compute envelope

Insufficient data

Source Proof anchors

Visual citations from the paper document graph.

Source proof

Visual citation anchors from the paper document graph.

17 anchors

proof blockPage 468%

This equation captures one of the core mathematical components of the system. Safe-action rate = Nsafe N , Unsafe-action rate = Nunsafe N , No-useful-action rate = Nno-useful N .

Page and bbox are available; crop image is pending.

proof blockPage 468%

This equation captures one of the core mathematical components of the system. Safe-action rate + Unsafe-action rate + CFR = 1, 1−CFR = Nsafe + Nunsafe N .

Page and bbox are available; crop image is pending.

proof blockPage 868%

This equation captures one of the core mathematical components of the system. General Phone-Use SR (%) Safe-Action Rate (%) (c) General SR vs. 1−CFR (ρ=0.922). (d) Strict vs. minimal (∆CFR=0.0).

Page and bbox are available; crop image is pending.

JSON-LD twin

The application/ld+json payload rendered for agents.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebPage",
      "@id": "https://sciencetostartup.com/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents#webpage",
      "url": "https://sciencetostartup.com/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents",
      "name": "Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents",
      "description": "A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements.",
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      }
    },
    {
      "@type": "ScholarlyArticle",
      "@id": "https://sciencetostartup.com/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents#scholarlyArticle",
      "headline": "Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents",
      "description": "A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements.",
      "url": "https://sciencetostartup.com/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents",
      "sameAs": "https://arxiv.org/abs/2605.07630",
      "identifier": {
        "@type": "PropertyValue",
        "propertyID": "arXiv",
        "value": "2605.07630"
      },
      "isAccessibleForFree": true,
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      },
      "datePublished": "2026-05-08T11:58:57.000Z",
      "author": [
        {
          "@type": "Person",
          "name": "Zhengyang Tang"
        },
        {
          "@type": "Person",
          "name": "Yi Zhang"
        },
        {
          "@type": "Person",
          "name": "Chenxin Li"
        },
        {
          "@type": "Person",
          "name": "Xin Lai"
        },
        {
          "@type": "Person",
          "name": "Pengyuan Lyu"
        },
        {
          "@type": "Person",
          "name": "Yiduo Guo"
        },
        {
          "@type": "Person",
          "name": "Weinong Wang"
        },
        {
          "@type": "Person",
          "name": "Junyi Li"
        },
        {
          "@type": "Person",
          "name": "Yang Ding"
        },
        {
          "@type": "Person",
          "name": "Huawen Shen"
        },
        {
          "@type": "Person",
          "name": "Zhengyao Fang"
        },
        {
          "@type": "Person",
          "name": "Xingran Zhou"
        },
        {
          "@type": "Person",
          "name": "Liang Wu"
        },
        {
          "@type": "Person",
          "name": "Fei Tang"
        },
        {
          "@type": "Person",
          "name": "Sunqi Fan"
        },
        {
          "@type": "Person",
          "name": "Shangpin Peng"
        },
        {
          "@type": "Person",
          "name": "Zheng Ruan"
        },
        {
          "@type": "Person",
          "name": "Anran Zhang"
        },
        {
          "@type": "Person",
          "name": "Benyou Wang"
        },
        {
          "@type": "Person",
          "name": "Chengquan Zhang"
        },
        {
          "@type": "Person",
          "name": "Han Hu"
        }
      ],
      "codeRepository": "https://github.com/tangzhy/PhoneSafety",
      "additionalProperty": [
        {
          "@type": "PropertyValue",
          "propertyID": "viabilityScore",
          "value": 7
        },
        {
          "@type": "PropertyValue",
          "propertyID": "researchDomain",
          "value": "Agents"
        },
        {
          "@type": "PropertyValue",
          "propertyID": "commercialReadiness",
          "value": "code, repo url"
        }
      ]
    },
    {
      "@type": "SoftwareSourceCode",
      "@id": "https://sciencetostartup.com/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents#software",
      "name": "Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents - Source Code",
      "description": "A new benchmark and evaluation framework to distinguish true safety in phone-use agents from mere inability to act, enabling targeted improvements.",
      "codeRepository": "https://github.com/tangzhy/PhoneSafety",
      "url": "https://github.com/tangzhy/PhoneSafety"
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Home",
          "item": "https://sciencetostartup.com"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Agents",
          "item": "https://sciencetostartup.com/topics"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "Safe, or Simply Incapable? Rethinking Safety Evaluation for ",
          "item": "https://sciencetostartup.com/paper/safe-or-simply-incapable-rethinking-safety-evaluation-for-phone-use-agents"
        }
      ]
    }
  ]
}

2/3 checks · 67%

References 0 / 33+ references
Sources 4 / 22+ sources
Coverage 83% / 50%50%+ coverage

Build Passport

UNVERIFIED

Build Passport

Build passport pending - Proof Lab budget No verified cost estimate / $7.00 cap

status

missing

reason

passport_row_missing

proof status

unverified

cost/budget

No verified cost estimate

confidence low

next verification path

/api/v1/paper/2605.07630v1/paper-pack
/api/v1/paper/2605.07630v1/build-passport
Generate required assets: Dockerfile.minimal, RUN.sh, EXPECTED_OUTPUT.json, COST_PASSPORT.json

Build artifacts

Brief

missing

Build brief missing until Build Passport data exists.

Source missing: Build Passport payload.

Experiment plan

missing

Experiment plan missing until prototype path is available.

No prototype path attached.

Validation checklist

missing

Validation checklist missing until required assets, cost, and regulatory flags are verified.

No checklist artifact is attached to the Build Passport payload.

Signal Canvas

Derived signals show verified:false until source-backed receipts exist.

Open Signal Canvas

SignalSourceStrengthFreshnessOwner action

Evidence coverage

OpportunityKernel evidence_receipt

0 refs / 4 sources / 83% coverage

stale

Verify missing sources before using this as buyer proof. verified:false

Build readiness

BuildPassport EvidenceState

passport absent

stale

Run Proof Lab or inspect typed missing state. verified:false

Artifact maturity

GitHub and Hugging Face maturity payloads

No public artifact surface observed

stale

Open source artifacts or mark the gap as missing. verified:false

Viability breakdown

decision rows

DimensionCurrent readEvidenceGapsNext test

Technical feasibility

partial

Current read

Runnable path is not fully verified.

Evidence

No Build Passport payload attached.

Gaps

No verified reproduction transcript attached.

Next test

Run minimal reproduction from the Build Passport prototype path.

Market urgency

missing

Current read

Buyer urgency is not verified from source.

Talent buckets

no invented people

Scientific founder

Needed now

No named scientific founder assigned.

Paper authors are not treated as operators without consent.

People

No named person assigned.

Gaps

Founder commitment not verified.

Next verification path

Confirm founder/operator owner.

Translational engineer

Needed now

Prototype owner missing.

Build Passport does not name an implementer.

People

No named person assigned.

ARTIFACTS

No public artifacts yet.

DEFENSIBILITY

Defensibility and confidence evidence pending.

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Claim map

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Timeline

Timeline

Claim map

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Constellation map

Competitive landscape

Buzz

Available agents

API/MCP endpoints

Tool contracts

Payload preview

Schema validation

Job trace

Evidence map

Page Freshness

Paper proof surface

Agent Handoff

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Buildability Receipt

Ready for execution: Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Compute envelope

Source Proof anchors

Source proof

JSON-LD twin

Evidence ids

Freshness

Hash state

Signature state

Blockers

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor