SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

degraded

Proof freshness: stale
Proof status: failed
Display score: 8/10
Last proof check: 2026-03-19
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 0
Source count: 0
Coverage: 33%

This page has proof data, but the latest verification did not complete cleanly.

Agent Handoff

Canonical ID skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks | Route /signal-canvas/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
    "query_text": "Summarize SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks",
  "normalized_query": "2602.12670",
  "route": "/signal-canvas/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "paper_ref": "skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: degraded

Claims: 8

References: Pending verification

Proof: Verification pending

Freshness state: stale

Source paper: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

PDF: https://arxiv.org/pdf/2602.12670v1

Source count: Pending verification

Coverage: 33%

Last proof check: 2026-03-19T21:31:49.672Z

Signal Canvas receipt window

Watch and verify: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

/buildability/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

Watchwatch

Subject: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
We test 7 agent-model configurations over 7,308 trajectories
Implicationpartial
Specific numbers provided in abstract indicating comprehensive evaluation
Verificationpartial
partial
Evidencepartial
Curated Skills raise average pass rate by 16.2 percentage points(pp)
Implicationpartial
Explicitly stated in abstract with specific numeric result
Verificationpartial
partial
Evidencepartial
effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare)
Implicationpartial
Specific domain-level performance differences with exact numbers provided in abstract
Verificationpartial
partial
Evidencepartial
Self-generated Skills provide no benefit on average
Implicationpartial
Directly stated in abstract with clear conclusion
Verificationpartial
partial
Evidencepartial
16 of 84 tasks show negative deltas
Implicationpartial
Specific count provided in abstract indicating limitations
Verificationpartial
partial
Evidencepartial
Focused Skills with 2--3 modules outperform comprehensive documentation
Implicationpartial
Directly stated in abstract but without specific performance numbers
Verificationpartial
partial
Evidencepartial
smaller models with Skills can match larger models without them
Implicationpartial
Directly stated in abstract but without specific model comparisons
Verificationpartial
partial
Evidencepartial
SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers
Implicationpartial
Explicitly stated in abstract with specific counts
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface