ARXIV:2602.12670 · AGENTS · SUBMITTED 19 MAR · 21:31 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsErrorProof: failed

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Q: What products could be built from this research?

To productize SkillsBench, one could develop a SaaS platform offering a customizable set of Skills tailored to enhance various AI applications in industry-specific workflows, leveraging the benchmark's results for validation and improvement.

Q: What are the practical use cases?

An enterprise AI toolkit that recommends and customizes procedural Skills for optimizing AI agent performance in specific domains like healthcare or software engineering.

Q: What industries could this research disrupt?

SkillsBench could disrupt the AI model evaluation space by setting a new standard for assessing augmentation strategies, shifting focus from raw model capabilities to the strategic enhancement of tasks via skills.

arXiv

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Blocked on Code›Score8.0Evidence failed

Opportunity summary

Pain SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence failed

Open Build Read PDF Signal Canvas Track

PROBLEM

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance. Despite rapid adoption, there is no standard way to measure whether they actually help.

METHOD

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help.

Full abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

RESULT

ScienceToStartup currently rates this 8.0/10 on the public viability pass. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare)…

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 8.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainSkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Evidence0 refs | 0 sources | 33% coverage

Blockermissing authors

Analysis summary

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsErrorProof: failed

ARXIV:2602.12670 · AGENTS · SUBMITTED 19 MAR · 21:31 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsErrorProof: failed

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

arXiv

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Blocked on Code›Score8.0Evidence failed

Opportunity summary

Pain SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence failed

Open Build Read PDF Signal Canvas Track

PROBLEM

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance. Despite rapid adoption, there is no standard way to measure whether they actually help.

METHOD

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help.

Full abstract

RESULT

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 8.0/10.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score8.0

PainSkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Evidence0 refs | 0 sources | 33% coverage

Blockermissing authors

Analysis summary

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

VerifiedSource: PDF linkedPartialPaperPack: 3 of 4 citation fields filledMissingMissing fields: authorsErrorProof: failed

Paper Pack

10.48550/arXiv.2602.12670

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Derived fallback

Read summaries are estimated from adjacent metadata, not verified extraction rows.

Proof status

failed

0 refs; 0 sources; 33% coverage.

What was readable

linkedon filenot materialized8 extracted40 indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

8.0

Time to MVP

MVP estimate missing

Commercial

No commercial flags on file

Export

Preparing verified analysis

lens / founder

PROBLEM

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance. Despite rapid adoption, there is no standard way to measure whether they actually help.

METHOD

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help.

RESULT

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 8.0/10.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
We test 7 agent-model configurations over 7,308 trajectories
Implicationpartial
Specific numbers provided in abstract indicating comprehensive evaluation
Verificationpartial
partial
Evidencepartial
Curated Skills raise average pass rate by 16.2 percentage points(pp)
Implicationpartial
Explicitly stated in abstract with specific numeric result
Verificationpartial
partial
Evidencepartial
effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare)
Implicationpartial
Specific domain-level performance differences with exact numbers provided in abstract
Verificationpartial
partial
Evidencepartial
Self-generated Skills provide no benefit on average
Implicationpartial
Directly stated in abstract with clear conclusion
Verificationpartial
partial
Evidencepartial
16 of 84 tasks show negative deltas
Implicationpartial
Specific count provided in abstract indicating limitations
Verificationpartial
partial
Evidencepartial
Focused Skills with 2--3 modules outperform comprehensive documentation
Implicationpartial
Directly stated in abstract but without specific performance numbers
Verificationpartial
partial
Evidencepartial
smaller models with Skills can match larger models without them
Implicationpartial
Directly stated in abstract but without specific model comparisons
Verificationpartial
partial
Evidencepartial
SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers
Implicationpartial
Explicitly stated in abstract with specific counts
Verificationpartial
partial

Constellation map

Paper-native neighborhood for concepts, methods, materials, markets, and competitors. Missing lanes stay labeled instead of disappearing behind commercialization gates.

Open full Signal Canvas

Concepts

not indexed

Methods

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help.

Materials

PDF linked

Markets

Agents

Competitors

not indexed

Competitive landscape

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Buzz

No indexed public discussion is attached to 2602.12670 yet. That is a visibility signal, not a blank module: the monitor is watching the public channels below.

Hacker News

Not indexed yet

Bluesky

Not indexed yet

PDF

Preview the source document here, or use the hero PDF action for a new tab.

References(40)

Establishing Best Practices for Building Rigorous Agentic Benchmarks

2025Yuxuan Zhu, Tengjun Jin et al.

SWE-smith: Scaling Data for Software Engineering Agents

2025John Yang, Kilian Adriano Lieret et al.

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

2024Jun Shern Chan, Neil Chowdhury et al.

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

2024Andy K. Zhang, Neil Perry et al.

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

2024H. Trivedi, Tushar Khot et al.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

2024Terry Yue Zhuo, Minh Chien Vu et al.

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

2024Shunyu Yao, Noah Shinn et al.

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

2024John Yang, Carlos E. Jimenez et al.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

2024Tianbao Xie, Danyang Zhang et al.

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

2024Wei-Lin Chiang, Lianmin Zheng et al.

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

2024Jing Yu Koh, Robert Lo et al.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

2023Carlos E. Jimenez, John Yang et al.

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

2023Andy Zhou, Kai Yan et al.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

2023O. Khattab, Arnav Singhvi et al.

Cognitive Architectures for Language Agents

2023T. Sumers, Shunyu Yao et al.

AgentBench: Evaluating LLMs as Agents

2023Xiao Liu, Hao Yu et al.

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

2023Yujia Qin, Shi Liang et al.

WebArena: A Realistic Web Environment for Building Autonomous Agents

2023Shuyan Zhou, Frank F. Xu et al.

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback

2023John Yang, Akshara Prabhakar et al.

Voyager: An Open-Ended Embodied Agent with Large Language Models

2023Guanzhi Wang, Yuqi Xie et al.

Showing 20 of 40 references

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

none indexed

Extension

Builds On ThisSWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

7.0

Builds On ThisHow Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

7.0

Builds On ThisSkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

7.0

Builds On ThisCounterfactual Trace Auditing of LLM Agent Skills

6.0

Builds On ThisEvoSkill: Automated Skill Discovery for Multi-Agent Systems

7.0

Builds On ThisAgent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

7.0

Commercially relevant

none indexed

Conflicting

Competing ApproachSkill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

4.0

Competing ApproachCan Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

7.0

Competing ApproachLiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

5.0

Competing ApproachFrom Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

3.0

Related Resources

Agents(glossary)
TransportAgents(glossary)
Mixture-of-Agents(glossary)
What is the future of AI agents according to Nothing's CEO?(question)
How do LLM efficiency advancements impact the development of AI agents?(question)
How does AgentXRay contribute to the explainability of AI agents in complex decision-making processes?(question)
Agents – Use Cases(use_case)
AI Agents – Use Cases(use_case)

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Agent drawer

5 surfaces preserved for agents. Humans can ignore.

Developer contracts, payload previews, evidence maps, and run controls stay here instead of the Read, Build, and Track workspace.

Run context

Paper: 2602.12670
Route: /paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks
Active tab: read
Artifact: skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

Available agents

Read extractor
Build planner
Track monitor
Competitive mapper
Related-paper scout

API/MCP endpoints

REST paper pack API/api/v1/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks/paper-pack
REST build passport API/api/v1/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks/build-passport
REST OpenAPI/api/openapi.json
MCP descriptor/api/mcp
MCP resourcesciencetostartup://surfaces/paper-workspace

Tool contracts

paper_packbuild_passportopportunity_kernelforesightsource_proofevidence_state

Payload preview

Inspect payload

{
  "contract_version": "paper-r2",
  "paper_id": "afd4d1e7-34c6-4916-98d8-9999bf86f131",
  "arxiv_id": "2602.12670",
  "canonical_route": "/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "active_tab": "synced from current hash by the drawer client",
  "selected_artifact": "skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "endpoints": {
    "paper_pack": "/api/v1/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks/paper-pack",
    "build_passport": "/api/v1/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks/build-passport",
    "mcp_resource": "sciencetostartup://surfaces/paper-workspace"
  }
}

Schema validation

paper-r2 contract: present
JSON-LD twin: SSR emitted
OpenAPI path parity: /api/openapi.json
MCP resource parity: paper-workspace

Job trace

queued: drawer opened by user action
running: inspect or copy payload
succeeded: payload available in SSR
failed: route errors appear in evidence cards

Evidence map

sources used: page freshness, source proof anchors, JSON-LD
missing sources: exposed by PaperPack and EvidenceState chips
derived fallbacks: marked unverified before handoff

Page Freshness

Canonical route, proof status, last verified, refs, sources, and coverage.

Page Freshness

Paper proof surface

Canonical route: /paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

degraded

Proof freshness: stale
Proof status: failed
Display score: 8/10
Last proof check: 2026-03-19
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 0
Source count: 0
Coverage: 33%

This page has proof data, but the latest verification did not complete cleanly.

OpenAlex: pending — this preprint is not yet indexed by OpenAlex.

Agent Handoff

Endpoint list, payload shape, route context, and copyable handoff data.

Agent Handoff

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Canonical ID skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks | Route /paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

MCP example

{
  "tool": "get_paper",
  "arguments": {
    "arxiv_id": "2602.12670"
  }
}

source_context

{
  "surface": "paper",
  "mode": "paper",
  "query": "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks",
  "normalized_query": "2602.12670",
  "route": "/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "paper_ref": "skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Buildability Receipt

Verdict, compute envelope, blockers, signature state, and receipt links.

Paper proof page receipt window

Watch and verify: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

/buildability/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

Watchwatch

Subject: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Time to first demo

Insufficient data

No first-demo timestamp, owner estimate, or elapsed demo receipt is attached to this surface.

Compute envelope

Structured compute envelope

Insufficient data

No data, compute, hardware, memory, latency, dependency, or serving requirement receipt is attached.

Evidence ids

Receipt path

/buildability/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

Paper ref

skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

arXiv id

2602.12670

Freshness

Generated at

2026-03-19T21:31:49.672Z

Evidence freshness

stale

Last verification

2026-03-19T21:31:49.672Z

Sources

References

Coverage

33%

Hash state

Lineage hash

7f7ad937cd8770455da8a47228aa2c08b9a6d6f6f357ea0da127fa6b24e5e60a

Canonical opportunity-kernel lineage hash.

Signature state

External signature

unsigned_external

No founder, registry, pilot, or production-adoption signature is attached to this receipt.

Verification

not_verified

Verification is blocked until an external signature is provided.

Blockers

Missing: repo_url
Missing: references
Missing: distribution_readiness_scores
Missing: paper_extraction_scorecards
Unknown: distribution readiness has not been computed yet

Verification pending / evidence receipt incomplete

repo_url

references

Missing proof, requirement, signature, approval, adoption, or telemetry fields are blockers and must not be inferred.

Open receipt API receipt Build Loop Signal Canvas Proof divergence Divergence API Brier outcomes API

Source Proof anchors

Visual citations from the paper document graph.

JSON-LD twin

The application/ld+json payload rendered for agents.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebPage",
      "@id": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks#webpage",
      "url": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
      "name": "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks",
      "description": "SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.",
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      }
    },
    {
      "@type": "ScholarlyArticle",
      "@id": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks#scholarlyArticle",
      "headline": "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks",
      "description": "SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.",
      "url": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
      "sameAs": "https://arxiv.org/abs/2602.12670",
      "identifier": {
        "@type": "PropertyValue",
        "propertyID": "arXiv",
        "value": "2602.12670"
      },
      "isAccessibleForFree": true,
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      },
      "datePublished": "2026-02-13T07:06:06.000Z",
      "author": [
        {
          "@type": "Person",
          "name": "Xiangyi Li",
          "affiliation": {
            "@type": "Organization",
            "name": "BenchFlow"
          }
        },
        {
          "@type": "Person",
          "name": "Wenbo Chen",
          "affiliation": {
            "@type": "Organization",
            "name": "Amazon"
          }
        },
        {
          "@type": "Person",
          "name": "Yimin Liu",
          "affiliation": {
            "@type": "Organization",
            "name": "Ohio State University"
          }
        },
        {
          "@type": "Person",
          "name": "Shenghan Zheng",
          "affiliation": {
            "@type": "Organization",
            "name": "Dartmouth College"
          }
        },
        {
          "@type": "Person",
          "name": "Xiaokun Chen",
          "affiliation": {
            "@type": "Organization",
            "name": "Stanford University"
          }
        },
        {
          "@type": "Person",
          "name": "Yifeng He",
          "affiliation": {
            "@type": "Organization",
            "name": "UC Davis"
          }
        },
        {
          "@type": "Person",
          "name": "Yubo Li",
          "affiliation": {
            "@type": "Organization",
            "name": "Carnegie Mellon University"
          }
        },
        {
          "@type": "Person",
          "name": "Bingran You",
          "affiliation": {
            "@type": "Organization",
            "name": "UC Berkeley"
          }
        },
        {
          "@type": "Person",
          "name": "Haotian Shen",
          "affiliation": {
            "@type": "Organization",
            "name": "Independent"
          }
        },
        {
          "@type": "Person",
          "name": "Jiankai Sun"
        },
        {
          "@type": "Person",
          "name": "Shuyi Wang"
        },
        {
          "@type": "Person",
          "name": "Qunhong Zeng",
          "affiliation": {
            "@type": "Organization",
            "name": "Beijing Institute of Technology"
          }
        },
        {
          "@type": "Person",
          "name": "Di Wang",
          "affiliation": {
            "@type": "Organization",
            "name": "Foxconn"
          }
        },
        {
          "@type": "Person",
          "name": "Xuandong Zhao",
          "affiliation": {
            "@type": "Organization",
            "name": "UC Berkeley"
          }
        },
        {
          "@type": "Person",
          "name": "Yuanli Wang",
          "affiliation": {
            "@type": "Organization",
            "name": "Boston University"
          }
        },
        {
          "@type": "Person",
          "name": "Roey Ben Chaim",
          "affiliation": {
            "@type": "Organization",
            "name": "Zenity"
          }
        },
        {
          "@type": "Person",
          "name": "Zonglin Di",
          "affiliation": {
            "@type": "Organization",
            "name": "UC Santa Cruz"
          }
        },
        {
          "@type": "Person",
          "name": "Yipeng Gao",
          "affiliation": {
            "@type": "Organization",
            "name": "USC"
          }
        },
        {
          "@type": "Person",
          "name": "Junwei He",
          "affiliation": {
            "@type": "Organization",
            "name": "ByteDance"
          }
        },
        {
          "@type": "Person",
          "name": "Yizhuo He",
          "affiliation": {
            "@type": "Organization",
            "name": "Carnegie Mellon University"
          }
        },
        {
          "@type": "Person",
          "name": "Liqiang Jing",
          "affiliation": {
            "@type": "Organization",
            "name": "UT Dallas"
          }
        },
        {
          "@type": "Person",
          "name": "Luyang Kong"
        },
        {
          "@type": "Person",
          "name": "Xin Lan",
          "affiliation": {
            "@type": "Organization",
            "name": "Michigan State University"
          }
        },
        {
          "@type": "Person",
          "name": "Jiachen Li",
          "affiliation": {
            "@type": "Organization",
            "name": "UT Austin"
          }
        },
        {
          "@type": "Person",
          "name": "Songlin Li"
        },
        {
          "@type": "Person",
          "name": "Yijiang Li",
          "affiliation": {
            "@type": "Organization",
            "name": "UC San Diego"
          }
        },
        {
          "@type": "Person",
          "name": "Yueqian Lin",
          "affiliation": {
            "@type": "Organization",
            "name": "Duke University"
          }
        },
        {
          "@type": "Person",
          "name": "Xinyi Liu"
        },
        {
          "@type": "Person",
          "name": "Xuanqing Liu"
        },
        {
          "@type": "Person",
          "name": "Haoran Lyu"
        },
        {
          "@type": "Person",
          "name": "Ze Ma",
          "affiliation": {
            "@type": "Organization",
            "name": "Columbia University"
          }
        },
        {
          "@type": "Person",
          "name": "Bowei Wang"
        },
        {
          "@type": "Person",
          "name": "Runhui Wang"
        },
        {
          "@type": "Person",
          "name": "Tianyu Wang"
        },
        {
          "@type": "Person",
          "name": "Wengao Ye",
          "affiliation": {
            "@type": "Organization",
            "name": "University of Oxford"
          }
        },
        {
          "@type": "Person",
          "name": "Yue Zhang",
          "affiliation": {
            "@type": "Organization",
            "name": "UT Dallas"
          }
        },
        {
          "@type": "Person",
          "name": "Hanwen Xing"
        },
        {
          "@type": "Person",
          "name": "Yiqi Xue",
          "affiliation": {
            "@type": "Organization",
            "name": "USC"
          }
        },
        {
          "@type": "Person",
          "name": "Steven Dillmann",
          "affiliation": {
            "@type": "Organization",
            "name": "Stanford University"
          }
        },
        {
          "@type": "Person",
          "name": "Han-chung Lee"
        }
      ],
      "citation": [
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "eeff713967bfd94da50e4cc8fda888b01f137e90"
          },
          "url": "https://www.semanticscholar.org/paper/eeff713967bfd94da50e4cc8fda888b01f137e90"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "cbdfe7c75676b6ba85f66bdffe162f5991d6f536"
          },
          "url": "https://www.semanticscholar.org/paper/cbdfe7c75676b6ba85f66bdffe162f5991d6f536"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "7c44b7fdcec2e517799f6c54f6ba42bf1a89d2e6"
          },
          "url": "https://www.semanticscholar.org/paper/7c44b7fdcec2e517799f6c54f6ba42bf1a89d2e6"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "cb4b0bba67466c22bbc99bbf973dce5e1d9a48b6"
          },
          "url": "https://www.semanticscholar.org/paper/cb4b0bba67466c22bbc99bbf973dce5e1d9a48b6"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "19430ba54cc873dc5061bb53601fc576486d5a3c"
          },
          "url": "https://www.semanticscholar.org/paper/19430ba54cc873dc5061bb53601fc576486d5a3c"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "f2e0b3d6a02dac33872f0a0b42affdcf454715cb"
          },
          "url": "https://www.semanticscholar.org/paper/f2e0b3d6a02dac33872f0a0b42affdcf454715cb"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "70aa016c1f68fd5c0261f26ad20017b8307650af"
          },
          "url": "https://www.semanticscholar.org/paper/70aa016c1f68fd5c0261f26ad20017b8307650af"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "1c3c531fc0fbe79f97f367ed3648de8467caeeaa"
          },
          "url": "https://www.semanticscholar.org/paper/1c3c531fc0fbe79f97f367ed3648de8467caeeaa"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "ff3e4f7c2481fb6df539f02be5945235101cbc19"
          },
          "url": "https://www.semanticscholar.org/paper/ff3e4f7c2481fb6df539f02be5945235101cbc19"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "53f4fb0e9972989194368faf288ff8e3cba5bd60"
          },
          "url": "https://www.semanticscholar.org/paper/53f4fb0e9972989194368faf288ff8e3cba5bd60"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "94a5f96308729e31c1ffbc0f0618db87795092fe"
          },
          "url": "https://www.semanticscholar.org/paper/94a5f96308729e31c1ffbc0f0618db87795092fe"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "700bd9681f1b9e9e2212e10415d27b11c7e6836b"
          },
          "url": "https://www.semanticscholar.org/paper/700bd9681f1b9e9e2212e10415d27b11c7e6836b"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "2069aaaa281eb13bcd9330fc4d43f24f6b436a53"
          },
          "url": "https://www.semanticscholar.org/paper/2069aaaa281eb13bcd9330fc4d43f24f6b436a53"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "e4bb1b1f97711a7634bf4bff72c56891be2222e6"
          },
          "url": "https://www.semanticscholar.org/paper/e4bb1b1f97711a7634bf4bff72c56891be2222e6"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "5dbf93a68b7fda600521f046dea35ea8ba9e884f"
          },
          "url": "https://www.semanticscholar.org/paper/5dbf93a68b7fda600521f046dea35ea8ba9e884f"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "0bfc804e31eecfd77f45e4ee7f4d629fffdcd628"
          },
          "url": "https://www.semanticscholar.org/paper/0bfc804e31eecfd77f45e4ee7f4d629fffdcd628"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "e41482f4ee984f17382f6cdd900df094d928be06"
          },
          "url": "https://www.semanticscholar.org/paper/e41482f4ee984f17382f6cdd900df094d928be06"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "f94c040b02bdd6cf1b85f374e3912630c66861c3"
          },
          "url": "https://www.semanticscholar.org/paper/f94c040b02bdd6cf1b85f374e3912630c66861c3"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "f197bf0fc2f228483f6af3285000d54d8d97f9eb"
          },
          "url": "https://www.semanticscholar.org/paper/f197bf0fc2f228483f6af3285000d54d8d97f9eb"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "2f3822eb380b5e753a6d579f31dfc3ec4c4a0820"
          },
          "url": "https://www.semanticscholar.org/paper/2f3822eb380b5e753a6d579f31dfc3ec4c4a0820"
        }
      ],
      "additionalProperty": [
        {
          "@type": "PropertyValue",
          "propertyID": "viabilityScore",
          "value": 8
        },
        {
          "@type": "PropertyValue",
          "propertyID": "researchDomain",
          "value": "Agents"
        }
      ]
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Home",
          "item": "https://sciencetostartup.com"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Agents",
          "item": "https://sciencetostartup.com/topics"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "SkillsBench: Benchmarking How Well Agent Skills Work Across ",
          "item": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks"
        }
      ]
    },
    {
      "@type": "FAQPage",
      "mainEntity": [
        {
          "@type": "Question",
          "name": "What is the startup potential of \"SkillsBench: Benchmarking How Well Agent Skills Work Across \"?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance."
          }
        },
        {
          "@type": "Question",
          "name": "What products could be built from this research?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "To productize SkillsBench, one could develop a SaaS platform offering a customizable set of Skills tailored to enhance various AI applications in industry-specific workflows, leveraging the benchmark's results for validation and improvement."
          }
        },
        {
          "@type": "Question",
          "name": "What are the practical use cases?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "An enterprise AI toolkit that recommends and customizes procedural Skills for optimizing AI agent performance in specific domains like healthcare or software engineering."
          }
        },
        {
          "@type": "Question",
          "name": "What industries could this research disrupt?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "SkillsBench could disrupt the AI model evaluation space by setting a new standard for assessing augmentation strategies, shifting focus from raw model capabilities to the strategic enhancement of tasks via skills."
          }
        }
      ]
    }
  ]
}

Paper Pack

10.48550/arXiv.2602.12670

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Abstract

Source availability

PDF linked

The paper record includes a public PDF URL.

Extraction status

Derived fallback

Read summaries are estimated from adjacent metadata, not verified extraction rows.

Proof status

failed

0 refs; 0 sources; 33% coverage.

What was readable

linkedon filenot materialized8 extracted40 indexednot indexed

Derived fallback: Estimated from adjacent evidence; not verified from source.

Viability

8.0

Time to MVP

MVP estimate missing

Commercial

No commercial flags on file

Export

Preparing verified analysis

lens / founder

PROBLEM

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance. Despite rapid adoption, there is no standard way to measure whether they actually help.

METHOD

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help.

RESULT

WHY NOW

Agents moved forward this cycle; last verified April 2026. Public score 8.0/10.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
We test 7 agent-model configurations over 7,308 trajectories
Implicationpartial
Specific numbers provided in abstract indicating comprehensive evaluation
Verificationpartial
partial
Evidencepartial
Curated Skills raise average pass rate by 16.2 percentage points(pp)
Implicationpartial
Explicitly stated in abstract with specific numeric result
Verificationpartial
partial
Evidencepartial
effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare)
Implicationpartial
Specific domain-level performance differences with exact numbers provided in abstract
Verificationpartial
partial
Evidencepartial
Self-generated Skills provide no benefit on average
Implicationpartial
Directly stated in abstract with clear conclusion
Verificationpartial
partial
Evidencepartial
16 of 84 tasks show negative deltas
Implicationpartial
Specific count provided in abstract indicating limitations
Verificationpartial
partial
Evidencepartial
Focused Skills with 2--3 modules outperform comprehensive documentation
Implicationpartial
Directly stated in abstract but without specific performance numbers
Verificationpartial
partial
Evidencepartial
smaller models with Skills can match larger models without them
Implicationpartial
Directly stated in abstract but without specific model comparisons
Verificationpartial
partial
Evidencepartial
SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers
Implicationpartial
Explicitly stated in abstract with specific counts
Verificationpartial
partial

Constellation map

Paper-native neighborhood for concepts, methods, materials, markets, and competitors. Missing lanes stay labeled instead of disappearing behind commercialization gates.

Open full Signal Canvas

Concepts

not indexed

Methods

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help.

Materials

PDF linked

Markets

Agents

Competitors

not indexed

Competitive landscape

SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.

Segment

Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

8.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Buzz

No indexed public discussion is attached to 2602.12670 yet. That is a visibility signal, not a blank module: the monitor is watching the public channels below.

Hacker News

Not indexed yet

Bluesky

Not indexed yet

PDF

Preview the source document here, or use the hero PDF action for a new tab.

References(40)

Establishing Best Practices for Building Rigorous Agentic Benchmarks

2025Yuxuan Zhu, Tengjun Jin et al.

SWE-smith: Scaling Data for Software Engineering Agents

2025John Yang, Kilian Adriano Lieret et al.

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

2024Jun Shern Chan, Neil Chowdhury et al.

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

2024Andy K. Zhang, Neil Perry et al.

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

2024H. Trivedi, Tushar Khot et al.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

2024Terry Yue Zhuo, Minh Chien Vu et al.

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

2024Shunyu Yao, Noah Shinn et al.

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

2024John Yang, Carlos E. Jimenez et al.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

2024Tianbao Xie, Danyang Zhang et al.

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

2024Wei-Lin Chiang, Lianmin Zheng et al.

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

2024Jing Yu Koh, Robert Lo et al.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

2023Carlos E. Jimenez, John Yang et al.

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

2023Andy Zhou, Kai Yan et al.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

2023O. Khattab, Arnav Singhvi et al.

Cognitive Architectures for Language Agents

2023T. Sumers, Shunyu Yao et al.

AgentBench: Evaluating LLMs as Agents

2023Xiao Liu, Hao Yu et al.

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

2023Yujia Qin, Shi Liang et al.

WebArena: A Realistic Web Environment for Building Autonomous Agents

2023Shuyan Zhou, Frank F. Xu et al.

InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback

2023John Yang, Akshara Prabhakar et al.

Voyager: An Open-Ended Embodied Agent with Large Language Models

2023Guanzhi Wang, Yuqi Xie et al.

Showing 20 of 40 references

CITED BY

No citing papers are indexed in the public S2S graph yet. This is an explicit zero-signal state, not a hidden lookup.

Foundation

none indexed

Extension

Builds On ThisSWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?

7.0

Builds On ThisHow Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

7.0

Builds On ThisSkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

7.0

Builds On ThisCounterfactual Trace Auditing of LLM Agent Skills

6.0

Builds On ThisEvoSkill: Automated Skill Discovery for Multi-Agent Systems

7.0

Builds On ThisAgent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

7.0

Commercially relevant

none indexed

Conflicting

Competing ApproachSkill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

4.0

Competing ApproachCan Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

7.0

Competing ApproachLiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

5.0

Competing ApproachFrom Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

3.0

Related Resources

Agents(glossary)
TransportAgents(glossary)
Mixture-of-Agents(glossary)
What is the future of AI agents according to Nothing's CEO?(question)
How do LLM efficiency advancements impact the development of AI agents?(question)
How does AgentXRay contribute to the explainability of AI agents in complex decision-making processes?(question)
Agents – Use Cases(use_case)
AI Agents – Use Cases(use_case)

Owned Distribution

Subscribe to the weekly brief

Get the weekly shortlist of commercializable papers, benchmark movers, and proof receipts that matter for product execution.

Agent drawer

5 surfaces preserved for agents. Humans can ignore.

Developer contracts, payload previews, evidence maps, and run controls stay here instead of the Read, Build, and Track workspace.

Run context

Paper: 2602.12670
Route: /paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks
Active tab: read
Artifact: skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

Available agents

Read extractor
Build planner
Track monitor
Competitive mapper
Related-paper scout

API/MCP endpoints

REST paper pack API/api/v1/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks/paper-pack
REST build passport API/api/v1/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks/build-passport
REST OpenAPI/api/openapi.json
MCP descriptor/api/mcp
MCP resourcesciencetostartup://surfaces/paper-workspace

Tool contracts

paper_packbuild_passportopportunity_kernelforesightsource_proofevidence_state

Payload preview

Inspect payload

{
  "contract_version": "paper-r2",
  "paper_id": "afd4d1e7-34c6-4916-98d8-9999bf86f131",
  "arxiv_id": "2602.12670",
  "canonical_route": "/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "active_tab": "synced from current hash by the drawer client",
  "selected_artifact": "skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "endpoints": {
    "paper_pack": "/api/v1/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks/paper-pack",
    "build_passport": "/api/v1/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks/build-passport",
    "mcp_resource": "sciencetostartup://surfaces/paper-workspace"
  }
}

Schema validation

paper-r2 contract: present
JSON-LD twin: SSR emitted
OpenAPI path parity: /api/openapi.json
MCP resource parity: paper-workspace

Job trace

queued: drawer opened by user action
running: inspect or copy payload
succeeded: payload available in SSR
failed: route errors appear in evidence cards

Evidence map

sources used: page freshness, source proof anchors, JSON-LD
missing sources: exposed by PaperPack and EvidenceState chips
derived fallbacks: marked unverified before handoff

Page Freshness

Canonical route, proof status, last verified, refs, sources, and coverage.

Page Freshness

Paper proof surface

Canonical route: /paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

degraded

Proof freshness: stale
Proof status: failed
Display score: 8/10
Last proof check: 2026-03-19
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 0
Source count: 0
Coverage: 33%

This page has proof data, but the latest verification did not complete cleanly.

OpenAlex: pending — this preprint is not yet indexed by OpenAlex.

Agent Handoff

Endpoint list, payload shape, route context, and copyable handoff data.

Agent Handoff

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Canonical ID skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks | Route /paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

MCP example

{
  "tool": "get_paper",
  "arguments": {
    "arxiv_id": "2602.12670"
  }
}

source_context

{
  "surface": "paper",
  "mode": "paper",
  "query": "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks",
  "normalized_query": "2602.12670",
  "route": "/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "paper_ref": "skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Buildability Receipt

Verdict, compute envelope, blockers, signature state, and receipt links.

Paper proof page receipt window

Watch and verify: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

/buildability/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

Watchwatch

Subject: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Time to first demo

Insufficient data

No first-demo timestamp, owner estimate, or elapsed demo receipt is attached to this surface.

Compute envelope

Structured compute envelope

Insufficient data

No data, compute, hardware, memory, latency, dependency, or serving requirement receipt is attached.

Evidence ids

Receipt path

/buildability/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

Paper ref

skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks

arXiv id

2602.12670

Freshness

Generated at

2026-03-19T21:31:49.672Z

Evidence freshness

stale

Last verification

2026-03-19T21:31:49.672Z

Sources

References

Coverage

33%

Hash state

Lineage hash

7f7ad937cd8770455da8a47228aa2c08b9a6d6f6f357ea0da127fa6b24e5e60a

Canonical opportunity-kernel lineage hash.

Signature state

External signature

unsigned_external

No founder, registry, pilot, or production-adoption signature is attached to this receipt.

Verification

not_verified

Verification is blocked until an external signature is provided.

Blockers

Missing: repo_url
Missing: references
Missing: distribution_readiness_scores
Missing: paper_extraction_scorecards
Unknown: distribution readiness has not been computed yet

Verification pending / evidence receipt incomplete

repo_url

references

Missing proof, requirement, signature, approval, adoption, or telemetry fields are blockers and must not be inferred.

Open receipt API receipt Build Loop Signal Canvas Proof divergence Divergence API Brier outcomes API

Source Proof anchors

Visual citations from the paper document graph.

JSON-LD twin

The application/ld+json payload rendered for agents.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "WebPage",
      "@id": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks#webpage",
      "url": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
      "name": "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks",
      "description": "SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.",
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      }
    },
    {
      "@type": "ScholarlyArticle",
      "@id": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks#scholarlyArticle",
      "headline": "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks",
      "description": "SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance.",
      "url": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks",
      "sameAs": "https://arxiv.org/abs/2602.12670",
      "identifier": {
        "@type": "PropertyValue",
        "propertyID": "arXiv",
        "value": "2602.12670"
      },
      "isAccessibleForFree": true,
      "isPartOf": {
        "@id": "https://sciencetostartup.com/#website"
      },
      "datePublished": "2026-02-13T07:06:06.000Z",
      "author": [
        {
          "@type": "Person",
          "name": "Xiangyi Li",
          "affiliation": {
            "@type": "Organization",
            "name": "BenchFlow"
          }
        },
        {
          "@type": "Person",
          "name": "Wenbo Chen",
          "affiliation": {
            "@type": "Organization",
            "name": "Amazon"
          }
        },
        {
          "@type": "Person",
          "name": "Yimin Liu",
          "affiliation": {
            "@type": "Organization",
            "name": "Ohio State University"
          }
        },
        {
          "@type": "Person",
          "name": "Shenghan Zheng",
          "affiliation": {
            "@type": "Organization",
            "name": "Dartmouth College"
          }
        },
        {
          "@type": "Person",
          "name": "Xiaokun Chen",
          "affiliation": {
            "@type": "Organization",
            "name": "Stanford University"
          }
        },
        {
          "@type": "Person",
          "name": "Yifeng He",
          "affiliation": {
            "@type": "Organization",
            "name": "UC Davis"
          }
        },
        {
          "@type": "Person",
          "name": "Yubo Li",
          "affiliation": {
            "@type": "Organization",
            "name": "Carnegie Mellon University"
          }
        },
        {
          "@type": "Person",
          "name": "Bingran You",
          "affiliation": {
            "@type": "Organization",
            "name": "UC Berkeley"
          }
        },
        {
          "@type": "Person",
          "name": "Haotian Shen",
          "affiliation": {
            "@type": "Organization",
            "name": "Independent"
          }
        },
        {
          "@type": "Person",
          "name": "Jiankai Sun"
        },
        {
          "@type": "Person",
          "name": "Shuyi Wang"
        },
        {
          "@type": "Person",
          "name": "Qunhong Zeng",
          "affiliation": {
            "@type": "Organization",
            "name": "Beijing Institute of Technology"
          }
        },
        {
          "@type": "Person",
          "name": "Di Wang",
          "affiliation": {
            "@type": "Organization",
            "name": "Foxconn"
          }
        },
        {
          "@type": "Person",
          "name": "Xuandong Zhao",
          "affiliation": {
            "@type": "Organization",
            "name": "UC Berkeley"
          }
        },
        {
          "@type": "Person",
          "name": "Yuanli Wang",
          "affiliation": {
            "@type": "Organization",
            "name": "Boston University"
          }
        },
        {
          "@type": "Person",
          "name": "Roey Ben Chaim",
          "affiliation": {
            "@type": "Organization",
            "name": "Zenity"
          }
        },
        {
          "@type": "Person",
          "name": "Zonglin Di",
          "affiliation": {
            "@type": "Organization",
            "name": "UC Santa Cruz"
          }
        },
        {
          "@type": "Person",
          "name": "Yipeng Gao",
          "affiliation": {
            "@type": "Organization",
            "name": "USC"
          }
        },
        {
          "@type": "Person",
          "name": "Junwei He",
          "affiliation": {
            "@type": "Organization",
            "name": "ByteDance"
          }
        },
        {
          "@type": "Person",
          "name": "Yizhuo He",
          "affiliation": {
            "@type": "Organization",
            "name": "Carnegie Mellon University"
          }
        },
        {
          "@type": "Person",
          "name": "Liqiang Jing",
          "affiliation": {
            "@type": "Organization",
            "name": "UT Dallas"
          }
        },
        {
          "@type": "Person",
          "name": "Luyang Kong"
        },
        {
          "@type": "Person",
          "name": "Xin Lan",
          "affiliation": {
            "@type": "Organization",
            "name": "Michigan State University"
          }
        },
        {
          "@type": "Person",
          "name": "Jiachen Li",
          "affiliation": {
            "@type": "Organization",
            "name": "UT Austin"
          }
        },
        {
          "@type": "Person",
          "name": "Songlin Li"
        },
        {
          "@type": "Person",
          "name": "Yijiang Li",
          "affiliation": {
            "@type": "Organization",
            "name": "UC San Diego"
          }
        },
        {
          "@type": "Person",
          "name": "Yueqian Lin",
          "affiliation": {
            "@type": "Organization",
            "name": "Duke University"
          }
        },
        {
          "@type": "Person",
          "name": "Xinyi Liu"
        },
        {
          "@type": "Person",
          "name": "Xuanqing Liu"
        },
        {
          "@type": "Person",
          "name": "Haoran Lyu"
        },
        {
          "@type": "Person",
          "name": "Ze Ma",
          "affiliation": {
            "@type": "Organization",
            "name": "Columbia University"
          }
        },
        {
          "@type": "Person",
          "name": "Bowei Wang"
        },
        {
          "@type": "Person",
          "name": "Runhui Wang"
        },
        {
          "@type": "Person",
          "name": "Tianyu Wang"
        },
        {
          "@type": "Person",
          "name": "Wengao Ye",
          "affiliation": {
            "@type": "Organization",
            "name": "University of Oxford"
          }
        },
        {
          "@type": "Person",
          "name": "Yue Zhang",
          "affiliation": {
            "@type": "Organization",
            "name": "UT Dallas"
          }
        },
        {
          "@type": "Person",
          "name": "Hanwen Xing"
        },
        {
          "@type": "Person",
          "name": "Yiqi Xue",
          "affiliation": {
            "@type": "Organization",
            "name": "USC"
          }
        },
        {
          "@type": "Person",
          "name": "Steven Dillmann",
          "affiliation": {
            "@type": "Organization",
            "name": "Stanford University"
          }
        },
        {
          "@type": "Person",
          "name": "Han-chung Lee"
        }
      ],
      "citation": [
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "eeff713967bfd94da50e4cc8fda888b01f137e90"
          },
          "url": "https://www.semanticscholar.org/paper/eeff713967bfd94da50e4cc8fda888b01f137e90"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "cbdfe7c75676b6ba85f66bdffe162f5991d6f536"
          },
          "url": "https://www.semanticscholar.org/paper/cbdfe7c75676b6ba85f66bdffe162f5991d6f536"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "7c44b7fdcec2e517799f6c54f6ba42bf1a89d2e6"
          },
          "url": "https://www.semanticscholar.org/paper/7c44b7fdcec2e517799f6c54f6ba42bf1a89d2e6"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "cb4b0bba67466c22bbc99bbf973dce5e1d9a48b6"
          },
          "url": "https://www.semanticscholar.org/paper/cb4b0bba67466c22bbc99bbf973dce5e1d9a48b6"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "19430ba54cc873dc5061bb53601fc576486d5a3c"
          },
          "url": "https://www.semanticscholar.org/paper/19430ba54cc873dc5061bb53601fc576486d5a3c"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "f2e0b3d6a02dac33872f0a0b42affdcf454715cb"
          },
          "url": "https://www.semanticscholar.org/paper/f2e0b3d6a02dac33872f0a0b42affdcf454715cb"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "70aa016c1f68fd5c0261f26ad20017b8307650af"
          },
          "url": "https://www.semanticscholar.org/paper/70aa016c1f68fd5c0261f26ad20017b8307650af"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "1c3c531fc0fbe79f97f367ed3648de8467caeeaa"
          },
          "url": "https://www.semanticscholar.org/paper/1c3c531fc0fbe79f97f367ed3648de8467caeeaa"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "ff3e4f7c2481fb6df539f02be5945235101cbc19"
          },
          "url": "https://www.semanticscholar.org/paper/ff3e4f7c2481fb6df539f02be5945235101cbc19"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "53f4fb0e9972989194368faf288ff8e3cba5bd60"
          },
          "url": "https://www.semanticscholar.org/paper/53f4fb0e9972989194368faf288ff8e3cba5bd60"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "94a5f96308729e31c1ffbc0f0618db87795092fe"
          },
          "url": "https://www.semanticscholar.org/paper/94a5f96308729e31c1ffbc0f0618db87795092fe"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "700bd9681f1b9e9e2212e10415d27b11c7e6836b"
          },
          "url": "https://www.semanticscholar.org/paper/700bd9681f1b9e9e2212e10415d27b11c7e6836b"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "2069aaaa281eb13bcd9330fc4d43f24f6b436a53"
          },
          "url": "https://www.semanticscholar.org/paper/2069aaaa281eb13bcd9330fc4d43f24f6b436a53"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "e4bb1b1f97711a7634bf4bff72c56891be2222e6"
          },
          "url": "https://www.semanticscholar.org/paper/e4bb1b1f97711a7634bf4bff72c56891be2222e6"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "5dbf93a68b7fda600521f046dea35ea8ba9e884f"
          },
          "url": "https://www.semanticscholar.org/paper/5dbf93a68b7fda600521f046dea35ea8ba9e884f"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "0bfc804e31eecfd77f45e4ee7f4d629fffdcd628"
          },
          "url": "https://www.semanticscholar.org/paper/0bfc804e31eecfd77f45e4ee7f4d629fffdcd628"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "e41482f4ee984f17382f6cdd900df094d928be06"
          },
          "url": "https://www.semanticscholar.org/paper/e41482f4ee984f17382f6cdd900df094d928be06"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "f94c040b02bdd6cf1b85f374e3912630c66861c3"
          },
          "url": "https://www.semanticscholar.org/paper/f94c040b02bdd6cf1b85f374e3912630c66861c3"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "f197bf0fc2f228483f6af3285000d54d8d97f9eb"
          },
          "url": "https://www.semanticscholar.org/paper/f197bf0fc2f228483f6af3285000d54d8d97f9eb"
        },
        {
          "@type": "ScholarlyArticle",
          "identifier": {
            "@type": "PropertyValue",
            "propertyID": "SemanticScholar",
            "value": "2f3822eb380b5e753a6d579f31dfc3ec4c4a0820"
          },
          "url": "https://www.semanticscholar.org/paper/2f3822eb380b5e753a6d579f31dfc3ec4c4a0820"
        }
      ],
      "additionalProperty": [
        {
          "@type": "PropertyValue",
          "propertyID": "viabilityScore",
          "value": 8
        },
        {
          "@type": "PropertyValue",
          "propertyID": "researchDomain",
          "value": "Agents"
        }
      ]
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Home",
          "item": "https://sciencetostartup.com"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Agents",
          "item": "https://sciencetostartup.com/topics"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "SkillsBench: Benchmarking How Well Agent Skills Work Across ",
          "item": "https://sciencetostartup.com/paper/skillsbench-benchmarking-how-well-agent-skills-work-across-diverse-tasks"
        }
      ]
    },
    {
      "@type": "FAQPage",
      "mainEntity": [
        {
          "@type": "Question",
          "name": "What is the startup potential of \"SkillsBench: Benchmarking How Well Agent Skills Work Across \"?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "SkillsBench evaluates the effectiveness of procedural Skills in boosting LLM agent task performance."
          }
        },
        {
          "@type": "Question",
          "name": "What products could be built from this research?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "To productize SkillsBench, one could develop a SaaS platform offering a customizable set of Skills tailored to enhance various AI applications in industry-specific workflows, leveraging the benchmark's results for validation and improvement."
          }
        },
        {
          "@type": "Question",
          "name": "What are the practical use cases?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "An enterprise AI toolkit that recommends and customizes procedural Skills for optimizing AI agent performance in specific domains like healthcare or software engineering."
          }
        },
        {
          "@type": "Question",
          "name": "What industries could this research disrupt?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "SkillsBench could disrupt the AI model evaluation space by setting a new standard for assessing augmentation strategies, shifting focus from raw model capabilities to the strategic enhancement of tasks via skills."
          }
        }
      ]
    }
  ]
}

0/3 checks · 0%

References 0 / 33+ references
Sources 0 / 22+ sources
Coverage 33% / 50%50%+ coverage

Build Passport

UNVERIFIED

Build Passport

Build passport pending - Proof Lab budget No verified cost estimate / $7.00 cap

status

missing

reason

passport_row_missing

proof status

unverified

cost/budget

No verified cost estimate

confidence low

next verification path

/api/v1/paper/2602.12670v1/paper-pack
/api/v1/paper/2602.12670v1/build-passport
Generate required assets: Dockerfile.minimal, RUN.sh, EXPECTED_OUTPUT.json, COST_PASSPORT.json

Build artifacts

Brief

missing

Build brief missing until Build Passport data exists.

Source missing: Build Passport payload.

Experiment plan

missing

Experiment plan missing until prototype path is available.

No prototype path attached.

Validation checklist

missing

Validation checklist missing until required assets, cost, and regulatory flags are verified.

No checklist artifact is attached to the Build Passport payload.

Signal Canvas

Derived signals show verified:false until source-backed receipts exist.

Open Signal Canvas

SignalSourceStrengthFreshnessOwner action

Evidence coverage

OpportunityKernel evidence_receipt

0 refs / 0 sources / 33% coverage

stale

Verify missing sources before using this as buyer proof. verified:false

Build readiness

BuildPassport EvidenceState

passport absent

stale

Run Proof Lab or inspect typed missing state. verified:false

Artifact maturity

GitHub and Hugging Face maturity payloads

No public artifact surface observed

stale

Open source artifacts or mark the gap as missing. verified:false

Viability breakdown

decision rows

DimensionCurrent readEvidenceGapsNext test

Technical feasibility

partial

Current read

Runnable path is not fully verified.

Evidence

No Build Passport payload attached.

Gaps

No verified reproduction transcript attached.

Next test

Run minimal reproduction from the Build Passport prototype path.

Market urgency

missing

Current read

Buyer urgency is not verified from source.

Evidence

0 references, 0 sources, 33% evidence coverage.

Gaps

No customer pull, deployment, or budget-owner source attached.

Next test

Collect buyer interview, deployment evidence, or cited demand signal.

Buyer clarity

missing

Current read

No budget owner is verified for this paper.

Evidence

Build tab has no CRM, procurement, or operator source.

Gaps

Buyer persona and buying trigger are not sourced.

Next test

Map target operator, economic buyer, and procurement trigger.

Defensibility

missing

Current read

Defensibility signals are missing.

Evidence

No defensibility receipt attached.

Gaps

Confidence band missing.

Next test

Refresh defensibility bars with source receipts.

Integration burden

missing

Current read

No public implementation surface observed.

Evidence

No GitHub or Hugging Face payload attached.

Gaps

No integration checklist or owner attached.

Next test

Write integration checklist from prototype path and target workflow.

Capital intensity

missing

Current read

No observed cost estimate is verified.

Evidence

Cost passport has no observed_usd value.

Gaps

No verified cost estimate.

Next test

Run cost passport or mark the cost field not applicable.

Regulatory load

missing

Current read

No regulatory classification is attached.

Evidence

Build Passport ledger does not include regulatory flags.

Gaps

Clinical, privacy, safety, export, and biosecurity flags unclassified.

Next test

Classify regulatory flags before commercialization planning.

Talent buckets

no invented people

Scientific founder

Needed now

No named scientific founder assigned.

Paper authors are not treated as operators without consent.

People

No named person assigned.

Gaps

Founder commitment not verified.

Next verification path

Confirm founder/operator owner.

Translational engineer

Needed now

Prototype owner missing.

Build Passport does not name an implementer.

People

No named person assigned.

Gaps

No repo owner or reproduction engineer attached.

Next verification path

Identify engineer for reproduction path.

Domain operator

Needed later

Operator workflow not sourced.

No buyer or workflow interview attached.

People

No named person assigned.

Gaps

No domain workflow evidence.

Next verification path

Interview target operator.

GTM lead

Needed later

No GTM owner verified.

No CRM or outreach source attached.

People

No named person assigned.

Gaps

No channel evidence.

Next verification path

Define first channel test.

Regulatory/clinical advisor

Needed later

Regulatory need unclassified.

No clinical or regulatory source attached.

People

No named person assigned.

Gaps

Regulatory domain not classified.

Next verification path

Classify regulatory flags.

ARTIFACTS

No public artifacts yet.

DEFENSIBILITY

Defensibility and confidence evidence pending.

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(40)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(40)

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline