ARXIV:2604.04323 · LLM AGENTS · SUBMITTED 07 APR · 20:12 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu · Jiabao Ji · Li An · Tommi Jaakkola · Yang Zhang · Shiyu Chang · arXiv

This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided…

METHOD

Full abstract

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance…

WHY NOW

LLM Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainThis research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.

Segment

LLM Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "4166863d-ae87-450c-be69-93dbe236a2a5", "arxiv_id": "2604.04323", "canonical_route": "/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings", "endpoints": { "paper_pack": "/api/v1/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings/paper-pack", "build_passport": "/api/v1/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings", "normalized_query": "2604.04323", "route": "/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings", "paper_ref": "how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings#webpage", "url": "https://sciencetostartup.com/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings", "name": "How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings", "description": "This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings#scholarlyArticle", "headline": "How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings", "description": "This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.", "url": "https://sciencetostartup.com/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings", "sameAs": "https://arxiv.org/abs/2604.04323", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.04323" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-06T00:10:30.000Z", "author": [ { "@type": "Person", "name": "Yujian Liu" }, { "@type": "Person", "name": "Jiabao Ji" }, { "@type": "Person", "name": "Li An" }, { "@type": "Person", "name": "Tommi Jaakkola" }, { "@type": "Person", "name": "Yang Zhang" }, { "@type": "Person", "name": "Shiyu Chang" } ], "codeRepository": "https://github.com/UCSB-NLP-Chang/Skill-Usage", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings#software", "name": "How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings - Source Code", "description": "This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.", "codeRepository": "https://github.com/UCSB-NLP-Chang/Skill-Usage", "url": "https://github.com/UCSB-NLP-Chang/Skill-Usage" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "How Well Do Agentic Skills Work in the Wild: Benchmarking LL", "item": "https://sciencetostartup.com/paper/how-well-do-agentic-skills-work-in-the-wild-benchmarking-llm-skill-usage-in-realistic-settings" } ] } ] }

Competitive landscape

This research benchmarks LLM agent skill usage in realistic settings and proposes refinement strategies to improve performance, with code available.

Segment

LLM Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline