ARXIV:2605.10834 · AGENTS · SUBMITTED 12 MAY · 20:14 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Pedro Conde · Henrique Branquinho · Valerio Mazzone · Bruno Mendes · André Baptista · Nuno Moniz · arXiv

A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit…

METHOD

Full abstract

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench. A public repository is linked, so build…

WHY NOW

Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "c43018ae-b7dc-4096-927b-a83178eeebe5", "arxiv_id": "2605.10834", "canonical_route": "/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world", "endpoints": { "paper_pack": "/api/v1/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world/paper-pack", "build_passport": "/api/v1/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World", "normalized_query": "2605.10834", "route": "/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world", "paper_ref": "from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world#webpage", "url": "https://sciencetostartup.com/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world", "name": "From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World", "description": "A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world#scholarlyArticle", "headline": "From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World", "description": "A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.", "url": "https://sciencetostartup.com/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world", "sameAs": "https://arxiv.org/abs/2605.10834", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.10834" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-11T16:50:00.000Z", "author": [ { "@type": "Person", "name": "Pedro Conde" }, { "@type": "Person", "name": "Henrique Branquinho" }, { "@type": "Person", "name": "Valerio Mazzone" }, { "@type": "Person", "name": "Bruno Mendes" }, { "@type": "Person", "name": "André Baptista" }, { "@type": "Person", "name": "Nuno Moniz" } ], "codeRepository": "https://github.com/jd0965199-oss/ethibench", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world#software", "name": "From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World - Source Code", "description": "A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.", "codeRepository": "https://github.com/jd0965199-oss/ethibench", "url": "https://github.com/jd0965199-oss/ethibench" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "From Controlled to the Wild: Evaluation of Pentesting Agents", "item": "https://sciencetostartup.com/paper/from-controlled-to-the-wild-evaluation-of-pentesting-agents-for-the-real-world" } ] } ] }

Competitive landscape

A novel evaluation protocol for AI pentesting agents that shifts assessment to validated vulnerability discovery in real-world scenarios.

Segment

Agents

Adoption evidence

Public code linked for build inspection

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline