ARXIV:2604.12875 · AI SAFETY · SUBMITTED 15 APR · 17:01 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

Abiodun A. Solanke · arXiv

AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.

Ship in 2-4 weeks›Score3.0Evidence unverified

Opportunity summary

Pain AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety…

METHOD

Full abstract

The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field's main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Code…

WHY NOW

AI Safety moved forward this cycle; last verified April 2026. Public score 3.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainAISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.

Segment

AI Safety

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "5253ea56-e3b5-47d5-9801-83cfcc4ef587", "arxiv_id": "2604.12875", "canonical_route": "/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark", "endpoints": { "paper_pack": "/api/v1/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark/paper-pack", "build_passport": "/api/v1/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance", "normalized_query": "2604.12875", "route": "/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark", "paper_ref": "aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark#webpage", "url": "https://sciencetostartup.com/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark", "name": "AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance", "description": "AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark#scholarlyArticle", "headline": "AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance", "description": "AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.", "url": "https://sciencetostartup.com/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark", "sameAs": "https://arxiv.org/abs/2604.12875", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.12875" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-14T15:26:03.000Z", "author": [ { "@type": "Person", "name": "Abiodun A. Solanke" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI Safety" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI Safety", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety", "item": "https://sciencetostartup.com/paper/aisafetybenchexplorer-a-metric-aware-catalogue-of-ai-safety-benchmarks-reveals-fragmented-measurement-and-weak-benchmark" } ] } ] }

Competitive landscape

AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.

Segment

AI Safety

Adoption evidence

No public code link in the paper record yet

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline