ARXIV:2603.21454 · LLM EVALUATION · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

Tae-Eun Song · arXiv

A novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers…

METHOD

Full abstract

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p…

WHY NOW

LLM Evaluation moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

A novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "fbd0ba95-b65d-494b-9dfb-51036689d876", "arxiv_id": "2603.21454", "canonical_route": "/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis", "endpoints": { "paper_pack": "/api/v1/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis/paper-pack", "build_passport": "/api/v1/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis", "normalized_query": "2603.21454", "route": "/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis", "paper_ref": "cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis#webpage", "url": "https://sciencetostartup.com/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis", "name": "Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis", "description": "A novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis#scholarlyArticle", "headline": "Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis", "description": "A novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks.", "url": "https://sciencetostartup.com/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis", "sameAs": "https://arxiv.org/abs/2603.21454", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.21454" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-23T00:18:34.000Z", "author": [ { "@type": "Person", "name": "Tae-Eun Song" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Evaluation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Evaluation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Cross-Context Verification: Hierarchical Detection of Benchm", "item": "https://sciencetostartup.com/paper/cross-context-verification-hierarchical-detection-of-benchmark-contamination-through-session-isolated-analysis" } ] } ] }

Competitive landscape

A novel black-box method and multi-agent framework to detect benchmark contamination in LLMs, ensuring the credibility of coding benchmarks.

Segment

LLM Evaluation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline