ARXIV:2605.21482 · AI AGENTS · SUBMITTED 21 MAY · 20:27 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Sixiong Xie · Zhuofan Shi · Haiyang Shen · Jiuzheng Wang · Siqi Zhong · Mugeng Liu · +5 at arXiv

DeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain DeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

DeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from…

METHOD

Full abstract

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Code availability…

WHY NOW

AI Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainDeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

DeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

DeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning.

Segment

AI Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "de2a0bbe-3b9e-4421-8d05-3616fb808cbf", "arxiv_id": "2605.21482", "canonical_route": "/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation", "endpoints": { "paper_pack": "/api/v1/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation/paper-pack", "build_passport": "/api/v1/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation", "normalized_query": "2605.21482", "route": "/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation", "paper_ref": "deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation#webpage", "url": "https://sciencetostartup.com/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation", "name": "DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation", "description": "DeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation#scholarlyArticle", "headline": "DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation", "description": "DeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning.", "url": "https://sciencetostartup.com/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation", "sameAs": "https://arxiv.org/abs/2605.21482", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.21482" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-20T17:59:03.000Z", "author": [ { "@type": "Person", "name": "Sixiong Xie" }, { "@type": "Person", "name": "Zhuofan Shi" }, { "@type": "Person", "name": "Haiyang Shen" }, { "@type": "Person", "name": "Jiuzheng Wang" }, { "@type": "Person", "name": "Siqi Zhong" }, { "@type": "Person", "name": "Mugeng Liu" }, { "@type": "Person", "name": "Chongyang Pan" }, { "@type": "Person", "name": "Peilun Jia" }, { "@type": "Person", "name": "Baoqing Sun" }, { "@type": "Person", "name": "Xiang Jing" }, { "@type": "Person", "name": "Yun Ma" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "AI Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "AI Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "DeepWeb-Bench: A Deep Research Benchmark Demanding Massive C", "item": "https://sciencetostartup.com/paper/deepweb-bench-a-deep-research-benchmark-demanding-massive-cross-source-evidence-and-long-horizon-derivation" } ] } ] }

Competitive landscape

DeepWeb-Bench: A challenging benchmark for deep research agents requiring massive evidence collection, cross-source reconciliation, and long-horizon reasoning.

Segment

AI Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline