ARXIV:2605.09698 · DATA SCIENCE AGENTS · SUBMITTED 12 MAY · 20:16 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

Josefa Lia Stoisser · Marc Boubnovski Martell · Sidsel Boldsen · Kaspar Märtens · Robert Kitchen · arXiv

A benchmark suite to evaluate and improve task-framing accuracy in data-science agents.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A benchmark suite to evaluate and improve task-framing accuracy in data-science agents.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark suite to evaluate and improve task-framing accuracy in data-science agents. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task.

METHOD

Full abstract

As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Ambig-DS, two diagnostic suites: one for prediction-target ambiguity (Ambig-DS-Target, 51 tasks built on DSBench, a tabular modeling benchmark) and one for evaluation-objective ambiguity (Ambig-DS-Objective, 61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark), constructed so that scoring uses each source benchmark's original evaluator. For every task we pair the original, fully specified version with an ambiguous variant produced by controlled edits; a human-and-LLM verification pipeline confirms each variant admits multiple plausible interpretations with decision-relevant consequences. The suites are analyzed independently and ambiguity lowers performance in both. Across five agents spanning efficient to frontier-class models, we find in our controlled diagnostic setting: (i) failures are silent commitments: wrong-target submissions on Target, wrong-metric or non-committal baseline submissions on Objective, rather than execution errors; (ii) allowing the agent to ask one clarifying question recovers much of the loss under idealized conditions, suggesting missing framing information drives a substantial part of the observed degradation; but (iii) agents cannot reliably tell when to use it: permissive prompts induce over-asking on clear tasks, while conservative prompts induce silent defaulting on ambiguous ones. Recognizing target and objective underspecification, not pipeline execution, is the bottleneck missing from standard DS-agent evaluations.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Recognizing target and objective underspecification, not pipeline execution, is the bottleneck missing from standard DS-agent evaluations. Code availability is flagged in the production record;…

WHY NOW

Data Science Agents moved forward this cycle; last verified May 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA benchmark suite to evaluate and improve task-framing accuracy in data-science agents.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A benchmark suite to evaluate and improve task-framing accuracy in data-science agents.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

{ "contract_version": "paper-r2", "paper_id": "26e51569-4cee-44e7-983d-8e0d0589c1c0", "arxiv_id": "2605.09698", "canonical_route": "/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents", "endpoints": { "paper_pack": "/api/v1/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents/paper-pack", "build_passport": "/api/v1/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents", "normalized_query": "2605.09698", "route": "/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents", "paper_ref": "ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents#webpage", "url": "https://sciencetostartup.com/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents", "name": "Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents", "description": "A benchmark suite to evaluate and improve task-framing accuracy in data-science agents.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents#scholarlyArticle", "headline": "Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents", "description": "A benchmark suite to evaluate and improve task-framing accuracy in data-science agents.", "url": "https://sciencetostartup.com/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents", "sameAs": "https://arxiv.org/abs/2605.09698", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.09698" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-10T18:34:12.000Z", "author": [ { "@type": "Person", "name": "Josefa Lia Stoisser" }, { "@type": "Person", "name": "Marc Boubnovski Martell" }, { "@type": "Person", "name": "Sidsel Boldsen" }, { "@type": "Person", "name": "Kaspar Märtens" }, { "@type": "Person", "name": "Robert Kitchen" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Data Science Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Data Science Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Sci", "item": "https://sciencetostartup.com/paper/ambig-ds-a-benchmark-for-task-framing-ambiguity-in-data-science-agents" } ] } ] }

Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline