ARXIV:2605.11269 · LLM AGENTS FOR SCIENTIFIC TASKS · SUBMITTED 13 MAY · 20:19 UTC · FRESHNESS FRESH

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

Tousif Islam · Digvijay Wadekar · Zihan Zhou · arXiv

A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.

Ship in 2-4 weeks›Score3.0Evidence unverified

Opportunity summary

Pain A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations. These problems demand extreme precision to support detection and parameter inference, with state-of-the-art models achieving $\lesssim 10^{-4}$ relative…

METHOD

Full abstract

Modern gravitational wave astronomy relies on modeling tasks that often require months of graduate-level effort, including building fast waveform surrogates from expensive numerical relativity simulations, modeling orbital dynamics of black holes, fitting merger remnant properties and constructing template banks. These problems demand extreme precision to support detection and parameter inference, with state-of-the-art models achieving $\lesssim 10^{-4}$ relative error. We study whether state-of-the-art LLM coding agents can perform such end-to-end scientific modeling, where success requires constructing models with stringent accuracy criteria and reasoning about physical systems. We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations collectively representing over $10^8$ core-hours of compute. The tasks span interpolation, regression, and high-dimensional time-series modeling, requiring a combination of numerical methods, machine learning, and physics-informed approaches. In preliminary experiments, agents frequently relied on proxy metrics, partial evaluation, or fabricated results to spuriously complete tasks. We therefore implement an external pre-defined framework to gauge agent progress. Evaluating twelve coding agents, we find no consistent winner. On the easiest task, multiple agents converge to the same cubic spline solution, with one rediscovering a coordinate transformation widely used in the literature. On harder tasks like analytic waveform modeling, all agents fall 1-2 orders of magnitude short of domain requirements and exhibit systematic failures, including metric misuse, constraint violations, and result fabrication. Our code, data, and website are publicly available.

RESULT

ScienceToStartup currently rates this 3.0/10 on the public viability pass. These problems demand extreme precision to support detection and parameter inference, with state-of-the-art models achieving $\lesssim 10^{-4}$ relative error. A public repository is linked,…

WHY NOW

LLM Agents for Scientific Tasks moved forward this cycle; last verified May 2026. Public score 3.0/10. Implementation evidence is present through a linked repository.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score3.0

PainA benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.

Segment

LLM Agents for Scientific Tasks

Adoption evidence

Public code linked for build inspection

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "eadfd71c-5db3-4770-a6a3-8f013f99eb4d", "arxiv_id": "2605.11269", "canonical_route": "/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy", "endpoints": { "paper_pack": "/api/v1/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy/paper-pack", "build_passport": "/api/v1/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy", "normalized_query": "2605.11269", "route": "/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy", "paper_ref": "gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy#webpage", "url": "https://sciencetostartup.com/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy", "name": "gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy", "description": "A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy#scholarlyArticle", "headline": "gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy", "description": "A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.", "url": "https://sciencetostartup.com/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy", "sameAs": "https://arxiv.org/abs/2605.11269", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.11269" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-11T21:47:22.000Z", "author": [ { "@type": "Person", "name": "Tousif Islam" }, { "@type": "Person", "name": "Digvijay Wadekar" }, { "@type": "Person", "name": "Zihan Zhou" } ], "codeRepository": "https://github.com/tousifislam/gwBenchmarks", "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 3 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Agents for Scientific Tasks" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code, repo url" } ] }, { "@type": "SoftwareSourceCode", "@id": "https://sciencetostartup.com/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy#software", "name": "gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy - Source Code", "description": "A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.", "codeRepository": "https://github.com/tousifislam/gwBenchmarks", "url": "https://github.com/tousifislam/gwBenchmarks" }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Agents for Scientific Tasks", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gr", "item": "https://sciencetostartup.com/paper/gwbenchmarks-stress-testing-llm-agents-on-high-precision-gravitational-wave-astronomy" } ] } ] }

Competitive landscape

A benchmark suite to stress-test LLM agents on high-precision gravitational wave astronomy tasks, revealing significant limitations.

Segment

LLM Agents for Scientific Tasks

Adoption evidence

Public code linked for build inspection

Commercial read

3.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline