ARXIV:2604.01532 · INDUSTRIAL AI AGENTS · SUBMITTED 03 APR · 20:50 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

Ayan Das · Dhaval Patel · arXiv

A benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities.

Evidence 0 refs | 0 sources | 33% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents…

METHOD

Full abstract

Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68\% task completion, with systematic failures in tool orchestration (23\% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7\% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score…

WHY NOW

Industrial AI Agents moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities.

Evidence0 refs | 0 sources | 33% coverage

Blockerno shell-level blocker reported

Analysis summary

A benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities.

Segment

Industrial AI Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "9ded6832-7f74-4894-b18a-bd8729a25e53", "arxiv_id": "2604.01532", "canonical_route": "/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance", "endpoints": { "paper_pack": "/api/v1/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance/paper-pack", "build_passport": "/api/v1/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance", "normalized_query": "2604.01532", "route": "/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance", "paper_ref": "phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance#webpage", "url": "https://sciencetostartup.com/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance", "name": "PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance", "description": "A benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance#scholarlyArticle", "headline": "PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance", "description": "A benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities.", "url": "https://sciencetostartup.com/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance", "sameAs": "https://arxiv.org/abs/2604.01532", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.01532" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-02T02:09:27.000Z", "author": [ { "@type": "Person", "name": "Ayan Das" }, { "@type": "Person", "name": "Dhaval Patel" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Industrial AI Agents" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Industrial AI Agents", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "PHMForge: A Scenario-Driven Agentic Benchmark for Industrial", "item": "https://sciencetostartup.com/paper/phmforge-a-scenario-driven-agentic-benchmark-for-industrial-asset-lifecycle-maintenance" } ] } ] }

Competitive landscape

A benchmark for evaluating LLM agents in industrial maintenance tasks, revealing significant gaps in current capabilities.

Segment

Industrial AI Agents

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline