ARXIV:2604.06742 · LLM SOFTWARE GENERATION · SUBMITTED 10 APR · 00:14 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

Ruida Hu · Xinchen Wang · Chao Peng · Cuiyun Gao · David Lo · arXiv

CLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain CLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications.

Evidence 53 refs | 3 sources | 67% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

CLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications. However, existing benchmarks fail to assess this 0-to-1 generation capability due…

METHOD

Full abstract

Large Language Models (LLMs) are driving a shift towards intent-driven development, where agents build complete software from scratch. However, existing benchmarks fail to assess this 0-to-1 generation capability due to two limitations: reliance on predefined scaffolds that ignore repository structure planning, and rigid white-box unit testing that lacks end-to-end behavioral validation. To bridge this gap, we introduce CLI-Tool-Bench, a structure-agnostic benchmark for evaluating the ground-up generation of Command-Line Interface (CLI) tools. It features 100 diverse real-world repositories evaluated via a black-box differential testing framework. Agent-generated software is executed in sandboxes, comparing system side effects and terminal outputs against human-written oracles using multi-tiered equivalence metrics. Evaluating seven state-of-the-art LLMs, we reveal that top models achieve under 43% success, highlighting the ongoing challenge of 0-to-1 generation. Furthermore, higher token consumption does not guarantee better performance, and agents tend to generate monolithic code.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Evaluating seven state-of-the-art LLMs, we reveal that top models achieve under 43% success, highlighting the ongoing challenge of 0-to-1 generation. Code availability is flagged…

WHY NOW

LLM Software Generation moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainCLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications.

Evidence53 refs | 3 sources | 67% coverage

Blockerno shell-level blocker reported

Analysis summary

CLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

CLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications.

Segment

LLM Software Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "648ff8bf-3b51-47e1-95c2-741d38ed31d4", "arxiv_id": "2604.06742", "canonical_route": "/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios", "endpoints": { "paper_pack": "/api/v1/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios/paper-pack", "build_passport": "/api/v1/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios", "normalized_query": "2604.06742", "route": "/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios", "paper_ref": "evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios#webpage", "url": "https://sciencetostartup.com/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios", "name": "Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios", "description": "CLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios#scholarlyArticle", "headline": "Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios", "description": "CLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications.", "url": "https://sciencetostartup.com/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios", "sameAs": "https://arxiv.org/abs/2604.06742", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.06742" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-08T07:09:10.000Z", "author": [ { "@type": "Person", "name": "Ruida Hu" }, { "@type": "Person", "name": "Xinchen Wang" }, { "@type": "Person", "name": "Chao Peng" }, { "@type": "Person", "name": "Cuiyun Gao" }, { "@type": "Person", "name": "David Lo" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Software Generation" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Software Generation", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Evaluating LLM-Based 0-to-1 Software Generation in End-to-En", "item": "https://sciencetostartup.com/paper/evaluating-llm-based-0-to-1-software-generation-in-end-to-end-cli-tool-scenarios" } ] } ] }

Competitive landscape

CLI-Tool-Bench is a new benchmark for evaluating LLM-based 0-to-1 software generation of CLI tools, revealing current limitations in generating complete and robust applications.

Segment

LLM Software Generation

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline