ARXIV:2603.23611 · LLM TESTING · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

LLMORPH: Automated Metamorphic Testing of Large Language Models

Steven Cho · Stefano Ruberto · Valerio Terragni · arXiv

LLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain LLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

LLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels. We present LLMORPH, an automated testing tool specifically designed for LLMs performing NLP tasks, which leverages…

METHOD

Full abstract

Automated testing is essential for evaluating and improving the reliability of Large Language Models (LLMs), yet the lack of automated oracles for verifying output correctness remains a key challenge. We present LLMORPH, an automated testing tool specifically designed for LLMs performing NLP tasks, which leverages Metamorphic Testing (MT) to uncover faulty behaviors without relying on human-labeled data. MT uses Metamorphic Relations (MRs) to generate follow-up inputs from source test input, enabling detection of inconsistencies in model outputs without the need of expensive labelled data. LLMORPH is aimed at researchers and developers who want to evaluate the robustness of LLM-based NLP systems. In this paper, we detail the design, implementation, and practical usage of LLMORPH, demonstrating how it can be easily extended to any LLM, NLP task, and set of MRs. In our evaluation, we applied 36 MRs across four NLP benchmarks, testing three state-of-the-art LLMs: GPT-4, LLAMA3, and HERMES 2. This produced over 561,000 test executions. Results demonstrate LLMORPH's effectiveness in automatically exposing inconsistencies.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. Results demonstrate LLMORPH's effectiveness in automatically exposing inconsistencies. Code availability is flagged in the production record; the public repository link still needs proof alignment.

WHY NOW

LLM Testing moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainLLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

LLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

LLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels.

Segment

LLM Testing

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "fde7eba2-263f-419a-b634-215444887cbe", "arxiv_id": "2603.23611", "canonical_route": "/paper/llmorph-automated-metamorphic-testing-of-large-language-models", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "llmorph-automated-metamorphic-testing-of-large-language-models", "endpoints": { "paper_pack": "/api/v1/paper/llmorph-automated-metamorphic-testing-of-large-language-models/paper-pack", "build_passport": "/api/v1/paper/llmorph-automated-metamorphic-testing-of-large-language-models/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "LLMORPH: Automated Metamorphic Testing of Large Language Models", "normalized_query": "2603.23611", "route": "/paper/llmorph-automated-metamorphic-testing-of-large-language-models", "paper_ref": "llmorph-automated-metamorphic-testing-of-large-language-models", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/llmorph-automated-metamorphic-testing-of-large-language-models#webpage", "url": "https://sciencetostartup.com/paper/llmorph-automated-metamorphic-testing-of-large-language-models", "name": "LLMORPH: Automated Metamorphic Testing of Large Language Models", "description": "LLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/llmorph-automated-metamorphic-testing-of-large-language-models#scholarlyArticle", "headline": "LLMORPH: Automated Metamorphic Testing of Large Language Models", "description": "LLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels.", "url": "https://sciencetostartup.com/paper/llmorph-automated-metamorphic-testing-of-large-language-models", "sameAs": "https://arxiv.org/abs/2603.23611", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.23611" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-24T18:01:02.000Z", "author": [ { "@type": "Person", "name": "Steven Cho" }, { "@type": "Person", "name": "Stefano Ruberto" }, { "@type": "Person", "name": "Valerio Terragni" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Testing" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Testing", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "LLMORPH: Automated Metamorphic Testing of Large Language Mod", "item": "https://sciencetostartup.com/paper/llmorph-automated-metamorphic-testing-of-large-language-models" } ] } ] }

Competitive landscape

LLMORPH automates LLM testing by generating new test cases from existing ones, uncovering model inconsistencies without human labels.

Segment

LLM Testing

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

LLMORPH: Automated Metamorphic Testing of Large Language Models

LLMORPH: Automated Metamorphic Testing of Large Language Models

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline