ARXIV:2603.27862 · IMAGE GENERATION BENCHMARKING · SUBMITTED 31 MAR · 20:21 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Samin Mahdizadeh Sani · Max Ku · Nima Jamali · Matina Mahdizadeh Sani · Paria Khoshtab · Wei-Chieh Sun · +20 at arXiv

ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.

Evidence 38 refs | 4 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only…

METHOD

Full abstract

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. Code availability is…

WHY NOW

Image Generation Benchmarking moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.

Evidence38 refs | 4 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Samin Mahdizadeh Sani · Max Ku · Nima Jamali · Matina Mahdizadeh Sani · Paria Khoshtab · Wei-Chieh Sun · +20 at arXiv

ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.

Competitive landscape

ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.

Segment

Image Generation Benchmarking

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "547c022a-1510-4b2f-8cae-2594da771a53", "arxiv_id": "2603.27862", "canonical_route": "/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks", "endpoints": { "paper_pack": "/api/v1/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks/paper-pack", "build_passport": "/api/v1/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks", "normalized_query": "2603.27862", "route": "/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks", "paper_ref": "imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks#webpage", "url": "https://sciencetostartup.com/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks", "name": "ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks", "description": "ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks#scholarlyArticle", "headline": "ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks", "description": "ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.", "url": "https://sciencetostartup.com/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks", "sameAs": "https://arxiv.org/abs/2603.27862", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.27862" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-29T20:42:05.000Z", "author": [ { "@type": "Person", "name": "Samin Mahdizadeh Sani" }, { "@type": "Person", "name": "Max Ku" }, { "@type": "Person", "name": "Nima Jamali" }, { "@type": "Person", "name": "Matina Mahdizadeh Sani" }, { "@type": "Person", "name": "Paria Khoshtab" }, { "@type": "Person", "name": "Wei-Chieh Sun" }, { "@type": "Person", "name": "Parnian Fazel" }, { "@type": "Person", "name": "Zhi Rui Tam" }, { "@type": "Person", "name": "Thomas Chong" }, { "@type": "Person", "name": "Edisy Kin Wai Chan" }, { "@type": "Person", "name": "Donald Wai Tong Tsang" }, { "@type": "Person", "name": "Chiao-Wei Hsu" }, { "@type": "Person", "name": "Ting Wai Lam" }, { "@type": "Person", "name": "Ho Yin Sam Ng" }, { "@type": "Person", "name": "Chiafeng Chu" }, { "@type": "Person", "name": "Chak-Wing Mak" }, { "@type": "Person", "name": "Keming Wu" }, { "@type": "Person", "name": "Hiu Tung Wong" }, { "@type": "Person", "name": "Yik Chun Ho" }, { "@type": "Person", "name": "Chi Ruan" }, { "@type": "Person", "name": "Zhuofeng Li" }, { "@type": "Person", "name": "I-Sheng Fang" }, { "@type": "Person", "name": "Shih-Ying Yeh" }, { "@type": "Person", "name": "Ho Kei Cheng" }, { "@type": "Person", "name": "Ping Nie" }, { "@type": "Person", "name": "Wenhu Chen" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "Image Generation Benchmarking" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "Image Generation Benchmarking", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "ImagenWorld: Stress-Testing Image Generation Models with Exp", "item": "https://sciencetostartup.com/paper/imagenworld-stress-testing-image-generation-models-with-explainable-human-evaluation-on-open-ended-real-world-tasks" } ] } ] }

Competitive landscape

ImagenWorld provides a comprehensive and explainable benchmark for stress-testing image generation models across diverse real-world tasks and domains, identifying specific failure modes.

Segment

Image Generation Benchmarking

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline