GUITestBench

Gold definitionUpdated Apr 2, 2026

GUITestBench is a novel, interactive benchmark specifically engineered for evaluating the performance of Multi-modal Large Language Model (MLLM) agents in autonomous exploratory Graphical User Interface (GUI) testing. It comprises a comprehensive suite of 143 distinct tasks and incorporates 26 unique software defects. The benchmark was developed to overcome critical limitations observed in existing MLLM agents, namely "Goal-Oriented Masking," where agents prioritize task completion over defect reporting, and "Execution-Bias Attribution," which leads to misidentifying system defects as agent errors. By providing a standardized and challenging environment, GUITestBench enables researchers and ML engineers to rigorously test, compare, and improve MLLM-based agents, thereby fostering advancements in automated software quality assurance and reducing the high manual costs associated with exploratory GUI testing.

Purpose and Design of GUITestBench

Addressing MLLM Agent Limitations: GUITestBench was introduced to tackle two core challenges in autonomous defect discovery by MLLM agents: Goal-Oriented Masking and Execution-Bias Attribution. It provides a structured environment to overcome these issues, where agents often fail to autonomously discover defects.
Benchmark Composition: As the first interactive benchmark for autonomous exploratory GUI testing, GUITestBench features a diverse set of 143 tasks. These tasks are specifically designed to expose 26 distinct defects, offering a robust platform for evaluating agent capabilities.

Evaluating Agents with GUITestBench

Standardized Evaluation Framework: GUITestBench serves as a crucial tool for assessing the effectiveness of multi-agent frameworks like GUITester. It provides a consistent and challenging environment, allowing for direct comparison of different approaches to autonomous GUI testing.

At a glance

Executive summary

GUITestBench is a new interactive test suite for AI models that automatically check software interfaces for bugs. It helps researchers evaluate how well AI agents can find defects, especially when previous models struggled to prioritize bug reporting over completing tasks. This benchmark is crucial for improving automated software quality assurance.

TL;DR

GUITestBench is a new test for AI programs that helps them get better at finding bugs in software interfaces automatically.

Key points

Provides a standardized, interactive benchmark with 143 tasks and 26 defects for evaluating MLLM agents in GUI testing.
Addresses the failure of MLLM agents to autonomously discover and correctly attribute defects in exploratory GUI testing.
Used by researchers and ML engineers developing MLLM agents for autonomous software quality assurance and GUI testing.
Improves upon prior MLLM agents that suffered from Goal-Oriented Masking and Execution-Bias Attribution, and reduces reliance on costly manual exploratory testing.
Advances the capabilities of AI agents for autonomous software testing and enhances GUI quality assurance.

Use cases

Evaluating new MLLM-based agents designed for autonomous exploratory GUI testing, such as GUITester, to measure their defect discovery capabilities.
Comparing the performance of different MLLM architectures or multi-agent frameworks in identifying software defects across a standardized set of tasks.
Guiding the development of more reliable and proactive automated quality assurance pipelines for complex graphical user interfaces.
Serving as a foundational dataset and evaluation tool for academic research into the application of large language models for software testing and verification.