GUITestBench is a novel, interactive benchmark specifically engineered for evaluating the performance of Multi-modal Large Language Model (MLLM) agents in autonomous exploratory Graphical User Interface (GUI) testing. It comprises a comprehensive suite of 143 distinct tasks and incorporates 26 unique software defects. The benchmark was developed to overcome critical limitations observed in existing MLLM agents, namely "Goal-Oriented Masking," where agents prioritize task completion over defect reporting, and "Execution-Bias Attribution," which leads to misidentifying system defects as agent errors. By providing a standardized and challenging environment, GUITestBench enables researchers and ML engineers to rigorously test, compare, and improve MLLM-based agents, thereby fostering advancements in automated software quality assurance and reducing the high manual costs associated with exploratory GUI testing.
GUITestBench is a new interactive test suite for AI models that automatically check software interfaces for bugs. It helps researchers evaluate how well AI agents can find defects, especially when previous models struggled to prioritize bug reporting over completing tasks. This benchmark is crucial for improving automated software quality assurance.
Was this definition helpful?