TAM-Eval

Definition

TAM-Eval (Test Automated Maintenance Evaluation) is a framework and benchmark designed to rigorously evaluate Large Language Models (LLMs) across the full spectrum of test suite maintenance tasks: creation, repair, and updating. It operates at the test file level with full repository context, reflecting real-world software engineering workflows.

At a glance

Executive summary

TAM-Eval is a new tool for testing how well AI models (LLMs) can help maintain software test suites, covering tasks like creating, fixing, and updating tests. It's designed to be more realistic than previous methods by looking at whole test files within a project. Initial results show current LLMs struggle with these complex tasks.

TL;DR

TAM-Eval is a benchmark that evaluates how effectively AI models can create, fix, and update software tests in real-world programming projects.

Key points

Evaluates LLMs for test suite creation, repair, and updating at the file level with full repository context.
Solves the problem of comprehensively assessing LLM capabilities in realistic test maintenance workflows.
Used by researchers and ML engineers developing LLMs for software engineering and automated testing.
Unlike prior work focused on isolated, function-level tasks, TAM-Eval operates at the test file level with broader context.
Highlights current limitations of state-of-the-art LLMs in complex software test maintenance, guiding future research.

Use cases

Benchmarking new LLM architectures for their ability to generate robust unit tests for existing codebases.

Evaluating agentic AI systems designed for automated bug fixing and test suite repair in enterprise software.

Assessing the effectiveness of fine-tuned LLMs for updating outdated test suites to match evolving application code.

Comparing different LLM prompting strategies for improving code coverage through test creation.

Guiding the development of LLM-powered tools for continuous integration/continuous deployment (CI/CD) pipelines.

Definition

At a glance

Executive summary

TL;DR

Key points

Use cases

Also known as

Related papers

Related topics