OctoBench is a benchmark designed to evaluate Large Language Models' (LLMs) ability to follow scaffold-specified, heterogeneous instructions in repository-grounded agentic coding. It features diverse environments, tasks, and an automated scoring toolkit to assess compliance and disentangle it from task-solving.
OctoBench is a new tool for testing how well AI coding assistants follow specific rules and instructions given to them, especially when those rules are complex and apply across many steps. It found that even advanced AI models often struggle to consistently follow these rules, even if they can complete the coding task itself. This highlights a need for better training methods for AI coding agents.
Was this definition helpful?