OctoBench

Gold definitionUpdated Apr 2, 2026

Definition

OctoBench is a benchmark designed to evaluate Large Language Models' (LLMs) ability to follow scaffold-specified, heterogeneous instructions in repository-grounded agentic coding. It features diverse environments, tasks, and an automated scoring toolkit to assess compliance and disentangle it from task-solving.

At a glance

Executive summary

OctoBench is a new tool for testing how well AI coding assistants follow specific rules and instructions given to them, especially when those rules are complex and apply across many steps. It found that even advanced AI models often struggle to consistently follow these rules, even if they can complete the coding task itself. This highlights a need for better training methods for AI coding agents.

TL;DR

OctoBench is a benchmark that tests how well AI coding agents follow complex instructions and rules within coding environments, revealing a gap between task completion and rule compliance.

Key points

Evaluates LLM instruction following in agentic coding via diverse tasks, scaffolds, and an automated scoring toolkit that disentangles task-solving from rule-following.
Addresses the underexamined ability of LLMs to follow heterogeneous and persistent scaffold-specified instructions in coding, revealing a systematic compliance gap.
Used by researchers and ML engineers developing and evaluating Large Language Model-based software agents and coding assistants.
Unlike general coding benchmarks that focus purely on task correctness, OctoBench specifically emphasizes and measures adherence to *scaffold-specified instructions* and constraints.
Accelerates the development of more 'scaffold-aware' and robust LLM coding agents by providing a targeted evaluation framework for instruction following.

Use cases

Evaluating new LLM architectures for coding to compare their performance in adhering to complex coding constraints.
Developing instruction-tuned coding agents by using OctoBench to fine-tune LLMs specifically to improve their ability to follow heterogeneous instructions.
Benchmarking agentic coding frameworks to assess the instruction-following robustness of their underlying LLM agents.
Identifying failure modes in LLM code generation by analyzing OctoBench trajectories to pinpoint specific types of instructions or constraints that LLMs consistently fail to follow.