MT-Bench is a multi-turn benchmark for evaluating large language models (LLMs) using strong LLMs as judges, designed to assess conversational abilities, reasoning, and instruction following across various domains.
MT-Bench is a popular test for large AI models, especially those designed for conversation. It uses a powerful AI to judge how well other AI models respond to a series of questions, helping researchers understand which models are best at talking, reasoning, and following instructions.
MT-Bench Leaderboard, LLM-as-a-Judge Benchmark
Was this definition helpful?