RTCE

Gold definitionUpdated Apr 2, 2026

RoundTripCodeEval (RTCE) is a specialized benchmark suite developed to assess the "round-trip consistency" of Large Language Models (LLMs) when performing code execution reasoning tasks. It specifically targets the ability of LLMs to maintain coherent and consistent logic when processing information in both forward and backward directions, such as encoding and decoding operations. RTCE operates by presenting four distinct code execution reasoning tasks, employing an execution-free, exact-match evaluation method to determine the fidelity of bijective mappings. This approach is crucial because it uncovers fundamental limitations in current LLMs, demonstrating their struggle with internal coherence and consistent reasoning, which are vital for trustworthy code generation and understanding. Researchers and ML engineers working on Code-LLMs utilize RTCE to identify weaknesses in model reasoning that are not captured by traditional I/O prediction or natural language benchmarks, guiding the development of more robust and reliable AI coding assistants.

Key Aspects of RoundTripCodeEval (RTCE)

Purpose and Scope: RTCE is a comprehensive benchmark specifically designed to rigorously test the round-trip consistency of LLMs in code execution reasoning, assessing their ability to maintain consistent logic across forward and backward operations. This addresses a critical gap in current LLM evaluation methods. [2601.13398v1]
Evaluation Methodology: The benchmark employs an execution-free, exact-match evaluation of "bijection fidelity," which determines if models preserve a consistent one-to-one mapping between encoding and decoding operations across various algorithms and directions. [2601.13398v1]
Task Design: RTCE comprises four distinct code execution reasoning tasks. These tasks are crafted to probe various algorithms and directions, thereby providing a thorough assessment of an LLM's internal coherence and reasoning capabilities. [2601.13398v1]

Insights and Challenges Revealed by RoundTripCodeEval (RTCE)

At a glance

Executive summary

RoundTripCodeEval (RTCE) is a new benchmark that tests how consistently large AI models handle code, especially when reversing operations like encoding and decoding. It shows that even advanced AI models struggle to maintain consistent logic, highlighting a key weakness in their ability to reason reliably about code. This benchmark provides novel insights not captured by existing evaluation methods.

TL;DR

RTCE is a test for AI models that write code, checking if they can consistently understand and reverse coding operations, revealing their struggles with reliable code reasoning.

Key points

Evaluates "bijection fidelity" in code execution reasoning through four distinct tasks, using an execution-free, exact-match method.
Identifies and quantifies LLMs' limitations in maintaining consistent reasoning across forward and backward code operations, crucial for trustworthy code AI.
Used by researchers and ML engineers developing and evaluating Code-LLMs.
Differs from existing I/O-prediction, execution-reasoning, or round-trip natural-language benchmarks by focusing specifically on consistent one-to-one mapping in code.
Highlights the ongoing challenge of achieving true internal coherence and reliable reasoning in LLMs for complex tasks like code generation and understanding.

Use cases

Evaluating Code-LLMs: Researchers can use RTCE to benchmark new Code-LLM architectures or training methodologies for improved consistency in code reasoning.
Debugging LLM Reasoning: Developers can employ RTCE to pinpoint specific types of code reasoning failures in LLMs, guiding targeted model improvements and architectural changes.
Curriculum Learning for Code: Designing training curricula for LLMs that explicitly address the "bijection fidelity" challenges identified by RTCE to build more robust models.
Assessing Trustworthiness: Providing a quantifiable metric for the internal coherence of LLMs, which is critical for their deployment in sensitive code-related applications like automated refactoring or vulnerability detection.

Also known as

RoundTripCodeEval