RelayLLM

Gold definitionUpdated Apr 2, 2026

Definition

RelayLLM is a novel framework for efficient reasoning that combines Small and Large Language Models. It empowers an SLM to act as a controller, dynamically invoking an LLM only for critical tokens, significantly reducing computational costs and latency while bridging the performance gap.

At a glance

Executive summary

RelayLLM is a new system that makes powerful AI language models (LLMs) more efficient by teaming them up with smaller, faster models (SLMs). Instead of the SLM asking the LLM for help on an entire problem, it only asks for help on very specific, difficult words or 'tokens,' saving a lot of computing power and speeding things up while keeping most of the LLM's accuracy.

TL;DR

RelayLLM lets a small AI model ask a big AI model for help only on the hardest words, making complex tasks much faster and cheaper without losing much accuracy.

Key points

SLM acts as an active controller, dynamically invoking an LLM for critical tokens via a special command.
Solves the problem of high LLM costs/latency and SLM reasoning gaps by avoiding coarse-grained offloading.
Used by researchers and ML engineers developing efficient AI systems for resource-constrained environments.
Unlike cascading or routing, which offload entire queries, RelayLLM operates at a token-level granularity, significantly reducing LLM invocation.
Represents a research trend towards efficient LLM deployment, model collaboration, and fine-grained control for cost-effective reasoning.

Use cases

Deploying complex reasoning tasks on edge devices like mobile phones or IoT where full LLMs are infeasible.
Real-time customer support chatbots that provide sophisticated responses quickly by engaging a large model only for nuanced queries.
Automated code generation/completion, where the SLM handles common patterns and the LLM is invoked for complex logical structures.
Reducing cloud inference costs and latency for applications requiring LLM-level reasoning, such as content summarization or data extraction services.

Also known as

Token-level collaborative decoding, SLM-LLM collaboration, dynamic LLM invocation, help-seeking LLM, hybrid LLM system