RelayLLM is a novel framework for efficient reasoning that combines Small and Large Language Models. It empowers an SLM to act as a controller, dynamically invoking an LLM only for critical tokens, significantly reducing computational costs and latency while bridging the performance gap.
RelayLLM is a new system that makes powerful AI language models (LLMs) more efficient by teaming them up with smaller, faster models (SLMs). Instead of the SLM asking the LLM for help on an entire problem, it only asks for help on very specific, difficult words or 'tokens,' saving a lot of computing power and speeding things up while keeping most of the LLM's accuracy.
Token-level collaborative decoding, SLM-LLM collaboration, dynamic LLM invocation, help-seeking LLM, hybrid LLM system
Was this definition helpful?