Recent advancements in large language model (LLM) training are focusing on enhancing efficiency and reliability, addressing both computational costs and the challenge of hallucinations. Techniques such as mixture-of-depths attention are being developed to improve signal retention in deeper layers, while new fine-tuning datasets aim to instill epistemological humility, helping models recognize their knowledge limits and reduce inaccuracies. Knowledge distillation frameworks are evolving to optimize training efficiency by decoupling teacher and student model architectures, allowing for faster and more effective model compression. Additionally, methods like memory-aware adaptive replay are being employed to combat catastrophic forgetting during continual fine-tuning, ensuring models remain adaptable in dynamic environments. These innovations collectively aim to create LLMs that are not only more efficient in their operations but also more reliable in their outputs, addressing critical commercial needs in sectors where accuracy and resource management are paramount.
Large pre-trained models (LMs) and Large Language Models (LLMs) are typically effective at capturing language semantics and contextual relationships. However, these models encounter challenges in main...
Deploying Large Language Models (LLMs) for discriminative workloads is often limited by inference latency, compute, and API costs at scale. Active distillation reduces these costs by querying an LLM o...
Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually dilut...
Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are ...
Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked...
Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, ...
Large language models (LLMs) often hallucinate, producing fluent but false information, partly because supervised fine-tuning (SFT) implicitly rewards always responding. We introduce $\textit{HypoTerm...
Continual fine-tuning of large language models (LLMs) is becoming increasingly crucial as these models are deployed in dynamic environments where tasks and data distributions evolve over time. While s...
Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not s...
We present experimental results from seven controlled runs of nanoFMT, a Free-Market Algorithm (FMA) orchestrated transformer with dynamic Mixture-of-Experts (MoE) management. The experiments address ...