Adam

Gold definitionUpdated Apr 2, 2026

Definition

Adam (Adaptive Moment Estimation) is a popular optimization algorithm in deep learning that adaptively adjusts learning rates for each parameter, leveraging estimates of both first and second moments of the gradients. It is widely used for its fast convergence in many deep learning tasks.

At a glance

Executive summary

Adam is a widely used optimization algorithm in deep learning, known for its fast convergence by adaptively adjusting learning rates for each parameter. While efficient, it can sometimes lead to suboptimal generalization by converging to sharp minima. Ongoing research focuses on understanding its theoretical properties and developing variants to improve generalization.

TL;DR

Adam is a popular, fast-converging optimization algorithm for deep learning that adaptively adjusts its step size for each parameter based on gradient moments.

Key points

Core mechanism: Computes adaptive learning rates for each parameter using estimates of first and second moments of gradients.
Problem solved: Provides fast and efficient optimization for complex deep learning models, especially in high-dimensional spaces.
Who uses it: Deep learning researchers, ML engineers, and practitioners in various AI applications like NLP, computer vision, and scientific computing.
Vs main alternative: Generally converges faster than traditional SGD or Momentum, but may lead to suboptimal generalization due to converging to sharp minima, prompting research into variants.
Research trend: Focus on improving generalization performance (e.g., finding flatter minima), understanding theoretical convergence properties, and optimizing hyperparameter schedules for specific tasks and batch sizes.

Use cases

Training Large Language Models (LLMs) for tasks like next token prediction, often using Adam(W) variants.
Optimizing neural networks in computer vision for tasks such as image classification and object detection.
Designing parameter configuration schedules for industrial applications like chip placement using methods like EvoStage.
Applying Adam-based updates in physics-informed machine learning to solve parametric partial differential equations (PDEs) with diffusion models.
General training of deep neural networks across various domains due to its robust and fast convergence properties.

Also known as

InvAdam, DualAdam, AdamW