Lookahead Optimizer: k steps forward, 1 step back

Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba

2019-07-19NeurIPS 2019 12Machine Translation Image Classification Translation Stochastic Optimization

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code(official)

Abstract

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

Results

Task	Dataset	Metric	Value	Model
Stochastic Optimization	CIFAR-10 ResNet-18 - 200 Epochs	Accuracy	95.27	Lookahead
Stochastic Optimization	CIFAR-10 ResNet-18 - 200 Epochs	Accuracy	95.23	SGD
Stochastic Optimization	CIFAR-10 ResNet-18 - 200 Epochs	Accuracy	94.84	ADAM
Stochastic Optimization	ImageNet ResNet-50 - 60 Epochs	Top 5 Accuracy	92.53	Lookahead
Stochastic Optimization	ImageNet ResNet-50 - 60 Epochs	Top 5 Accuracy	92.56	SGD

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18 Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17 Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17 Federated Learning for Commercial Image Sources2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15 Function-to-Style Guidance of LLMs for Code Translation2025-07-15