SGD Can Converge to Local Maxima

Liu Ziyin, Botao Li, James B Simon, Masahito Ueda

2021-09-29ICLR 2022 4

Abstract

Stochastic gradient descent (SGD) is widely used for the nonlinear, nonconvex problem of training deep neural networks, but its behavior remains poorly understood. Many theoretical works have studied SGD, but they commonly rely on restrictive and unrealistic assumptions about the nature of its noise. In this work, we construct example optimization problems illustrating that, if these assumptions are relaxed, SGD can exhibit many strange behaviors that run counter to the established wisdom of the field. Our constructions show that (1) SGD can converge to local maxima, (2) SGD might only escape saddle points arbitrarily slowly, (3) SGD can prefer sharp minima over flat ones, and (4) AMSGrad can converge to local maxima. We realize our most surprising results in a simple neural network-like construction, suggesting their relevance to deep learning.