Regularizing and Optimizing LSTM Language Models

Stephen Merity, Nitish Shirish Keskar, Richard Socher

2017-08-07ICLR 2018 1Image Classification Translation Language Modelling

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code(official)Code

Abstract

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.

Results

Task	Dataset	Metric	Value	Model
Language Modelling	Penn Treebank (Word Level)	Test perplexity	52.8	AWD-LSTM + continuous cache pointer
Language Modelling	Penn Treebank (Word Level)	Validation perplexity	53.9	AWD-LSTM + continuous cache pointer
Language Modelling	Penn Treebank (Word Level)	Test perplexity	57.3	AWD-LSTM
Language Modelling	Penn Treebank (Word Level)	Validation perplexity	60	AWD-LSTM
Language Modelling	WikiText-2	Test perplexity	52	AWD-LSTM + continuous cache pointer
Language Modelling	WikiText-2	Validation perplexity	53.8	AWD-LSTM + continuous cache pointer
Language Modelling	WikiText-2	Test perplexity	65.8	AWD-LSTM
Language Modelling	WikiText-2	Validation perplexity	68.6	AWD-LSTM

Regularizing and Optimizing LSTM Language Models

Abstract

Results

Related Papers

Regularizing and Optimizing LSTM Language Models

Abstract

Results

Related Papers