Test-Time Training on Nearest Neighbors for Large Language Models

Moritz Hardt, Yu Sun

2023-05-29Prompt Engineering Retrieval Language Modelling

Abstract

Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for language modeling.

Results

Task	Dataset	Metric	Value	Model
Language Modelling	The Pile	Bits per byte	0.85	GPT-2 Large 774M (test-time training on nearest neighbors)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Leveraging Language Prior for Infrared Small Target Detection2025-07-17 Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17