Description
Adam-mini is a memory-efficient Adam variant that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces the memory footprint by cutting down the learning rate resources in Adam (i.e., ). The authors find that ≥ 90% of these learning rates in could be harmlessly removed if they (1) carefully partition the parameters into blocks following their proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. They further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out.