Inverse Square Root is a learning rate schedule 1 / where is the current training iteration and is the number of warm-up steps. This sets a constant learning rate for the first steps, then exponentially decays the learning rate until pre-training is over.