The Importance of Preserving Training Dynamics in Machine Learning
Preserving training dynamics in machine learning is crucial for achieving the trade-off between batch size and wall-clock time. Scaling rules, such as linearly scaling the learning rate with the batch size in stochastic gradient descent, enable this trade-off. Another important tool in machine learning is the model EMA, which can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL).
Optimizing the Model EMA
Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, a scaling rule for optimization in the presence of a model EMA is provided and demonstrated across a range of architectures, optimizers, and data modalities. The validity of the rule is also shown in enabling the training of EMA-based pseudo-labeling and SSL methods at both small and large batch sizes.
Implications for Self-Supervised Learning
For SSL, the training of BYOL can be enabled up to a batch size of 24,576 without sacrificing performance, resulting in a 6× wall-clock time reduction under idealized hardware settings.