Optimizing Model EMA: Scaling Rule for Enhanced Machine Learning Performance

The Importance of Preserving Training Dynamics in Machine Learning

Preserving training dynamics in machine learning is crucial for achieving the trade-off between batch size and wall-clock time. Scaling rules, such as linearly scaling the learning rate with the batch size in stochastic gradient descent, enable this trade-off. Another important tool in machine learning is the model EMA, which can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL).

Optimizing the Model EMA

Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, a scaling rule for optimization in the presence of a model EMA is provided and demonstrated across a range of architectures, optimizers, and data modalities. The validity of the rule is also shown in enabling the training of EMA-based pseudo-labeling and SSL methods at both small and large batch sizes.

Implications for Self-Supervised Learning

For SSL, the training of BYOL can be enabled up to a batch size of 24,576 without sacrificing performance, resulting in a 6× wall-clock time reduction under idealized hardware settings.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...