Understanding the Success of Multi-View Self-Supervised Learning
In the realm of artificial intelligence (AI), there is still much to discover about the workings of multi-view self-supervised learning (MVSSL). While contrastive MVSSL methods have been examined through the InfoNCE lens, which is a lower bound for Mutual Information (MI), the connection between other MVSSL methods and MI remains unclear.
The Role of Entropy and Reconstruction
At this juncture, we introduce an alternative measure for the lower bound of MI. This measure combines an entropy and a reconstruction term (ER), providing fresh insight into the main MVSSL families. Using the ER bound, we reveal that clustering-based methods like DeepCluster and SwAV maximize the MI.
Reinterpreting Distillation-Based Approaches
We also take a closer look at distillation-based approaches, namely BYOL and DINO, and present a new perspective on their mechanisms. Our analysis reveals that these methods explicitly maximize the reconstruction term while implicitly promoting a stable entropy. This finding has been verified through empirical evidence.
The Benefits of the ER Bound
By replacing the objectives of conventional MVSSL methods with the ER bound, we are able to achieve competitive performance. Furthermore, this modification ensures the stability of these methods, even when training with smaller batch sizes.