Researchers at NeurIPS 2023 Machine Learning for Audio Workshop have accepted a paper that explores the synergy between diffusion models and Masked Audio-Video Learners (MAViL).
The Significance of Audio-Visual Representations
Over the past few years, combining audio and visual signals has led to the development of richer audio-visual representations. There is a large amount of unlabeled video content available, which has been used to train unsupervised frameworks, resulting in impressive results in various audio and video tasks.
MAViL and Diffusion Models
MAViL has emerged as a leading pre-training framework, using contrastive learning and masked autoencoding to reconstruct audio spectrograms and video frames by combining information from both modalities. By incorporating diffusion models into MAViL and implementing training efficiency techniques such as masking ratio curriculum and adaptive batch sizing, researchers have achieved a 32% reduction in pre-training FLOPS and an 18% decrease in pre-training wall clock time. Importantly, this increased efficiency does not compromise the model’s performance in downstream audio-classification tasks compared to MAViL.
This study shows the potential benefits of combining diffusion models with MAViL, leading to improved efficiency without sacrificing performance in audio tasks.