Efficiency Boost: Fusion of Diffusion Models and MAViL for Audio-Visual Learning

Researchers at NeurIPS 2023 Machine Learning for Audio Workshop have accepted a paper that explores the synergy between diffusion models and Masked Audio-Video Learners (MAViL).

The Significance of Audio-Visual Representations

Over the past few years, combining audio and visual signals has led to the development of richer audio-visual representations. There is a large amount of unlabeled video content available, which has been used to train unsupervised frameworks, resulting in impressive results in various audio and video tasks.

MAViL and Diffusion Models

MAViL has emerged as a leading pre-training framework, using contrastive learning and masked autoencoding to reconstruct audio spectrograms and video frames by combining information from both modalities. By incorporating diffusion models into MAViL and implementing training efficiency techniques such as masking ratio curriculum and adaptive batch sizing, researchers have achieved a 32% reduction in pre-training FLOPS and an 18% decrease in pre-training wall clock time. Importantly, this increased efficiency does not compromise the model’s performance in downstream audio-classification tasks compared to MAViL.

This study shows the potential benefits of combining diffusion models with MAViL, leading to improved efficiency without sacrificing performance in audio tasks.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...