MIT researchers, in collaboration with the MIT-IBM Watson AI Lab and IBM Research, have developed a new technique for analyzing unlabeled audio and visual data that could improve machine learning models used in applications such as speech recognition and object detection. The technique, called the Contrastive Audio-Visual Masked Autoencoder (CAV-MAE), combines two architectures of self-supervised learning to scale machine learning tasks without the need for annotation. By training on large YouTube datasets of audio and video clips, the CAV-MAE can extract and map meaningful latent representations into high-dimensional space. The researchers believe this technique could have applications in action recognition in sports, education, entertainment, motor vehicles, and public safety, and potentially extend to other modalities beyond audio and visual data.