Title: Enhancing Multimodal Deep Learning with video2dataset
Big breakthroughs in multimodal deep learning, such as CLIP, Stable Diffusion, and Flamingo, have significantly advanced the field in recent years. Joint text-image modeling has become a critical aspect of modern artificial intelligence, allowing models to generate stunning imagery and solve complex problems. However, despite the success in text-image modeling, there is a lack of similar achievements in other modalities like video and audio. This is mainly due to the scarcity of high-quality, large-scale annotated datasets. Addressing this data problem is crucial for further advancements in multimodal research.
Video2dataset: A Solution for Video and Audio Dataset Curation
To address the limitations in video and audio datasets, researchers have introduced video2dataset. This open-source program allows for fast and extensive curation of video and audio datasets. It has been tested successfully on several large video datasets and offers a wide range of transformations and features.
Architectural Features of video2dataset
Similar to its image counterpart, img2dataset, video2dataset takes a list of URLs and associated metadata to create a WebDataset that can be easily loaded with a single command. The WebDataset can be further processed and modified while preserving the shard contents. This flexibility enables researchers to make additional changes to the dataset as needed.
How Does video2dataset Work?
The process begins with the partitioning of input data to distribute it evenly among workers. Temporary caching of input shards ensures fault-free recovery if the processing run terminates unexpectedly. Multiple distribution modes, including multiprocessing, pyspark, and slurm, offer scalability options for different scenarios.
Different reading strategies are implemented based on the format of the incoming dataset. If the data consists of URLs, video2dataset fetches the videos from the internet and adds them to the dataset. For existing Web datasets, the data loader can read the tensor format of the bytes or frames.
Subsampling plays a vital role in video2dataset, enabling operations such as downsampling frame rate and resolution, identifying scenes, and extracting metadata. Researchers can easily define new subsamplers or modify existing ones to add new transformations.
Efficient Logging and Integration
video2dataset maintains detailed logs at various stages of the process. Each completed shard generates an associated “_stats.json” file, capturing information such as the number of successfully handled samples, errors encountered, and overall progress. Integration with tools like Weights & Biases (wand) provides further performance reporting and metrics for benchmarking and cost estimation.
Storing and Reprocessing
video2dataset stores the modified information in output shards at specified locations for subsequent training or reprocessing. The dataset can be downloaded in various formats, including folders, tar files, records, and parquet files. Reprocessing allows researchers to apply new transformations to previously generated datasets efficiently, avoiding redundant downloads of large datasets.
Future Plans and Exciting Possibilities
Researchers plan to conduct a comprehensive study using the dataset curated with video2dataset, followed by the public sharing of the study results. Synthetic captioning for videos and innovative methods using image captioning models and LLMs are areas ripe for further improvement. The software is also enabling the extraction of text tokens from podcasts, leading to the creation of a publicly available text dataset. The research community eagerly awaits developments in curating datasets for video and audio modalities, as it encourages experimentation and advancements in the field.
video2dataset is an open-source tool that addresses the data problem in multimodal research, particularly in video and audio domains. With its efficient dataset curation capabilities, it opens up opportunities for groundbreaking initiatives, improved pre-trained models, and various applications. Researchers are actively developing video2dataset and welcome contributions from the community. To learn more about video2dataset, visit the project’s GitHub page and explore the blog for further insights.
Keywords: multimodal deep learning, video2dataset, dataset curation, video modeling, audio modeling, data problem, artificial intelligence, open-source, transformations