Introducing DataComp: A New Testbed for Multimodal Datasets
Multimodal datasets are super important for recent developments in AI, like Stable Diffusion and GPT-4. But not enough attention is given to how they’re designed. That’s why a group of researchers created DataComp. It’s a place where people can test out new filtering techniques or find new data sources. Then they can use a standardized CLIP training code to see how well their new dataset works on a bunch of different tests.
What’s really cool about DataComp is that it’s made to work on different levels of computing power. That means more people can try it out and see what they can come up with. And so far, the results have been pretty promising. In fact, the best dataset so far, called DataComp-1B, has even outperformed a similar model from OpenAI.
With DataComp, it looks like researchers might have a new way to make better training sets for AI models. And that could lead to even more exciting breakthroughs in the future.
The Importance of Multimodal Datasets in AI
Multimodal datasets are a critical component in recent AI breakthroughs, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, researchers have introduced DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl.
Testing and Scaling with DataComp
Participants in the DataComp benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running standardized CLIP training code and testing the resulting model on 38 downstream test sets. The benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources.
Better Training Sets with DataComp
Baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute.