Article – DALL·E 2 and Deduplication: Solving Image Regurgitation
Introduction: Solving the Problem of Image Reproduction
In the development of DALL·E 2, we discovered an issue where our previous models would reproduce training images exactly. This was undesirable because we wanted DALL·E 2 to generate unique and original images, rather than just stitching together existing ones. Moreover, the exact replication of training images raised concerns about copyright infringement, ownership, and privacy. To address this problem, we decided to investigate further.
Understanding Image Regurgitation and the Dataset Analysis
To gain insights into the issue of image replication, we created a dataset of commonly duplicated prompts. We used a trained model to sample images from our training dataset for 50,000 prompts and sorted them by perceptual similarity. A manual inspection of the top matches revealed only a few hundred true duplicate pairs out of the 50,000 prompts, indicating that the regurgitation rate was less than 1%. However, due to the aforementioned concerns, we aimed to reduce the rate to 0%.
Upon analyzing the dataset of regurgitated images, we observed two patterns. Firstly, the replicated images were mostly simple vector graphics, which are easily memorizable due to their low information content. Secondly, these replicated images had numerous similar duplicates in the training dataset. For example, a vector graphic of a clock showing 1 o’clock would have duplicates showing 2 o’clock, 3 o’clock, and so on. This led us to employ a distributed nearest neighbor search, verifying that all regurgitated images had perceptually similar duplicates in the dataset. Similar findings have been observed in other works, showing a strong connection between data duplication and memorization in large language models.
The Deduplication Solution
To solve the issue of image replication, we proposed deduplicating our dataset by using a neural network to identify similar image groups and retaining only one image per group. However, checking for duplicates within our massive dataset of hundreds of millions of images would require a staggering number of pairwise comparisons. While technically feasible using a large compute cluster, we found a more efficient alternative that offered similar results at a significantly reduced cost.
Cluster-Based Deduplication Approach
We realized that clustering our dataset before deduplication could prevent most duplicate pairs from crossing cluster boundaries. This meant we could deduplicate samples within each cluster without checking for duplicates outside the cluster, thereby achieving faster results. Though this approach might miss a small fraction of duplicate pairs, it proved to be more efficient than the naive method of checking every pair of images.
Empirical Testing and Success Rate
In our empirical testing on a subset of the data, the cluster-based deduplication approach successfully identified 85% of all duplicate pairs using 1,024 clusters (K=1024). To improve its success rate, we leveraged the observation that different random subsets of a dataset generate distinct cluster decision boundaries. By utilizing five different clusterings, we searched for duplicates of each image within the union of these clusters, resulting in the discovery of 97% of all duplicate pairs on the tested data subset.
Benefits and Model Performance Evaluation
Remarkably, deduplication led to the removal of almost a quarter of our dataset. Upon further examination, we found that many of the near-duplicate pairs contained meaningful variations. For instance, a clock image at different times of day provided a valuable learning experience for the model, helping it distinguish between different clock faces. Despite concerns about data loss, training a model on the deduplicated dataset did not harm its performance. In fact, human evaluators slightly preferred the deduplicated model, suggesting that the excessive redundancy in the dataset impeded performance.
Confirmation of Deduplication’s Effectiveness
To validate the effectiveness of our deduplication approach, we conducted additional tests. The new model trained on deduplicated data never reproduced a training image when provided with an exact prompt from the training dataset. We went a step further and performed a comprehensive nearest neighbor search across the entire training dataset for each of the 50,000 generated images. Even with this thorough analysis, we found no cases of image replication.
In conclusion, our deduplication efforts successfully addressed the problem of image regurgitation in DALL·E 2. By creating a dataset free from replicated images, we ensured the generation of original and unique content while mitigating the legal and privacy concerns associated with copyright infringement. Deduplication not only improved the model’s performance but also confirmed its efficacy through rigorous testing.