The Limitation of Pre-Trained Language Models in Auto-Regressive Text-to-Image Generation
A recent paper accepted at the workshop I Can’t Believe It’s Not Better! (ICBINB) at NeurIPS 2023, explores the limitations of pre-trained language models in auto-regressive text-to-image generation. The paper focuses on the gap in leveraging pre-trained language models for image tokenizers, such as VQ-VAE, which have enabled text-to-image generation using auto-regressive methods.
Challenges in Utilizing Pre-Trained Language Models
The study finds that pre-trained language models offer limited help in auto-regressive text-to-image generation. Analysis shows that image tokens possess significantly different semantics compared to text tokens, making pre-trained language models no more effective in modeling them than randomly initialized ones. Additionally, the text tokens in image-text datasets are too simple compared to normal language model pre-training data, causing the catastrophic degradation of language models’ capability.
Implications for AI Development
This research sheds light on the challenges in leveraging pre-trained language models for auto-regressive text-to-image generation. As the field of AI continues to advance, understanding these limitations is crucial for developing more effective methods for generating images from text.