In recent times, there has been a growing interest in generative AI models that can process natural language and generate visuals. A new study introduces CM3leon, a groundbreaking multimodal model that can generate both text and images.
CM3leon is developed using a modified version of text-only language models. It undergoes a large-scale retrieval-augmented pre-training stage and a multitask supervised fine-tuning stage. The architecture of CM3leon is similar to popular text-based models, but it stands out because it can handle both text and visuals. Despite being trained with less computation than previous transformer-based approaches, CM3leon achieves state-of-the-art performance for text-to-image generation.
Features of CM3leon
CM3leon combines the flexibility and power of autoregressive models with the efficiency of training and inference. It can generate text and image sequences based on any given input. This makes CM3leon a causal masked mixed-modal model, surpassing earlier models that could only perform one task.
To enhance CM3leon’s performance, the researchers applied large-scale multitask instruction tweaking. This significantly improves tasks like image caption generation, visual question answering, text-based editing, and conditional image generation. Additionally, an independently trained super-resolution stage is added to create higher-resolution images.
State-of-the-art Performance
CM3leon outperforms Google’s Parti text-to-image model, achieving a new state-of-the-art FID score of 4.88 on the popular picture creation benchmark (zero-shot MS-COCO). Despite being trained on a dataset consisting of only three billion text tokens, CM3leon’s zero-shot performance competes with larger models trained on larger datasets. This highlights the effectiveness of retrieval enhancement and scaling techniques in improving autoregressive models’ output.
CM3leon’s impressive performance across various tasks gives the researchers hope that they can continue to improve image generation and comprehension accuracy.