GPT-4: Generating Images with Large Language Models
OpenAI’s latest release, GPT-4, brings multimodality to Large Language Models. Unlike its predecessor, GPT 3.5, which only accepts text inputs for ChatGPT, GPT-4 now accepts both text and images. A team of researchers from Carnegie Mellon University has introduced a method called Generating Images with Large Language Models (GILL) that extends multimodal language models to generate unique and impressive images.
GILL enables the processing of mixed inputs of images and text to produce text, retrieve images, and create new images. By transferring the output embedding space of a frozen text-only LLM to that of a frozen image-generating model, GILL achieves this feat without the need for interleaved image-text data. Instead, a small number of parameters are fine-tuned using image-caption pairings to accomplish the mapping.
The team mentions that GILL combines large language models for frozen text with pre-trained models for image encoding and decoding. This allows for a wide range of multimodal capabilities, including image retrieval, unique image production, and multimodal dialogue. By mapping the embedding spaces of the different modalities, GILL is able to fuse them together. This method works with mixed image and text inputs and produces coherent and readable outputs.
The key to GILL’s impressive picture generation performance lies in its mapping network. This network links the LLM to a text-to-image generation model, converting hidden text representations into the embedding space of visual models. By leveraging the powerful text representations of the LLM, GILL is able to produce aesthetically consistent outputs.
Furthermore, GILL not only creates new images but also retrieves images from a specific dataset. The model can decide whether to produce or obtain an image at the time of inference, using a learned decision module that depends on the LLM’s hidden representations. This approach is computationally efficient as it eliminates the need to run the image generation model during training.
GILL outperforms baseline generation models, particularly in tasks that require longer and more sophisticated language. It surpasses the Stable Diffusion method in processing longer-form text, such as dialogue and discourse. GILL also excels in dialogue-conditioned image generation compared to non-LLM-based generation models, benefiting from multimodal context and generating images that better align with the given text. Unlike traditional text-to-image models that only process textual input, GILL can handle arbitrarily interleaved image-text inputs.
In conclusion, GILL demonstrates promising capabilities compared to previous multimodal language models. Its ability to outperform non-LLM-based generation models in various text-to-image tasks that rely on context dependence makes it a powerful solution for multimodal tasks.
Check out the Paper and Project Page for more details. Don’t forget to join our ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news and cool AI projects. If you have any questions or if we missed anything in the article, feel free to email us at Asif@marktechpost.com.
🚀 Check Out 100’s AI Tools in AI Tools Club.
🔥 Gain a competitive edge with data: Actionable market intelligence for global brands, retailers, analysts, and investors. (Sponsored)