Large Language Models (LLMs) are highly popular for their exceptional capabilities in Natural Language Processing and Natural Language Understanding. These models, like OpenAI’s ChatGPT, are revolutionizing human-computer interaction by imitating realistic conversations and performing various tasks such as question answering, content generation, code completion, machine translation, and text summarization.
While LLMs excel in capturing deep conceptual knowledge through lexical embeddings, researchers are exploring ways to enable them to handle visual tasks. One approach is to use a vector quantizer that translates images into a language that LLMs can understand, allowing them to generate and understand images without specific training on image-text pairs.
To facilitate this cross-modal task, researchers from Google Research and Carnegie Mellon University have introduced Semantic Pyramid AutoEncoder (SPAE), an autoencoder for multimodal generation with frozen LLMs. SPAE produces a lexical word sequence that carries rich semantics while retaining fine details for signal reconstruction. It combines an autoencoder architecture with a hierarchical pyramid structure and encodes images into an interpretable discrete latent space.
The pyramid-shaped representation of SPAE tokens has multiple scales, with lower layers prioritizing fine details and upper layers containing semantically central notions. This allows for adjusting the token length dynamically based on the task requirements. SPAE has been trained independently, without backpropagating through any language model.
To evaluate SPAE’s effectiveness, the researchers conducted experiments on image understanding tasks such as classification, captioning, and visual question answering. The results showcased LLMs’ capabilities in handling visual modalities and their applications in content generation, design support, and interactive storytelling. In-context denoising methods were also used to illustrate the picture-generating abilities of LLMs.
In summary, SPAE is a significant breakthrough in bridging the gap between language models and visual understanding. It demonstrates the remarkable potential of LLMs in handling cross-modal tasks. For more details, you can refer to the paper and join their ML SubReddit, Discord Channel, and Email Newsletter to stay updated on the latest AI research news and projects.