Title: Introducing REVEAL: A Visual-Language Model for Answering Knowledge-Intensive Queries
Large-scale AI models like T5, GPT-3, PaLM, Flamingo, and PaLI have shown impressive knowledge storage capabilities when trained on massive datasets. However, these models require extensive training data and have high computational demands. Additionally, they can become outdated as new information emerges. To address these challenges, our research team at Google has developed REVEAL, a visual-language model that harnesses a multi-source, multimodal “memory” to answer knowledge-intensive queries.
Memory Construction from Multimodal Knowledge Corpora:
In REVEAL, we use neural representation learning to convert diverse knowledge sources into a memory structure. This structure consists of key-value pairs, where keys index the memory items and values contain relevant information. We leverage various multimodal knowledge corpora such as WikiData, Wikipedia passages and images, web image-text pairs, and visual question answering data to construct this memory.
Scaling Memory Using Compression:
To tackle the challenge of storing and processing massive amounts of token sequences, we employ the Perceiver architecture. This allows us to compress knowledge items into arbitrary lengths, enabling retrieval of top-k memory entries. By using this approach, we optimize memory storage and retrieval without compromising efficiency.
Large-scale Pre-training on Image-Text Pairs:
We train REVEAL using a large-scale corpus of three billion image alt-text caption pairs collected from the web. We filter out shorter captions and combine this dataset with the text generation objective to train the model. To overcome the cold-start problem, we warm start the model by creating an initial retrieval dataset with pseudo-ground-truth knowledge from the WIT dataset.
The workflow of REVEAL involves four primary steps. First, the model encodes multimodal inputs into token and query embeddings. Then, it translates multi-source knowledge entries into key-value pairs. Next, REVEAL retrieves the most relevant knowledge pieces and fuses them using attentive knowledge fusion. This allows concurrent training of the memory, encoder, retriever, and generator components.
We evaluated REVEAL on knowledge-based visual question answering tasks and achieved superior performance compared to previous attempts.
REVEAL is a groundbreaking visual-language model that effectively addresses the challenges of knowledge-intensive queries. By leveraging a multi-source, multimodal memory, REVEAL optimizes memory storage, retrieval, and reasoning. Our results demonstrate its capability to outperform existing models in knowledge-based tasks.