Google Research has recently explored the potential of multimodal capabilities in AI systems for medical applications. In a blog post, Greg Corrado, Head of Health AI, and Yossi Matias, VP of Engineering and Research, outline different approaches to incorporating multimodal capabilities into large language models (LLMs) and highlight the benefits and challenges of each approach.
The first approach is “tool use,” where an LLM outsources the analysis of different modalities to specialized software subsystems. For example, a medical LLM could use an API to integrate the analysis of a chest X-ray from a radiology AI system. This approach allows flexibility and independence between subsystems but requires careful communication between them.
The second approach is “model grafting,” which involves adapting a neural network specialized for a specific domain to plug directly into the LLM. Google Research has shown that this is feasible in recent papers. They describe mapping data from new modalities, such as spirograms (used to assess breathing ability) or radiology images, into the LLM’s input word embedding space. This approach leverages existing validated models in each data domain and facilitates testing and debugging of individual components.
The third approach is building a fully integrated, “generalist” system capable of handling information from all modalities. Google Research has developed a multimodal model called Med-PaLM M, which combines a large language model with a vision encoder. This model can interpret biomedical data, including clinical language, imaging, and genomics, using a single set of model weights. It offers flexibility and information transfer between modalities but requires higher computational costs.
Google Research emphasizes the importance of evaluating these technologies in collaboration with the medical community and healthcare ecosystem. By combining multimodal capabilities with AI, it is possible to develop assistive technologies that benefit professional medicine, medical research, and consumer applications.
In summary, the three approaches to incorporating multimodal capabilities into LLMs are tool use, model grafting, and generalist systems. Each approach has its advantages and challenges, and Google Research’s work highlights the potential of these approaches for advancing medical AI systems.