Large Language Models (LLMs) have become popular due to their ability to generate and understand text. However, these models are limited to working with text and cannot handle other types of media like images and videos. This limitation restricts their use in various applications. To overcome this limitation, researchers from Alibaba group have developed the Qwen-VL series models, which are large-scale visual-language models.
The Qwen-VL models have two versions: Qwen-VL and Qwen-VL-Chat. Qwen-VL is a pre-trained model that combines a visual encoder with a language model to provide visual capabilities. It can understand and analyze visual information at different levels. Qwen-VL-Chat, on the other hand, is an interactive model based on Qwen-VL that allows for more flexible interaction, such as multiple picture inputs, multi-round discussions, and localization capability.
The Qwen-VL models offer several features:
1. Strong performance: These models outperform current open-sourced Large Vision Language Models (LVLM) on various assessment benchmarks, including Zero-shot Captioning, VQA, DocVQA, and Grounding.
2. Multilingual support: Qwen-VL supports both English and Chinese languages, enabling bilingual dialogue and multilingual conversations.
3. Multi-image conversations: Users can compare multiple pictures, ask questions about the images, and engage in multi-image storytelling.
4. Accurate recognition and comprehension: The Qwen-VL models have a higher resolution (448×448) compared to competing LVLM models (224×224), which allows for finer text recognition, document quality assurance, and bounding box identification.
To learn more about the Qwen-VL models, you can check out the research paper and the Github repository. The credit for this research goes to the researchers involved in the project. If you’re interested in staying updated with the latest AI research news and projects, you can join the ML SubReddit, the Facebook community, the Discord channel, and subscribe to the email newsletter.
Author: Aneesh Tickoo