Mobile-Agent: A Vision-Centric Solution for Mobile Device Operations

Mobile-Agent: The Revolutionary AI-Powered Mobile Device Assistant

The rise of Multimodal Large Language Models (MLLM) has paved the way for exciting advancements in artificial intelligence. With their exceptional visual comprehension capabilities, MLLM-based agents have expanded into diverse applications, including the emergence of mobile device agents. These agents are designed to operate mobile devices based on screen content and user instructions.

Introducing Mobile-Agent

Beijing Jiaotong University and Alibaba Group researchers have developed Mobile-Agent, an autonomous multi-modal mobile device agent. This innovative approach utilizes visual perception tools to accurately identify and locate visual and textual elements within an app’s front-end interface. Leveraging this perceived vision context, Mobile-Agent autonomously plans and decomposes complex operation tasks, navigating through mobile apps step by step. This approach eliminates reliance on XML files or mobile system metadata, enhancing adaptability across diverse mobile operating environments.

Utilizing Advanced Tools

Mobile-Agent employs Optical Character Recognition (OCR) tools for text and OpenAI’s CLIP for icon localization. With a framework defining eight operations, the agent can perform tasks such as opening apps, clicking text or icons, typing, and navigating, showcasing its iterative self-planning and self-reflection capabilities.

Evaluating Mobile-Agent

Beijing Jiaotong University and Alibaba Group researchers presented Mobile-Eval, a benchmark of 10 popular mobile apps with three instructions each, to comprehensively evaluate Mobile-Agent. With completion rates of 91%, 82%, and 82% across instructions and high Process Score of around 80%, the study highlighted the effectiveness of Mobile-Agent, showcasing its self-reflective capabilities in correcting errors during execution.

Overall, Mobile-Agent presents a breakthrough in the realm of AI-powered mobile device assistants. Its vision-centric approach and robust performance demonstrate potential as a versatile and adaptable solution for language-agnostic interaction with mobile applications. For more information about this innovative research, you can view the Paper and GitHub. Join the discussion by following us on Twitter and Google News, and don’t forget to join our various online communities and newsletters for the latest updates on AI technology.

Source link

Stay in the Loop

Get the daily email from AI Headliner that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

You might also like...