Voice assistants are incredibly useful tools that can help people with a wide range of tasks such as making phone calls, sending messages, creating events, and navigating. However, these assistants often struggle to fully understand the context of their users’ needs. This is where our work comes in. We have developed a new feature that allows users to easily reference phone numbers, addresses, email addresses, URLs, and dates displayed on their phone screens.
Our focus is on reference understanding, which becomes especially important when there are multiple similar texts on the screen. This is similar to the concept of visual grounding. To accomplish this, we have created a dataset and developed a lightweight general-purpose model.
One of the main challenges we faced was the high cost associated with directly analyzing the pixels on the screen. As a result, our system relies on extracting text from the user interface (UI). This approach allows us to save resources and makes our system more efficient.
Our model is modular, meaning it can be easily customized and adapted to different scenarios. This flexibility not only improves the interpretability of the results but also enhances the efficiency of the system. It ensures that the model can efficiently use the available runtime memory without causing any performance issues.
In conclusion, our work focuses on improving the reference understanding capabilities of voice assistants. We have developed a lightweight model that allows users to easily reference important information displayed on their phone screens. By relying on extracted text from the UI, our system offers improved efficiency and flexibility. With this new feature, voice assistants can provide a more personalized and user-friendly experience.