Google’s Language Models (LMs) play a critical role in applications like Gboard, enhancing user typing experience with features such as next word prediction, Smart Compose, and slide to type. Now, Google has made advances in protecting user privacy while training these LMs directly on user data.
**Gboard On-Device Language Models**
With the development of federated learning (FL) in 2017 and formal differential privacy (DP) guarantees in 2022, Google has been able to train Gboard LMs directly on user data while maintaining user privacy. This approach allows for collaboration between mobile phones for model training while keeping data on-device and applying data anonymization techniques.
**Privacy Principles and Practices**
Google has implemented principles like transparency, data minimization, data anonymization, and auditability to ensure user data is protected. Open-sourced code for privacy accounting is available for public review, highlighting Google’s commitment to user privacy.
**Differential Privacy by Default**
Google’s approach to training LMs with DP guarantees involves pre-training with a multilingual dataset, employing simulation experiments to find the right noise-to-signal ratio for high utility, and using advanced algorithms like Matrix Factorization DP-FTRL. These practices have allowed Google to achieve strong DP guarantees, with some models having ε values ≤ 1, signifying Tier 1 privacy guarantees.
Overall, Google’s efforts in incorporating differential privacy into their language models demonstrate a strong commitment to protecting user privacy while still delivering high utility. As research continues, Google plans to further enhance privacy and utility trade-offs through algorithmic advancements and empirical privacy auditing.