Large machine learning models are used in various applications, such as spam filters, recommender systems, and virtual assistants. These models perform well because they have access to a lot of training data. However, this training data can sometimes contain private information that needs to be protected. This is where Differential Privacy (DP) comes in.
DP is a technology that ensures data anonymization in machine learning models. It guarantees that each individual user’s contribution to the model will not significantly change the overall model. The privacy guarantees of a model are represented by a tuple (ε, δ), where smaller values mean stronger privacy.
While there are successful examples of using DP to protect training data, it can be challenging to achieve good utility with differentially private machine learning (DP-ML) techniques. There are privacy/computation tradeoffs that can limit a model’s performance. Additionally, there is a lack of guidelines on how to effectively tune the model’s architecture and hyperparameters for privacy.
In our research paper “How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy,” we discuss the current state of DP-ML research. We explore common techniques for achieving DP-ML models and address the challenges, mitigation techniques, and open questions in this field. We will also be presenting tutorials based on this work at ICML 2023 and KDD 2023.
DP can be introduced at three different stages of the ML development process: at the input data level, during training, or at inference. Each option provides privacy protection at different levels. Making the input data differentially private means that any model trained on this data will also have privacy guarantees. Introducing DP during training means that only that particular model has privacy guarantees. DP at the prediction level only protects the model’s predictions, not the model itself.
DP-training is commonly introduced during the training stage. Gradient noise injection methods like DP-SGD or DP-FTRL are the most practical methods for achieving DP guarantees in complex models like deep neural networks. These methods involve clipping gradients to limit sensitivity and adding noise to the aggregated gradients. However, these methods come with challenges such as loss of utility, slower training, and increased memory usage.
To mitigate the loss of utility, it is recommended to use larger batch sizes and more iterations during training. Hyperparameter tuning is also crucial to optimize the model’s performance. Increasing the computation resources or using gradient accumulation techniques can address the slower training and increased memory footprint issues.
There are several best practices for achieving rigorous DP guarantees with optimal model utility. Choosing the right privacy unit, such as example-level or user-level protection, depending on the type of data, is important. Practitioners should aim for reasonable privacy guarantees with ε ≤ 10. Hyperparameter tuning should consider the trade-offs between model utility, privacy cost ε, and computation cost. Lastly, reporting comprehensive privacy guarantees, including the DP setting, instantiating the DP definition, and privacy accounting details, is essential for comparing different DP methods.
In conclusion, DP is a crucial technology for protecting the privacy of training data in machine learning models. While there are challenges in achieving good utility with DP-ML techniques, following best practices and considering privacy guarantees can help researchers and practitioners achieve optimal results.