A team of scientists has developed a machine-learning model that can predict if a patient has cancer from lung scan images. They want to share this model with hospitals so doctors can use it for diagnosis. However, there’s a problem. The model was trained using millions of real lung scan images, and this sensitive data could potentially be extracted by someone with bad intentions. To prevent this, the scientists can add noise to the model to make it harder for anyone to guess the original data. MIT researchers have come up with a technique called Probably Approximately Correct (PAC) Privacy that allows users to add the smallest amount of noise possible while still protecting the sensitive data.
The researchers created a privacy metric called PAC Privacy and built a framework based on this metric. This framework can automatically determine the minimal amount of noise needed to protect the data without knowing the inner workings of the model or its training process. The researchers demonstrate that PAC Privacy requires less noise compared to other approaches in several cases. This could help engineers create machine-learning models that can hide training data and maintain accuracy in real-world settings.
PAC Privacy focuses on how hard it would be for someone to reconstruct any part of the sensitive data after noise has been added, rather than just focusing on whether the data can be distinguished. The researchers developed an algorithm that tells users how much noise to add to the model to prevent someone from confidently reconstructing a close approximation of the sensitive data. This algorithm ensures privacy even if the person trying to extract the data has infinite computing power.
Unlike other privacy approaches, the PAC Privacy algorithm doesn’t require knowledge of the model’s inner workings or the training process. Users can specify their desired level of confidence, and the algorithm will determine the optimal amount of noise required to achieve that level of privacy. However, the algorithm doesn’t indicate how much accuracy the model will lose once the noise is added. Additionally, running the algorithm multiple times can be computationally expensive.
To improve PAC Privacy, one approach is to make the machine-learning training process more stable, which would result in smaller variances between subsample outputs and require less noise to be added. Stabler models also tend to have less generalization error, meaning they can make more accurate predictions on new data.
The researchers are interested in exploring the relationship between stability, privacy, and generalization error further in the future. This research is funded by DSTA Singapore, Cisco Systems, Capital One, and a MathWorks Fellowship.