Large Embedding Models: DP-AdaFEST

Large embedding models have become essential in the world of AI. They help power recommendation systems and language processing. The challenge is to maintain user privacy when using these models. One popular method is called differential privacy (DP). It’s widely used because it protects individual data while still allowing analysis. However, when applying privacy techniques to large embedding models, the issue of gradient sparsity arises. This is where DP-AdaFEST comes in.

Challenges of DP-SGD

The problem with large embedding models lies in their non-numerical feature fields and dense vectors. DP-SGD, a commonly used algorithm, completely eliminates the sparsity of these gradients. This poses a significant barrier to private training of these large models. The lack of gradient sparsity hinders training efficiency compared to non-private methods.

The DP-AdaFEST Algorithm

Our solution to the gradient sparsity problem is DP-AdaFEST. This extends standard DP-SGD to include an additional mechanism that privately selects the “hot features”. These are the features activated by multiple training examples in the current mini-batch. It maintains sparsity by only selecting features above a certain threshold for gradient update.

Our theory behind DP-AdaFEST is that it reduces variance at the cost of slightly increasing bias. This is due to the algorithm only adding noise to a smaller set of coordinates. In our experiments, DP-AdaFEST has been proven effective in reducing gradient size and maintaining utility. We’ve seen significant cost reduction in gradient computation while maintaining comparable utility in ad prediction datasets.

Source link

Large Embedding Models: DP-AdaFEST

Challenges of DP-SGD

The DP-AdaFEST Algorithm

LEAVE A REPLY Cancel reply