MediaPipe FaceStylizer: Lightweight Face Stylization with Few-Shot Learning
Augmented reality (AR) experiences that use real-time face feature generation and editing functions have become increasingly popular in recent years. These experiences, found in mobile apps, short videos, virtual reality, and gaming, have created a demand for lightweight face generation and editing models. Enter MediaPipe FaceStylizer, an efficient solution for face stylization that overcomes the challenges of complexity and data efficiency while adhering to Google’s responsible AI principles.
The Model: Face Generator and Encoder
MediaPipe FaceStylizer is composed of a face generator and a face encoder. The face generator utilizes generative adversarial network (GAN) techniques to create high-quality images. The face encoder acts as GAN inversion, mapping images into latent code for the generator. To ensure high generation quality, we designed a mobile-friendly synthesis network for the face generator, featuring auxiliary heads that convert features to RGB at each level of the generator. Loss functions were carefully designed for these auxiliary heads, which, along with common GAN loss functions, distill the student generator from the teacher StyleGAN model, resulting in a lightweight model with maintained quality.
Few-Shot Face Stylization
To adapt the MediaPipe FaceStylizer to different styles, we created an end-to-end pipeline that enables users to fine-tune the model with just a few examples. During the fine-tuning process, the encoder module is frozen, and only the generator is adjusted. By sampling multiple latent codes close to the encoding output of the input style images, the generator learns to reconstruct images of a person’s face in the chosen style. With this customization, the FaceStylizer can stylize test images of real human faces.
Generator: BlazeStyleGAN
Based on the widely-used StyleGAN model, the BlazeStyleGAN generator was designed for efficient on-device face generation. The generator reduces complexity by decreasing the latent feature dimension and designing multiple auxiliary heads to evaluate perceptual quality from coarse to fine. This design allows for the maintenance of high visual quality while reducing artifacts from the teacher model. By distilling the BlazeStyleGAN model from the teacher StyleGAN model, we transfer the high fidelity generation capability and mitigate artifacts.
Efficient GAN Inversion with Encoder
To support image-to-image stylization, an efficient GAN inversion encoder was introduced, mapping input images to the latent space of the generator. This encoder, defined by a MobileNet V2 backbone, performs well with natural face images. The loss function combines image perceptual quality, style similarity, embedding distance, and L1 loss between input and reconstructed images.
On-Device Performance and Fairness Evaluation
The MediaPipe FaceStylizer was benchmarked for parameter numbers, computing FLOPs, and inference time on various devices. Both BlazeStyleGAN-256 and BlazeStyleGAN-512 achieved real-time performance on GPU devices, and BlazeStyleGAN-256 performed well on iOS devices’ CPUs. Moreover, fairness evaluation demonstrated the model’s balanced performance across different human faces.
Face Stylization Visualization
Finally, the effectiveness of the MediaPipe FaceStylizer was demonstrated through face stylization visualization. Using style and natural face images, the FaceStylizer produced high-quality stylization results, blending the characteristics of both inputs.
With MediaPipe FaceStylizer, users can enjoy lightweight and customizable face stylization with few-shot learning. The model’s efficiency, quality, and responsible AI guiding principles make it a valuable tool for AR experiences, gaming, and more.