Egor Zakharov, Aleksei Ivakhnenko, Aliaksandra Shysheya, Victor Lempitsky
We propose a neural rendering-based system that creates head avatars from a single photograph. Our approach models a person's appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline, warped and added to the coarse image to ensure a high effective resolution of synthesized head views. We compare our system to analogous state-of-the-art systems in terms of visual quality and speed. The experiments show significant inference speedup over previous neural head avatar models for a given visual quality. We also report on a real-time smartphone-based implementation of our system.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| Facial Recognition and Modelling | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| Image Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| Talking Head Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| Face Generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| Face Reconstruction | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| 3D | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| 3D Face Modelling | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| 3D Face Reconstruction | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | CSIM | 0.653 | Fast Bi-layer Avatars (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.358 | Fast Bi-layer Avatars (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 43.3 | Fast Bi-layer Avatars (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | SSIM | 0.508 | Fast Bi-layer Avatars (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 4 | Fast Bi-layer Avatars (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | CSIM | 0.638 | First Order Motion Model (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.311 | First Order Motion Model (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 47.8 | First Order Motion Model (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | SSIM | 0.553 | First Order Motion Model (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 13 | First Order Motion Model (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | CSIM | 0.604 | Few-shot Vid-to-vid (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | LPIPS | 0.368 | Few-shot Vid-to-vid (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | Normalized Pose Error | 46.1 | Few-shot Vid-to-vid (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | SSIM | 0.419 | Few-shot Vid-to-vid (medium size) |
| 10-shot image generation | VoxCeleb2 - 1-shot learning | inference time (ms) | 22 | Few-shot Vid-to-vid (medium size) |