Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, Fang Wen
How to learn a universal facial representation that boosts all face analysis tasks? This paper takes one step toward this goal. In this paper, we study the transfer performance of pre-trained models on face analysis tasks and introduce a framework, called FaRL, for general Facial Representation Learning in a visual-linguistic manner. On one hand, the framework involves a contrastive loss to learn high-level semantic meaning from image-text pairs. On the other hand, we propose exploring low-level information simultaneously to further enhance the face representation, by adding a masked image modeling. We perform pre-training on LAION-FACE, a dataset containing large amount of face image-text pairs, and evaluate the representation capability on multiple downstream tasks. We show that FaRL achieves better transfer performance compared with previous pre-trained models. We also verify its superiority in the low-data regime. More importantly, our model surpasses the state-of-the-art methods on face analysis tasks including face parsing and face alignment.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Facial Recognition and Modelling | WFW (Extra Data) | AUC@10 (inter-ocular) | 61.16 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | WFW (Extra Data) | FR@10 (inter-ocular) | 1.76 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | WFW (Extra Data) | NME (inter-ocular) | 3.96 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | AFLW-19 | AUC_box@0.07 (%, Full) | 81.3 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | AFLW-19 | NME_box (%, Full) | 1.334 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | AFLW-19 | NME_diag (%, Frontal) | 0.821 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | AFLW-19 | NME_diag (%, Full) | 0.943 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | 300W | NME_inter-ocular (%, Challenge) | 4.42 | FaRL-B (epoch 64) |
| Facial Recognition and Modelling | 300W | NME_inter-ocular (%, Common) | 2.5 | FaRL-B (epoch 64) |
| Facial Recognition and Modelling | 300W | NME_inter-ocular (%, Full) | 2.88 | FaRL-B (epoch 64) |
| Facial Recognition and Modelling | 300W | NME_inter-pupil (%, Challenge) | 6.38 | FaRL-B (epoch 64) |
| Facial Recognition and Modelling | 300W | NME_inter-pupil (%, Common) | 3.46 | FaRL-B (epoch 64) |
| Facial Recognition and Modelling | 300W | NME_inter-pupil (%, Full) | 4.05 | FaRL-B (epoch 64) |
| Facial Recognition and Modelling | 300W | NME_inter-ocular (%, Challenge) | 4.45 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | 300W | NME_inter-ocular (%, Common) | 2.56 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | 300W | NME_inter-ocular (%, Full) | 2.93 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | 300W | NME_inter-pupil (%, Challenge) | 6.42 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | 300W | NME_inter-pupil (%, Common) | 3.53 | FaRL-B (epoch 16) |
| Facial Recognition and Modelling | 300W | NME_inter-pupil (%, Full) | 4.11 | FaRL-B (epoch 16) |
| Scene Parsing | CelebAMask-HQ | Mean F1 | 89.56 | FaRL-B |
| Scene Parsing | LaPa | Mean F1 | 93.88 | FaRL-B |
| Face Reconstruction | 300W | NME_inter-ocular (%, Challenge) | 4.42 | FaRL-B (epoch 64) |
| Face Reconstruction | 300W | NME_inter-ocular (%, Common) | 2.5 | FaRL-B (epoch 64) |
| Face Reconstruction | 300W | NME_inter-ocular (%, Full) | 2.88 | FaRL-B (epoch 64) |
| Face Reconstruction | 300W | NME_inter-pupil (%, Challenge) | 6.38 | FaRL-B (epoch 64) |
| Face Reconstruction | 300W | NME_inter-pupil (%, Common) | 3.46 | FaRL-B (epoch 64) |
| Face Reconstruction | 300W | NME_inter-pupil (%, Full) | 4.05 | FaRL-B (epoch 64) |
| Face Reconstruction | 300W | NME_inter-ocular (%, Challenge) | 4.45 | FaRL-B (epoch 16) |
| Face Reconstruction | 300W | NME_inter-ocular (%, Common) | 2.56 | FaRL-B (epoch 16) |
| Face Reconstruction | 300W | NME_inter-ocular (%, Full) | 2.93 | FaRL-B (epoch 16) |
| Face Reconstruction | 300W | NME_inter-pupil (%, Challenge) | 6.42 | FaRL-B (epoch 16) |
| Face Reconstruction | 300W | NME_inter-pupil (%, Common) | 3.53 | FaRL-B (epoch 16) |
| Face Reconstruction | 300W | NME_inter-pupil (%, Full) | 4.11 | FaRL-B (epoch 16) |
| Face Reconstruction | WFW (Extra Data) | AUC@10 (inter-ocular) | 61.16 | FaRL-B (epoch 16) |
| Face Reconstruction | WFW (Extra Data) | FR@10 (inter-ocular) | 1.76 | FaRL-B (epoch 16) |
| Face Reconstruction | WFW (Extra Data) | NME (inter-ocular) | 3.96 | FaRL-B (epoch 16) |
| Face Reconstruction | AFLW-19 | AUC_box@0.07 (%, Full) | 81.3 | FaRL-B (epoch 16) |
| Face Reconstruction | AFLW-19 | NME_box (%, Full) | 1.334 | FaRL-B (epoch 16) |
| Face Reconstruction | AFLW-19 | NME_diag (%, Frontal) | 0.821 | FaRL-B (epoch 16) |
| Face Reconstruction | AFLW-19 | NME_diag (%, Full) | 0.943 | FaRL-B (epoch 16) |
| 3D | 300W | NME_inter-ocular (%, Challenge) | 4.42 | FaRL-B (epoch 64) |
| 3D | 300W | NME_inter-ocular (%, Common) | 2.5 | FaRL-B (epoch 64) |
| 3D | 300W | NME_inter-ocular (%, Full) | 2.88 | FaRL-B (epoch 64) |
| 3D | 300W | NME_inter-pupil (%, Challenge) | 6.38 | FaRL-B (epoch 64) |
| 3D | 300W | NME_inter-pupil (%, Common) | 3.46 | FaRL-B (epoch 64) |
| 3D | 300W | NME_inter-pupil (%, Full) | 4.05 | FaRL-B (epoch 64) |
| 3D | 300W | NME_inter-ocular (%, Challenge) | 4.45 | FaRL-B (epoch 16) |
| 3D | 300W | NME_inter-ocular (%, Common) | 2.56 | FaRL-B (epoch 16) |
| 3D | 300W | NME_inter-ocular (%, Full) | 2.93 | FaRL-B (epoch 16) |
| 3D | 300W | NME_inter-pupil (%, Challenge) | 6.42 | FaRL-B (epoch 16) |
| 3D | 300W | NME_inter-pupil (%, Common) | 3.53 | FaRL-B (epoch 16) |
| 3D | 300W | NME_inter-pupil (%, Full) | 4.11 | FaRL-B (epoch 16) |
| 3D | WFW (Extra Data) | AUC@10 (inter-ocular) | 61.16 | FaRL-B (epoch 16) |
| 3D | WFW (Extra Data) | FR@10 (inter-ocular) | 1.76 | FaRL-B (epoch 16) |
| 3D | WFW (Extra Data) | NME (inter-ocular) | 3.96 | FaRL-B (epoch 16) |
| 3D | AFLW-19 | AUC_box@0.07 (%, Full) | 81.3 | FaRL-B (epoch 16) |
| 3D | AFLW-19 | NME_box (%, Full) | 1.334 | FaRL-B (epoch 16) |
| 3D | AFLW-19 | NME_diag (%, Frontal) | 0.821 | FaRL-B (epoch 16) |
| 3D | AFLW-19 | NME_diag (%, Full) | 0.943 | FaRL-B (epoch 16) |
| 3D Face Modelling | WFW (Extra Data) | AUC@10 (inter-ocular) | 61.16 | FaRL-B (epoch 16) |
| 3D Face Modelling | WFW (Extra Data) | FR@10 (inter-ocular) | 1.76 | FaRL-B (epoch 16) |
| 3D Face Modelling | WFW (Extra Data) | NME (inter-ocular) | 3.96 | FaRL-B (epoch 16) |
| 3D Face Modelling | AFLW-19 | AUC_box@0.07 (%, Full) | 81.3 | FaRL-B (epoch 16) |
| 3D Face Modelling | AFLW-19 | NME_box (%, Full) | 1.334 | FaRL-B (epoch 16) |
| 3D Face Modelling | AFLW-19 | NME_diag (%, Frontal) | 0.821 | FaRL-B (epoch 16) |
| 3D Face Modelling | AFLW-19 | NME_diag (%, Full) | 0.943 | FaRL-B (epoch 16) |
| 3D Face Modelling | 300W | NME_inter-ocular (%, Challenge) | 4.42 | FaRL-B (epoch 64) |
| 3D Face Modelling | 300W | NME_inter-ocular (%, Common) | 2.5 | FaRL-B (epoch 64) |
| 3D Face Modelling | 300W | NME_inter-ocular (%, Full) | 2.88 | FaRL-B (epoch 64) |
| 3D Face Modelling | 300W | NME_inter-pupil (%, Challenge) | 6.38 | FaRL-B (epoch 64) |
| 3D Face Modelling | 300W | NME_inter-pupil (%, Common) | 3.46 | FaRL-B (epoch 64) |
| 3D Face Modelling | 300W | NME_inter-pupil (%, Full) | 4.05 | FaRL-B (epoch 64) |
| 3D Face Modelling | 300W | NME_inter-ocular (%, Challenge) | 4.45 | FaRL-B (epoch 16) |
| 3D Face Modelling | 300W | NME_inter-ocular (%, Common) | 2.56 | FaRL-B (epoch 16) |
| 3D Face Modelling | 300W | NME_inter-ocular (%, Full) | 2.93 | FaRL-B (epoch 16) |
| 3D Face Modelling | 300W | NME_inter-pupil (%, Challenge) | 6.42 | FaRL-B (epoch 16) |
| 3D Face Modelling | 300W | NME_inter-pupil (%, Common) | 3.53 | FaRL-B (epoch 16) |
| 3D Face Modelling | 300W | NME_inter-pupil (%, Full) | 4.11 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | WFW (Extra Data) | AUC@10 (inter-ocular) | 61.16 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | WFW (Extra Data) | FR@10 (inter-ocular) | 1.76 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | WFW (Extra Data) | NME (inter-ocular) | 3.96 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | AFLW-19 | AUC_box@0.07 (%, Full) | 81.3 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | AFLW-19 | NME_box (%, Full) | 1.334 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | AFLW-19 | NME_diag (%, Frontal) | 0.821 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | AFLW-19 | NME_diag (%, Full) | 0.943 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | 300W | NME_inter-ocular (%, Challenge) | 4.42 | FaRL-B (epoch 64) |
| 3D Face Reconstruction | 300W | NME_inter-ocular (%, Common) | 2.5 | FaRL-B (epoch 64) |
| 3D Face Reconstruction | 300W | NME_inter-ocular (%, Full) | 2.88 | FaRL-B (epoch 64) |
| 3D Face Reconstruction | 300W | NME_inter-pupil (%, Challenge) | 6.38 | FaRL-B (epoch 64) |
| 3D Face Reconstruction | 300W | NME_inter-pupil (%, Common) | 3.46 | FaRL-B (epoch 64) |
| 3D Face Reconstruction | 300W | NME_inter-pupil (%, Full) | 4.05 | FaRL-B (epoch 64) |
| 3D Face Reconstruction | 300W | NME_inter-ocular (%, Challenge) | 4.45 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | 300W | NME_inter-ocular (%, Common) | 2.56 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | 300W | NME_inter-ocular (%, Full) | 2.93 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | 300W | NME_inter-pupil (%, Challenge) | 6.42 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | 300W | NME_inter-pupil (%, Common) | 3.53 | FaRL-B (epoch 16) |
| 3D Face Reconstruction | 300W | NME_inter-pupil (%, Full) | 4.11 | FaRL-B (epoch 16) |
| 2D Semantic Segmentation | CelebAMask-HQ | Mean F1 | 89.56 | FaRL-B |
| 2D Semantic Segmentation | LaPa | Mean F1 | 93.88 | FaRL-B |