Bin-Bin Gao, Chao Xing, Chen-Wei Xie, Jianxin Wu, Xin Geng
Convolutional Neural Networks (ConvNets) have achieved excellent recognition performance in various visual recognition tasks. A large labeled training set is one of the most important factors for its success. However, it is difficult to collect sufficient training images with precise labels in some domains such as apparent age estimation, head pose estimation, multi-label classification and semantic segmentation. Fortunately, there is ambiguous information among labels, which makes these tasks different from traditional classification. Based on this observation, we convert the label of each image into a discrete label distribution, and learn the label distribution by minimizing a Kullback-Leibler divergence between the predicted and ground-truth label distributions using deep ConvNets. The proposed DLDL (Deep Label Distribution Learning) method effectively utilizes the label ambiguity in both feature learning and classifier learning, which help prevent the network from over-fitting even when the training set is small. Experimental results show that the proposed approach produces significantly better results than state-of-the-art methods for age estimation and head pose estimation. At the same time, it also improves recognition performance for multi-label classification and semantic segmentation tasks.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Facial Recognition and Modelling | ChaLearn 2015 | MAE | 3.51 | DLDL+VGG-Face |
| Facial Recognition and Modelling | ChaLearn 2015 | e-error | 0.31 | DLDL+VGG-Face |
| Facial Recognition and Modelling | MORPH Album2 | MAE | 2.42 | DLDL+VGG-Face (KL, Max)3 |
| Semantic Segmentation | PASCAL VOC 2012 | Mean IoU | 67.1 | DLDL-8s+CRF |
| Semantic Segmentation | PASCAL VOC 2011 | Mean IoU | 67.6 | DLDL-8s+CRF |
| Pose Estimation | Pointing'04 | MAE | 4.64 | Ours DLDL (KL) |
| Pose Estimation | AFLW | MAE | 9.78 | DLDL (KL) |
| Pose Estimation | BJUT-3D | MAE | 0.09 | Ours DLDL (KL) |
| Multi-Label Classification | PASCAL VOC 2012 | mAP | 92.4 | Ours PF-DLDL |
| Multi-Label Classification | PASCAL VOC 2007 | mAP | 93.4 | Ours PF-DLDL |
| Face Reconstruction | ChaLearn 2015 | MAE | 3.51 | DLDL+VGG-Face |
| Face Reconstruction | ChaLearn 2015 | e-error | 0.31 | DLDL+VGG-Face |
| Face Reconstruction | MORPH Album2 | MAE | 2.42 | DLDL+VGG-Face (KL, Max)3 |
| 3D | Pointing'04 | MAE | 4.64 | Ours DLDL (KL) |
| 3D | AFLW | MAE | 9.78 | DLDL (KL) |
| 3D | BJUT-3D | MAE | 0.09 | Ours DLDL (KL) |
| 3D | ChaLearn 2015 | MAE | 3.51 | DLDL+VGG-Face |
| 3D | ChaLearn 2015 | e-error | 0.31 | DLDL+VGG-Face |
| 3D | MORPH Album2 | MAE | 2.42 | DLDL+VGG-Face (KL, Max)3 |
| 3D Face Modelling | ChaLearn 2015 | MAE | 3.51 | DLDL+VGG-Face |
| 3D Face Modelling | ChaLearn 2015 | e-error | 0.31 | DLDL+VGG-Face |
| 3D Face Modelling | MORPH Album2 | MAE | 2.42 | DLDL+VGG-Face (KL, Max)3 |
| 3D Face Reconstruction | ChaLearn 2015 | MAE | 3.51 | DLDL+VGG-Face |
| 3D Face Reconstruction | ChaLearn 2015 | e-error | 0.31 | DLDL+VGG-Face |
| 3D Face Reconstruction | MORPH Album2 | MAE | 2.42 | DLDL+VGG-Face (KL, Max)3 |
| 10-shot image generation | PASCAL VOC 2012 | Mean IoU | 67.1 | DLDL-8s+CRF |
| 10-shot image generation | PASCAL VOC 2011 | Mean IoU | 67.6 | DLDL-8s+CRF |
| Age Estimation | ChaLearn 2015 | MAE | 3.51 | DLDL+VGG-Face |
| Age Estimation | ChaLearn 2015 | e-error | 0.31 | DLDL+VGG-Face |
| Age Estimation | MORPH Album2 | MAE | 2.42 | DLDL+VGG-Face (KL, Max)3 |
| 1 Image, 2*2 Stitchi | Pointing'04 | MAE | 4.64 | Ours DLDL (KL) |
| 1 Image, 2*2 Stitchi | AFLW | MAE | 9.78 | DLDL (KL) |
| 1 Image, 2*2 Stitchi | BJUT-3D | MAE | 0.09 | Ours DLDL (KL) |