Mingyang Shang, Dawei Xiang, Zhicheng Wang, Erjin Zhou
Occlusion is very challenging in pedestrian detection. In this paper, we propose a simple yet effective method named V2F-Net, which explicitly decomposes occluded pedestrian detection into visible region detection and full body estimation. V2F-Net consists of two sub-networks: Visible region Detection Network (VDN) and Full body Estimation Network (FEN). VDN tries to localize visible regions and FEN estimates full-body box on the basis of the visible box. Moreover, to further improve the estimation of full body, we propose a novel Embedding-based Part-aware Module (EPM). By supervising the visibility for each part, the network is encouraged to extract features with essential part information. We experimentally show the effectiveness of V2F-Net by conducting several experiments on two challenging datasets. V2F-Net achieves 5.85% AP gains on CrowdHuman and 2.24% MR-2 improvements on CityPersons compared to FPN baseline. Besides, the consistent gain on both one-stage and two-stage detector validates the generalizability of our method.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | CrowdHuman (full body) | AP | 91.03 | V2F-Net |
| Object Detection | CrowdHuman (full body) | Recall | 84.2 | V2F-Net |
| Object Detection | CrowdHuman (full body) | mMR | 42.28 | V2F-Net |
| Object Detection | CityPersons | mMR | 10.08 | V2F-Net |
| 3D | CrowdHuman (full body) | AP | 91.03 | V2F-Net |
| 3D | CrowdHuman (full body) | Recall | 84.2 | V2F-Net |
| 3D | CrowdHuman (full body) | mMR | 42.28 | V2F-Net |
| 3D | CityPersons | mMR | 10.08 | V2F-Net |
| 2D Classification | CrowdHuman (full body) | AP | 91.03 | V2F-Net |
| 2D Classification | CrowdHuman (full body) | Recall | 84.2 | V2F-Net |
| 2D Classification | CrowdHuman (full body) | mMR | 42.28 | V2F-Net |
| 2D Classification | CityPersons | mMR | 10.08 | V2F-Net |
| 2D Object Detection | CrowdHuman (full body) | AP | 91.03 | V2F-Net |
| 2D Object Detection | CrowdHuman (full body) | Recall | 84.2 | V2F-Net |
| 2D Object Detection | CrowdHuman (full body) | mMR | 42.28 | V2F-Net |
| 2D Object Detection | CityPersons | mMR | 10.08 | V2F-Net |
| 16k | CrowdHuman (full body) | AP | 91.03 | V2F-Net |
| 16k | CrowdHuman (full body) | Recall | 84.2 | V2F-Net |
| 16k | CrowdHuman (full body) | mMR | 42.28 | V2F-Net |
| 16k | CityPersons | mMR | 10.08 | V2F-Net |