Ping Hu, Fabian Caba Heilbron, Oliver Wang, Zhe Lin, Stan Sclaroff, Federico Perazzi
We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore, at each time step, we only need to perform a lightweight computation to extract a sub-features group from a single sub-network. The full features used for segmentation are then recomposed by application of a novel attention propagation module that compensates for geometry deformation between frames. A grouped knowledge distillation loss is also introduced to further improve the representation power at both full and sub-feature levels. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Scene Parsing | Cityscapes val | mIoU | 79.9 | TDNet-50 [9] |
| Scene Parsing | CamVid | Mean IoU | 76.2 | TDNet-50 |
| Semantic Segmentation | NYU Depth v2 | Mean IoU | 43.5 | TD2-PSP50 |
| Semantic Segmentation | NYU Depth v2 | Mean IoU | 37.4 | TD4-PSP18 |
| Semantic Segmentation | UrbanLF | mIoU (Real) | 76.48 | TDNet (ResNet-50) |
| Semantic Segmentation | UrbanLF | mIoU (Syn) | 74.71 | TDNet (ResNet-50) |
| Semantic Segmentation | Cityscapes test | Time (ms) | 21 | TD4-BISE18 |
| Semantic Segmentation | CamVid | Time (ms) | 90 | TD2-PSP50 |
| Semantic Segmentation | CamVid | mIoU | 76 | TD2-PSP50 |
| Semantic Segmentation | CamVid | Time (ms) | 40 | TD4-PSP18 |
| Semantic Segmentation | CamVid | mIoU | 72.6 | TD4-PSP18 |
| Semantic Segmentation | NYU Depth v2 | Speed(ms/f) | 35 | TD2-PSP50 |
| Semantic Segmentation | NYU Depth v2 | mIoU | 43.5 | TD2-PSP50 |
| Semantic Segmentation | NYU Depth v2 | Speed(ms/f) | 19 | TD4-PSP18 |
| Semantic Segmentation | NYU Depth v2 | mIoU | 37.4 | TD4-PSP18 |
| Video Semantic Segmentation | Cityscapes val | mIoU | 79.9 | TDNet-50 [9] |
| Video Semantic Segmentation | CamVid | Mean IoU | 76.2 | TDNet-50 |
| Scene Understanding | Cityscapes val | mIoU | 79.9 | TDNet-50 [9] |
| Scene Understanding | CamVid | Mean IoU | 76.2 | TDNet-50 |
| 2D Semantic Segmentation | Cityscapes val | mIoU | 79.9 | TDNet-50 [9] |
| 2D Semantic Segmentation | CamVid | Mean IoU | 76.2 | TDNet-50 |
| 10-shot image generation | NYU Depth v2 | Mean IoU | 43.5 | TD2-PSP50 |
| 10-shot image generation | NYU Depth v2 | Mean IoU | 37.4 | TD4-PSP18 |
| 10-shot image generation | UrbanLF | mIoU (Real) | 76.48 | TDNet (ResNet-50) |
| 10-shot image generation | UrbanLF | mIoU (Syn) | 74.71 | TDNet (ResNet-50) |
| 10-shot image generation | Cityscapes test | Time (ms) | 21 | TD4-BISE18 |
| 10-shot image generation | CamVid | Time (ms) | 90 | TD2-PSP50 |
| 10-shot image generation | CamVid | mIoU | 76 | TD2-PSP50 |
| 10-shot image generation | CamVid | Time (ms) | 40 | TD4-PSP18 |
| 10-shot image generation | CamVid | mIoU | 72.6 | TD4-PSP18 |
| 10-shot image generation | NYU Depth v2 | Speed(ms/f) | 35 | TD2-PSP50 |
| 10-shot image generation | NYU Depth v2 | mIoU | 43.5 | TD2-PSP50 |
| 10-shot image generation | NYU Depth v2 | Speed(ms/f) | 19 | TD4-PSP18 |
| 10-shot image generation | NYU Depth v2 | mIoU | 37.4 | TD4-PSP18 |