Khurram Azeem Hashmi, Alain Pagani, Didier Stricker, Muhammamd Zeshan Afzal
We present a new, simple yet effective approach to uplift video object detection. We observe that prior works operate on instance-level feature aggregation that imminently neglects the refined pixel-level representation, resulting in confusion among objects sharing similar appearance or motion characteristics. To address this limitation, we propose BoxMask, which effectively learns discriminative representations by incorporating class-aware pixel-level information. We simply consider bounding box-level annotations as a coarse mask for each object to supervise our method. The proposed module can be effortlessly integrated into any region-based detector to boost detection. Extensive experiments on ImageNet VID and EPIC KITCHENS datasets demonstrate consistent and significant improvement when we plug our BoxMask module into numerous recent state-of-the-art methods.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | ImageNet VID | MAP | 84.8 | BoxMask(ResNeXt101) |
| Object Detection | ImageNet VID | MAP | 80.7 | BoxMask (ResNet-50) |
| 3D | ImageNet VID | MAP | 84.8 | BoxMask(ResNeXt101) |
| 3D | ImageNet VID | MAP | 80.7 | BoxMask (ResNet-50) |
| 2D Classification | ImageNet VID | MAP | 84.8 | BoxMask(ResNeXt101) |
| 2D Classification | ImageNet VID | MAP | 80.7 | BoxMask (ResNet-50) |
| 2D Object Detection | ImageNet VID | MAP | 84.8 | BoxMask(ResNeXt101) |
| 2D Object Detection | ImageNet VID | MAP | 80.7 | BoxMask (ResNet-50) |
| 16k | ImageNet VID | MAP | 84.8 | BoxMask(ResNeXt101) |
| 16k | ImageNet VID | MAP | 80.7 | BoxMask (ResNet-50) |