Christopher Choy, JunYoung Gwak, Silvio Savarese
In many robotics and VR/AR applications, 3D-videos are readily-available sources of input (a continuous sequence of depth images, or LIDAR scans). However, those 3D-videos are processed frame-by-frame either through 2D convnets or 3D perception algorithms. In this work, we propose 4-dimensional convolutional neural networks for spatio-temporal perception that can directly process such 3D-videos using high-dimensional convolutions. For this, we adopt sparse tensors and propose the generalized sparse convolution that encompasses all discrete convolutions. To implement the generalized sparse convolution, we create an open-source auto-differentiation library for sparse tensors that provides extensive functions for high-dimensional convolutional neural networks. We create 4D spatio-temporal convolutional neural networks using the library and validate them on various 3D semantic segmentation benchmarks and proposed 4D datasets for 3D-video perception. To overcome challenges in the 4D space, we propose the hybrid kernel, a special case of the generalized sparse convolution, and the trilateral-stationary conditional random field that enforces spatio-temporal consistency in the 7D space-time-chroma space. Experimentally, we show that convolutional neural networks with only generalized 3D sparse convolutions can outperform 2D or 2D-3D hybrid methods by a large margin. Also, we show that on 3D-videos, 4D spatio-temporal convolutional neural networks are robust to noise, outperform 3D convolutional neural networks and are faster than the 3D counterpart in some cases.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | ScanNet | test mIoU | 73.4 | MinkowskiNet |
| Semantic Segmentation | ScanNet | val mIoU | 72.2 | MinkowskiNet |
| Semantic Segmentation | S3DIS Area5 | mAcc | 71.7 | MinkowskiNet |
| Semantic Segmentation | S3DIS Area5 | mIoU | 65.4 | MinkowskiNet |
| Semantic Segmentation | S3DIS | Mean IoU | 65.4 | MinkowskiNet |
| Semantic Segmentation | S3DIS | Params (M) | 37.9 | MinkowskiNet |
| Semantic Segmentation | ScanNet200 | test mIoU | 25.3 | MinkUNet |
| Semantic Segmentation | ScanNet200 | val mIoU | 25 | MinkUNet |
| Semantic Segmentation | STPLS3D | mIOU | 51.3 | MinkowskiNet |
| Semantic Segmentation | WildScenes | mIoU | 36.53 | MinkUNet |
| Semantic Segmentation | WildScenes | mIoU (Env DA) | 30.78 | MinkUNet |
| Semantic Segmentation | WildScenes | mIoU (Temporal DA) | 27.2 | MinkUNet |
| Semantic Segmentation | ScanNet++ | Top-1 IoU | 0.456 | SpUNet (MinkowskiNet) |
| Semantic Segmentation | ScanNet++ | Top-3 IoU | 0.683 | SpUNet (MinkowskiNet) |
| Semantic Segmentation | ScribbleKITTI | mIoU | 55 | MinkowskiNet |
| 3D Semantic Segmentation | ScanNet200 | test mIoU | 25.3 | MinkUNet |
| 3D Semantic Segmentation | ScanNet200 | val mIoU | 25 | MinkUNet |
| 3D Semantic Segmentation | STPLS3D | mIOU | 51.3 | MinkowskiNet |
| 3D Semantic Segmentation | WildScenes | mIoU | 36.53 | MinkUNet |
| 3D Semantic Segmentation | WildScenes | mIoU (Env DA) | 30.78 | MinkUNet |
| 3D Semantic Segmentation | WildScenes | mIoU (Temporal DA) | 27.2 | MinkUNet |
| 3D Semantic Segmentation | ScanNet++ | Top-1 IoU | 0.456 | SpUNet (MinkowskiNet) |
| 3D Semantic Segmentation | ScanNet++ | Top-3 IoU | 0.683 | SpUNet (MinkowskiNet) |
| 3D Semantic Segmentation | ScribbleKITTI | mIoU | 55 | MinkowskiNet |
| 10-shot image generation | ScanNet | test mIoU | 73.4 | MinkowskiNet |
| 10-shot image generation | ScanNet | val mIoU | 72.2 | MinkowskiNet |
| 10-shot image generation | S3DIS Area5 | mAcc | 71.7 | MinkowskiNet |
| 10-shot image generation | S3DIS Area5 | mIoU | 65.4 | MinkowskiNet |
| 10-shot image generation | S3DIS | Mean IoU | 65.4 | MinkowskiNet |
| 10-shot image generation | S3DIS | Params (M) | 37.9 | MinkowskiNet |
| 10-shot image generation | ScanNet200 | test mIoU | 25.3 | MinkUNet |
| 10-shot image generation | ScanNet200 | val mIoU | 25 | MinkUNet |
| 10-shot image generation | STPLS3D | mIOU | 51.3 | MinkowskiNet |
| 10-shot image generation | WildScenes | mIoU | 36.53 | MinkUNet |
| 10-shot image generation | WildScenes | mIoU (Env DA) | 30.78 | MinkUNet |
| 10-shot image generation | WildScenes | mIoU (Temporal DA) | 27.2 | MinkUNet |
| 10-shot image generation | ScanNet++ | Top-1 IoU | 0.456 | SpUNet (MinkowskiNet) |
| 10-shot image generation | ScanNet++ | Top-3 IoU | 0.683 | SpUNet (MinkowskiNet) |
| 10-shot image generation | ScribbleKITTI | mIoU | 55 | MinkowskiNet |