Kyle Min, Jason J. Corso
TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. It consists of two building blocks: first, the encoder network extracts low-resolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information. As a result, a single prediction map is produced from an input clip of multiple frames. Frame-wise saliency maps can be predicted by applying TASED-Net in a sliding-window fashion to a video. The proposed approach assumes that the saliency map of any frame can be predicted by considering a limited number of past frames. The results of our extensive experiments on video saliency detection validate this assumption and demonstrate that our fully-convolutional model with temporal aggregation method is effective. TASED-Net significantly outperforms previous state-of-the-art approaches on all three major large-scale datasets of video saliency detection: DHF1K, Hollywood2, and UCFSports. After analyzing the results qualitatively, we observe that our model is especially better at attending to salient moving objects.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Saliency Detection | DHF1K | NSS | 2.667 | TASED-Net |
| Saliency Detection | MSU Video Saliency Prediction | AUC-J | 0.852 | TASED-Net |
| Saliency Detection | MSU Video Saliency Prediction | CC | 0.71 | TASED-Net |
| Saliency Detection | MSU Video Saliency Prediction | FPS | 1.85 | TASED-Net |
| Saliency Detection | MSU Video Saliency Prediction | KLDiv | 0.538 | TASED-Net |
| Saliency Detection | MSU Video Saliency Prediction | NSS | 1.96 | TASED-Net |
| Saliency Detection | MSU Video Saliency Prediction | SIM | 0.61 | TASED-Net |