Yiling Zhang, Erkut Akdag, Egor Bondarev, Peter H. N. de With
Detection of anomaly events is relevant for public safety and requires a combination of fine-grained motion information and contextual events at variable time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL outperforms state-of-the-art methods on the UCF-Crime dataset, achieving an anomaly detection performance 89.78% AUC. Moreover, it performs complementary to SotA with 95.32% AUC on the ShanghaiTech and 84.57% AP on the XD-Violence dataset. Furthermore, we generate an extended dataset of the UCF-Crime for development and evaluation on a wider range of anomalies, namely Video Anomaly Detection Dataset (VADD), involving 2,591 videos in 18 classes with extensive coverage of realistic anomalies.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video Understanding | VADD | ROC AUC | 88.42 | MTFL (VST, finetuned on VADD) |
| Video Understanding | ShanghaiTech Weakly Supervised | AUC-ROC | 95.7 | MTFL (VST, finetuned on VADD) |
| Video Understanding | ShanghaiTech Weakly Supervised | AUC-ROC | 95.32 | MTFL (VST) |
| Video Understanding | UCF-Crime | ROC AUC | 89.78 | MTFL (VST, finetuned on VADD) |
| Video Understanding | UCF-Crime | ROC AUC | 87.16 | MTFL (VST) |
| Video Understanding | XD-Violence | AP | 84.57 | MTFL (VST) |
| Video Understanding | XD-Violence | AP | 79.4 | MTFL (VST, finetuned on VADD) |
| Video | VADD | ROC AUC | 88.42 | MTFL (VST, finetuned on VADD) |
| Video | ShanghaiTech Weakly Supervised | AUC-ROC | 95.7 | MTFL (VST, finetuned on VADD) |
| Video | ShanghaiTech Weakly Supervised | AUC-ROC | 95.32 | MTFL (VST) |
| Video | UCF-Crime | ROC AUC | 89.78 | MTFL (VST, finetuned on VADD) |
| Video | UCF-Crime | ROC AUC | 87.16 | MTFL (VST) |
| Video | XD-Violence | AP | 84.57 | MTFL (VST) |
| Video | XD-Violence | AP | 79.4 | MTFL (VST, finetuned on VADD) |
| Anomaly Detection | VADD | ROC AUC | 88.42 | MTFL (VST, finetuned on VADD) |
| Anomaly Detection | ShanghaiTech Weakly Supervised | AUC-ROC | 95.7 | MTFL (VST, finetuned on VADD) |
| Anomaly Detection | ShanghaiTech Weakly Supervised | AUC-ROC | 95.32 | MTFL (VST) |
| Anomaly Detection | UCF-Crime | ROC AUC | 89.78 | MTFL (VST, finetuned on VADD) |
| Anomaly Detection | UCF-Crime | ROC AUC | 87.16 | MTFL (VST) |
| Anomaly Detection | XD-Violence | AP | 84.57 | MTFL (VST) |
| Anomaly Detection | XD-Violence | AP | 79.4 | MTFL (VST, finetuned on VADD) |