Chao Yang, Huizhou Li, Fangting Lin, Bin Jiang, Hao Zhao
Recently, deep learning-based models have exhibited remarkable performance for image manipulation detection. However, most of them suffer from poor universality of handcrafted or predetermined features. Meanwhile, they only focus on manipulation localization and overlook manipulation classification. To address these issues, we propose a coarse-to-fine architecture named Constrained R-CNN for complete and accurate image forensics. First, the learnable manipulation feature extractor learns a unified feature representation directly from data. Second, the attention region proposal network effectively discriminates manipulated regions for the next manipulation classification and coarse localization. Then, the skip structure fuses low-level and high-level information to refine the global manipulation features. Finally, the coarse localization information guides the model to further learn the finer local features and segment out the tampered region. Experimental results show that our model achieves state-of-the-art performance. Especially, the F1 score is increased by 28.4%, 73.2%, 13.3% on the NIST16, COVERAGE, and Columbia dataset.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Manipulation Detection | COVERAGE | AUC | 0.553 | CR-CNN |
| Image Manipulation Detection | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| Image Manipulation Detection | Columbia | AUC | 0.755 | CR-CNN |
| Image Manipulation Detection | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| Image Manipulation Detection | CocoGlide | AUC | 0.589 | CR-CNN |
| Image Manipulation Detection | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| Image Manipulation Detection | DSO-1 | AUC | 0.576 | CR-CNN |
| Image Manipulation Detection | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| Image Manipulation Detection | Casia V1+ | AUC | 0.67 | CR-CNN |
| Image Manipulation Detection | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| Video | COVERAGE | AUC | 0.553 | CR-CNN |
| Video | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| Video | Columbia | AUC | 0.755 | CR-CNN |
| Video | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| Video | CocoGlide | AUC | 0.589 | CR-CNN |
| Video | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| Video | DSO-1 | AUC | 0.576 | CR-CNN |
| Video | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| Video | Casia V1+ | AUC | 0.67 | CR-CNN |
| Video | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| Temporal Action Localization | COVERAGE | AUC | 0.553 | CR-CNN |
| Temporal Action Localization | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| Temporal Action Localization | Columbia | AUC | 0.755 | CR-CNN |
| Temporal Action Localization | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| Temporal Action Localization | CocoGlide | AUC | 0.589 | CR-CNN |
| Temporal Action Localization | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| Temporal Action Localization | DSO-1 | AUC | 0.576 | CR-CNN |
| Temporal Action Localization | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| Temporal Action Localization | Casia V1+ | AUC | 0.67 | CR-CNN |
| Temporal Action Localization | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| Anomaly Detection | COVERAGE | AUC | 0.553 | CR-CNN |
| Anomaly Detection | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| Anomaly Detection | Columbia | AUC | 0.755 | CR-CNN |
| Anomaly Detection | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| Anomaly Detection | CocoGlide | AUC | 0.589 | CR-CNN |
| Anomaly Detection | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| Anomaly Detection | DSO-1 | AUC | 0.576 | CR-CNN |
| Anomaly Detection | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| Anomaly Detection | Casia V1+ | AUC | 0.67 | CR-CNN |
| Anomaly Detection | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| Zero-Shot Learning | COVERAGE | AUC | 0.553 | CR-CNN |
| Zero-Shot Learning | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| Zero-Shot Learning | Columbia | AUC | 0.755 | CR-CNN |
| Zero-Shot Learning | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| Zero-Shot Learning | CocoGlide | AUC | 0.589 | CR-CNN |
| Zero-Shot Learning | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| Zero-Shot Learning | DSO-1 | AUC | 0.576 | CR-CNN |
| Zero-Shot Learning | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| Zero-Shot Learning | Casia V1+ | AUC | 0.67 | CR-CNN |
| Zero-Shot Learning | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| Activity Recognition | COVERAGE | AUC | 0.553 | CR-CNN |
| Activity Recognition | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| Activity Recognition | Columbia | AUC | 0.755 | CR-CNN |
| Activity Recognition | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| Activity Recognition | CocoGlide | AUC | 0.589 | CR-CNN |
| Activity Recognition | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| Activity Recognition | DSO-1 | AUC | 0.576 | CR-CNN |
| Activity Recognition | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| Activity Recognition | Casia V1+ | AUC | 0.67 | CR-CNN |
| Activity Recognition | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| Action Localization | COVERAGE | AUC | 0.553 | CR-CNN |
| Action Localization | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| Action Localization | Columbia | AUC | 0.755 | CR-CNN |
| Action Localization | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| Action Localization | CocoGlide | AUC | 0.589 | CR-CNN |
| Action Localization | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| Action Localization | DSO-1 | AUC | 0.576 | CR-CNN |
| Action Localization | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| Action Localization | Casia V1+ | AUC | 0.67 | CR-CNN |
| Action Localization | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| 3D Action Recognition | COVERAGE | AUC | 0.553 | CR-CNN |
| 3D Action Recognition | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| 3D Action Recognition | Columbia | AUC | 0.755 | CR-CNN |
| 3D Action Recognition | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| 3D Action Recognition | CocoGlide | AUC | 0.589 | CR-CNN |
| 3D Action Recognition | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| 3D Action Recognition | DSO-1 | AUC | 0.576 | CR-CNN |
| 3D Action Recognition | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| 3D Action Recognition | Casia V1+ | AUC | 0.67 | CR-CNN |
| 3D Action Recognition | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| Action Recognition | COVERAGE | AUC | 0.553 | CR-CNN |
| Action Recognition | COVERAGE | Balanced Accuracy | 0.391 | CR-CNN |
| Action Recognition | Columbia | AUC | 0.755 | CR-CNN |
| Action Recognition | Columbia | Balanced Accuracy | 0.631 | CR-CNN |
| Action Recognition | CocoGlide | AUC | 0.589 | CR-CNN |
| Action Recognition | CocoGlide | Balanced Accuracy | 0.447 | CR-CNN |
| Action Recognition | DSO-1 | AUC | 0.576 | CR-CNN |
| Action Recognition | DSO-1 | Balanced Accuracy | 0.289 | CR-CNN |
| Action Recognition | Casia V1+ | AUC | 0.67 | CR-CNN |
| Action Recognition | Casia V1+ | Balanced Accuracy | 0.481 | CR-CNN |
| Image Manipulation Localization | Columbia | Average Pixel F1(Fixed threshold) | 0.631 | CR-CNN |
| Image Manipulation Localization | COVERAGE | Average Pixel F1(Fixed threshold) | 0.391 | CR-CNN |
| Image Manipulation Localization | Casia V1+ | Average Pixel F1(Fixed threshold) | 0.481 | CR-CNN |
| Image Manipulation Localization | CocoGlide | Average Pixel F1(Fixed threshold) | 0.447 | CR-CNN |
| Image Manipulation Localization | DSO-1 | Average Pixel F1(Fixed threshold) | 0.289 | CR-CNN |