Kostas Triaridis, Konstantinos Tsigos, Vasileios Mezaris
Recent image manipulation localization and detection techniques typically leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM or Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of combining the outputs of such filters to leverage the complementary nature of the produced artifacts for performing image manipulation localization and detection (IMLD). We assess two distinct combination methods: one that produces independent features from each forensic filter and then fuses them (this is referred to as late fusion) and one that performs early mixing of different modal outputs and produces combined features (this is referred to as early fusion). We use the latter as a feature encoding mechanism, accompanied by a new decoding mechanism that encompasses feature re-weighting, for formulating the proposed MMFusion architecture. We demonstrate that MMFusion achieves competitive performance for both image manipulation localization and detection, outperforming state-of-the-art models across several image and video datasets. We also investigate further the contribution of each forensic filter within MMFusion for addressing different types of manipulations, building on recent AI explainability measures.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Manipulation Detection | COVERAGE | AUC | 0.839 | Early Fusion |
| Image Manipulation Detection | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| Image Manipulation Detection | COVERAGE | AUC | 0.792 | Late Fusion |
| Image Manipulation Detection | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| Image Manipulation Detection | Columbia | AUC | 0.996 | Early Fusion |
| Image Manipulation Detection | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| Image Manipulation Detection | Columbia | AUC | 0.977 | Late Fusion |
| Image Manipulation Detection | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| Image Manipulation Detection | CocoGlide | AUC | 0.76 | Late Fusion |
| Image Manipulation Detection | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| Image Manipulation Detection | CocoGlide | AUC | 0.755 | Early Fusion |
| Image Manipulation Detection | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| Image Manipulation Detection | DSO-1 | AUC | 0.966 | Early Fusion |
| Image Manipulation Detection | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| Image Manipulation Detection | DSO-1 | AUC | 0.958 | Late Fusion |
| Image Manipulation Detection | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| Image Manipulation Detection | Casia V1+ | AUC | 0.93 | Late Fusion |
| Image Manipulation Detection | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| Image Manipulation Detection | Casia V1+ | AUC | 0.929 | Early Fusion |
| Image Manipulation Detection | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| Video | COVERAGE | AUC | 0.839 | Early Fusion |
| Video | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| Video | COVERAGE | AUC | 0.792 | Late Fusion |
| Video | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| Video | Columbia | AUC | 0.996 | Early Fusion |
| Video | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| Video | Columbia | AUC | 0.977 | Late Fusion |
| Video | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| Video | CocoGlide | AUC | 0.76 | Late Fusion |
| Video | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| Video | CocoGlide | AUC | 0.755 | Early Fusion |
| Video | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| Video | DSO-1 | AUC | 0.966 | Early Fusion |
| Video | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| Video | DSO-1 | AUC | 0.958 | Late Fusion |
| Video | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| Video | Casia V1+ | AUC | 0.93 | Late Fusion |
| Video | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| Video | Casia V1+ | AUC | 0.929 | Early Fusion |
| Video | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| Temporal Action Localization | COVERAGE | AUC | 0.839 | Early Fusion |
| Temporal Action Localization | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| Temporal Action Localization | COVERAGE | AUC | 0.792 | Late Fusion |
| Temporal Action Localization | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| Temporal Action Localization | Columbia | AUC | 0.996 | Early Fusion |
| Temporal Action Localization | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| Temporal Action Localization | Columbia | AUC | 0.977 | Late Fusion |
| Temporal Action Localization | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| Temporal Action Localization | CocoGlide | AUC | 0.76 | Late Fusion |
| Temporal Action Localization | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| Temporal Action Localization | CocoGlide | AUC | 0.755 | Early Fusion |
| Temporal Action Localization | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| Temporal Action Localization | DSO-1 | AUC | 0.966 | Early Fusion |
| Temporal Action Localization | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| Temporal Action Localization | DSO-1 | AUC | 0.958 | Late Fusion |
| Temporal Action Localization | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| Temporal Action Localization | Casia V1+ | AUC | 0.93 | Late Fusion |
| Temporal Action Localization | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| Temporal Action Localization | Casia V1+ | AUC | 0.929 | Early Fusion |
| Temporal Action Localization | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| Anomaly Detection | COVERAGE | AUC | 0.839 | Early Fusion |
| Anomaly Detection | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| Anomaly Detection | COVERAGE | AUC | 0.792 | Late Fusion |
| Anomaly Detection | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| Anomaly Detection | Columbia | AUC | 0.996 | Early Fusion |
| Anomaly Detection | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| Anomaly Detection | Columbia | AUC | 0.977 | Late Fusion |
| Anomaly Detection | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| Anomaly Detection | CocoGlide | AUC | 0.76 | Late Fusion |
| Anomaly Detection | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| Anomaly Detection | CocoGlide | AUC | 0.755 | Early Fusion |
| Anomaly Detection | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| Anomaly Detection | DSO-1 | AUC | 0.966 | Early Fusion |
| Anomaly Detection | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| Anomaly Detection | DSO-1 | AUC | 0.958 | Late Fusion |
| Anomaly Detection | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| Anomaly Detection | Casia V1+ | AUC | 0.93 | Late Fusion |
| Anomaly Detection | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| Anomaly Detection | Casia V1+ | AUC | 0.929 | Early Fusion |
| Anomaly Detection | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| Zero-Shot Learning | COVERAGE | AUC | 0.839 | Early Fusion |
| Zero-Shot Learning | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| Zero-Shot Learning | COVERAGE | AUC | 0.792 | Late Fusion |
| Zero-Shot Learning | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| Zero-Shot Learning | Columbia | AUC | 0.996 | Early Fusion |
| Zero-Shot Learning | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| Zero-Shot Learning | Columbia | AUC | 0.977 | Late Fusion |
| Zero-Shot Learning | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| Zero-Shot Learning | CocoGlide | AUC | 0.76 | Late Fusion |
| Zero-Shot Learning | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| Zero-Shot Learning | CocoGlide | AUC | 0.755 | Early Fusion |
| Zero-Shot Learning | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| Zero-Shot Learning | DSO-1 | AUC | 0.966 | Early Fusion |
| Zero-Shot Learning | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| Zero-Shot Learning | DSO-1 | AUC | 0.958 | Late Fusion |
| Zero-Shot Learning | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| Zero-Shot Learning | Casia V1+ | AUC | 0.93 | Late Fusion |
| Zero-Shot Learning | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| Zero-Shot Learning | Casia V1+ | AUC | 0.929 | Early Fusion |
| Zero-Shot Learning | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| Activity Recognition | COVERAGE | AUC | 0.839 | Early Fusion |
| Activity Recognition | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| Activity Recognition | COVERAGE | AUC | 0.792 | Late Fusion |
| Activity Recognition | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| Activity Recognition | Columbia | AUC | 0.996 | Early Fusion |
| Activity Recognition | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| Activity Recognition | Columbia | AUC | 0.977 | Late Fusion |
| Activity Recognition | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| Activity Recognition | CocoGlide | AUC | 0.76 | Late Fusion |
| Activity Recognition | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| Activity Recognition | CocoGlide | AUC | 0.755 | Early Fusion |
| Activity Recognition | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| Activity Recognition | DSO-1 | AUC | 0.966 | Early Fusion |
| Activity Recognition | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| Activity Recognition | DSO-1 | AUC | 0.958 | Late Fusion |
| Activity Recognition | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| Activity Recognition | Casia V1+ | AUC | 0.93 | Late Fusion |
| Activity Recognition | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| Activity Recognition | Casia V1+ | AUC | 0.929 | Early Fusion |
| Activity Recognition | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| Action Localization | COVERAGE | AUC | 0.839 | Early Fusion |
| Action Localization | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| Action Localization | COVERAGE | AUC | 0.792 | Late Fusion |
| Action Localization | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| Action Localization | Columbia | AUC | 0.996 | Early Fusion |
| Action Localization | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| Action Localization | Columbia | AUC | 0.977 | Late Fusion |
| Action Localization | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| Action Localization | CocoGlide | AUC | 0.76 | Late Fusion |
| Action Localization | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| Action Localization | CocoGlide | AUC | 0.755 | Early Fusion |
| Action Localization | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| Action Localization | DSO-1 | AUC | 0.966 | Early Fusion |
| Action Localization | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| Action Localization | DSO-1 | AUC | 0.958 | Late Fusion |
| Action Localization | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| Action Localization | Casia V1+ | AUC | 0.93 | Late Fusion |
| Action Localization | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| Action Localization | Casia V1+ | AUC | 0.929 | Early Fusion |
| Action Localization | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| 3D Action Recognition | COVERAGE | AUC | 0.839 | Early Fusion |
| 3D Action Recognition | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| 3D Action Recognition | COVERAGE | AUC | 0.792 | Late Fusion |
| 3D Action Recognition | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| 3D Action Recognition | Columbia | AUC | 0.996 | Early Fusion |
| 3D Action Recognition | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| 3D Action Recognition | Columbia | AUC | 0.977 | Late Fusion |
| 3D Action Recognition | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| 3D Action Recognition | CocoGlide | AUC | 0.76 | Late Fusion |
| 3D Action Recognition | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| 3D Action Recognition | CocoGlide | AUC | 0.755 | Early Fusion |
| 3D Action Recognition | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| 3D Action Recognition | DSO-1 | AUC | 0.966 | Early Fusion |
| 3D Action Recognition | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| 3D Action Recognition | DSO-1 | AUC | 0.958 | Late Fusion |
| 3D Action Recognition | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| 3D Action Recognition | Casia V1+ | AUC | 0.93 | Late Fusion |
| 3D Action Recognition | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| 3D Action Recognition | Casia V1+ | AUC | 0.929 | Early Fusion |
| 3D Action Recognition | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| Action Recognition | COVERAGE | AUC | 0.839 | Early Fusion |
| Action Recognition | COVERAGE | Balanced Accuracy | 0.77 | Early Fusion |
| Action Recognition | COVERAGE | AUC | 0.792 | Late Fusion |
| Action Recognition | COVERAGE | Balanced Accuracy | 0.72 | Late Fusion |
| Action Recognition | Columbia | AUC | 0.996 | Early Fusion |
| Action Recognition | Columbia | Balanced Accuracy | 0.962 | Early Fusion |
| Action Recognition | Columbia | AUC | 0.977 | Late Fusion |
| Action Recognition | Columbia | Balanced Accuracy | 0.822 | Late Fusion |
| Action Recognition | CocoGlide | AUC | 0.76 | Late Fusion |
| Action Recognition | CocoGlide | Balanced Accuracy | 0.677 | Late Fusion |
| Action Recognition | CocoGlide | AUC | 0.755 | Early Fusion |
| Action Recognition | CocoGlide | Balanced Accuracy | 0.66 | Early Fusion |
| Action Recognition | DSO-1 | AUC | 0.966 | Early Fusion |
| Action Recognition | DSO-1 | Balanced Accuracy | 0.935 | Early Fusion |
| Action Recognition | DSO-1 | AUC | 0.958 | Late Fusion |
| Action Recognition | DSO-1 | Balanced Accuracy | 0.83 | Late Fusion |
| Action Recognition | Casia V1+ | AUC | 0.93 | Late Fusion |
| Action Recognition | Casia V1+ | Balanced Accuracy | 0.86 | Late Fusion |
| Action Recognition | Casia V1+ | AUC | 0.929 | Early Fusion |
| Action Recognition | Casia V1+ | Balanced Accuracy | 0.845 | Early Fusion |
| Image Manipulation Localization | Columbia | Average Pixel F1(Fixed threshold) | 0.888 | Early Fusion |
| Image Manipulation Localization | Columbia | Average Pixel F1(Fixed threshold) | 0.864 | Late Fusion |
| Image Manipulation Localization | COVERAGE | Average Pixel F1(Fixed threshold) | 0.663 | Early Fusion |
| Image Manipulation Localization | COVERAGE | Average Pixel F1(Fixed threshold) | 0.641 | Late Fusion |
| Image Manipulation Localization | Casia V1+ | Average Pixel F1(Fixed threshold) | 0.784 | Early Fusion |
| Image Manipulation Localization | Casia V1+ | Average Pixel F1(Fixed threshold) | 0.775 | Late Fusion |
| Image Manipulation Localization | CocoGlide | Average Pixel F1(Fixed threshold) | 0.574 | Late Fusion |
| Image Manipulation Localization | CocoGlide | Average Pixel F1(Fixed threshold) | 0.553 | Early Fusion |
| Image Manipulation Localization | DSO-1 | Average Pixel F1(Fixed threshold) | 0.899 | Late Fusion |
| Image Manipulation Localization | DSO-1 | Average Pixel F1(Fixed threshold) | 0.869 | Early Fusion |