Heitor R. Medeiros, David Latortue, Eric Granger, Marco Pedersoli
In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing uneven recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa) from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference. Our code is available at: https://github.com/heitorrapela/MiPa.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Autonomous Vehicles | LLVIP | AP | 0.665 | MiPa |
| Object Detection | FLIR | AP 0.5 | 0.813 | MiPa |
| Object Detection | LLVIP | AP | 0.665 | MiPa |
| 3D | FLIR | AP 0.5 | 0.813 | MiPa |
| 3D | LLVIP | AP | 0.665 | MiPa |
| 2D Classification | FLIR | AP 0.5 | 0.813 | MiPa |
| 2D Classification | LLVIP | AP | 0.665 | MiPa |
| Pedestrian Detection | LLVIP | AP | 0.665 | MiPa |
| 2D Object Detection | FLIR | AP 0.5 | 0.813 | MiPa |
| 2D Object Detection | LLVIP | AP | 0.665 | MiPa |
| 16k | FLIR | AP 0.5 | 0.813 | MiPa |
| 16k | LLVIP | AP | 0.665 | MiPa |