TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/What Makes Training Multi-Modal Classification Networks Ha...

What Makes Training Multi-Modal Classification Networks Hard?

Wei-Yao Wang, Du Tran, Matt Feiszli

2019-05-29CVPR 2020 6Multi-modal ClassificationAction ClassificationEvent DetectionGeneral ClassificationAction RecognitionClassificationAction Recognition In VideosTemporal Action Localization
PaperPDFCodeCodeCode

Abstract

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@178.9G-Blend (Sports-1M pretrain)
VideoKinetics-400Acc@177.7G-Blend
Activity RecognitionminiSportsClip Hit@149.7G-Blend
Activity RecognitionminiSportsVideo hit@162.8G-Blend
Activity RecognitionminiSportsVideo hit@585.5G-Blend
Activity RecognitionSports-1MVideo hit@174.8G-Blend
Activity RecognitionSports-1MVideo hit@592.4G-Blend
Action RecognitionminiSportsClip Hit@149.7G-Blend
Action RecognitionminiSportsVideo hit@162.8G-Blend
Action RecognitionminiSportsVideo hit@585.5G-Blend
Action RecognitionSports-1MVideo hit@174.8G-Blend
Action RecognitionSports-1MVideo hit@592.4G-Blend
Action Recognition In VideosminiSportsClip Hit@149.7G-Blend
Action Recognition In VideosminiSportsVideo hit@162.8G-Blend
Action Recognition In VideosminiSportsVideo hit@585.5G-Blend
Action Recognition In VideosSports-1MVideo hit@174.8G-Blend
Action Recognition In VideosSports-1MVideo hit@592.4G-Blend

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Safeguarding Federated Learning-based Road Condition Classification2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13Fuzzy Classification Aggregation for a Continuum of Agents2025-07-06Hybrid-View Attention for csPCa Classification in TRUS2025-07-04