TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MELTR: Meta Loss Transformer for Learning to Fine-tune Vid...

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Dohwan Ko, Joonmyung Choi, Hyeong Kyu Choi, Kyoung-Woon On, Byungseok Roh, Hyunwoo J. Kim

2023-03-23CVPR 2023 1Question AnsweringVideo RetrievalSentiment AnalysisText to Video RetrievalVideo Question AnsweringVideo CaptioningTGIF-TransitionRetrievalVisual Question Answering (VQA)TGIF-ActionTGIF-FrameMultimodal Sentiment Analysis
PaperPDFCode(official)

Abstract

Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately `transforms' individual loss functions and `melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video R@141.3All-in-one + MELTR
VideoMSR-VTT-1kAtext-to-video R@1082.5All-in-one + MELTR
VideoMSR-VTT-1kAtext-to-video R@573.5All-in-one + MELTR
VideoMSR-VTT-1kAtext-to-video Median Rank3VIOLET + MELTR
VideoMSR-VTT-1kAtext-to-video R@135.5VIOLET + MELTR
VideoMSR-VTT-1kAtext-to-video R@1078.4VIOLET + MELTR
VideoMSR-VTT-1kAtext-to-video R@567.2VIOLET + MELTR
VideoMSR-VTT-1kAtext-to-video Median Rank4UniVL + MELTR
VideoMSR-VTT-1kAtext-to-video R@131.1UniVL + MELTR
VideoMSR-VTT-1kAtext-to-video R@1068.3UniVL + MELTR
VideoMSR-VTT-1kAtext-to-video R@555.7UniVL + MELTR
VideoYouCook2text-to-video Median Rank3UniVL + MELTR
VideoYouCook2text-to-video R@133.7UniVL + MELTR
VideoYouCook2text-to-video R@1074.8UniVL + MELTR
VideoYouCook2text-to-video R@563.1UniVL + MELTR
VideoMSR-VTTtext-to-video R@138.6All-in-one + MELTR
VideoMSR-VTTtext-to-video R@1084.7All-in-one + MELTR
VideoMSR-VTTtext-to-video R@574.4All-in-one + MELTR
VideoMSR-VTTtext-to-video Median Rank3VIOLET + MELTR
VideoMSR-VTTtext-to-video R@133.6VIOLET + MELTR
VideoMSR-VTTtext-to-video R@1077.8VIOLET + MELTR
VideoMSR-VTTtext-to-video R@563.7VIOLET + MELTR
VideoMSR-VTTtext-to-video Median Rank4UniVL + MELTR
VideoMSR-VTTtext-to-video R@128.5UniVL + MELTR
VideoMSR-VTTtext-to-video R@1067.6UniVL + MELTR
VideoMSR-VTTtext-to-video R@555.5UniVL + MELTR
Visual Question Answering (VQA)MSVD-QAAccuracy0.517VIOLET + MELTR
Sentiment AnalysisCMU-MOSIAcc-285.3UniVL + MELTR
Sentiment AnalysisCMU-MOSICorr0.789UniVL + MELTR
Sentiment AnalysisCMU-MOSIF185.4UniVL + MELTR
Sentiment AnalysisCMU-MOSIMAE0.759UniVL + MELTR
Video CaptioningMSR-VTTBLEU-444.17UniVL + MELTR
Video CaptioningMSR-VTTCIDEr52.77UniVL + MELTR
Video CaptioningMSR-VTTMETEOR29.26UniVL + MELTR
Video CaptioningMSR-VTTROUGE-L62.35UniVL + MELTR
Video CaptioningYouCook2BLEU-324.12UniVL + MELTR
Video CaptioningYouCook2BLEU-417.92UniVL + MELTR
Video CaptioningYouCook2CIDEr1.9UniVL + MELTR
Video CaptioningYouCook2METEOR22.56UniVL + MELTR
Video CaptioningYouCook2ROUGE-L47.04UniVL + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@141.3All-in-one + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@1082.5All-in-one + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@573.5All-in-one + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank3VIOLET + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@135.5VIOLET + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@1078.4VIOLET + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@567.2VIOLET + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank4UniVL + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@131.1UniVL + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@1068.3UniVL + MELTR
Video RetrievalMSR-VTT-1kAtext-to-video R@555.7UniVL + MELTR
Video RetrievalYouCook2text-to-video Median Rank3UniVL + MELTR
Video RetrievalYouCook2text-to-video R@133.7UniVL + MELTR
Video RetrievalYouCook2text-to-video R@1074.8UniVL + MELTR
Video RetrievalYouCook2text-to-video R@563.1UniVL + MELTR
Video RetrievalMSR-VTTtext-to-video R@138.6All-in-one + MELTR
Video RetrievalMSR-VTTtext-to-video R@1084.7All-in-one + MELTR
Video RetrievalMSR-VTTtext-to-video R@574.4All-in-one + MELTR
Video RetrievalMSR-VTTtext-to-video Median Rank3VIOLET + MELTR
Video RetrievalMSR-VTTtext-to-video R@133.6VIOLET + MELTR
Video RetrievalMSR-VTTtext-to-video R@1077.8VIOLET + MELTR
Video RetrievalMSR-VTTtext-to-video R@563.7VIOLET + MELTR
Video RetrievalMSR-VTTtext-to-video Median Rank4UniVL + MELTR
Video RetrievalMSR-VTTtext-to-video R@128.5UniVL + MELTR
Video RetrievalMSR-VTTtext-to-video R@1067.6UniVL + MELTR
Video RetrievalMSR-VTTtext-to-video R@555.5UniVL + MELTR

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17