Dohwan Ko, Joonmyung Choi, Hyeong Kyu Choi, Kyoung-Woon On, Byungseok Roh, Hyunwoo J. Kim
Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately `transforms' individual loss functions and `melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MSR-VTT-1kA | text-to-video R@1 | 41.3 | All-in-one + MELTR |
| Video | MSR-VTT-1kA | text-to-video R@10 | 82.5 | All-in-one + MELTR |
| Video | MSR-VTT-1kA | text-to-video R@5 | 73.5 | All-in-one + MELTR |
| Video | MSR-VTT-1kA | text-to-video Median Rank | 3 | VIOLET + MELTR |
| Video | MSR-VTT-1kA | text-to-video R@1 | 35.5 | VIOLET + MELTR |
| Video | MSR-VTT-1kA | text-to-video R@10 | 78.4 | VIOLET + MELTR |
| Video | MSR-VTT-1kA | text-to-video R@5 | 67.2 | VIOLET + MELTR |
| Video | MSR-VTT-1kA | text-to-video Median Rank | 4 | UniVL + MELTR |
| Video | MSR-VTT-1kA | text-to-video R@1 | 31.1 | UniVL + MELTR |
| Video | MSR-VTT-1kA | text-to-video R@10 | 68.3 | UniVL + MELTR |
| Video | MSR-VTT-1kA | text-to-video R@5 | 55.7 | UniVL + MELTR |
| Video | YouCook2 | text-to-video Median Rank | 3 | UniVL + MELTR |
| Video | YouCook2 | text-to-video R@1 | 33.7 | UniVL + MELTR |
| Video | YouCook2 | text-to-video R@10 | 74.8 | UniVL + MELTR |
| Video | YouCook2 | text-to-video R@5 | 63.1 | UniVL + MELTR |
| Video | MSR-VTT | text-to-video R@1 | 38.6 | All-in-one + MELTR |
| Video | MSR-VTT | text-to-video R@10 | 84.7 | All-in-one + MELTR |
| Video | MSR-VTT | text-to-video R@5 | 74.4 | All-in-one + MELTR |
| Video | MSR-VTT | text-to-video Median Rank | 3 | VIOLET + MELTR |
| Video | MSR-VTT | text-to-video R@1 | 33.6 | VIOLET + MELTR |
| Video | MSR-VTT | text-to-video R@10 | 77.8 | VIOLET + MELTR |
| Video | MSR-VTT | text-to-video R@5 | 63.7 | VIOLET + MELTR |
| Video | MSR-VTT | text-to-video Median Rank | 4 | UniVL + MELTR |
| Video | MSR-VTT | text-to-video R@1 | 28.5 | UniVL + MELTR |
| Video | MSR-VTT | text-to-video R@10 | 67.6 | UniVL + MELTR |
| Video | MSR-VTT | text-to-video R@5 | 55.5 | UniVL + MELTR |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.517 | VIOLET + MELTR |
| Sentiment Analysis | CMU-MOSI | Acc-2 | 85.3 | UniVL + MELTR |
| Sentiment Analysis | CMU-MOSI | Corr | 0.789 | UniVL + MELTR |
| Sentiment Analysis | CMU-MOSI | F1 | 85.4 | UniVL + MELTR |
| Sentiment Analysis | CMU-MOSI | MAE | 0.759 | UniVL + MELTR |
| Video Captioning | MSR-VTT | BLEU-4 | 44.17 | UniVL + MELTR |
| Video Captioning | MSR-VTT | CIDEr | 52.77 | UniVL + MELTR |
| Video Captioning | MSR-VTT | METEOR | 29.26 | UniVL + MELTR |
| Video Captioning | MSR-VTT | ROUGE-L | 62.35 | UniVL + MELTR |
| Video Captioning | YouCook2 | BLEU-3 | 24.12 | UniVL + MELTR |
| Video Captioning | YouCook2 | BLEU-4 | 17.92 | UniVL + MELTR |
| Video Captioning | YouCook2 | CIDEr | 1.9 | UniVL + MELTR |
| Video Captioning | YouCook2 | METEOR | 22.56 | UniVL + MELTR |
| Video Captioning | YouCook2 | ROUGE-L | 47.04 | UniVL + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 41.3 | All-in-one + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 82.5 | All-in-one + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 73.5 | All-in-one + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video Median Rank | 3 | VIOLET + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 35.5 | VIOLET + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 78.4 | VIOLET + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 67.2 | VIOLET + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video Median Rank | 4 | UniVL + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 31.1 | UniVL + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 68.3 | UniVL + MELTR |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 55.7 | UniVL + MELTR |
| Video Retrieval | YouCook2 | text-to-video Median Rank | 3 | UniVL + MELTR |
| Video Retrieval | YouCook2 | text-to-video R@1 | 33.7 | UniVL + MELTR |
| Video Retrieval | YouCook2 | text-to-video R@10 | 74.8 | UniVL + MELTR |
| Video Retrieval | YouCook2 | text-to-video R@5 | 63.1 | UniVL + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 38.6 | All-in-one + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 84.7 | All-in-one + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 74.4 | All-in-one + MELTR |
| Video Retrieval | MSR-VTT | text-to-video Median Rank | 3 | VIOLET + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 33.6 | VIOLET + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 77.8 | VIOLET + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 63.7 | VIOLET + MELTR |
| Video Retrieval | MSR-VTT | text-to-video Median Rank | 4 | UniVL + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 28.5 | UniVL + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 67.6 | UniVL + MELTR |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 55.5 | UniVL + MELTR |