ViLCo
ViLCo-Bench
ImagesTextsVideosMIT LicenseIntroduced 2024-06-19
We propose the first standardized benchmark in multimodal continual learning for video data, defining protocols for training and metrics for evaluation. This standardized framework allows researchers to effectively compare models, driving advancements in AI systems that can continuously learn from diverse data sources.
We define the setup for three recent multimodal tasks in a continual learning setup: Moment Query (MQ), Natural Language Query (NLQ), and Visual Query (VQ). We also provide systematic insights into the challenges, gaps, and limitations of each video-text continual learning tasks.