TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Elysium: Exploring Object-level Perception in Videos via M...

Elysium: Exploring Object-level Perception in Videos via MLLM

Han Wang, Yanjie Wang, YongJie Ye, Yuxiang Nie, Can Huang

2024-03-25Zero-Shot Video Question AnswerVisual Object TrackingReferring ExpressionReferring expression generationReferring Expression ComprehensionVideo Question AnsweringZero-Shot Single Object TrackingObject Tracking
PaperPDFCode(official)

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset supported for three tasks: Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that attempts to conduct object-level tasks in videos without requiring any additional plug-in or expert models. All codes and datasets are available at https://github.com/Hon-Wong/Elysium.

Results

TaskDatasetMetricValueModel
Question AnsweringMSVD-QAAccuracy75.8Elysium
Question AnsweringMSVD-QAConfidence Score3.7Elysium
Question AnsweringTGIF-QAAccuracy66.6Elysium
Question AnsweringTGIF-QAConfidence Score3.6Elysium
Question AnsweringMSRVTT-QAAccuracy67.5Elysium
Question AnsweringMSRVTT-QAConfidence Score3.2Elysium
Question AnsweringActivityNet-QAAccuracy43.4Elysium
Question AnsweringActivityNet-QAConfidence Score2.9Elysium
Video Question AnsweringMSVD-QAAccuracy75.8Elysium
Video Question AnsweringMSVD-QAConfidence Score3.7Elysium
Video Question AnsweringTGIF-QAAccuracy66.6Elysium
Video Question AnsweringTGIF-QAConfidence Score3.6Elysium
Video Question AnsweringMSRVTT-QAAccuracy67.5Elysium
Video Question AnsweringMSRVTT-QAConfidence Score3.2Elysium
Video Question AnsweringActivityNet-QAAccuracy43.4Elysium
Video Question AnsweringActivityNet-QAConfidence Score2.9Elysium
Object TrackingLaSOTAUC56.1Elysium
Object TrackingLaSOTNormalized Precision61Elysium
Object TrackingLaSOTPrecision50.1Elysium
Visual Object TrackingLaSOTAUC56.1Elysium
Visual Object TrackingLaSOTNormalized Precision61Elysium
Visual Object TrackingLaSOTPrecision50.1Elysium

Related Papers

MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results2025-07-17YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association2025-07-16HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking2025-07-07UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions2025-07-01Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking2025-06-30Visual and Memory Dual Adapter for Multi-Modal Object Tracking2025-06-30Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28