Video Question Answering on NExT-QA (Efficient)

Metric: 1:1 Accuracy (higher is better)

LeaderboardDataset
Loading chart...
#Model1:1 AccuracyExtra DataPaperDateCode
1ViLA (3B, 4 frames)74.4NoViLA: Efficient Video-Language Alignment for Vid...2023-12-13Code
2SeViLA (4 frames)73.8NoSelf-Chained Image-Language Model for Video Loca...2023-05-11Code