TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/InternVideo2-6B

InternVideo2-6B

Reported on 107 benchmarks across 13 tasks · 1 paper · 74 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision87 results

  • VideoonVATEX
    video-to-text R@1· uses extra data· 2024-03-22
    89.3
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonActivityNet
    text-to-video R@1· uses extra data· 2024-03-22
    74.1
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonActivityNet
    video-to-text R@1· uses extra data· 2024-03-22
    69.7
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonDiDeMo
    text-to-video R@1· uses extra data· 2024-03-22
    74.2
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonDiDeMo
    video-to-text R@1· uses extra data· 2024-03-22
    71.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonLSMDC
    text-to-video R@1· uses extra data· 2024-03-22
    46.4
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonLSMDC
    video-to-text R@1· uses extra data· 2024-03-22
    46.7
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonMSVD
    text-to-video R@1· uses extra data· 2024-03-22
    61.4
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonMSVD
    video-to-text R@1· uses extra data· 2024-03-22
    85.2
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonQVHighlights
    R@1,IoU=0.5· 2024-03-22
    71.42
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonQVHighlights
    R@1,IoU=0.7· 2024-03-22
    56.45
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonKinetics-700
    Top-1 Accuracy· uses extra data· 2024-03-22
    85.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonKinetics-400
    Acc@1· uses extra data· 2024-03-22
    92.1
    best: 93.6 (OmniVec2)
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonKinetics-600
    Top-1 Accuracy· uses extra data· 2024-03-22
    91.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonMIT
    Top 1 Accuracy· 2024-03-22
    51.2
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonVATEX
    video-to-text R@1· uses extra data· 2024-03-22
    89.3
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonActivityNet
    text-to-video R@1· uses extra data· 2024-03-22
    74.1
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonActivityNet
    video-to-text R@1· uses extra data· 2024-03-22
    69.7
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonDiDeMo
    text-to-video R@1· uses extra data· 2024-03-22
    74.2
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonDiDeMo
    video-to-text R@1· uses extra data· 2024-03-22
    71.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonLSMDC
    text-to-video R@1· uses extra data· 2024-03-22
    46.4
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonLSMDC
    video-to-text R@1· uses extra data· 2024-03-22
    46.7
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonMSVD
    text-to-video R@1· uses extra data· 2024-03-22
    61.4
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonMSVD
    video-to-text R@1· uses extra data· 2024-03-22
    85.2
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonQVHighlights
    R@1,IoU=0.5· 2024-03-22
    71.42
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonQVHighlights
    R@1,IoU=0.7· 2024-03-22
    56.45
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Moment RetrievalonCharades-STA
    R@1 IoU=0.5· 2024-03-22
    70.03
    best: 71.1 (SG-DETR (w/ PT))
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Moment RetrievalonCharades-STA
    R@1 IoU=0.7· 2024-03-22
    48.95
    best: 52.8 (SG-DETR (w/ PT))
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Moment RetrievalonQVHighlights
    R@1 IoU=0.5· uses extra data· 2024-03-22
    71.42
    best: 76.59 (LLaVA-MR)
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Moment RetrievalonQVHighlights
    R@1 IoU=0.7· uses extra data· 2024-03-22
    56.45
    best: 61.48 (LLaVA-MR)
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Moment RetrievalonQVHighlights
    mAP· uses extra data· 2024-03-22
    49.24
    best: 58.8 (SG-DETR (w/ PT))
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video GroundingonQVHighlights
    R@1,IoU=0.5· 2024-03-22
    71.42
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video GroundingonQVHighlights
    R@1,IoU=0.7· 2024-03-22
    56.45
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonVATEX
    text-to-video R@1· uses extra data· 2024-03-22
    71.5
    best: 83.9 (GRAM)
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonVATEX
    text-to-video R@10· uses extra data· 2024-03-22
    97.1
    best: 99.5 (GRAM)
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonVATEX
    text-to-video R@5· uses extra data· 2024-03-22
    94
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonVATEX
    video-to-text R@10· uses extra data· 2024-03-22
    99.3
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonVATEX
    video-to-text R@5· uses extra data· 2024-03-22
    97.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSR-VTT
    text-to-video R@1· uses extra data· 2024-03-22
    55.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSR-VTT
    text-to-video R@10· uses extra data· 2024-03-22
    85.1
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSR-VTT
    text-to-video R@5· uses extra data· 2024-03-22
    78.3
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSR-VTT
    video-to-text R@1· uses extra data· 2024-03-22
    53.7
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSR-VTT
    video-to-text R@10· uses extra data· 2024-03-22
    84.1
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSR-VTT
    video-to-text R@5· uses extra data· 2024-03-22
    77.5
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSVD
    text-to-video R@1· uses extra data· 2024-03-22
    59.3
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSVD
    text-to-video R@10· uses extra data· 2024-03-22
    89.6
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSVD
    text-to-video R@5· uses extra data· 2024-03-22
    84.4
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonDiDeMo
    text-to-video R@1· uses extra data· 2024-03-22
    57.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonDiDeMo
    text-to-video R@5· uses extra data· 2024-03-22
    80
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonDiDeMo
    video-to-text R@1· uses extra data· 2024-03-22
    57.1
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonDiDeMo
    video-to-text R@10· uses extra data· 2024-03-22
    85
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonDiDeMo
    video-to-text R@5· uses extra data· 2024-03-22
    79.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonLSMDC
    text-to-video R@1· uses extra data· 2024-03-22
    33.8
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonLSMDC
    text-to-video R@10· uses extra data· 2024-03-22
    62.2
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonLSMDC
    text-to-video R@5· uses extra data· 2024-03-22
    55.9
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonLSMDC
    video-to-text R@1· uses extra data· 2024-03-22
    30.1
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonLSMDC
    video-to-text R@10· uses extra data· 2024-03-22
    54.8
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonLSMDC
    video-to-text R@5· uses extra data· 2024-03-22
    47.7
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonActivityNet
    text-to-video R@1· uses extra data· 2024-03-22
    63.2
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonActivityNet
    text-to-video R@10· uses extra data· 2024-03-22
    92.5
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonActivityNet
    text-to-video R@5· uses extra data· 2024-03-22
    85.6
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonActivityNet
    video-to-text R@1· uses extra data· 2024-03-22
    56.5
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonActivityNet
    video-to-text R@10· uses extra data· 2024-03-22
    90.3
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonActivityNet
    video-to-text R@5· uses extra data· 2024-03-22
    82.8
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonHACS
    Average-mAP· 2024-03-22
    43.3
    best: 45.8 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonActivityNet-1.3
    mAP· 2024-03-22
    41.2
    best: 42.9 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonFineAction
    mAP· 2024-03-22
    27.7
    best: 29.6 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonTHUMOS’14
    Avg mAP (0.3:0.7)· uses extra data· 2024-03-22
    72
    best: 76.9 (AdaTAD (VideoMAEv2-giant))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonVATEX
    text-to-video R@1· uses extra data· 2024-03-22
    75.5
    best: 87.7 (GRAM)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonMSR-VTT
    text-to-video R@1· uses extra data· 2024-03-22
    62.8
    best: 64 (GRAM)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • VideoonMSR-VTT
    video-to-text R@1· uses extra data· 2024-03-22
    60.2
    best: 64.8 (GRAM)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Temporal Action LocalizationonHACS
    Average-mAP· 2024-03-22
    43.3
    best: 45.8 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Temporal Action LocalizationonActivityNet-1.3
    mAP· 2024-03-22
    41.2
    best: 42.9 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Temporal Action LocalizationonFineAction
    mAP· 2024-03-22
    27.7
    best: 29.6 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Temporal Action LocalizationonTHUMOS’14
    Avg mAP (0.3:0.7)· uses extra data· 2024-03-22
    72
    best: 76.9 (AdaTAD (VideoMAEv2-giant))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action LocalizationonHACS
    Average-mAP· 2024-03-22
    43.3
    best: 45.8 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action LocalizationonActivityNet-1.3
    mAP· 2024-03-22
    41.2
    best: 42.9 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action LocalizationonFineAction
    mAP· 2024-03-22
    27.7
    best: 29.6 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action LocalizationonTHUMOS’14
    Avg mAP (0.3:0.7)· uses extra data· 2024-03-22
    72
    best: 76.9 (AdaTAD (VideoMAEv2-giant))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonVATEX
    text-to-video R@1· uses extra data· 2024-03-22
    75.5
    best: 87.7 (GRAM)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonMSR-VTT
    text-to-video R@1· uses extra data· 2024-03-22
    62.8
    best: 64 (GRAM)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Video RetrievalonMSR-VTT
    video-to-text R@1· uses extra data· 2024-03-22
    60.2
    best: 64.8 (GRAM)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonVATEX
    video-to-text R@1· uses extra data· 2024-03-22
    85.3
    best: 85.4 (InternVideo2-1B)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSVD
    video-to-text R@1· uses extra data· 2024-03-22
    83.1
    best: 83.3 (InternVideo2-1B)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSVD
    video-to-text R@10· uses extra data· 2024-03-22
    97
    best: 97.9 (LanguageBind(ViT-L/14))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonMSVD
    video-to-text R@5· uses extra data· 2024-03-22
    94.2
    best: 94.3 (InternVideo2-1B)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot Video RetrievalonDiDeMo
    text-to-video R@10· uses extra data· 2024-03-22
    84.6
    best: 85.1 (InternVideo2-1B)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377

Robots6 results

  • Activity RecognitiononHACS
    Top 1 Accuracy· uses extra data· 2024-03-22
    97
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Activity RecognitiononSomething-Something V2
    GFLOPs· uses extra data· 2024-03-22
    13321
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Activity RecognitiononSomething-Something V2
    Parameters· uses extra data· 2024-03-22
    2131
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Activity RecognitiononSomething-Something V2
    Top-1 Accuracy· uses extra data· 2024-03-22
    1
    best: 77.3 (MVD (Kinetics400 pretrain, ViT-H, 16 frame))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Activity RecognitiononSomething-Something V2
    Top-5 Accuracy· uses extra data· 2024-03-22
    12
    best: 96.3 (DejaVid)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Activity RecognitiononActivityNet
    mAP· uses extra data· 2024-03-22
    95.9
    best: 96.9 (Text4Vis (w/ ViT-L))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377

Time Series6 results

  • Action RecognitiononHACS
    Top 1 Accuracy· uses extra data· 2024-03-22
    97
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action RecognitiononSomething-Something V2
    GFLOPs· uses extra data· 2024-03-22
    13321
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action RecognitiononSomething-Something V2
    Parameters· uses extra data· 2024-03-22
    2131
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action RecognitiononSomething-Something V2
    Top-1 Accuracy· uses extra data· 2024-03-22
    1
    best: 77.3 (MVD (Kinetics400 pretrain, ViT-H, 16 frame))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action RecognitiononSomething-Something V2
    Top-5 Accuracy· uses extra data· 2024-03-22
    12
    best: 96.3 (DejaVid)
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Action RecognitiononActivityNet
    mAP· uses extra data· 2024-03-22
    95.9
    best: 96.9 (Text4Vis (w/ ViT-L))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377

Methodology4 results

  • Zero-Shot LearningonHACS
    Average-mAP· 2024-03-22
    43.3
    best: 45.8 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot LearningonActivityNet-1.3
    mAP· 2024-03-22
    41.2
    best: 42.9 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot LearningonFineAction
    mAP· 2024-03-22
    27.7
    best: 29.6 (RDFA-S6 (InternVideo2-6B))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Zero-Shot LearningonTHUMOS’14
    Avg mAP (0.3:0.7)· uses extra data· 2024-03-22
    72
    best: 76.9 (AdaTAD (VideoMAEv2-giant))
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377

Audio2 results

  • Text to Audio RetrievalonAudioCaps
    R@1· uses extra data· 2024-03-22
    55.2
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377
  • Text to Audio RetrievalonClotho
    R@1· uses extra data· 2024-03-22
    27.2
    best: 27.69 (PaSST-RoBERTa & Estimated Audio–Caption Correspondences)
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377

Natural Language Processing1 result

  • Question AnsweringonEgoSchema (fullset)
    Accuracy· 2024-03-22
    60.2
    best: 71.14 (BIMBA-LLaVA-Qwen2-7B)
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377

Reasoning1 result

  • Video Question AnsweringonEgoSchema (fullset)
    Accuracy· 2024-03-22
    60.2
    best: 71.14 (BIMBA-LLaVA-Qwen2-7B)
    SOTA
    InternVideo2: Scaling Foundation Models for Multimodal Video UnderstandingarXiv:2403.15377