SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

Tao Wu, Runyu He, Gangshan Wu, LiMin Wang

2024-04-06CVPR 2024 1Scene Graph Generation Video Visual Relation Detection Video Understanding Video scene graph generation Graph Generation

Paper PDF Code(official)

Abstract

Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17 NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation2025-07-17 UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15 EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14 Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14 GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation2025-07-10 SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning2025-07-08 Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08