SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

2404.04565

Published 4/9/2024 by Tao Wu, Runyu He, Gangshan Wu, Limin Wang

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

Abstract

Video-based visual relation detection tasks, such as video scene graph generation, play important roles in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First, they do not explore complex human-human interactions in multi-person scenarios. Second, the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this issue, we propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

Create account to get full access

Overview

This paper introduces SportsHHI, a new dataset for detecting human-human interactions in sports videos.
The dataset contains labeled examples of various types of interactions, such as physical contact, passing, and running together.
The authors aim to advance research in human-human interaction detection, which has applications in sports analysis, video surveillance, and other domains.

Plain English Explanation

The researchers who created this dataset are trying to help computers better understand what's happening in sports videos. Specifically, they want computers to be able to detect when players are interacting with each other - things like physical contact, passing the ball, or running together. This is a challenging computer vision problem, but it could be useful for things like analyzing sports games, monitoring security footage, and more.

To help solve this problem, the researchers put together a new dataset called SportsHHI. This dataset contains lots of sports video clips that have been carefully labeled to show the different types of interactions happening between the players. By giving computers access to this labeled data, the researchers hope to advance the field of human-human interaction detection and enable more sophisticated video analysis applications.

Technical Explanation

The SportsHHI dataset [1] was created to address the challenge of detecting human-human interactions (HHI) in sports videos. Previous datasets [2,3] focused on human-object interactions or general human activity recognition, but lacked the nuanced examples of interpersonal interactions found in sports.

The SportsHHI dataset contains over 10,000 annotated video clips across 8 sports, including basketball, soccer, and volleyball. Each clip was labeled with the type of interaction occurring, such as physical contact, passing, or running together. The authors designed a robust annotation process to ensure high-quality labels.

The dataset is intended to serve as a benchmark for evaluating HHI detection algorithms. The authors provide baseline results using state-of-the-art action recognition models [4,5], which demonstrate that existing techniques struggle to accurately identify the subtle interactions in sports videos. This highlights the need for further research in this area.

Critical Analysis

The SportsHHI dataset represents an important contribution to the field of computer vision and video analysis. By focusing specifically on human-human interactions in sports, it addresses a gap in existing datasets and provides a valuable testbed for advancing the state-of-the-art.

One potential limitation of the dataset is the relatively narrow scope of sports included. While the 8 sports covered (basketball, soccer, etc.) are popular and diverse, expanding the dataset to include a wider range of athletic activities could further broaden its applicability.

Additionally, the authors note that the baseline models they tested struggled to achieve high performance on the dataset. This suggests that significant advancements in HHI detection algorithms are still needed to fully leverage the potential of this data. Further research could explore novel neural network architectures, multi-modal fusion techniques, or other innovative approaches to tackle this challenging problem.

Conclusion

The SportsHHI dataset represents an important step forward in the field of human-human interaction detection. By providing a large, high-quality dataset focused specifically on sports videos, the authors have created a valuable resource for researchers and developers working on video analysis applications. While more work is needed to develop robust HHI detection algorithms, this dataset lays the groundwork for future progress in this area.

[1] SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos [2] HOI-M3Capture: Multiple Humans Objects Interaction Within Capture [3] HOI4ABot: Human-Object Interaction Anticipation for Robots [4] Disentangled Pre-training for Human-Object Interaction Detection [5] Template-free Reconstruction of Human-Object Interaction and Procedural

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, Jingya Wang

Humans naturally interact with both others and the surrounding multiple objects, engaging in various social activities. However, recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects, due to fundamental data scarcity. In this paper, we introduce HOI-M3, a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects. Notably, it provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs, covering 199 sequences and 181M frames of diverse humans and objects under rich activities. With the unique HOI-M3 dataset, we introduce two novel data-driven tasks with companion strong baselines: monocular capture and unstructured generation of multiple human-object interactions. Extensive experiments demonstrate that our dataset is challenging and worthy of further research about multiple human-object interactions and behavior analysis. Our HOI-M3 dataset, corresponding codes, and pre-trained models will be disseminated to the community for future research.

4/3/2024

cs.CV

Geometric Features Enhanced Human-Object Interaction Detection

Manli Zhu, Edmond S. L. Ho, Shuang Chen, Longzhi Yang, Hubert P. H. Shum

Cameras are essential vision instruments to capture images for pattern detection and measurement. Human-object interaction (HOI) detection is one of the most popular pattern detection approaches for captured human-centric visual scenes. Recently, Transformer-based models have become the dominant approach for HOI detection due to their advanced network architectures and thus promising results. However, most of them follow the one-stage design of vanilla Transformer, leaving rich geometric priors under-exploited and leading to compromised performance especially when occlusion occurs. Given that geometric features tend to outperform visual ones in occluded scenarios and offer information that complements visual cues, we propose a novel end-to-end Transformer-style HOI detection model, i.e., geometric features enhanced HOI detector (GeoHOI). One key part of the model is a new unified self-supervised keypoint learning method named UniPointNet that bridges the gap of consistent keypoint representation across diverse object categories, including humans. GeoHOI effectively upgrades a Transformer-based HOI detector benefiting from the keypoints similarities measuring the likelihood of human-object interactions as well as local keypoint patches to enhance interaction query representation, so as to boost HOI predictions. Extensive experiments show that the proposed method outperforms the state-of-the-art models on V-COCO and achieves competitive performance on HICO-DET. Case study results on the post-disaster rescue with vision-based instruments showcase the applicability of the proposed GeoHOI in real-world applications.

6/28/2024

cs.CV

📈

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, Jiangmiao Pang

Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .

4/22/2024

cs.CV

Exploring Interactive Semantic Alignment for Efficient HOI Detection with Vision-language Model

Jihao Dong, Renjie Pan, Hua Yang

Human-Object Interaction (HOI) detection aims to localize human-object pairs and comprehend their interactions. Recently, two-stage transformer-based methods have demonstrated competitive performance. However, these methods frequently focus on object appearance features and ignore global contextual information. Besides, vision-language model CLIP which effectively aligns visual and text embeddings has shown great potential in zero-shot HOI detection. Based on the former facts, We introduce a novel HOI detector named ISA-HOI, which extensively leverages knowledge from CLIP, aligning interactive semantics between visual and textual features. We first extract global context of image and local features of object to Improve interaction Features in images (IF). On the other hand, we propose a Verb Semantic Improvement (VSI) module to enhance textual features of verb labels via cross-modal fusion. Ultimately, our method achieves competitive results on the HICO-DET and V-COCO benchmarks with much fewer training epochs, and outperforms the state-of-the-art under zero-shot settings.

5/27/2024

cs.CV