Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Read original: arXiv:2402.08875 - Published 7/17/2024 by Yang Qian, Yinan Sun, Ali Kargarandehkordi, Parnian Azizian, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Overview

This paper introduces a new video dataset called TikTokActions, which is derived from TikTok videos and designed for human action recognition research.
The dataset contains a diverse set of everyday actions and interactions captured in short, user-generated videos.
The authors propose this dataset as a complement to existing action recognition benchmarks, arguing that it better reflects the real-world diversity and challenges of human behavior.

Plain English Explanation

The researchers have created a new dataset of short videos called TikTokActions that they believe can help improve computer vision systems for recognizing human actions and activities. This dataset is unique because it comes directly from the popular social media platform TikTok, where people share all kinds of everyday videos of themselves doing various things.

Existing datasets for action recognition research often use professional-quality videos that don't fully capture the messy reality of how people move and interact in the real world. In contrast, TikTokActions contains a much wider variety of actions, perspectives, and visual styles, which the authors argue makes it a better test of how well AI models can handle the complexities of human behavior. [link to "Human Action Recognition Datasets" section]

By using this real-world dataset, the researchers hope to spur progress in developing computer vision systems that are more robust and practical for real-world applications, like smart home assistants or autonomous vehicles that need to understand and respond appropriately to human actions. [link to "Introduction" section]

Technical Explanation

The paper describes the process of constructing the TikTokActions dataset from TikTok video submissions. The authors developed a custom data collection pipeline to filter, annotate, and organize a diverse set of short video clips depicting a wide range of everyday human actions and interactions.

The final dataset includes over 200,000 video clips spanning 200 different action classes, with each clip averaging around 5 seconds in length. The authors note that this scale and diversity is larger than many previous human action recognition datasets, which tended to focus on a narrower range of activities in more controlled settings.

To benchmark the capabilities of existing computer vision models on this new dataset, the researchers evaluated several state-of-the-art action recognition architectures, including [link to "ActNetFormer" paper] and [link to "Taylor Videos" paper]. Their results suggest that while these models perform well on traditional benchmarks, they struggle to generalize to the real-world complexity and variance present in the TikTokActions dataset. [link to "Picture Is Worth 500 Labels" paper]

The authors conclude that the TikTokActions dataset provides a valuable new resource for advancing the field of human action recognition, as it better captures the nuances and challenges of understanding natural human behavior across diverse contexts. They encourage the research community to leverage this dataset to develop more robust and practical computer vision systems. [link to "Distilling Vision-Language Models" paper]

Critical Analysis

The TikTokActions dataset represents an important step forward in creating more realistic and challenging benchmarks for human action recognition research. By drawing from the wealth of user-generated content on TikTok, the authors have assembled a dataset that reflects the true diversity and messiness of human behavior in the real world.

However, one potential limitation of the dataset is the reliance on TikTok as the sole source. While this platform provides a rich trove of everyday videos, it may also introduce certain biases or skew the distribution of actions and contexts represented. The authors acknowledge this and suggest exploring other social media platforms as additional sources for future iterations of the dataset.

Additionally, the paper does not delve deeply into the ethical considerations of using user-generated content for research purposes. While the data was collected with appropriate consent, there may be concerns around privacy, representation, and the potential misuse of this information that merit further discussion.

Overall, the TikTokActions dataset is a valuable contribution to the field, but researchers should approach it with a critical eye and consider its strengths, limitations, and implications. Continued efforts to develop diverse, realistic, and responsible benchmarks will be crucial for advancing computer vision systems that can truly understand and interact with human behavior in the real world.

Conclusion

The TikTokActions dataset represents a significant advancement in the field of human action recognition research. By leveraging the wealth of user-generated content on TikTok, the authors have created a dataset that better captures the diversity, complexity, and challenges of natural human behavior.

This dataset provides a valuable complement to existing benchmarks, which have tended to focus on more controlled, professional-quality videos. By evaluating state-of-the-art computer vision models on TikTokActions, the researchers have highlighted the limitations of these approaches when faced with the real-world messiness and variance present in the dataset.

Looking ahead, the TikTokActions dataset holds the potential to drive progress in developing more robust and practical computer vision systems. As researchers and developers leverage this resource to tackle the challenges of understanding human actions and interactions, it could lead to significant improvements in applications ranging from smart home assistants to autonomous vehicles.

However, the authors also acknowledge the need to consider the ethical implications of using user-generated content for research purposes. Ongoing efforts to create diverse, responsible, and representative benchmarks will be crucial for ensuring that the advancements in human action recognition benefit society as a whole.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Yang Qian, Yinan Sun, Ali Kargarandehkordi, Parnian Azizian, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington

The increasing variety and quantity of tagged multimedia content on a variety of online platforms offer a unique opportunity to advance the field of human action recognition. In this study, we utilize 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. We employ VideoMAE V2, an advanced model integrating Masked Autoencoders (MAE) with Vision Transformers (ViT), pre-trained on this diverse collection of unstructured videos. Our model, fine-tuned on established action recognition benchmarks such as UCF101 and HMDB51, achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone. These results highlight the potential of using unstructured and unlabeled videos as a valuable source of diverse and dynamic content for training foundation models. Our investigation confirms that while initial increases in pre-training data volume significantly enhance model performance, the gains diminish as the dataset size continues to expand. Our findings emphasize two critical axioms in self-supervised learning for computer vision: (1) additional pre-training data can yield diminishing benefits for some datasets and (2) quality is more important than quantity in self-supervised learning, especially when building foundation models.

7/17/2024

HabitAction: A Video Dataset for Human Habitual Behavior Recognition

Hongwu Li, Zhenliang Zhang, Wei Wang

Human Action Recognition (HAR) is a very crucial task in computer vision. It helps to carry out a series of downstream tasks, like understanding human behaviors. Due to the complexity of human behaviors, many highly valuable behaviors are not yet encompassed within the available datasets for HAR, e.g., human habitual behaviors (HHBs). HHBs hold significant importance for analyzing a person's personality, habits, and psychological changes. To solve these problems, in this work, we build a novel video dataset to demonstrate various HHBs. These behaviors in the proposed dataset are able to reflect internal mental states and specific emotions of the characters, e.g., crossing arms suggests to shield oneself from perceived threats. The dataset contains 30 categories of habitual behaviors including more than 300,000 frames and 6,899 action instances. Since these behaviors usually appear at small local parts of human action videos, it is difficult for existing action recognition methods to handle these local features. Therefore, we also propose a two-stream model using both human skeletons and RGB appearances. Experimental results demonstrate that our proposed method has much better performance in action recognition than the existing methods on the proposed dataset.

8/27/2024

Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models

Thinesh Thiyakesan Ponbagavathi, Kunyu Peng, Alina Roitberg

Foundation models (FMs) are large neural networks trained on broad datasets, excelling in downstream tasks with minimal fine-tuning. Human activity recognition in video has advanced with FMs, driven by competition among different architectures. However, high accuracies on standard benchmarks can draw an artificially rosy picture, as they often overlook real-world factors like changing camera perspectives. Popular benchmarks, mostly from YouTube or movies, offer diverse views but only coarse actions, which are insufficient for use-cases needing fine-grained, domain-specific actions. Domain-specific datasets (e.g., for industrial assembly) typically use data from limited static perspectives. This paper empirically evaluates how perspective changes affect different FMs in fine-grained human activity recognition. We compare multiple backbone architectures and design choices, including image- and video- based models, and various strategies for temporal information fusion, including commonly used score averaging and more novel attention-based temporal aggregation mechanisms. This is the first systematic study of different foundation models and specific design choices for human activity recognition from unknown views, conducted with the goal to provide guidance for backbone- and temporal- fusion scheme selection. Code and models will be made publicly available to the community.

7/23/2024

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin

Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at https://github.com/zhenzhiwang/HumanVid/.

7/30/2024