Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Read original: arXiv:2407.14834 - Published 7/23/2024 by Harsh Lunia

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Overview

Large language models (LLMs) have shown impressive capabilities in visual reasoning tasks, suggesting they could be used for action recognition in videos.
The paper investigates whether video-language models (VLMs) can be effectively applied to action recognition in videos.
Key findings include that VLMs can achieve competitive performance on action recognition benchmarks and provide insights into the reasoning process.

Plain English Explanation

Visual reasoning is the ability to understand and reason about visual information. Large language models (LLMs) have demonstrated strong visual reasoning capabilities, leading researchers to wonder if they could also be used for action recognition in videos.

Action recognition is the task of identifying the actions or activities happening in a video. For example, if a video shows someone playing basketball, an action recognition system should be able to detect and classify that activity.

The researchers in this paper investigated whether video-language models (VLMs), which combine language and visual understanding, can be effective at action recognition. They found that VLMs can achieve competitive performance on standard action recognition benchmarks, and can also provide insights into how they are reasoning about the videos.

This suggests that VLMs, which leverage the powerful capabilities of large language models, could be a promising approach for video understanding tasks like action recognition. By combining visual and language understanding, VLMs may be able to better comprehend the complex events and activities captured in videos.

Technical Explanation

The researchers evaluated the ability of video-language models (VLMs) to perform action recognition on videos. They fine-tuned several VLM architectures, including CLIP and LAVA, on standard action recognition datasets like Kinetics-400 and Something-Something V2.

The results showed that VLMs can achieve competitive performance on action recognition, outperforming or matching specialized video models. Additionally, the researchers analyzed the attention maps produced by the VLMs to gain insights into their reasoning process. They found that the models were able to focus on the relevant spatial and temporal regions of the videos to recognize the actions.

The paper suggests that VLMs are effective at action recognition because they can leverage their strong language understanding capabilities to reason about the semantics and context of the actions, in addition to processing the visual information. This indicates that VLMs may be a promising approach for video understanding tasks that require high-level reasoning.

Critical Analysis

The paper provides a compelling demonstration of the potential of VLMs for action recognition in videos. However, there are a few limitations and areas for further research:

The experiments were conducted on relatively constrained datasets, and it would be valuable to evaluate the performance of VLMs on more diverse and challenging video datasets.
The analysis of the VLMs' attention maps offers insights into their reasoning process, but a more thorough investigation of their internal representations and decision-making could yield additional valuable insights.
While the VLMs achieved strong performance, it's unclear how they compare to specialized video models that have been optimized for action recognition. A more comprehensive benchmarking against state-of-the-art video models would help contextualize the results.
The paper does not discuss the computational and memory efficiency of the VLMs for action recognition, which could be an important consideration for real-world applications.

Overall, the research presented in this paper is a promising step towards understanding the potential of VLMs for video understanding tasks. Further investigation and comparison to specialized video models could help solidify the role of these powerful language-vision models in video analysis.

Conclusion

This paper explores the use of video-language models (VLMs) for action recognition in videos. The key findings are that VLMs can achieve competitive performance on standard action recognition benchmarks, and their attention maps provide insights into their reasoning process.

The results suggest that VLMs, which leverage the strong language understanding capabilities of large language models, may be a powerful approach for video understanding tasks that require high-level reasoning. By combining visual and language understanding, VLMs can potentially better comprehend the complex events and activities captured in videos.

While the paper demonstrates the promise of VLMs for action recognition, further research is needed to fully understand their capabilities, limitations, and trade-offs compared to specialized video models. Expanding the evaluation to more diverse datasets, analyzing the internal representations in depth, and assessing computational efficiency are all important next steps.

Overall, this research contributes to the growing body of work exploring the potential of language-vision models for advanced video analysis and understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

Harsh Lunia

Recent advancements have introduced multiple vision-language models (VLMs) demonstrating impressive commonsense reasoning across various domains. Despite their individual capabilities, the potential of synergizing these complementary VLMs remains underexplored. The Cola Framework addresses this by showcasing how a large language model (LLM) can efficiently coordinate multiple VLMs through natural language communication, leveraging their distinct strengths. We have verified this claim on the challenging A-OKVQA dataset, confirming the effectiveness of such coordination. Building on this, our study investigates whether the same methodology can be applied to surveillance videos for action recognition. Specifically, we explore if leveraging the combined knowledge base of VLMs and LLM can effectively deduce actions from a video when presented with only a few selectively important frames and minimal temporal information. Our experiments demonstrate that LLM, when coordinating different VLMs, can successfully recognize patterns and deduce actions in various scenarios despite the weak temporal signals. However, our findings suggest that to enhance this approach as a viable alternative solution, integrating a stronger temporal signal and exposing the models to slightly more frames would be beneficial.

7/23/2024

Open-vocabulary Temporal Action Localization using VLMs

Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos. A sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/.

9/10/2024

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, Xin Eric Wang

In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) on visual commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision pipelines are good at different kinds of VCR problems. Pre-trained VLMs exhibit strong performance for problems involving understanding the literal visual content, which we noted as visual commonsense understanding (VCU). For problems where the goal is to infer conclusions beyond image content, which we noted as visual commonsense inference (VCI), VLMs face difficulties, while LLMs, given sufficient visual evidence, can use commonsense to infer the answer well. We empirically validate this by letting LLMs classify VCR problems into these two categories and show the significant difference between VLM and LLM with image caption decision pipelines on two subproblems. Moreover, we identify a challenge with VLMs' passive perception, which may miss crucial context information, leading to incorrect reasoning by LLMs. Based on these, we suggest a collaborative approach, named ViCor, where pre-trained LLMs serve as problem classifiers to analyze the problem category, then either use VLMs to answer the question directly or actively instruct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain fine-tuning.

5/20/2024

Open Vocabulary Multi-Label Video Classification

Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan, Ashish Tawari, Son Tran, Mubarak Shah, Benjamin Yao, Trishul Chilimbi

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.

7/15/2024