ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition

Read original: arXiv:2404.08937 - Published 4/16/2024 by Otto Brookes, Majid Mirmehdi, Hjalmar Kuhl, Tilo Burghardt

ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition

Overview

Presents a novel approach for recognizing chimpanzee behaviors using a Vision-Language Model (VLM) enhanced with an ethogram-based training process
Demonstrates improved performance compared to existing methods for chimpanzee behavior recognition
Explores the value of incorporating detailed behavioral knowledge into VLM training for improved animal behavior understanding

Plain English Explanation

This research paper describes a new way to automatically identify and classify different behaviors exhibited by chimpanzees. The researchers developed a Vision-Language Model that was trained not only on visual images of chimpanzees, but also on detailed behavioral information known as an "ethogram."

An ethogram is essentially a comprehensive catalog of all the different behaviors that a particular animal species is known to display. By incorporating this rich behavioral knowledge into the training process, the Vision-Language Model was able to better recognize and categorize the various actions and movements of chimpanzees.

The results showed that this approach outperformed existing methods for chimpanzee behavior recognition, demonstrating the value of harnessing the power of large Vision-Language Models and combining them with domain-specific behavioral knowledge. This kind of enhanced robot explanation capability could have important applications in fields like animal behavior research and conservation.

Technical Explanation

The researchers developed a Vision-Language Model called ChimpVLM that was trained on both visual images of chimpanzees as well as an extensive ethogram detailing their known behaviors. This ethogram-enhanced training process allowed the model to learn more nuanced and accurate representations of chimpanzee behaviors compared to using visual data alone.

The model architecture consisted of a vision backbone (e.g., a convolutional neural network) that processed the input images, combined with a language model that processed the behavioral information from the ethogram. The outputs of these two components were then fused to produce the final behavior classification.

Experiments on a benchmark chimpanzee behavior dataset showed that ChimpVLM outperformed previous state-of-the-art methods by a significant margin. The researchers attribute this improved performance to the model's ability to leverage the rich behavioral context provided by the ethogram during training.

Critical Analysis

The paper makes a compelling case for the value of incorporating domain-specific knowledge, in this case an ethogram of chimpanzee behaviors, into the training of Vision-Language Models for improved animal behavior recognition. However, the research is limited to a single species (chimpanzees) and does not explore the generalizability of this approach to other animals.

Additionally, the paper does not address potential biases or limitations in the ethogram data itself. Ethograms are typically curated by human experts and may reflect anthropocentric perspectives or miss certain nuances in animal behavior. Incorporating such biases into the training process could lead to skewed model performance.

Further research is needed to understand how the ethogram-enhanced training process can be applied to a wider range of animal species and how to mitigate potential issues with the underlying behavioral data. Nonetheless, this work represents an important step towards leveraging the power of large Vision-Language Models for more accurate and nuanced understanding of animal behavior.

Conclusion

The ChimpVLM model presented in this paper demonstrates the value of incorporating domain-specific knowledge, in the form of an ethogram, into the training of Vision-Language Models for animal behavior recognition. By leveraging the rich behavioral context provided by the ethogram, the model was able to achieve superior performance compared to existing methods.

This research highlights the potential for enhanced robot explanation capabilities through the integration of large-scale Vision-Language Models and specialized domain knowledge. While further work is needed to explore the generalizability of this approach, this study represents an important step forward in the field of computer vision for animal behavior understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition

Otto Brookes, Majid Mirmehdi, Hjalmar Kuhl, Tilo Burghardt

We show that chimpanzee behaviour understanding from camera traps can be enhanced by providing visual architectures with access to an embedding of text descriptions that detail species behaviours. In particular, we present a vision-language model which employs multi-modal decoding of visual features extracted directly from camera trap videos to process query tokens representing behaviours and output class predictions. Query tokens are initialised using a standardised ethogram of chimpanzee behaviour, rather than using random or name-based initialisations. In addition, the effect of initialising query tokens using a masked language model fine-tuned on a text corpus of known behavioural patterns is explored. We evaluate our system on the PanAf500 and PanAf20K datasets and demonstrate the performance benefits of our multi-modal decoding approach and query initialisation strategy on multi-class and multi-label recognition tasks, respectively. Results and ablations corroborate performance improvements. We achieve state-of-the-art performance over vision and vision-language models in top-1 accuracy (+6.34%) on PanAf500 and overall (+1.1%) and tail-class (+2.26%) mean average precision on PanAf20K. We share complete source code and network weights for full reproducibility of results and easy utilisation.

4/16/2024

From Forest to Zoo: Great Ape Behavior Recognition with ChimpBehave

Michael Fuchs, Emilie Genty, Adrian Bangerter, Klaus Zuberbuhler, Paul Cotofrei

This paper addresses the significant challenge of recognizing behaviors in non-human primates, specifically focusing on chimpanzees. Automated behavior recognition is crucial for both conservation efforts and the advancement of behavioral research. However, it is significantly hindered by the labor-intensive process of manual video annotation. Despite the availability of large-scale animal behavior datasets, the effective application of machine learning models across varied environmental settings poses a critical challenge, primarily due to the variability in data collection contexts and the specificity of annotations. In this paper, we introduce ChimpBehave, a novel dataset featuring over 2 hours of video (approximately 193,000 video frames) of zoo-housed chimpanzees, meticulously annotated with bounding boxes and behavior labels for action recognition. ChimpBehave uniquely aligns its behavior classes with existing datasets, allowing for the study of domain adaptation and cross-dataset generalization methods between different visual settings. Furthermore, we benchmark our dataset using a state-of-the-art CNN-based action recognition model, providing the first baseline results for both within and cross-dataset settings. The dataset, models, and code can be accessed at: https://github.com/MitchFuchs/ChimpBehave

5/31/2024

GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding

Yiqi Wu, Xiaodan Hu, Ziming Fu, Siling Zhou, Jiangong Li

Animal ethology is an crucial aspect of animal research, and animal behavior labeling is the foundation for studying animal behavior. This process typically involves labeling video clips with behavioral semantic tags, a task that is complex, subjective, and multimodal. With the rapid development of multimodal large language models(LLMs), new application have emerged for animal behavior understanding tasks in livestock scenarios. This study evaluates the visual perception capabilities of multimodal LLMs in animal activity recognition. To achieve this, we created piglet test data comprising close-up video clips of individual piglets and annotated full-shot video clips. These data were used to assess the performance of four multimodal LLMs-Video-LLaMA, MiniGPT4-Video, Video-Chat2, and GPT-4 omni (GPT-4o)-in piglet activity understanding. Through comprehensive evaluation across five dimensions, including counting, actor referring, semantic correspondence, time perception, and robustness, we found that while current multimodal LLMs require improvement in semantic correspondence and time perception, they have initially demonstrated visual perception capabilities for animal activity recognition. Notably, GPT-4o showed outstanding performance, with Video-Chat2 and GPT-4o exhibiting significantly better semantic correspondence and time perception in close-up video clips compared to full-shot clips. The initial evaluation experiments in this study validate the potential of multimodal large language models in livestock scene video understanding and provide new directions and references for future research on animal behavior video understanding. Furthermore, by deeply exploring the influence of visual prompts on multimodal large language models, we expect to enhance the accuracy and efficiency of animal behavior recognition in livestock scenarios through human visual processing methods.

6/17/2024

EVLM: An Efficient Vision-Language Model for Visual Understanding

Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

7/22/2024