Automatic Sound Event Detection and Classification of Great Ape Calls Using Neural Networks

Read original: arXiv:2301.02214 - Published 6/24/2024 by Zifan Jiang, Adrian Soldati, Isaac Schamberg, Adriano R. Lameira, Steven Moran

🔎

Overview

The paper presents a novel approach to automatically detect and classify great ape calls from continuous raw audio recordings collected during field research.
The method leverages deep pre-trained and sequential neural networks, including wav2vec 2.0 and LSTM, and is validated on three data sets from three different great ape lineages (orangutans, chimpanzees, and bonobos).
The recordings were collected by different researchers and include different annotation schemes, which the pipeline preprocesses and trains in a uniform fashion.
The results for call detection and classification attain high accuracy, and the method is aimed to be generalizable to other animal species and sound event detection tasks.

Plain English Explanation

The researchers have developed a new way to automatically identify and categorize the calls made by different great ape species, such as orangutans, chimpanzees, and bonobos, from continuous audio recordings collected in the field. They used advanced deep learning models, including wav2vec 2.0 and LSTM, to analyze the audio data. The recordings were collected by different researchers, so the data had varying annotation schemes, but the researchers' pipeline was able to preprocess and train the models in a consistent way.

The method proved to be highly accurate at detecting and classifying the great ape calls. The researchers believe this approach could be applied to other animal species and used for sound event detection more broadly. To help further research in this area, the researchers have made their pipeline and methods publicly available.

Technical Explanation

The paper presents a novel deep learning-based approach for automated detection and classification of great ape calls from continuous audio recordings collected during field research. The method leverages powerful pre-trained models, such as wav2vec 2.0, and sequential neural networks like LSTM to process the raw audio data.

The researchers validated their approach on three datasets from three different great ape lineages (orangutans, chimpanzees, and bonobos). These datasets were collected by different researchers and included varying annotation schemes, which the researchers' preprocessing pipeline was able to handle in a uniform fashion.

The results demonstrate that the proposed method achieves high accuracy in both call detection and classification tasks across the three great ape species. The researchers emphasize that their approach is designed to be generalizable to other animal species and sound event detection applications more broadly.

Critical Analysis

The paper presents a comprehensive and well-executed study, with strong technical implementation and robust validation across multiple datasets. However, the researchers acknowledge certain limitations and areas for further research.

One potential caveat is the reliance on pre-trained models, such as wav2vec 2.0, which may not be readily available or optimized for all animal species or recording conditions. The researchers suggest that further finetuning or adaptation of these models may be necessary for optimal performance in some scenarios.

Additionally, the paper does not delve deeply into the interpretability or explainability of the deep learning models used. While the methods demonstrate high accuracy, there may be a need for more transparent and interpretable models to gain deeper insights into the acoustic features and patterns that the models are learning.

Lastly, the researchers highlight the need for further validation and testing on a broader range of great ape species, recording conditions, and real-world deployment scenarios to fully assess the generalizability and robustness of the proposed approach.

Conclusion

This paper presents a promising and innovative approach to automated bioacoustic monitoring of great apes in the wild. The researchers have developed a deep learning-based pipeline that can accurately detect and classify great ape calls from continuous audio recordings, even when the data is collected by different researchers and has varying annotation schemes.

The high accuracy of the method, along with its potential for generalization to other animal species and sound event detection tasks, makes it a valuable contribution to the field of bioacoustic monitoring and animal behavior research. By making their pipeline and methods publicly available, the researchers are fostering further advancements in this area and enabling other researchers to build upon their work.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Automatic Sound Event Detection and Classification of Great Ape Calls Using Neural Networks

Zifan Jiang, Adrian Soldati, Isaac Schamberg, Adriano R. Lameira, Steven Moran

We present a novel approach to automatically detect and classify great ape calls from continuous raw audio recordings collected during field research. Our method leverages deep pretrained and sequential neural networks, including wav2vec 2.0 and LSTM, and is validated on three data sets from three different great ape lineages (orangutans, chimpanzees, and bonobos). The recordings were collected by different researchers and include different annotation schemes, which our pipeline preprocesses and trains in a uniform fashion. Our results for call detection and classification attain high accuracy. Our method is aimed to be generalizable to other animal species, and more generally, sound event detection tasks. To foster future research, we make our pipeline and methods publicly available.

6/24/2024

Advanced Framework for Animal Sound Classification With Features Optimization

Qiang Yang, Xiuying Chen, Changsheng Ma, Carlos M. Duarte, Xiangliang Zhang

The automatic classification of animal sounds presents an enduring challenge in bioacoustics, owing to the diverse statistical properties of sound signals, variations in recording equipment, and prevalent low Signal-to-Noise Ratio (SNR) conditions. Deep learning models like Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) have excelled in human speech recognition but have not been effectively tailored to the intricate nature of animal sounds, which exhibit substantial diversity even within the same domain. We propose an automated classification framework applicable to general animal sound classification. Our approach first optimizes audio features from Mel-frequency cepstral coefficients (MFCC) including feature rearrangement and feature reduction. It then uses the optimized features for the deep learning model, i.e., an attention-based Bidirectional LSTM (Bi-LSTM), to extract deep semantic features for sound classification. We also contribute an animal sound benchmark dataset encompassing oceanic animals and birds1. Extensive experimentation with real-world datasets demonstrates that our approach consistently outperforms baseline methods by over 25% in precision, recall, and accuracy, promising advancements in animal sound classification.

7/8/2024

On Feature Learning for Titi Monkey Activity Detection

Aditya Ravuri, Jen Muir, Neil D. Lawrence

This paper, a technical summary of our preceding publication, introduces a robust machine learning framework for the detection of vocal activities of Coppery titi monkeys. Utilizing a combination of MFCC features and a bidirectional LSTM-based classifier, we effectively address the challenges posed by the small amount of expert-annotated vocal data available. Our approach significantly reduces false positives and improves the accuracy of call detection in bioacoustic research. Initial results demonstrate an accuracy of 95% on instance predictions, highlighting the effectiveness of our model in identifying and classifying complex vocal patterns in environmental audio recordings. Moreover, we show how call classification can be done downstream, paving the way for real-world monitoring.

7/2/2024

On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis

Eklavya Sarkar, Mathew Magimai. -Doss

Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.

7/25/2024