UrBAN: Urban Beehive Acoustics and PheNotyping Dataset

Read original: arXiv:2406.03657 - Published 6/21/2024 by Mahsa Abdollahi, Yi Zhu, Heitor R. Guimar~aes, Nico Coallier, S'egol`ene Maucourt, Pierre Giovenazzo, Tiago H. Falk

UrBAN: Urban Beehive Acoustics and PheNotyping Dataset

Overview

This paper introduces the UrBAN dataset, which contains audio recordings and phenotype data from urban beehives.
The dataset aims to support research on monitoring beehive health and behavior using acoustic analysis.
The data was collected from beehives in an urban environment over multiple seasons.

Plain English Explanation

The UrBAN dataset provides a collection of audio recordings and other data from beehives located in cities. The goal is to help researchers study the health and behavior of urban bees by analyzing the sounds they make.

Bees make a variety of buzzing and humming noises that can reveal information about their colony's status. By recording these sounds and combining them with data on the physical characteristics (or "phenotypes") of the bees, scientists can build models to automatically monitor bee health. This could be useful for identifying issues like pesticide exposure or detecting diseases early on.

The UrBAN dataset was collected from beehives in an urban setting over multiple seasons. This allowed the researchers to capture how bee sounds and characteristics change over time and in different environmental conditions. Having this real-world dataset can help develop practical audio-based monitoring systems for urban beekeepers and conservationists.

Technical Explanation

The UrBAN dataset consists of audio recordings and phenotype data collected from urban beehives over multiple seasons. The audio data was recorded using microphones installed inside the hives, capturing the various sounds produced by the bees. Alongside the audio, the researchers also gathered data on physical bee characteristics like size, weight, and wing morphology.

By combining the acoustic and phenotypic information, the UrBAN dataset aims to support research on computational analysis of animal behavior, specifically for monitoring the health and status of urban bee colonies. The dataset can be used to develop machine learning models that can automatically detect changes in bee behavior or colony health based on the audio signals.

The researchers collected data from multiple beehives located in an urban environment in Montreal, Canada. Recordings were made at regular intervals over the course of several seasons, allowing the researchers to capture seasonal and environmental variations in bee sounds and physical characteristics.

Critical Analysis

The UrBAN dataset provides a valuable resource for researchers studying urban bee populations and developing acoustic-based monitoring systems. By including both audio recordings and physical bee measurements, the dataset enables the exploration of connections between bee sounds and their underlying biological state.

One potential limitation of the dataset is the specific geographical and environmental context of the data collection, which was limited to a single urban area. While this provides a realistic dataset for that locale, the applicability of models trained on UrBAN data to other urban environments may be limited. Expanding the dataset to include more diverse urban settings could increase its broader utility.

Additionally, the dataset focuses on beehive-level acoustics and phenotypes, rather than individual bee-level data. Incorporating individual bee identification and tracking could provide more granular insights into bee behavior and health. This could be an area for future research and dataset expansion.

Overall, the UrBAN dataset represents an important step forward in enabling the use of acoustic analysis for monitoring urban bee populations. The dataset's availability can spur further advancements in computational analysis of animal behavior and the development of practical, non-invasive tools for beekeepers and conservationists.

Conclusion

The UrBAN dataset provides a comprehensive collection of audio recordings and physical measurements from urban beehives, with the goal of supporting research on acoustic-based monitoring of bee health and behavior. By combining these multimodal data sources, the dataset enables the development of machine learning models that can potentially detect changes in bee colonies early on, helping urban beekeepers and conservationists respond to issues like disease, pest infestations, or environmental stressors.

The dataset's focus on real-world, multi-seasonal data from an urban setting makes it a valuable resource for advancing the field of computational analysis of animal behavior and building practical audio-based monitoring systems for bees. While the dataset has some limitations in its geographic scope, it represents an important step forward in leveraging acoustics and phenotyping to support the protection and management of urban bee populations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

UrBAN: Urban Beehive Acoustics and PheNotyping Dataset

Mahsa Abdollahi, Yi Zhu, Heitor R. Guimar~aes, Nico Coallier, S'egol`ene Maucourt, Pierre Giovenazzo, Tiago H. Falk

In this paper, we present a multimodal dataset obtained from a honey bee colony in Montr'eal, Quebec, Canada, spanning the years of 2021 to 2022. This apiary comprised 10 beehives, with microphones recording more than 2000 hours of high quality raw audio, and also sensors capturing temperature, and humidity. Periodic hive inspections involved monitoring colony honey bee population changes, assessing queen-related conditions, and documenting overall hive health. Additionally, health metrics, such as Varroa mite infestation rates and winter mortality assessments were recorded, offering valuable insights into factors affecting hive health status and resilience. In this study, we first outline the data collection process, sensor data description, and dataset structure. Furthermore, we demonstrate a practical application of this dataset by extracting various features from the raw audio to predict colony population using the number of frames of bees as a proxy.

6/21/2024

🐍

HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models

Aakash Tripathi, Asim Waqas, Yasin Yilmaz, Ghulam Rasool

Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundation models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical diagnostic and pathology imaging data, medical notes, reports, records, and molecular data. It employs data preprocessing techniques and foundation models to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of these embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.

6/14/2024

🏷️

ApisTox: a new benchmark dataset for the classification of small molecules toxicity on honey bees

Jakub Adamczyk, Jakub Poziemski, Pawe{l} Siedlecki

The global decline in bee populations poses significant risks to agriculture, biodiversity, and environmental stability. To bridge the gap in existing data, we introduce ApisTox, a comprehensive dataset focusing on the toxicity of pesticides to honey bees (Apis mellifera). This dataset combines and leverages data from existing sources such as ECOTOX and PPDB, providing an extensive, consistent, and curated collection that surpasses the previous datasets. ApisTox incorporates a wide array of data, including toxicity levels for chemicals, details such as time of their publication in literature, and identifiers linking them to external chemical databases. This dataset may serve as an important tool for environmental and agricultural research, but also can support the development of policies and practices aimed at minimizing harm to bee populations. Finally, ApisTox offers a unique resource for benchmarking molecular property prediction methods on agrochemical compounds, facilitating advancements in both environmental science and cheminformatics. This makes it a valuable tool for both academic research and practical applications in bee conservation.

9/4/2024

BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

Zahra Gharaee, Scott C. Lowe, ZeMing Gong, Pablo Millan Arias, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Lila Kari, Dirk Steinke, Graham W. Taylor, Paul Fieguth, Angel X. Chang

As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, and geographical information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the BIOSCAN-5M dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at https://github.com/zahrag/BIOSCAN-5M.

6/26/2024