Wake Vision: A Large-scale, Diverse Dataset and Benchmark Suite for TinyML Person Detection

Read original: arXiv:2405.00892 - Published 6/7/2024 by Colby Banbury, Emil Njor, Matthew Stewart, Pete Warden, Manjunath Kudlur, Nat Jeffries, Xenofon Fafoutis, Vijay Janapa Reddi

Wake Vision: A Large-scale, Diverse Dataset and Benchmark Suite for TinyML Person Detection

Overview

Presents a large-scale, diverse dataset called "Wake Vision" for TinyML person detection
Introduces a benchmark suite to evaluate the performance of TinyML models on person detection tasks
Evaluates the effectiveness of various TinyML models on the dataset and provides insights into their strengths and limitations

Plain English Explanation

This paper presents a new dataset called "Wake Vision" that can be used to train and evaluate machine learning models for detecting people in images. The dataset is large and diverse, containing a wide range of images of people in different settings, poses, and lighting conditions. The researchers also developed a set of benchmarks, or standardized tests, that can be used to measure how well different machine learning models perform on the person detection task.

One of the key goals of this research is to support the development of TinyML models, which are machine learning models that are designed to run on small, low-power devices like smartphones or embedded sensors. These types of models are becoming increasingly important as our world becomes more connected and we need to process data closer to the source, rather than sending it to a central server.

The researchers evaluated the performance of several different TinyML models on the Wake Vision dataset, and they found that some models performed better than others. They also identified areas where the models struggled, such as detecting people in challenging lighting conditions or when they are partially obscured. These insights can help researchers and developers improve the performance of TinyML models for person detection and other similar tasks.

Technical Explanation

The Wake Vision dataset is a large-scale, diverse dataset containing over 1 million images of people in a wide range of settings, including indoor and outdoor scenes, different lighting conditions, and various poses and occlusions. The dataset was designed to be challenging and representative of real-world person detection scenarios.

The researchers also developed a suite of benchmarks to evaluate the performance of TinyML models on the person detection task. These benchmarks include metrics such as detection accuracy, inference speed, and energy consumption, which are all important factors for deploying TinyML models in real-world applications.

The researchers evaluated several state-of-the-art TinyML models, including MobileNetV2, EfficientNet-Lite, and [MnasNet], on the Wake Vision dataset. They found that the models had varying levels of performance, with some struggling in challenging lighting conditions or when detecting partially occluded people.

Critical Analysis

The Wake Vision dataset and benchmark suite provide a valuable resource for researchers and developers working on TinyML person detection models. The dataset's diversity and scale make it a more realistic and challenging test bed than many existing person detection datasets.

However, the paper does not provide detailed information on the dataset's annotation process or the specific challenges faced in its creation. It would be helpful to know more about the trade-offs and design decisions made by the researchers, as well as the potential biases or limitations of the dataset.

Additionally, the paper focuses primarily on evaluating the performance of existing TinyML models, but it does not propose any novel model architectures or training techniques to address the identified challenges. Future research could explore ways to improve the robustness and efficiency of TinyML person detection models, particularly in scenarios with complex lighting, occlusions, or other real-world challenges.

Conclusion

The Wake Vision dataset and benchmark suite represent an important contribution to the field of TinyML person detection. By providing a large-scale, diverse dataset and a standardized set of benchmarks, the researchers have created a valuable tool for researchers and developers working to create efficient, high-performing machine learning models for real-world applications.

The insights gained from evaluating existing TinyML models on the Wake Vision dataset can help guide the development of more robust and effective person detection models, which could have far-reaching implications for a wide range of applications, from smart home systems to autonomous vehicles.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Wake Vision: A Large-scale, Diverse Dataset and Benchmark Suite for TinyML Person Detection

Colby Banbury, Emil Njor, Matthew Stewart, Pete Warden, Manjunath Kudlur, Nat Jeffries, Xenofon Fafoutis, Vijay Janapa Reddi

Tiny machine learning (TinyML), which enables machine learning applications on extremely low-power devices, suffers from limited size and quality of relevant datasets. To address this issue, we introduce Wake Vision, a large-scale, diverse dataset tailored for person detection, the canonical task for TinyML visual sensing. Wake Vision comprises over 6 million images, representing a hundredfold increase compared to the previous standard, and has undergone thorough quality filtering. We provide two Wake Vision training sets: Wake Vision (Large) and Wake Vision (Quality), a smaller set with higher-quality labels. Our results demonstrate that using the Wake Vision (Quality) training set produces more accurate models than the Wake Vision (Large) training set, strongly suggesting that label quality is more important than quantity in our setting. We find use for the large training set for pre-training and knowledge distillation. To minimize label errors that can obscure true model performance, we manually label the validation and test sets, improving the test set error rate from 7.8% in the prior standard to only 2.2%. In addition to the dataset, we provide a collection of five detailed benchmark sets to facilitate the evaluation of model quality in challenging real world scenarios that are often ignored when focusing solely on overall accuracy. These novel fine-grained benchmarks assess model performance on specific segments of the test data, such as varying lighting conditions, distances from the camera, and demographic characteristics of subjects. Our results demonstrate that using Wake Vision for training results in a 2.49% increase in accuracy compared to the established dataset. We also show the importance of dataset quality for low-capacity models and the value of dataset size for high-capacity models. wakevision.ai

6/7/2024

MindSet: Vision. A toolbox for testing DNNs on key psychological experiments

Valerio Biscione, Dong Yin, Gaurav Malhotra, Marin Dujmovic, Milton L. Montero, Guillermo Puebla, Federico Adolfi, Rachel F. Heaton, John E. Hummel, Benjamin D. Evans, Karim Habashy, Jeffrey S. Bowers

Multiple benchmarks have been developed to assess the alignment between deep neural networks (DNNs) and human vision. In almost all cases these benchmarks are observational in the sense they are composed of behavioural and brain responses to naturalistic images that have not been manipulated to test hypotheses regarding how DNNs or humans perceive and identify objects. Here we introduce the toolbox MindSet: Vision, consisting of a collection of image datasets and related scripts designed to test DNNs on 30 psychological findings. In all experimental conditions, the stimuli are systematically manipulated to test specific hypotheses regarding human visual perception and object recognition. In addition to providing pre-generated datasets of images, we provide code to regenerate these datasets, offering many configurable parameters which greatly extend the dataset versatility for different research contexts, and code to facilitate the testing of DNNs on these image datasets using three different methods (similarity judgments, out-of-distribution classification, and decoder method), accessible at https://github.com/MindSetVision/mindset-vision. We test ResNet-152 on each of these methods as an example of how the toolbox can be used.

4/9/2024

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, Weisi Lin

The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at https://github.com/Q-Future/Q-Bench.

8/13/2024

VoiceWukong: Benchmarking Deepfake Voice Detection

Ziwei Yan, Yanjie Zhao, Haoyu Wang

With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encountered in real-world production environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors. To build the dataset, we first collected deepfake voices generated by 19 advanced and widely recognized commercial tools and 15 open-source tools. We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200 Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12 state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of 13.50%, while all others exceeded 20%. Our findings reveal that these detectors face significant challenges in real-world applications, with dramatically declining performance. In addition, we conducted a user study with more than 300 participants. The results are compared with the performance of the 12 detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio, where different detectors and humans exhibit varying identification capabilities for deepfake voices at different deception levels, while the LALM demonstrates no detection ability at all. Furthermore, we provide a leaderboard for deepfake voice detection, publicly available at {https://voicewukong.github.io}.

9/11/2024