Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

Read original: arXiv:2406.02554 - Published 6/6/2024 by Shijian Deng, Erin E. Kosloski, Siddhi Patel, Zeke A. Barnett, Yiyang Nan, Alexander Kaplan, Sisira Aarukapalli, William T. Doan, Matthew Wang, Harsh Singh and 2 others

👁️

Overview

Introduces a new problem of audio-visual autism behavior recognition
Collected a large audio-visual dataset (AV-ASD) for autism screening using a behavioral approach
Explored leveraging foundation models and multimodal large language models across different modalities
Demonstrated that integrating audio, visual, and speech modalities significantly enhances performance in autism behavior recognition
Investigated the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to potentially augment the model's explanatory capability

Plain English Explanation

The researchers have developed a new way to recognize behaviors associated with autism spectrum disorder (ASD) using both audio and visual cues. This is an important advance, as previous AI-assisted autism screening research has often overlooked the social aspects of autism.

To enable this new research direction, the team collected a large dataset called AV-ASD, which contains video recordings of a wide range of autism-related behaviors, including those related to social communication and interaction. This is currently the largest video dataset available for this type of autism screening approach.

The researchers then explored using advanced machine learning models, known as foundation models and multimodal large language models, to analyze the audio and visual data. Their experiments showed that combining information from the audio, visual, and speech modalities significantly improves the ability to recognize autism-related behaviors.

Additionally, the researchers investigated a technique called a "post-hoc to ad-hoc pipeline" within a multimodal large language model. This approach has the potential to make the model's decision-making process more transparent and easier to understand when it is used for autism behavior recognition.

Overall, this research represents an important step forward in developing more comprehensive and effective AI-assisted tools for autism screening and understanding.

Technical Explanation

The researchers introduced a new problem of audio-visual autism behavior recognition, which uses both audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. This is an essential aspect that was previously overlooked in AI-assisted autism screening research.

To facilitate this new research direction, the team collected the AV-ASD dataset, which is currently the largest video dataset for autism screening using a behavioral approach. The dataset covers a wide range of autism-associated behaviors, including those related to social communication and interaction.

The researchers then explored leveraging foundation models and multimodal large language models across different modalities to tackle the audio-visual autism behavior recognition problem. Their experiments on the AV-ASD dataset demonstrated that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition.

Additionally, the researchers explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model's explanatory capability during autism behavior recognition.

Critical Analysis

The researchers have made significant progress in expanding the scope of AI-assisted autism screening by introducing the novel problem of audio-visual autism behavior recognition. However, the paper does not provide extensive details on the specific behavioral aspects covered in the AV-ASD dataset or how they were selected and annotated.

Additionally, while the researchers demonstrated the benefits of integrating multiple modalities, they did not delve deeply into the challenges and limitations of this approach. For example, the paper does not discuss potential issues with data quality, variability in participant behavior, or the generalizability of the models to diverse populations.

The exploration of the post-hoc to ad-hoc pipeline is an interesting approach, but the paper does not provide a thorough evaluation of its effectiveness in improving the model's explanatory capability. Further research is needed to understand the practical implications and real-world applicability of this technique.

Overall, this research represents an important step forward, but more work is needed to fully address the complexities and challenges of using audio-visual data for comprehensive autism screening and understanding.

Conclusion

This research has introduced a novel problem of audio-visual autism behavior recognition, which leverages both audio and visual cues to recognize a broad range of autism-related behaviors, including those related to social communication and interaction. The researchers have collected a large dataset (AV-ASD) to enable this new research direction and have explored the use of foundation models and multimodal large language models to tackle the problem.

The results demonstrate that integrating multiple modalities, such as audio, visual, and speech, can significantly enhance the performance in autism behavior recognition. Additionally, the researchers have investigated the potential of a post-hoc to ad-hoc pipeline to augment the explanatory capability of the models.

This work represents an important step forward in developing more comprehensive and effective AI-assisted tools for autism screening and understanding. However, further research is needed to address the remaining challenges and limitations, particularly in terms of dataset composition, model generalizability, and the practical application of the proposed techniques.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👁️

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

Shijian Deng, Erin E. Kosloski, Siddhi Patel, Zeke A. Barnett, Yiyang Nan, Alexander Kaplan, Sisira Aarukapalli, William T. Doan, Matthew Wang, Harsh Singh, Pamela R. Rollins, Yapeng Tian

In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model's explanatory capability during autism behavior recognition. We will release our dataset, code, and pre-trained models.

6/6/2024

A Novel Dataset for Video-Based Autism Classification Leveraging Extra-Stimulatory Behavior

Manuel Serna-Aguilera, Xuan Bac Nguyen, Han-Seok Seo, Khoa Luu

Autism Spectrum Disorder (ASD) can affect individuals at varying degrees of intensity, from challenges in overall health, communication, and sensory processing, and this often begins at a young age. Thus, it is critical for medical professionals to be able to accurately diagnose ASD in young children, but doing so is difficult. Deep learning can be responsibly leveraged to improve productivity in addressing this task. The availability of data, however, remains a considerable obstacle. Hence, in this work, we introduce the Video ASD dataset--a dataset that contains video frame convolutional and attention map feature data--to foster further progress in the task of ASD classification. The original videos showcase children reacting to chemo-sensory stimuli, among auditory, touch, and vision This dataset contains the features of the frames spanning 2,467 videos, for a total of approximately 1.4 million frames. Additionally, head pose angles are included to account for head movement noise, as well as full-sentence text labels for the taste and smell videos that describe how the facial expression changes before, immediately after, and long after interaction with the stimuli. In addition to providing features, we also test foundation models on this data to showcase how movement noise affects performance and the need for more data and more complex labels.

9/10/2024

👁️

Versatile audio-visual learning for emotion recognition

Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin, Berrak Sisman, Carlos Busso

Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression or classification tasks. This study proposes a versatile audio-visual learning (VAVL) framework for handling unimodal and multimodal systems for emotion regression or emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on the CREMA-D, MSP-IMPROV, and CMU-MOSEI corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus.

7/31/2024

$Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder$

Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder

Marie Huynh (Stanford University), Aaron Kline (Stanford University), Saimourya Surabhi (Stanford University), Kaitlyn Dunlap (Stanford University), Onur Cezmi Mutlu (Stanford University), Mohammadmahdi Honarmand (Stanford University), Parnian Azizian (Stanford University), Peter Washington (University of Hawaii at Manoa), Dennis P. Wall (Stanford University)

Early detection of autism, a neurodevelopmental disorder marked by social communication challenges, is crucial for timely intervention. Recent advancements have utilized naturalistic home videos captured via the mobile application GuessWhat. Through interactive games played between children and their guardians, GuessWhat has amassed over 3,000 structured videos from 382 children, both diagnosed with and without Autism Spectrum Disorder (ASD). This collection provides a robust dataset for training computer vision models to detect ASD-related phenotypic markers, including variations in emotional expression, eye contact, and head movements. We have developed a protocol to curate high-quality videos from this dataset, forming a comprehensive training set. Utilizing this set, we trained individual LSTM-based models using eye gaze, head positions, and facial landmarks as input features, achieving test AUCs of 86%, 67%, and 78%, respectively. To boost diagnostic accuracy, we applied late fusion techniques to create ensemble models, improving the overall AUC to 90%. This approach also yielded more equitable results across different genders and age groups. Our methodology offers a significant step forward in the early detection of ASD by potentially reducing the reliance on subjective assessments and making early identification more accessibly and equitable.

8/26/2024