Localizing Moments of Actions in Untrimmed Videos of Infants with Autism Spectrum Disorder

Read original: arXiv:2404.05849 - Published 4/10/2024 by Halil Ismail Helvaci, Sen-ching Samson Cheung, Chen-Nee Chuah, Sally Ozonoff

Localizing Moments of Actions in Untrimmed Videos of Infants with Autism Spectrum Disorder

Overview

This paper presents a method for localizing moments of actions in untrimmed videos of infants with Autism Spectrum Disorder (ASD).
The researchers developed a novel deep learning model that can accurately identify and temporally localize specific actions within longer videos.
The model was trained and evaluated on a new dataset of untrimmed videos of infants, some with ASD and some typically developing.
The results demonstrate the model's ability to pinpoint the start and end times of target actions, which could aid in early ASD diagnosis and intervention.

Plain English Explanation

Researchers have created a new AI system that can watch videos of babies and identify specific actions or behaviors, even in longer, unedited videos. This is particularly useful for studying infants with Autism Spectrum Disorder (ASD), as detecting certain behaviors early on can help with diagnosis and providing support.

The system works by using deep learning, a type of artificial intelligence that can learn patterns from data. The researchers trained the AI on a dataset of videos showing both typically developing infants and infants with ASD. By analyzing these videos, the AI learned to recognize when particular actions or movements occurred, and where they started and ended in the footage.

This allows the system to watch a longer video of a baby and pinpoint the exact moments when they perform certain behaviors, like reaching for a toy or making a particular hand motion. Being able to identify these moments precisely, without having to manually review all the footage, could be a big help for researchers and clinicians working with ASD.

The paper demonstrates that this AI model is quite accurate at localizing these moments of interest, suggesting it could be a valuable tool for early ASD detection and monitoring. By automatically analyzing videos, doctors and therapists may be able to more easily spot early signs of developmental differences and get infants the support they need.

Technical Explanation

The researchers developed a deep learning model for localizing temporal moments of actions in untrimmed videos of infants, some with Autism Spectrum Disorder (ASD) and some typically developing. Their UniAV model takes in video frames and audio features, and outputs the start and end times of target actions.

The model was trained and evaluated on a new dataset collected by the researchers, comprising over 100 hours of naturalistic infant videos. Human annotators precisely marked the temporal boundaries of specific infant actions, such as hand flapping, finger manipulating, and reaching.

Experiments showed the UniAV model could localize these moments of interest with high accuracy, significantly outperforming prior temporal action detection methods. The model was also able to generalize from the typically developing infants to the ASD group, demonstrating its potential for early ASD diagnosis and monitoring.

Critical Analysis

The researchers acknowledge several limitations in their work. First, the dataset, while sizable, may not capture the full diversity of infant behaviors, particularly for the ASD population. Further data collection and annotation efforts could help expand the model's capabilities.

Additionally, the paper does not explore the model's performance on real-world, clinical applications. While the results are promising, more research is needed to understand how well the system would function in practical settings, such as supporting clinicians in their ASD assessments.

Another potential concern is the interpretability of the model's inner workings. As a deep learning system, it may be challenging to fully explain how it arrives at its predictions. Developing more transparent and explainable models could be an important next step.

Overall, this work represents a valuable step forward in leveraging computer vision and audio analysis to assist in early ASD detection and intervention. With further refinement and validation, the proposed approach could become a powerful tool in the hands of clinicians and researchers.

Conclusion

This paper presents a novel deep learning model for precisely localizing moments of actions in untrimmed videos of infants, including those with Autism Spectrum Disorder. The model's ability to pinpoint the start and end times of target behaviors, such as hand flapping or reaching, could significantly aid in early ASD diagnosis and monitoring.

While the research has some limitations, it demonstrates the potential for advanced AI techniques to transform the way clinicians and researchers analyze infant development. By automating the detection of key behavioral markers, this work could lead to more efficient and effective interventions for children with ASD, ultimately improving their long-term outcomes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Localizing Moments of Actions in Untrimmed Videos of Infants with Autism Spectrum Disorder

Halil Ismail Helvaci, Sen-ching Samson Cheung, Chen-Nee Chuah, Sally Ozonoff

Autism Spectrum Disorder (ASD) presents significant challenges in early diagnosis and intervention, impacting children and their families. With prevalence rates rising, there is a critical need for accessible and efficient screening tools. Leveraging machine learning (ML) techniques, in particular Temporal Action Localization (TAL), holds promise for automating ASD screening. This paper introduces a self-attention based TAL model designed to identify ASD-related behaviors in infant videos. Unlike existing methods, our approach simplifies complex modeling and emphasizes efficiency, which is essential for practical deployment in real-world scenarios. Importantly, this work underscores the importance of developing computer vision methods capable of operating in naturilistic environments with little equipment control, addressing key challenges in ASD screening. This study is the first to conduct end-to-end temporal action localization in untrimmed videos of infants with ASD, offering promising avenues for early intervention and support. We report baseline results of behavior detection using our TAL model. We achieve 70% accuracy for look face, 79% accuracy for look object, 72% for smile and 65% for vocalization.

4/10/2024

Test-Time Zero-Shot Temporal Action Localization

Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

4/12/2024

Open-vocabulary Temporal Action Localization using VLMs

Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Video action localization aims to find timings of a specific action from a long video. Although existing learning-based approaches have been successful, those require annotating videos that come with a considerable labor cost. This paper proposes a learning-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLM). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames into a concatenated image with frame index labels, making a VLM guess a frame that is considered to be closest to the start/end of the action. Iterating this process by narrowing a sampling time window results in finding a specific frame of start and end of an action. We demonstrate that this sampling technique yields reasonable results, illustrating a practical extension of VLMs for understanding videos. A sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/.

9/10/2024

$Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder$

Ensemble Modeling of Multiple Physical Indicators to Dynamically Phenotype Autism Spectrum Disorder

Marie Huynh (Stanford University), Aaron Kline (Stanford University), Saimourya Surabhi (Stanford University), Kaitlyn Dunlap (Stanford University), Onur Cezmi Mutlu (Stanford University), Mohammadmahdi Honarmand (Stanford University), Parnian Azizian (Stanford University), Peter Washington (University of Hawaii at Manoa), Dennis P. Wall (Stanford University)

Early detection of autism, a neurodevelopmental disorder marked by social communication challenges, is crucial for timely intervention. Recent advancements have utilized naturalistic home videos captured via the mobile application GuessWhat. Through interactive games played between children and their guardians, GuessWhat has amassed over 3,000 structured videos from 382 children, both diagnosed with and without Autism Spectrum Disorder (ASD). This collection provides a robust dataset for training computer vision models to detect ASD-related phenotypic markers, including variations in emotional expression, eye contact, and head movements. We have developed a protocol to curate high-quality videos from this dataset, forming a comprehensive training set. Utilizing this set, we trained individual LSTM-based models using eye gaze, head positions, and facial landmarks as input features, achieving test AUCs of 86%, 67%, and 78%, respectively. To boost diagnostic accuracy, we applied late fusion techniques to create ensemble models, improving the overall AUC to 90%. This approach also yielded more equitable results across different genders and age groups. Our methodology offers a significant step forward in the early detection of ASD by potentially reducing the reliance on subjective assessments and making early identification more accessibly and equitable.

8/26/2024