Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals

Read original: arXiv:2309.05927 - Published 4/22/2024 by Ran Liu, Ellen L. Zippi, Hadi Pouransari, Chris Sandino, Jingping Nie, Hanlin Goh, Erdrin Azemi, Ali Moin

🌐

Overview

Multimodal biosignals, such as physiological and behavioral data, are crucial for understanding people's physical and mental states.
However, these multimodal signals often exhibit significant distributional shifts between pretraining and inference datasets, due to changes in task specifications or variations in modality compositions.
To address this challenge, the researchers propose a frequency-aware masked autoencoder (bioFAME) that learns to represent biosignals in the frequency domain.

Plain English Explanation

Monitoring various biological signals, like heart rate and brain activity, can provide valuable insights into a person's physical and mental well-being. This type of "multimodal" data is incredibly useful for understanding and tracking an individual's overall health and behavior.

However, the data collected from these different sensors can sometimes look quite different when used for training AI models versus when those models are applied in the real world. This is because the original training data may have been collected in a specific setting, while the real-world data might come from a different context with different characteristics.

To overcome this challenge, the researchers developed a new AI model called bioFAME. This model learns to represent the biological signals in terms of their underlying frequencies, rather than just the raw time-domain signals. By focusing on the frequency content of the data, the model can better handle variations in things like the length or sampling rate of the input signals.

The key idea is that certain important physiological patterns are reflected in the frequency domain, rather than just the time-domain. So by explicitly modeling the frequency information, the bioFAME model can learn more robust and generalizable representations of the multimodal biosignals. This allows the model to be used effectively across a wider range of applications and scenarios, even when the input data characteristics change.

Technical Explanation

The researchers propose the frequency-aware masked autoencoder (bioFAME) to address the challenge of distributional shifts in multimodal biosignals. The core components of bioFAME include:

Frequency-Aware Transformer: This module uses a fixed-size Fourier-based operator to enable global token mixing in the frequency domain, independently of the length and sampling rate of the input signals.
Frequency-Maintain Pretraining: To preserve the important frequency components within each input channel, bioFAME employs a pretraining strategy that performs masked autoencoding directly in the latent space.

By operating in the frequency domain, bioFAME can effectively leverage multimodal information during pretraining, and then be seamlessly adapted to diverse tasks and modalities at test time, regardless of input size and order.

The researchers evaluate bioFAME on a variety of transfer learning experiments using unimodal time series data. They demonstrate an average of 5.5% improvement in classification accuracy over previous state-of-the-art models. Additionally, they show that bioFAME is robust to modality mismatch scenarios, such as unexpected modality dropout or substitution, highlighting its practical utility for real-world applications.

Critical Analysis

The paper presents a well-designed and thoroughly evaluated approach for leveraging multimodal biosignals in a way that is robust to distributional shifts. The key strength of the bioFAME model is its ability to learn representations in the frequency domain, which allows it to handle variations in signal characteristics that often arise in real-world scenarios.

One potential limitation, however, is the reliance on a fixed-size Fourier-based operator for the transformer module. While this design choice enables efficient global token mixing, it may limit the model's ability to capture more complex frequency-domain relationships. Exploring alternative frequency-aware transformer architectures, such as those based on learnable spectral filters, could be an interesting direction for future research.

Additionally, the paper does not provide a detailed analysis of the types of distributional shifts that the bioFAME model is most effective at handling. Further investigation into the specific scenarios where the frequency-domain approach shines would help practitioners better understand the scope and limitations of the proposed solution.

Overall, the bioFAME model represents a valuable contribution to the field of multimodal biosignal processing, demonstrating the benefits of frequency-aware representations for improving model robustness and generalization. As the researchers suggest, the techniques employed in this work could potentially be extended to other domains beyond biosignals, making it an intriguing area for future exploration.

Conclusion

The frequency-aware masked autoencoder (bioFAME) proposed in this paper provides a robust and effective approach for leveraging multimodal biosignals, even in the presence of significant distributional shifts between pretraining and inference datasets.

By focusing on the frequency domain representation of the input signals, bioFAME can learn generalizable and adaptable models that maintain performance across a wide range of real-world applications. The researchers have demonstrated the practical utility of this approach through extensive experiments, showcasing its ability to outperform previous state-of-the-art methods.

As the use of multimodal biosensors continues to grow, solutions like bioFAME will become increasingly important for building comprehensive and reliable systems that can accurately monitor and understand human physical and mental states. This work represents an important step forward in this direction, paving the way for further advancements in the field of multimodal signal processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🌐

Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals

Ran Liu, Ellen L. Zippi, Hadi Pouransari, Chris Sandino, Jingping Nie, Hanlin Goh, Erdrin Azemi, Ali Moin

Leveraging multimodal information from biosignals is vital for building a comprehensive representation of people's physical and mental states. However, multimodal biosignals often exhibit substantial distributional shifts between pretraining and inference datasets, stemming from changes in task specification or variations in modality compositions. To achieve effective pretraining in the presence of potential distributional shifts, we propose a frequency-aware masked autoencoder ($texttt{bio}$FAME) that learns to parameterize the representation of biosignals in the frequency space. $texttt{bio}$FAME incorporates a frequency-aware transformer, which leverages a fixed-size Fourier-based operator for global token mixing, independent of the length and sampling rate of inputs. To maintain the frequency components within each input channel, we further employ a frequency-maintain pretraining strategy that performs masked autoencoding in the latent space. The resulting architecture effectively utilizes multimodal information during pretraining, and can be seamlessly adapted to diverse tasks and modalities at test time, regardless of input size and order. We evaluated our approach on a diverse set of transfer experiments on unimodal time series, achieving an average of $uparrow$5.5% improvement in classification accuracy over the previous state-of-the-art. Furthermore, we demonstrated that our architecture is robust in modality mismatch scenarios, including unpredicted modality dropout or substitution, proving its practical utility in real-world applications. Code is available at https://github.com/apple/ml-famae .

4/22/2024

🔄

Neuro-BERT: Rethinking Masked Autoencoding for Self-supervised Neurological Pretraining

Di Wu, Siyuan Li, Jie Yang, Mohamad Sawan

Deep learning associated with neurological signals is poised to drive major advancements in diverse fields such as medical diagnostics, neurorehabilitation, and brain-computer interfaces. The challenge in harnessing the full potential of these signals lies in the dependency on extensive, high-quality annotated data, which is often scarce and expensive to acquire, requiring specialized infrastructure and domain expertise. To address the appetite for data in deep learning, we present Neuro-BERT, a self-supervised pre-training framework of neurological signals based on masked autoencoding in the Fourier domain. The intuition behind our approach is simple: frequency and phase distribution of neurological signals can reveal intricate neurological activities. We propose a novel pre-training task dubbed Fourier Inversion Prediction (FIP), which randomly masks out a portion of the input signal and then predicts the missing information using the Fourier inversion theorem. Pre-trained models can be potentially used for various downstream tasks such as sleep stage classification and gesture recognition. Unlike contrastive-based methods, which strongly rely on carefully hand-crafted augmentations and siamese structure, our approach works reasonably well with a simple transformer encoder with no augmentation requirements. By evaluating our method on several benchmark datasets, we show that Neuro-BERT improves downstream neurological-related tasks by a large margin.

7/8/2024

🏷️

Spatio-Temporal Encoding of Brain Dynamics with Surface Masked Autoencoders

Simon Dahan, Logan Z. J. Williams, Yourong Guo, Daniel Rueckert, Emma C. Robinson

The development of robust and generalisable models for encoding the spatio-temporal dynamics of human brain activity is crucial for advancing neuroscientific discoveries. However, significant individual variation in the organisation of the human cerebral cortex makes it difficult to identify population-level trends in these signals. Recently, Surface Vision Transformers (SiTs) have emerged as a promising approach for modelling cortical signals, yet they face some limitations in low-data scenarios due to the lack of inductive biases in their architecture. To address these challenges, this paper proposes the surface Masked AutoEncoder (sMAE) and video surface Masked AutoEncoder (vsMAE) - for multivariate and spatio-temporal pre-training of cortical signals over regular icosahedral grids. These models are trained to reconstruct cortical feature maps from masked versions of the input by learning strong latent representations of cortical structure and function. Such representations translate into better modelling of individual phenotypes and enhanced performance in downstream tasks. The proposed approach was evaluated on cortical phenotype regression using data from the young adult Human Connectome Project (HCP) and developing HCP (dHCP). Results show that (v)sMAE pre-trained models improve phenotyping prediction performance on multiple tasks by $ge 26%$, and offer faster convergence relative to models trained from scratch. Finally, we show that pre-training vision transformers on large datasets, such as the UK Biobank (UKB), supports transfer learning to low-data regimes. Our code and pre-trained models are publicly available at https://github.com/metrics-lab/surface-masked-autoencoders .

6/12/2024

👁️

MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition

Peihao Xiang, Chaohao Lin, Kaida Wu, Ou Bai

This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.

5/17/2024