Multi-View Spectrogram Transformer for Respiratory Sound Classification

2311.09655

YC

0

Reddit

0

Published 5/31/2024 by Wentao He, Yuchen Yan, Jianfeng Ren, Ruibin Bai, Xudong Jiang

🏷️

Abstract

Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different sized patches, representing the multi-view acoustic elements of a respiratory sound. These patches and positional embeddings are then fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.

Create account to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper proposes a Multi-View Spectrogram Transformer (MVST) for classifying respiratory sounds from audio spectrograms.
  • Existing models often treat the spectrogram as a synthetic image, overlooking its physical characteristics.
  • The MVST embeds different views of time-frequency characteristics into a vision transformer to better capture the acoustic elements of respiratory sounds.

Plain English Explanation

The paper describes a new way to analyze audio recordings of breathing sounds using a type of artificial intelligence called a "transformer." Typically, when analyzing these types of audio recordings, researchers will convert the audio into a visual representation called a "spectrogram," which shows how the different frequencies in the sound change over time.

However, the authors argue that existing models that use spectrograms often don't fully account for the physical properties of the sound. To address this, the MVST [Multi-View Spectrogram Transformer] splits the spectrogram into different sized "patches," each representing a different aspect of the acoustic elements in the respiratory sound. These patches are then fed into a transformer, which can identify important patterns and relationships between the different parts of the spectrogram.

By taking this multi-view approach and using a transformer architecture, the researchers were able to significantly outperform existing methods for classifying different types of respiratory sounds, as demonstrated on a dataset called the ICBHI dataset. This suggests that their MVST model is better able to capture the nuanced characteristics of respiratory sounds compared to previous techniques that treated the spectrogram more simplistically.

Technical Explanation

The Multi-View Spectrogram Transformer (MVST) proposed in this paper aims to better leverage the physical properties of audio spectrograms for respiratory sound classification. Unlike previous approaches that treated the spectrogram as a synthetic image, the MVST splits the mel-spectrogram into patches of different sizes to represent various time-frequency characteristics of the acoustic elements.

These multi-view patches, along with positional embeddings, are then fed into transformer encoders. The transformer's self-attention mechanism allows the model to extract important relational information between the different acoustic patches. Finally, a gated fusion scheme is used to automatically weight the contributions of the multi-view features, highlighting the most relevant aspects for a given classification scenario.

The researchers evaluated the MVST on the ICBHI dataset, and found that it significantly outperformed state-of-the-art methods for classifying respiratory sounds. This suggests that the MVST's ability to better capture the physical characteristics of the spectrogram, through its multi-view and transformer-based approach, is a key advantage over previous techniques.

Critical Analysis

The paper presents a novel and promising approach for respiratory sound classification using audio spectrograms. By explicitly modeling the multi-view, time-frequency characteristics of the spectrogram, the MVST seems to offer advantages over methods that treat the spectrogram more simplistically as a synthetic image.

However, the paper does not provide a detailed analysis of the limitations or potential issues with the MVST approach. For example, it is unclear how the MVST model would perform on more diverse or challenging respiratory sound datasets beyond the ICBPI dataset used in the experiments. Additionally, the paper does not discuss the computational complexity or inference time of the MVST, which could be important considerations for real-world deployment.

Further research could also explore the interpretability of the MVST's internal representations and attention mechanisms, as this could provide valuable insights into how the model is making its classification decisions. [Comparing the MVST's performance to other transformer-based approaches for audio classification, such as SleepVST or 3D Convolution-Guided Spectral-Spatial Transformer, could also help situate the MVST's capabilities within the broader context of transformer-based audio processing.

Conclusion

The Multi-View Spectrogram Transformer (MVST) proposed in this paper represents a promising approach for respiratory sound classification from audio spectrograms. By explicitly modeling the multi-view, time-frequency characteristics of the spectrogram using a transformer-based architecture, the MVST was able to significantly outperform existing methods on the ICBHI dataset.

This work highlights the importance of considering the physical properties of audio data, rather than simply treating spectrograms as synthetic images. The MVST's ability to capture these nuanced acoustic elements could have valuable applications in respiratory health monitoring and diagnosis. Further research to explore the MVST's generalization, interpretability, and performance on a wider range of respiratory sound datasets could help solidify its potential contributions to the field.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

👀

Abnormal Respiratory Sound Identification Using Audio-Spectrogram Vision Transformer

Whenty Ariyanti, Kai-Chun Liu, Kuan-Yu Chen, Yu Tsao

YC

0

Reddit

0

Respiratory disease, the third leading cause of deaths globally, is considered a high-priority ailment requiring significant research on identification and treatment. Stethoscope-recorded lung sounds and artificial intelligence-powered devices have been used to identify lung disorders and aid specialists in making accurate diagnoses. In this study, audio-spectrogram vision transformer (AS-ViT), a new approach for identifying abnormal respiration sounds, was developed. The sounds of the lungs are converted into visual representations called spectrograms using a technique called short-time Fourier transform (STFT). These images are then analyzed using a model called vision transformer to identify different types of respiratory sounds. The classification was carried out using the ICBHI 2017 database, which includes various types of lung sounds with different frequencies, noise levels, and backgrounds. The proposed AS-ViT method was evaluated using three metrics and achieved 79.1% and 59.8% for 60:40 split ratio and 86.4% and 69.3% for 80:20 split ratio in terms of unweighted average recall and overall scores respectively for respiratory sound detection, surpassing previous state-of-the-art results.

Read more

5/15/2024

SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers

SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers

Jonathan F. Carter, Jo~ao Jorge, Oliver Gibson, Lionel Tarassenko

YC

0

Reddit

0

Advances in camera-based physiological monitoring have enabled the robust, non-contact measurement of respiration and the cardiac pulse, which are known to be indicative of the sleep stage. This has led to research into camera-based sleep monitoring as a promising alternative to gold-standard polysomnography, which is cumbersome, expensive to administer, and hence unsuitable for longer-term clinical studies. In this paper, we introduce SleepVST, a transformer model which enables state-of-the-art performance in camera-based sleep stage classification (sleep staging). After pre-training on contact sensor data, SleepVST outperforms existing methods for cardio-respiratory sleep staging on the SHHS and MESA datasets, achieving total Cohen's kappa scores of 0.75 and 0.77 respectively. We then show that SleepVST can be successfully transferred to cardio-respiratory waveforms extracted from video, enabling fully contact-free sleep staging. Using a video dataset of 50 nights, we achieve a total accuracy of 78.8% and a Cohen's $kappa$ of 0.71 in four-class video-based sleep staging, setting a new state-of-the-art in the domain.

Read more

4/8/2024

BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification

BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification

June-Woo Kim, Miika Toikkanen, Yera Choi, Seoung-Eun Moon, Ho-Young Jung

YC

0

Reddit

0

Respiratory sound classification (RSC) is challenging due to varied acoustic signatures, primarily influenced by patient demographics and recording environments. To address this issue, we introduce a text-audio multimodal model that utilizes metadata of respiratory sounds, which provides useful complementary information for RSC. Specifically, we fine-tune a pretrained text-audio multimodal model using free-text descriptions derived from the sound samples' metadata which includes the gender and age of patients, type of recording devices, and recording location on the patient's body. Our method achieves state-of-the-art performance on the ICBHI dataset, surpassing the previous best result by a notable margin of 1.17%. This result validates the effectiveness of leveraging metadata and respiratory sound samples in enhancing RSC performance. Additionally, we investigate the model performance in the case where metadata is partially unavailable, which may occur in real-world clinical setting.

Read more

6/17/2024

🤿

Efficient Multi-View Fusion and Flexible Adaptation to View Missing in Cardiovascular System Signals

Qihan Hu, Daomiao Wang, Hong Wu, Jian Liu, Cuiwei Yang

YC

0

Reddit

0

The progression of deep learning and the widespread adoption of sensors have facilitated automatic multi-view fusion (MVF) about the cardiovascular system (CVS) signals. However, prevalent MVF model architecture often amalgamates CVS signals from the same temporal step but different views into a unified representation, disregarding the asynchronous nature of cardiovascular events and the inherent heterogeneity across views, leading to catastrophic view confusion. Efficient training strategies specifically tailored for MVF models to attain comprehensive representations need simultaneous consideration. Crucially, real-world data frequently arrives with incomplete views, an aspect rarely noticed by researchers. Thus, the View-Centric Transformer (VCT) and Multitask Masked Autoencoder (M2AE) are specifically designed to emphasize the centrality of each view and harness unlabeled data to achieve superior fused representations. Additionally, we systematically define the missing-view problem for the first time and introduce prompt techniques to aid pretrained MVF models in flexibly adapting to various missing-view scenarios. Rigorous experiments involving atrial fibrillation detection, blood pressure estimation, and sleep staging-typical health monitoring tasks-demonstrate the remarkable advantage of our method in MVF compared to prevailing methodologies. Notably, the prompt technique requires finetuning less than 3% of the entire model's data, substantially fortifying the model's resilience to view missing while circumventing the need for complete retraining. The results demonstrate the effectiveness of our approaches, highlighting their potential for practical applications in cardiovascular health monitoring. Codes and models are released at URL.

Read more

6/14/2024