DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation

Read original: arXiv:2409.08610 - Published 9/16/2024 by Ziqian Wang, Jiayao Sun, Zihan Zhang, Xingchen Li, Jie Liu, Lei Xie
Total Score

0

DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Provides guidelines for authors submitting manuscripts to the SLT 2024 conference
  • Covers formatting requirements, title section details, and other key submission information

Plain English Explanation

The provided document outlines the guidelines for authors who want to submit a manuscript to the SLT 2024 conference. It covers the expected formatting of the paper, including details about the title section. The guidelines ensure a consistent structure and presentation for all submissions, which helps the conference organizers review the papers effectively.

Technical Explanation

The paper describes the required formatting for manuscripts submitted to the SLT 2024 conference. This includes specifications for the page layout, font styles, and section structure. The title section must include the paper title, author names, affiliations, and contact information. Other sections cover the abstract, body text, references, and supplementary materials. These guidelines ensure a consistent presentation across all accepted papers, facilitating the review process for the conference organizers.

Critical Analysis

The guidelines provided seem comprehensive and well-structured to ensure a professional and organized submission process for the SLT 2024 conference. The clear formatting rules and title section requirements help maintain a high standard for all papers. However, the guidelines do not address potential issues like author anonymity during the review process or the handling of sensitive data or materials. Additionally, the guidelines could be improved by providing more detailed instructions for incorporating figures, tables, and mathematical equations into the manuscript.

Conclusion

These author guidelines establish a solid framework for submitting manuscripts to the SLT 2024 conference. By standardizing the formatting and structure of papers, the guidelines promote a consistent and efficient review process. The detailed specifications cover the key elements needed for a successful submission, helping authors prepare their work in the expected format. Overall, these guidelines support the conference's goal of showcasing high-quality research in the field of speech and language technology.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation
Total Score

0

DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation

Ziqian Wang, Jiayao Sun, Zihan Zhang, Xingchen Li, Jie Liu, Lei Xie

Advancements in deep learning and voice-activated technologies have driven the development of human-vehicle interaction. Distributed microphone arrays are widely used in in-car scenarios because they can accurately capture the voices of passengers from different speech zones. However, the increase in the number of audio channels, coupled with the limited computational resources and low latency requirements of in-car systems, presents challenges for in-car multi-channel speech separation. To migrate the problems, we propose a lightweight framework that cascades digital signal processing (DSP) and neural networks (NN). We utilize fixed beamforming (BF) to reduce computational costs and independent vector analysis (IVA) to provide spatial prior. We employ dual encoders for dual-branch modeling, with spatial encoder capturing spatial cues and spectral encoder preserving spectral information, facilitating spatial-spectral fusion. Our proposed system supports both streaming and non-streaming modes. Experimental results demonstrate the superiority of the proposed system across various metrics. With only 0.83M parameters and 0.39 real-time factor (RTF) on an Intel Core i7 (2.6GHz) CPU, it effectively separates speech into distinct speech zones. Our demos are available at https://honee-w.github.io/DualSep/.

Read more

9/16/2024

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2
Total Score

0

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

Thomas Serre (S2A, IDS), Mathieu Fontaine (S2A, IDS), 'Eric Benhaim (S2A, IDS), Geoffroy Dutour (S2A, IDS), Slim Essid (S2A, IDS)

Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.

Read more

4/15/2024

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation
Total Score

0

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

Ui-Hyeop Shin, Sangyoun Lee, Taehan Kim, Hyung-Min Park

Since the success of a time-domain speech separation, further improvements have been made by expanding the length and channel of a feature sequence to increase the amount of computation. When temporally expanded to a long sequence, the feature is segmented into chunks as a dual-path model in most studies of speech separation. In particular, it is common for the process of separating features corresponding to each speaker to be located in the final stage of the network. However, it is more advantageous and intuitive to proactively expand the feature sequence to include the number of speakers as an extra dimension. In this paper, we present an asymmetric strategy in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, as Siamese network, in addition to cross-speaker processing. By using the Siamese network in the decoder, without using speaker information, the network directly learns to discriminate the features using a separation objective. With a common split layer, intermediate encoder features for skip connections are also split for the reconstruction decoder based on the U-Net structure. In addition, instead of segmenting the feature into chunks as dual-path, we design global and local Transformer blocks to directly process long sequences. The experimental results demonstrated that this separation-and-reconstruction framework is effective and that the combination of proposed global and local Transformer can sufficiently replace the role of inter- and intra-chunk processing in dual-path structure. Finally, the presented model including both of these achieved state-of-the-art performance with less computation than before in various benchmark datasets.

Read more

6/11/2024

🗣️

Total Score

0

End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

Giovanni Morrone, Samuele Cornell, Luca Serafini, Enrico Zovato, Alessio Brutti, Stefano Squartini

Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

Read more

5/24/2024