Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Read original: arXiv:2409.08797 - Published 9/16/2024 by Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu
Total Score

0

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Provides author guidelines for submitting manuscripts to the Blind SLT 2024 conference
  • Covers formatting requirements, structure, and other important details for preparing a paper

Plain English Explanation

The provided document outlines the guidelines and instructions for authors who wish to submit a paper to the Blind SLT 2024 conference. It covers the expected formatting, structure, and other important details that authors should follow when preparing their manuscripts.

The introduction section explains the purpose of the guidelines and the importance of following them. The formatting section goes into detail on things like page layout, font styles, and citation formatting. The page title section describes how the title, author information, and other front matter should be structured.

This information is crucial for ensuring that all submissions to the conference adhere to a common standard, which makes the review and selection process more efficient and fair. Following these guidelines closely will help authors present their work in the best possible way.

Technical Explanation

The introduction provides context on the Blind SLT 2024 conference and the need for a standardized set of author guidelines. It explains that these guidelines are intended to ensure a consistent formatting and structure across all submitted papers.

The formatting section specifies the required page size, margins, font styles, and other layout details. It also covers the citation formatting, including guidance on referencing previous works. Proper formatting is essential for the efficient review and publication of accepted papers.

The page title section describes how the title, author information, and other front matter should be structured. This includes requirements around the use of blind review, where author identities are concealed from reviewers. Adherence to these guidelines ensures a fair, unbiased evaluation process.

Critical Analysis

The guidelines appear to be thorough and well-designed to support the conference's blind review process. The formatting requirements are detailed and comprehensive, which should help ensure a consistent presentation across all submissions.

One potential limitation is the lack of guidance on the structure and organization of the paper content itself. While the formatting is specified, there are no details on recommended section headings, the flow of information, or other high-level structural elements. Some authors may benefit from additional guidance in this area.

Additionally, the guidelines do not address the use of figures, tables, or other visual elements. Clear instructions on incorporating and formatting these types of content would be a valuable addition.

Overall, these guidelines seem well-suited to facilitating a fair and efficient review process for the Blind SLT 2024 conference. With a few potential enhancements, they could provide even more comprehensive support for authors.

Conclusion

The provided author guidelines for the Blind SLT 2024 conference offer a detailed set of instructions and requirements for formatting and structuring manuscript submissions. By ensuring a consistent presentation across all papers, these guidelines support a fair and efficient review process.

The guidelines cover essential elements like page layout, font styles, citation formatting, and blind review requirements. This level of standardization is crucial for conferences aimed at maintaining high-quality, unbiased evaluations of research work.

While the guidelines could potentially be expanded to provide more guidance on the overall paper structure and the use of visual elements, they otherwise appear to be a comprehensive and well-designed set of instructions for authors. Closely following these guidelines will help ensure that submissions to the Blind SLT 2024 conference are presented in the best possible way.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR
Total Score

0

New!Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu

Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance's internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published WER of 11.15% and 11.14% were obtained on the dev and test sets. Our work is open-source and publicly available at https://github.com/open-creator/icefall/tree/master/egs/gigaspeech/Context_ASR.

Read more

9/16/2024

Exploring SSL Discrete Tokens for Multilingual ASR
Total Score

0

New!Exploring SSL Discrete Tokens for Multilingual ASR

Mingyu Cui, Daxin Tan, Yifan Yang, Dingdong Wang, Huimeng Wang, Xiao Chen, Xie Chen, Xunying Liu

With the advancement of Self-supervised Learning (SSL) in speech-related tasks, there has been growing interest in utilizing discrete tokens generated by SSL for automatic speech recognition (ASR), as they offer faster processing techniques. However, previous studies primarily focused on multilingual ASR with Fbank features or English ASR with discrete tokens, leaving a gap in adapting discrete tokens for multilingual ASR scenarios. This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. We aim to explore the performance and efficiency of speech discrete tokens across multiple language domains for both monolingual and multilingual ASR scenarios. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on Fbank features in ASR tasks across seven language domains with an average word error rate (WER) reduction of 0.31% and 1.76% absolute (2.80% and 15.70% relative) on dev and test sets respectively, with particularly WER reduction of 6.82% absolute (41.48% relative) on the Polish test set.

Read more

9/16/2024

🗣️

Total Score

0

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

Read more

4/30/2024

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations
Total Score

0

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Kunal Dhawan, Nithin Rao Koluguri, Ante Juki'c, Ryan Langman, Jagadeesh Balam, Boris Ginsburg

Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data.

Read more

7/8/2024