DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks

Read original: arXiv:2407.15624 - Published 7/23/2024 by Shahan Nercessian, Alexey Lukin, Johannes Imort

DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks

Overview

This paper proposes a deep learning-based approach for bandwidth extension of speech signals.
The method uses locally-conditioned excitation and linear time-varying filter subnetworks to enhance the quality of the extended bandwidth.
The authors evaluate their approach on a standard speech dataset and compare it to previous methods.

Plain English Explanation

The paper describes a new way to take a speech recording with a limited frequency range and expand it to have a wider, more natural-sounding frequency range. This is called "bandwidth extension."

The key idea is to use machine learning models with two specialized components:

A locally-conditioned excitation network that generates additional high-frequency content to add to the original recording.
A linear time-varying filter network that shapes this new high-frequency content to make it sound more natural and coherent with the original lower frequencies.

By combining these two specialized networks, the authors were able to produce bandwidth-extended speech that sounded better than previous approaches. They tested their method on a standard dataset of speech recordings and showed it outperformed other bandwidth extension techniques.

The advantage of this approach is that it can intelligently generate and shape the new high-frequency content, rather than just blindly copying or extrapolating the original lower frequencies. This allows for more natural-sounding and higher-quality bandwidth extension.

Technical Explanation

The paper introduces a deep learning-based bandwidth extension system that uses two key subnetworks:

A locally-conditioned excitation network that generates additional high-frequency content to supplement the original narrowband speech signal. This network is conditioned on the local characteristics of the input speech to produce coherent high-frequency content.
A linear time-varying filter network that shapes the generated high-frequency content to make it blend seamlessly with the original low-frequency components. This allows the system to adapt the high-frequency generation to the specific characteristics of each input speech segment.

By combining these two specialized subnetworks, the authors were able to outperform previous speech enhancement and bandwidth extension methods on a standard speech dataset. The approach is designed to be efficient and suitable for real-time applications.

Critical Analysis

The paper provides a thorough evaluation of the proposed bandwidth extension system, including comparisons to several previous methods. The authors acknowledge some limitations, such as the potential impact of the training data distribution on performance and the need for further research on efficient real-time implementation.

One potential area for improvement could be investigating the use of multi-channel information to further enhance the bandwidth extension quality. Additionally, the authors could explore the application of their approach to personalized speech enhancement scenarios.

Overall, the paper presents a novel and promising deep learning-based approach for speech bandwidth extension, with a solid technical foundation and experimental validation. The use of specialized subnetworks for excitation and filtering is a compelling strategy that could inspire further research in this area.

Conclusion

This paper introduces a deep learning-based speech bandwidth extension system that leverages locally-conditioned excitation and linear time-varying filter subnetworks to generate high-quality extended-bandwidth speech. The authors demonstrate the effectiveness of their approach through experiments on a standard speech dataset, outperforming previous methods. While the paper acknowledges some limitations, the proposed techniques represent an interesting and potentially impactful contribution to the field of speech enhancement and bandwidth extension.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks

Shahan Nercessian, Alexey Lukin, Johannes Imort

In this paper, we propose a dual-stage architecture for bandwidth extension (BWE) increasing the effective sampling rate of speech signals from 8 kHz to 48 kHz. Unlike existing end-to-end deep learning models, our proposed method explicitly models BWE using excitation and linear time-varying (LTV) filter stages. The excitation stage broadens the spectrum of the input, while the filtering stage properly shapes it based on outputs from an acoustic feature predictor. To this end, an acoustic feature loss term can implicitly promote the excitation subnetwork to produce white spectra in the upper frequency band to be synthesized. Experimental results demonstrate that the added inductive bias provided by our approach can improve upon BWE results using the generators from both SEANet or HiFi-GAN as exciters, and that our means of adapting processing with acoustic feature predictions is more effective than that used in HiFi-GAN-2. Secondary contributions include extensions of the SEANet model to accommodate local conditioning information, as well as the application of HiFi-GAN-2 for the BWE problem.

7/23/2024

Multi-Stage Speech Bandwidth Extension with Flexible Sampling Rate Control

Ye-Xin Lu, Yang Ai, Zheng-Yan Sheng, Zhen-Hua Ling

The majority of existing speech bandwidth extension (BWE) methods operate under the constraint of fixed source and target sampling rates, which limits their flexibility in practical applications. In this paper, we propose a multi-stage speech BWE model named MS-BWE, which can handle a set of source and target sampling rate pairs and achieve flexible extensions of frequency bandwidth. The proposed MS-BWE model comprises a cascade of BWE blocks, with each block featuring a dual-stream architecture to realize amplitude and phase extension, progressively painting the speech frequency bands stage by stage. The teacher-forcing strategy is employed to mitigate the discrepancy between training and inference. Experimental results demonstrate that our proposed MS-BWE is comparable to state-of-the-art speech BWE methods in speech quality. Regarding generation efficiency, the one-stage generation of MS-BWE can achieve over one thousand times real-time on GPU and about sixty times on CPU.

6/5/2024

Vector Quantized Diffusion Model Based Speech Bandwidth Extension

Yuan Fang, Jinglin Bai, Jiajie Wang, Xueliang Zhang

Recent advancements in neural audio codec (NAC) unlock new potential in audio signal processing. Studies have increasingly explored leveraging the latent features of NAC for various speech signal processing tasks. This paper introduces the first approach to speech bandwidth extension (BWE) that utilizes the discrete features obtained from NAC. By restoring high-frequency details within highly compressed discrete tokens, this approach enhances speech intelligibility and naturalness. Based on Vector Quantized Diffusion, the proposed framework combines the strengths of advanced NAC, diffusion models, and Mamba-2 to reconstruct high-frequency speech components. Extensive experiments demonstrate that this method exhibits superior performance across both log-spectral distance and ViSQOL, significantly improving speech quality.

9/17/2024

Speech enhancement deep-learning architecture for efficient edge processing

Monisankha Pal, Arvind Ramanathan, Ted Wada, Ashutosh Pandey

Deep learning has become a de facto method of choice for speech enhancement tasks with significant improvements in speech quality. However, real-time processing with reduced size and computations for low-power edge devices drastically degrades speech quality. Recently, transformer-based architectures have greatly reduced the memory requirements and provided ways to improve the model performance through local and global contexts. However, the transformer operations remain computationally heavy. In this work, we introduce WaveUNet squeeze-excitation Res2 (WSR)-based metric generative adversarial network (WSR-MGAN) architecture that can be efficiently implemented on low-power edge devices for noise suppression tasks while maintaining speech quality. We utilize multi-scale features using Res2Net blocks that can be related to spectral content used in speech-processing tasks. In the generator, we integrate squeeze-excitation blocks (SEB) with multi-scale features for maintaining local and global contexts along with gated recurrent units (GRUs). The proposed approach is optimized through a combined loss function calculated over raw waveform, multi-resolution magnitude spectrogram, and objective metrics using a metric discriminator. Experimental results in terms of various objective metrics on VoiceBank+DEMAND and DNS-2020 challenge datasets demonstrate that the proposed speech enhancement (SE) approach outperforms the baselines and achieves state-of-the-art (SOTA) performance in the time domain.

5/28/2024