Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Read original: arXiv:2406.12998 - Published 8/22/2024 by Cheol Jun Cho, Peter Wu, Tejas S. Prabhune, Dhruv Agarwal, Gopala K. Anumanchipalli

Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Overview

This paper presents a novel speech coding technique called "Articulatory Encodec" that uses vocal tract kinematics as the underlying representation for speech.
The researchers demonstrate that this articulatory-based approach can outperform traditional acoustic-based speech codecs in terms of speech quality, bit-rate, and latency.
The proposed system leverages recent advancements in speech inversion and articulatory synthesis to encode speech as a sequence of vocal tract parameters instead of raw acoustic features.

Plain English Explanation

The paper describes a new way to encode and compress speech that is based on how the human vocal tract moves rather than just the resulting sound waves. Instead of representing speech as a series of audio samples, the "Articulatory Encodec" system captures the movements of the lips, tongue, and other parts of the vocal system.

This approach is inspired by how humans produce speech - by precisely controlling the shape of the vocal tract. The researchers show that by encoding speech this way, they can achieve higher quality, lower bit-rates, and lower latency compared to traditional speech coding techniques that focus only on the acoustic signal.

The key insight is that the movements of the articulators (the parts of the vocal tract) contain more efficient and meaningful information for representing speech than just the acoustic waveform. By modeling this articulatory information directly, the system can transmit speech using fewer bits without losing important details.

This articulatory-based coding strategy could enable new applications like higher fidelity voice calls, more natural text-to-speech, and better speech recognition - especially in noisy environments. It's an exciting innovation that taps into the underlying biomechanics of human speech production.

Technical Explanation

The Articulatory Encodec system works by first estimating the time-varying vocal tract parameters from the input speech signal using a speech inversion model. This maps the acoustic features to a sequence of articulatory configurations, represented by variables like lip opening, tongue position, velum height, etc.

These articulatory parameters are then encoded using a neural network-based codec. The encoder maps the articulatory trajectories to a compact latent representation, which is transmitted. On the decoder side, another network reconstructs the full articulatory motion from the latent code and uses an articulatory synthesis model to generate the final speech waveform.

The researchers demonstrate that this articulatory-based approach outperforms traditional acoustic codecs like OPUS and Opus-MT in terms of perceived speech quality, bit-rate efficiency, and algorithmic latency. They also show that the system is robust to background noise and can maintain high quality even at very low bit-rates.

Critical Analysis

The key strength of the Articulatory Encodec system is its ability to leverage the underlying biomechanics of speech production to achieve more efficient and robust speech coding. By modeling the movement of the articulators directly, the system can capture nuanced aspects of speech that are difficult to represent purely through the acoustic signal.

However, the reliance on accurate speech inversion - estimating articulator positions from audio - is a potential limitation. The performance of the overall codec is heavily dependent on the quality of the inversion model, which can be challenging to train, especially for speakers with atypical vocal tract anatomy or articulation patterns.

Additionally, the system currently requires access to electromagnetic articulography (EMA) data, which is an invasive and expensive data collection method. Extending the approach to work with more widely available data sources, such as real-time MRI or ultrasound, could broaden its applicability.

Finally, the paper does not provide a detailed analysis of the computational complexity and memory requirements of the Articulatory Encodec system, which would be important considerations for real-world deployment, especially in resource-constrained scenarios like mobile devices or low-power IoT applications.

Conclusion

The Articulatory Encodec paper presents a promising new direction for speech coding by shifting the representation from the acoustic domain to the articulatory domain. This biomimetic approach leverages our understanding of human speech production to achieve higher quality, lower bit-rate, and lower latency speech transmission compared to traditional codecs.

While there are some technical challenges to overcome, this research demonstrates the value of integrating articulatory phonetics into speech technology. By modeling the underlying mechanisms of speech, we can develop more efficient and robust coding systems that could enable a wide range of new applications, from enhanced telecommunications to more natural text-to-speech synthesis.

As the field of phonetic-enhanced language modeling and variational auto-encoder based variability encoding continues to advance, the Articulatory Encodec system represents an exciting step towards a deeper integration of speech production and speech processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Cheol Jun Cho, Peter Wu, Tejas S. Prabhune, Dhruv Agarwal, Gopala K. Anumanchipalli

Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- Articulatory Encodec. Articulatory Encodec comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.

8/22/2024

Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically, we define a framework with three dimensions: Glottalization, Tenseness, and Resonance (GTR), to guide the synthesis at the voice production level. With this framework, we record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor across 125 distinct GTR combinations. We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability along the GTR dimensions on two fine-tuned expressive TTS models. We open-source the dataset and TTS models.

6/18/2024

Decoding Vocal Articulations from Acoustic Latent Representations

Mateo C'amara, Fernando Marcos, Jos'e Luis Blanco

We present a novel neural encoder system for acoustic-to-articulatory inversion. We leverage the Pink Trombone voice synthesizer that reveals articulatory parameters (e.g tongue position and vocal cord configuration). Our system is designed to identify the articulatory features responsible for producing specific acoustic characteristics contained in a neural latent representation. To generate the necessary latent embeddings, we employed two main methodologies. The first was a self-supervised variational autoencoder trained from scratch to reconstruct the input signal at the decoder stage. We conditioned its bottleneck layer with a subnetwork called the projector, which decodes the voice synthesizer's parameters. The second methodology utilized two pretrained models: EnCodec and Wav2Vec. They eliminate the need to train the encoding process from scratch, allowing us to focus on training the projector network. This approach aimed to explore the potential of these existing models in the context of acoustic-to-articulatory inversion. By reusing the pretrained models, we significantly simplified the data processing pipeline, increasing efficiency and reducing computational overhead. The primary goal of our project was to demonstrate that these neural architectures can effectively encapsulate both acoustic and articulatory features. This prediction-based approach is much faster than traditional methods focused on acoustic feature-based parameter optimization. We validated our models by predicting six different parameters and evaluating them with objective and ViSQOL subjective-equivalent metric using both synthesizer- and human-generated sounds. The results show that the predicted parameters can generate human-like vowel sounds when input into the synthesizer. We provide the dataset, code, and detailed findings to support future research in this field.

6/21/2024

Simulating Articulatory Trajectories with Phonological Feature Interpolation

Angelo Ortiz Tandazo, Thomas Schatz, Thomas Hueber, Emmanuel Dupoux

As a first step towards a complete computational model of speech learning involving perception-production loops, we investigate the forward mapping between pseudo-motor commands and articulatory trajectories. Two phonological feature sets, based respectively on generative and articulatory phonology, are used to encode a phonetic target sequence. Different interpolation techniques are compared to generate smooth trajectories in these feature spaces, with a potential optimisation of the target value and timing to capture co-articulation effects. We report the Pearson correlation between a linear projection of the generated trajectories and articulatory data derived from a multi-speaker dataset of electromagnetic articulography (EMA) recordings. A correlation of 0.67 is obtained with an extended feature set based on generative phonology and a linear interpolation technique. We discuss the implications of our results for our understanding of the dynamics of biological motion.

8/9/2024