Preserving spoken content in voice anonymisation with character-level vocoder conditioning

Read original: arXiv:2408.04306 - Published 8/9/2024 by Michele Panariello, Massimiliano Todisco, Nicholas Evans

Preserving spoken content in voice anonymisation with character-level vocoder conditioning

Overview

Discusses a voice anonymization technique that preserves the spoken content while modifying the speaker's voice
Uses a character-level vocoder conditioning approach to achieve this
Aims to maintain the intelligibility and naturalness of the anonymized voice

Plain English Explanation

This paper describes a method for voice anonymization - a way to modify a person's voice in a recording while still preserving the actual words and content they spoke. The key idea is to use a "character-level vocoder" - a type of text-to-speech system that operates on individual characters rather than entire words or sentences.

By conditioning the vocoder on the characters of the spoken content, the researchers were able to anonymize the voice while keeping the intelligibility and natural flow of the speech. This is important for applications like speaker anonymization where you want to protect someone's identity but still preserve the original meaning and delivery of their message.

The approach aims to be better than simply pitch-shifting or applying other basic voice modifications, which can make the speech sound unnatural or robotic. By conditioning the vocoder on the actual text, this method can produce anonymized voices that sound more natural and human-like.

Technical Explanation

The key technical innovation in this work is the use of a character-level vocoder for voice anonymization. Traditional vocoders operate on entire words or phrases, but this system conditions the vocoder on individual characters of the spoken text.

The overall architecture consists of an encoder that takes the input speech signal and extracts acoustic features, and a decoder that generates the anonymized speech waveform. Crucially, the decoder also takes in the text characters corresponding to the spoken content, which allows it to preserve the original meaning and flow of the speech.

The researchers trained and evaluated this character-level vocoder on several public speech datasets. They found that it was able to produce anonymized voices that maintained high intelligibility and naturalness compared to other voice modification techniques.

Critical Analysis

One potential limitation of this approach is that it requires access to the full text transcript of the speech in order to condition the vocoder. In real-world scenarios, this text may not always be available or easy to obtain. The researchers acknowledge this and suggest exploring techniques that can infer the text from the audio signal alone.

Additionally, while the results show improvements in intelligibility and naturalness over basic voice modification, there may still be room for further enhancements. The researchers did not extensively evaluate factors like speaker similarity or voice conversion quality, which would be important for real-world voice anonymization applications.

Overall, this work demonstrates a novel and promising direction for preserving spoken content during voice anonymization. Further research is needed to address some of the practical limitations and continue improving the quality and robustness of the anonymized voices.

Conclusion

This paper presents a character-level vocoder conditioning approach for voice anonymization that aims to preserve the intelligibility and naturalness of the spoken content. By leveraging the text transcript of the speech, the system can generate anonymized voices that maintain the original meaning and flow of the message.

While there are some limitations to the current implementation, this work represents an important step forward in the field of speaker anonymization. The ability to protect someone's identity while still conveying the full meaning and intent of their speech has significant applications in areas like privacy-preserving communication, content moderation, and voice-based assistants.

Further research and development in this area could lead to even more advanced voice anonymization techniques that strike a better balance between protecting the speaker's identity and preserving the nuances of their spoken expression.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Preserving spoken content in voice anonymisation with character-level vocoder conditioning

Michele Panariello, Massimiliano Todisco, Nicholas Evans

Voice anonymisation can be used to help protect speaker privacy when speech data is shared with untrusted others. In most practical applications, while the voice identity should be sanitised, other attributes such as the spoken content should be preserved. There is always a trade-off; all approaches reported thus far sacrifice spoken content for anonymisation performance. We report what is, to the best of our knowledge, the first attempt to actively preserve spoken content in voice anonymisation. We show how the output of an auxiliary automatic speech recognition model can be used to condition the vocoder module of an anonymisation system using a set of learnable embedding dictionaries in order to preserve spoken content. Relative to a baseline approach, and for only a modest cost in anonymisation performance, the technique is successful in decreasing the word error rate computed from anonymised utterances by almost 60%.

8/9/2024

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Rui Wang, Liping Chen, Kong AiK Lee, Zhen-Hua Ling

Voice anonymization has been developed as a technique for preserving privacy by replacing the speaker's voice in a speech signal with that of a pseudo-speaker, thereby obscuring the original voice attributes from machine recognition and human perception. In this paper, we focus on altering the voice attributes against machine recognition while retaining human perception. We referred to this as the asynchronous voice anonymization. To this end, a speech generation framework incorporating a speaker disentanglement mechanism is employed to generate the anonymized speech. The speaker attributes are altered through adversarial perturbation applied on the speaker embedding, while human perception is preserved by controlling the intensity of perturbation. Experiments conducted on the LibriSpeech dataset showed that the speaker attributes were obscured with their human perception preserved for 60.71% of the processed utterances.

6/14/2024

Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Xiaoxiao Miao, Yuxiang Zhang, Xin Wang, Natalia Tomashenko, Donny Cheng Lock Soh, Ian Mcloughlin

A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.

8/13/2024

Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization

Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola Garc'ia-Perera, Kevin Duh, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its utility, including linguistic and paralinguistic aspects. However, anonymizing speech while maintaining emotional state remains challenging. We explore this problem in the context of the VoicePrivacy 2024 challenge. Specifically, we developed various speaker anonymization pipelines and find that approaches either excel at anonymization or preserving emotion state, but not both simultaneously. Achieving both would require an in-domain emotion recognizer. Additionally, we found that it is feasible to train a semi-effective speaker verification system using only emotion representations, demonstrating the challenge of separating these two modalities.

9/6/2024