HLTCOE JHU Submission to the Voice Privacy Challenge 2024

Read original: arXiv:2409.08913 - Published 9/18/2024 by Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola Garc'ia-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

HLTCOE JHU Submission to the Voice Privacy Challenge 2024

Overview

This paper describes the submission of HLTCOE JHU to the Voice Privacy Challenge 2024.
The Voice Privacy Challenge aims to improve voice anonymization techniques to protect speaker privacy.
The paper presents the methods and results of HLTCOE JHU's approach to this challenge.

Plain English Explanation

The Voice Privacy Challenge 2024 is a competition focused on developing techniques to anonymize voice recordings. This helps protect the privacy of speakers by hiding their identity while still preserving the content of the speech.

The paper describes the approach taken by researchers at HLTCOE JHU to participate in this challenge. They developed a system that can take a person's voice recording and transform it to sound like a different, artificial speaker. This prevents the original speaker's identity from being recognized.

The key idea is to use machine learning models to learn the characteristics of the original speaker's voice, and then generate a new voice that has different vocal traits. This anonymized voice still conveys the same words and meaning as the original, but sounds like it comes from a different person.

The paper outlines the specific techniques and architectures the HLTCOE JHU team used to build their voice anonymization system. They describe how they trained the models, the data they used, and the various components involved. The results show that their approach was effective at preserving speech quality while successfully anonymizing the speaker's identity.

Technical Explanation

The HLTCOE JHU submission to the Voice Privacy Challenge 2024 describes a voice anonymization system composed of several key components:

Acoustic Feature Extraction: The system first extracts various acoustic features from the input speech audio, such as spectral and prosodic characteristics.
Speaker Embedding Extraction: A speaker embedding model is used to capture the unique voiceprint of the original speaker. This allows the system to identify the speaker's identity.
Voice Conversion: The acoustic features and speaker embedding are then passed to a voice conversion model. This model learns to generate a new speech waveform that has the same content as the original, but with a different speaker identity.
Post-Processing: Finally, the generated anonymized speech is post-processed to ensure high perceptual quality and naturalness.

The researchers trained these models using large speech datasets, including recordings from many different speakers. By learning the patterns and nuances of human voices, the system can effectively transform a given voice into a new, synthetic one.

Experiments show that this approach is able to preserve the linguistic content of the speech while successfully obfuscating the original speaker's identity. The anonymized voices were rated as sounding natural and intelligible by human evaluators.

Critical Analysis

The paper provides a thorough technical description of the HLTCOE JHU voice anonymization system and its performance on the Voice Privacy Challenge 2024. However, it does not address some potential limitations and areas for further research:

The system was only evaluated on a limited dataset, so its robustness and generalization to diverse real-world speech data is unclear. More extensive testing would be needed.
The paper does not discuss potential biases or fairness issues that could arise, such as the system performing better or worse for certain demographic groups or accents.
While the anonymized voices were rated as natural, there may be subtle perceptual differences compared to human-generated speech that could be detected by more sensitive evaluations.
The ethical implications of voice anonymization, such as potential misuse for deception or avoiding accountability, are not explored.

Addressing these types of concerns could strengthen the research and help ensure the responsible development of voice anonymization technologies.

Conclusion

The HLTCOE JHU submission to the Voice Privacy Challenge 2024 presents a compelling approach for anonymizing speaker identity while preserving speech content. By leveraging machine learning techniques like acoustic feature extraction, speaker embedding, and voice conversion, the system is able to transform voices into new, synthetic ones.

This work represents an important step forward in developing privacy-preserving voice technologies. As voice-based interfaces become increasingly ubiquitous, tools like this will be crucial for safeguarding individual privacy. However, further research is needed to fully understand the limitations and ethical considerations of such systems.

Overall, the HLTCOE JHU submission demonstrates the potential of voice anonymization to balance the benefits of speech technology with the need to protect sensitive personal information. As the field progresses, striking this balance will be a key challenge and priority.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

HLTCOE JHU Submission to the Voice Privacy Challenge 2024

Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola Garc'ia-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.

9/18/2024

The VoicePrivacy 2024 Challenge Evaluation Plan

Natalia Tomashenko, Xiaoxiao Miao, Pierre Champion, Sarina Meyer, Xin Wang, Emmanuel Vincent, Michele Panariello, Nicholas Evans, Junichi Yamagishi, Massimiliano Todisco

The task of the challenge is to develop a voice anonymization system for speech data which conceals the speaker's voice identity while protecting linguistic content and emotional states. The organizers provide development and evaluation datasets and evaluation scripts, as well as baseline anonymization systems and a list of training resources formed on the basis of the participants' requests. Participants apply their developed anonymization systems, run evaluation scripts and submit evaluation results and anonymized speech data to the organizers. Results will be presented at a workshop held in conjunction with Interspeech 2024 to which all participants are invited to present their challenge systems and to submit additional workshop papers.

6/13/2024

NPU-NTU System for Voice Privacy 2024 Challenge

Jixun Yao, Nikita Kuzmin, Qing Wang, Pengcheng Guo, Ziqian Ning, Dake Guo, Kong Aik Lee, Eng-Siong Chng, Lei Xie

Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper, we describe our proposed speaker anonymization system for VPC 2024. Our system employs a disentangled neural codec architecture and a serial disentanglement strategy to gradually disentangle the global speaker identity and time-variant linguistic content and paralinguistic information. We introduce multiple distillation methods to disentangle linguistic content, speaker identity, and emotion. These methods include semantic distillation, supervised speaker distillation, and frame-level emotion distillation. Based on these distillations, we anonymize the original speaker identity using a weighted sum of a set of candidate speaker identities and a randomly generated speaker identity. Our system achieves the best trade-off of privacy protection and emotion preservation in VPC 2024.

9/9/2024

The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation

Michele Panariello, Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Pierre Champion, Hubert Nourtel, Massimiliano Todisco, Nicholas Evans, Emmanuel Vincent, Junichi Yamagishi

The VoicePrivacy Challenge promotes the development of voice anonymisation solutions for speech technology. In this paper we present a systematic overview and analysis of the second edition held in 2022. We describe the voice anonymisation task and datasets used for system development and evaluation, present the different attack models used for evaluation, and the associated objective and subjective metrics. We describe three anonymisation baselines, provide a summary description of the anonymisation systems developed by challenge participants, and report objective and subjective evaluation results for all. In addition, we describe post-evaluation analyses and a summary of related work reported in the open literature. Results show that solutions based on voice conversion better preserve utility, that an alternative which combines automatic speech recognition with synthesis achieves greater privacy, and that a privacy-utility trade-off remains inherent to current anonymisation solutions. Finally, we present our ideas and priorities for future VoicePrivacy Challenge editions.

7/17/2024