The Database and Benchmark for Source Speaker Verification Against Voice Conversion

Read original: arXiv:2406.04951 - Published 6/10/2024 by Ze Li, Yuke Lin, Tian Yao, Hongbin Suo, Ming Li

The Database and Benchmark for Source Speaker Verification Against Voice Conversion

Overview

This paper introduces a new database and benchmark for evaluating source speaker verification against voice conversion.
The database contains audio recordings from multiple speakers, both before and after voice conversion.
The benchmark is designed to test the ability of speaker verification systems to distinguish between the original speaker and the converted voice.

Plain English Explanation

This research paper presents a new database and evaluation system for testing how well speaker verification algorithms can identify the original speaker, even when their voice has been altered through a process called voice conversion. Voice conversion is a technique that can change someone's voice to sound like a different person.

The database includes audio recordings from multiple speakers. For each speaker, there are recordings of their natural voice, as well as recordings where their voice has been converted to sound like a different person. This allows researchers to evaluate how well speaker verification systems can detect when a converted voice is actually the original speaker, and not the new person it's made to sound like.

This is an important problem, as voice conversion technology is becoming more advanced and accessible. Being able to reliably verify a speaker's identity, even when their voice has been altered, has applications in areas like speech-based authentication and text-to-speech systems. The benchmark provided in this paper gives researchers a standardized way to test and compare different speaker verification approaches in this challenging scenario.

Technical Explanation

The paper introduces a new speech corpus called VoiceConversionEvaluation (VCE), which contains audio recordings from 100 speakers. For each speaker, the database includes:

Natural speech recordings, capturing the speaker's unmodified voice.
Voice conversion recordings, where the speaker's voice has been altered to sound like a different person using voice conversion techniques.

This dataset allows researchers to evaluate speaker verification systems in the context of voice conversion attacks, where an imposter tries to impersonate a target speaker by converting their voice. The authors define a benchmark task to test the ability of speaker verification models to correctly identify the original speaker, even when their voice has been converted.

The paper also presents baseline results using several diverse pre-trained audio models fine-tuned for the speaker verification task. These models are able to achieve real-time, accurate zero-shot speaker verification on the VCE dataset, providing a strong starting point for future research in this area.

Critical Analysis

The authors acknowledge that the VCE dataset, while a valuable resource, has some limitations. The voice conversion process used to create the altered recordings may not fully capture the nuances and artifacts that could arise from real-world voice conversion attacks. Additionally, the dataset only includes recordings in English, so the generalization of the benchmark to other languages is not yet clear.

Further research could explore expanding the VCE dataset to include more diverse speakers, languages, and voice conversion techniques. Investigating the robustness of speaker verification models to more advanced or tailored voice conversion attacks would also be a valuable direction for future work.

Conclusion

This paper presents a new database and benchmark for evaluating speaker verification systems against voice conversion attacks. The VCE dataset provides a standardized way for researchers to test the ability of speaker verification models to correctly identify the original speaker, even when their voice has been altered. The baseline results demonstrate the potential for diverse pre-trained audio models to achieve real-time, accurate zero-shot speaker verification in the face of voice conversion challenges. This work contributes an important new resource for advancing research in speaker verification and voice conversion security.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

The Database and Benchmark for Source Speaker Verification Against Voice Conversion

Ze Li, Yuke Lin, Tian Yao, Hongbin Suo, Ming Li

Voice conversion systems can transform audio to mimic another speaker's voice, thereby attacking speaker verification systems. However, ongoing studies on source speaker verification are hindered by limited data availability and methodological constraints. In this paper, we generate a large-scale converted speech database and train a batch of baseline systems based on the MFA-Conformer architecture to promote the source speaker verification task. In addition, we introduce a related task called conversion method recognition. An adapter-based multi-task learning approach is employed to achieve effective conversion method recognition without compromising source speaker verification performance. Additionally, we investigate and effectively address the open-set conversion method recognition problem through the implementation of an open-set nearest neighbor approach.

6/10/2024

Who is Authentic Speaker

Qiang Huang

Voice conversion (VC) using deep learning technologies can now generate high quality one-to-many voices and thus has been used in some practical application fields, such as entertainment and healthcare. However, voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes. Moreover, it is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly. In this paper we attempt to explore the feasibility of identifying authentic speakers from converted voices. This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices. Therefore our experiments are geared towards recognising the source speakers given the converted voices, which are generated by using FragmentVC on the randomly paired utterances from source and target speakers. To improve the robustness against converted voices, our recognition model is constructed by using hierarchical vector of locally aggregated descriptors (VLAD) in deep neural networks. The authentic speaker recognition system is mainly tested in two aspects, including the impact of quality of converted voices and the variations of VLAD. The dataset used in this work is VCTK corpus, where source and target speakers are randomly paired. The results obtained on the converted utterances show promising performances in recognising authentic speakers from converted voices.

5/2/2024

New!Speaker Contrastive Learning for Source Speaker Tracing

Qing Wang, Hongmei Guo, Jian Kang, Mengjie Du, Jie Li, Xiao-Lei Zhang, Lei Xie

As a form of biometric authentication technology, the security of speaker verification systems is of utmost importance. However, SV systems are inherently vulnerable to various types of attacks that can compromise their accuracy and reliability. One such attack is voice conversion, which modifies a persons speech to sound like another person by altering various vocal characteristics. This poses a significant threat to SV systems. To address this challenge, the Source Speaker Tracing Challenge in IEEE SLT2024 aims to identify the source speaker information in manipulated speech signals. Specifically, SSTC focuses on source speaker verification against voice conversion to determine whether two converted speech samples originate from the same source speaker. In this study, we propose a speaker contrastive learning-based approach for source speaker tracing to learn the latent source speaker information in converted speech. To learn a more source-speaker-related representation, we employ speaker contrastive loss during the training of the embedding extractor. This speaker contrastive loss helps identify the true source speaker embedding among several distractor speaker embeddings, enabling the embedding extractor to learn the potentially possessing source speaker information present in the converted speech. Experiments demonstrate that our proposed speaker contrastive learning system achieves the lowest EER of 16.788% on the challenge test set, securing first place in the challenge.

9/17/2024

🤿

Source -Free Domain Adaptation for Speaker Verification in Data-Scarce Languages and Noisy Channels

Shlomo Salo Elia, Aviad Malachi, Vered Aharonson, Gadi Pinkas

Domain adaptation is often hampered by exceedingly small target datasets and inaccessible source data. These conditions are prevalent in speech verification, where privacy policies and/or languages with scarce speech resources limit the availability of sufficient data. This paper explored techniques of sourcefree domain adaptation unto a limited target speech dataset for speaker verificationin data-scarce languages. Both language and channel mis-match between source and target were investigated. Fine-tuning methods were evaluated and compared across different sizes of labeled target data. A novel iterative cluster-learn algorithm was studied for unlabeled target datasets.

6/11/2024