Voice Passing : a Non-Binary Voice Gender Prediction System for evaluating Transgender voice transition

Read original: arXiv:2404.15176 - Published 4/24/2024 by David Doukhan, Simon Devauchelle, Lucile Girard-Monneron, M'ia Ch'avez Ruz, V. Chaddouk, Isabelle Wagner, Albert Rilliard

🔮

Overview

This paper presents a software tool that can describe voices using a continuous Voice Femininity Percentage (VFP)
The system is intended to help transgender speakers during their voice transition and voice therapists supporting them
The researchers recorded a corpus of 41 French cis- and transgender speakers and had 57 participants estimate the VFP for each voice
Machine learning models were trained on external data to predict gender and were then calibrated to estimate VFP, achieving higher accuracy than models based on pitch or vocal tract length

Plain English Explanation

The researchers have developed a software tool that can analyze someone's voice and provide a numerical estimate of how "feminine" or "masculine" it sounds on a continuous scale. This "Voice Femininity Percentage" (VFP) is intended to be useful for transgender individuals going through a voice transition, as well as the speech therapists who work with them.

To create this system, the researchers recorded voices from 41 French speakers, both cisgender (non-transgender) and transgender. They then had 57 human participants listen to the voices and estimate each one's VFP on a scale. The researchers then trained machine learning models on other datasets to predict the gender of voices. They calibrated these models to output the VFP values, and found them to be more accurate than simpler models based just on pitch or vocal tract length.

The accuracy of the VFP estimates was affected by factors like the speaking style and age of the speakers. This highlights the importance of considering social and cultural aspects of gender when building AI systems to represent these concepts.

Technical Explanation

The key elements of this paper are:

Corpus Collection: The researchers recorded a corpus of 41 French cis- and transgender speakers, which was used to train and evaluate their models.
Perceptual Evaluation: 57 participants listened to the voice recordings and provided estimates of the Voice Femininity Percentage (VFP) for each one on a continuous scale.
Gender Prediction Models: The researchers trained binary gender classification models on external gender-balanced datasets. They then used these models on overlapping windows of the voice recordings to obtain average gender prediction scores, which were calibrated to output the VFP values.
Model Evaluation: The researchers found that the calibrated gender prediction models achieved higher accuracy in estimating VFP compared to simpler models based on features like fundamental frequency (F0) or vocal tract length.
Impact of Factors: The researchers observed that the accuracy of the VFP estimates was affected by factors like the speaking style and age of the speakers, highlighting the importance of considering social and cultural aspects of gender when building such AI systems.

Critical Analysis

The paper acknowledges some important limitations and areas for further research. For example, the corpus was limited to French speakers, so the generalization to other languages and cultures is unclear. Additionally, the perceptual evaluation involved a relatively small number of raters, and their own biases and conceptions of gender may have influenced the VFP labels.

One potential issue not addressed in the paper is the ethical implications of such a system. While it may be useful for transgender individuals and their therapists, there are concerns about the potential for misuse, such as in surveillance or discrimination. The researchers could have discussed safeguards or guidelines for the responsible development and deployment of this technology.

Furthermore, the paper does not delve deeply into the sociocultural factors that shape perceptions of voice femininity. A more thorough exploration of these complex issues could provide valuable insights and inform the design of more inclusive and equitable voice analysis systems.

Conclusion

This paper presents a novel software tool that can estimate a continuous Voice Femininity Percentage (VFP) for voice recordings. The system is intended to support transgender individuals and their voice therapists during the transition process. The researchers found that machine learning models calibrated to predict VFP outperformed simpler approaches based on acoustic features alone.

The study highlights the importance of considering social and cultural factors, such as speaking style and age, when building AI systems to represent concepts like gender. Further research is needed to address the ethical implications of such technology and to more deeply explore the sociocultural complexities underlying perceptions of voice femininity.

Overall, this work represents an important step towards developing more inclusive and nuanced voice analysis tools, with potential applications in speech-to-speech and text-to-speech systems, as well as voice privacy and multimodal applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔮

Voice Passing : a Non-Binary Voice Gender Prediction System for evaluating Transgender voice transition

David Doukhan, Simon Devauchelle, Lucile Girard-Monneron, M'ia Ch'avez Ruz, V. Chaddouk, Isabelle Wagner, Albert Rilliard

This paper presents a software allowing to describe voices using a continuous Voice Femininity Percentage (VFP). This system is intended for transgender speakers during their voice transition and for voice therapists supporting them in this process. A corpus of 41 French cis- and transgender speakers was recorded. A perceptual evaluation allowed 57 participants to estimate the VFP for each voice. Binary gender classification models were trained on external gender-balanced data and used on overlapping windows to obtain average gender prediction estimates, which were calibrated to predict VFP and obtained higher accuracy than $F_0$ or vocal track length-based models. Training data speaking style and DNN architecture were shown to impact VFP estimation. Accuracy of the models was affected by speakers' age. This highlights the importance of style, age, and the conception of gender as binary or not, to build adequate statistical representations of cultural concepts.

4/24/2024

Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology

Robin Netzorg, Alyssa Cote, Sumi Koshin, Klo Vivienne Garoute, Gopala Krishna Anumanchipalli

As experts in voice modification, trans-feminine gender-affirming voice teachers have unique perspectives on voice that confound current understandings of speaker identity. To demonstrate this, we present the Versatile Voice Dataset (VVD), a collection of three speakers modifying their voices along gendered axes. The VVD illustrates that current approaches in speaker modeling, based on categorical notions of gender and a static understanding of vocal texture, fail to account for the flexibility of the vocal tract. Utilizing publicly-available speaker embeddings, we demonstrate that gender classification systems are highly sensitive to voice modification, and speaker verification systems fail to identify voices as coming from the same speaker as voice modification becomes more drastic. As one path towards moving beyond categorical and static notions of speaker identity, we propose modeling individual qualities of vocal texture such as pitch, resonance, and weight.

7/11/2024

🔗

Evolution of Voices in French Audiovisual Media Across Genders and Age in a Diachronic Perspective

Albert Rilliard, David Doukhan, R'emi Uro, Simon Devauchelle

We present a diachronic acoustic analysis of the voice of 1023 speakers from French media archives. The speakers are spread across 32 categories based on four periods (years 1955/56, 1975/76, 1995/96, 2015/16), four age groups (20-35; 36-50; 51-65, >65), and two genders. The fundamental frequency ($F_0$) and the first four formants (F1-4) were estimated. Procedures used to ensure the quality of these estimations on heterogeneous data are described. From each speaker's $F_0$ distribution, the base-$F_0$ value was calculated to estimate the register. Average vocal tract length was estimated from formant frequencies. Base-$F_0$ and vocal tract length were fit by linear mixed models to evaluate how they may have changed across time periods and genders, corrected for age effects. Results show an effect of the period with a tendency to lower voices, independently of gender. A lowering of pitch is observed with age for female but not male speakers.

4/26/2024

Voice Disorder Analysis: a Transformer-based Approach

Alkis Koudounas, Gabriele Ciravegna, Marco Fantini, Giovanni Succo, Erika Crosetti, Tania Cerquitelli, Elena Baralis

Voice disorders are pathologies significantly affecting patient quality of life. However, non-invasive automated diagnosis of these pathologies is still under-explored, due to both a shortage of pathological voice data, and diversity of the recording types used for the diagnosis. This paper proposes a novel solution that adopts transformers directly working on raw voice signals and addresses data shortage through synthetic data generation and data augmentation. Further, we consider many recording types at the same time, such as sentence reading and sustained vowel emission, by employing a Mixture of Expert ensemble to align the predictions on different data types. The experimental results, obtained on both public and private datasets, show the effectiveness of our solution in the disorder detection and classification tasks and largely improve over existing approaches.

6/24/2024