KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis

Read original: arXiv:2404.01033 - Published 4/11/2024 by Adal Abilbekov, Saida Mussakhojayeva, Rustem Yeshpanov, Huseyin Atakan Varol
Total Score

0

KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • This paper introduces the KazEmoTTS dataset, a new dataset for Kazakh emotional text-to-speech (TTS) synthesis.
  • The dataset includes audio recordings of Kazakh speakers expressing different emotions, along with corresponding transcripts and emotion labels.
  • The paper describes the process of constructing the dataset and presents baseline results for emotion classification and TTS synthesis using the data.

Plain English Explanation

The researchers who created this dataset wanted to help develop better Kazakh language text-to-speech (TTS) systems that can convey emotions. Currently, most TTS systems have a neutral, robotic-sounding voice. The researchers hope that by providing a dataset of Kazakh speakers expressing different emotions, it will allow developers to train TTS models that can sound more natural and convey the intended emotion when converting text to speech.

The KazEmoTTS dataset contains audio recordings of Kazakh speakers saying various phrases while expressing emotions like happiness, sadness, anger, and fear. Each recording is labeled with the intended emotion. The researchers collected this data and processed it to create a high-quality dataset that can be used to train and test Kazakh emotional TTS systems.

By making this dataset publicly available, the researchers hope it will spur further development of Kazakh language TTS capabilities that can communicate more expressively and naturally. This could have applications in areas like Kazakh language virtual assistants, machine translation, and question answering systems that need to generate more human-like speech output. Overall, this dataset is an important contribution to advancing Kazakh language technology and natural language processing.

Technical Explanation

The researchers constructed the KazEmoTTS dataset by recording professional Kazakh voice actors expressing different emotions as they read a set of pre-written phrases. They collected audio recordings for four primary emotions: happiness, sadness, anger, and fear. Each phrase was recorded with the actor conveying the intended emotion.

The researchers then processed the raw audio recordings, segmented them, and added corresponding transcripts and emotion labels. This resulted in a high-quality dataset with 10,000 utterances across the four emotion categories. The dataset was split into training, validation, and test sets to enable benchmarking of Kazakh emotional TTS systems.

The researchers provide baseline results for two tasks using the KazEmoTTS dataset. First, they trained an emotion classification model to automatically detect the expressed emotion in an audio recording. Second, they fine-tuned a Kazakh TTS model on the dataset to generate emotional speech. The results demonstrate the utility of the dataset for advancing Kazakh emotional speech technology.

Critical Analysis

The KazEmoTTS dataset represents an important step forward for Kazakh language technology, as it provides a valuable resource for developing more expressive and natural-sounding Kazakh text-to-speech systems. However, the dataset is limited to a relatively small set of pre-written phrases and four basic emotions.

While this provides a good starting point, further expansion of the dataset with more diverse content, speakers, and emotional nuances would be beneficial. Additionally, the paper does not address potential biases or limitations in the dataset, such as the demographics of the speakers or the quality of the emotion annotations.

Future work could also explore ways to incorporate the KazEmoTTS dataset into end-to-end Kazakh TTS systems, rather than just using it for fine-tuning. Integrating the emotional expressiveness directly into the TTS model architecture may lead to even more natural-sounding Kazakh speech output.

Conclusion

The KazEmoTTS dataset is a valuable contribution to the field of Kazakh language technology, providing a high-quality resource for developing emotional text-to-speech systems. By making this dataset publicly available, the researchers have opened up new avenues for advancing Kazakh TTS capabilities, which could have far-reaching impacts on Kazakh language virtual assistants, machine translation, and question answering systems.

While the dataset has some limitations, it represents an important step forward and lays the groundwork for future research and development in Kazakh emotional speech technology. As the field continues to evolve, datasets like KazEmoTTS will play a crucial role in pushing the boundaries of what is possible for Kazakh language AI systems.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on š• ā†’

Related Papers

KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis
Total Score

0

KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis

Adal Abilbekov, Saida Mussakhojayeva, Rustem Yeshpanov, Huseyin Atakan Varol

This study focuses on the creation of the KazEmoTTS dataset, designed for emotional Kazakh text-to-speech (TTS) applications. KazEmoTTS is a collection of 54,760 audio-text pairs, with a total duration of 74.85 hours, featuring 34.23 hours delivered by a female narrator and 40.62 hours by two male narrators. The list of the emotions considered include neutral, angry, happy, sad, scared, and surprised. We also developed a TTS model trained on the KazEmoTTS dataset. Objective and subjective evaluations were employed to assess the quality of synthesized speech, yielding an MCD score within the range of 6.02 to 7.67, alongside a MOS that spanned from 3.51 to 3.57. To facilitate reproducibility and inspire further research, we have made our code, pre-trained model, and dataset accessible in our GitHub repository.

Read more

4/11/2024

KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes
Total Score

0

KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes

Rustem Yeshpanov, Huseyin Atakan Varol

This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5, providing a quantitative representation of customer attitudes. The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models trained for both polarity classification and score classification. Experimental analysis included evaluation of the results considering both balanced and imbalanced scenarios. The most successful model attained an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets. The dataset and fine-tuned models are open access and available for download under the Creative Commons Attribution 4.0 International License (CC BY 4.0) through our GitHub repository.

Read more

4/11/2024

nEMO: Dataset of Emotional Speech in Polish
Total Score

0

nEMO: Dataset of Emotional Speech in Polish

Iwona Christop

Speech emotion recognition has become increasingly important in recent years due to its potential applications in healthcare, customer service, and personalization of dialogue systems. However, a major issue in this field is the lack of datasets that adequately represent basic emotional states across various language families. As datasets covering Slavic languages are rare, there is a need to address this research gap. This paper presents the development of nEMO, a novel corpus of emotional speech in Polish. The dataset comprises over 3 hours of samples recorded with the participation of nine actors portraying six emotional states: anger, fear, happiness, sadness, surprise, and a neutral state. The text material used was carefully selected to represent the phonetics of the Polish language adequately. The corpus is freely available under the terms of a Creative Commons license (CC BY-NC-SA 4.0).

Read more

4/10/2024

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages
Total Score

0

ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based TTS model, achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for the utterances generated by the same vocoder and natural spectrogram, and the MOS of 4.01 for the natural waveform, demonstrating the exceptional quality and effectiveness of the corpus.

Read more

9/12/2024