Vokan

Last updated 7/18/2024

🖼️

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

Vokan is an advanced finetuned StyleTTS2 model designed for authentic and expressive zero-shot performance. It was created by ShoukanLabs, a prolific AI model developer. Vokan leverages a diverse dataset and extensive training to generate high-quality synthesized speech. It was trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, ensuring authenticity and naturalness across various accents and contexts.

Model inputs and outputs

Inputs

Text to be converted to speech

Outputs

Synthesized speech audio

Capabilities

Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance in generating expressive and natural-sounding speech. With over 6+ days worth of audio data and 672 diverse and expressive speakers, the model has learned to handle a broad array of accents and contexts.

What can I use it for?

Vokan can be used in a variety of applications that require high-quality text-to-speech (TTS) capabilities, such as audiobook production, voice assistants, and multimedia content creation. Its expressive and natural-sounding synthesis makes it a compelling choice for projects that require a human-like voice.

Things to try

Experiment with Vokan by providing it with different types of text, ranging from formal to informal, to see how it handles various styles and tones. Additionally, you can explore its potential by integrating it into your own projects and observing its performance in real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🖼️

Vokan

ShoukanLabs

Vokan is an advanced finetuned StyleTTS2 model designed for authentic and expressive zero-shot performance. It was created by ShoukanLabs, a prolific AI model developer. Vokan leverages a diverse dataset and extensive training to generate high-quality synthesized speech. It was trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, ensuring authenticity and naturalness across various accents and contexts. Model inputs and outputs Inputs Text to be converted to speech Outputs Synthesized speech audio Capabilities Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance in generating expressive and natural-sounding speech. With over 6+ days worth of audio data and 672 diverse and expressive speakers, the model has learned to handle a broad array of accents and contexts. What can I use it for? Vokan can be used in a variety of applications that require high-quality text-to-speech (TTS) capabilities, such as audiobook production, voice assistants, and multimedia content creation. Its expressive and natural-sounding synthesis makes it a compelling choice for projects that require a human-like voice. Things to try Experiment with Vokan by providing it with different types of text, ranging from formal to informal, to see how it handles various styles and tones. Additionally, you can explore its potential by integrating it into your own projects and observing its performance in real-world scenarios.

Updated Invalid Date

Text-to-Audio

🤯

kotoba-whisper-v1.0

kotoba-tech

kotoba-whisper-v1.0 is a distilled version of the Whisper model for Japanese automatic speech recognition (ASR). It was developed through a collaboration between Asahi Ushio and Kotoba Technologies. The model is based on the distil-whisper approach, which uses knowledge distillation to create a smaller and faster model while retaining performance. kotoba-whisper-v1.0 is 6.3x faster than the openai/whisper-large-v3 model, while achieving comparable or better character error rate (CER) and word error rate (WER) on Japanese speech recognition tasks. Model inputs and outputs Inputs Audio data in the form of PCM waveforms at a sampling rate of 16kHz Outputs Japanese text transcriptions of the input audio Capabilities kotoba-whisper-v1.0 demonstrates strong performance on Japanese speech recognition tasks, outperforming the larger openai/whisper-large-v3 model on the ReazonSpeech test set. It also achieves competitive results on out-of-domain datasets like the JSUT basic 5000 and the Japanese subset of CommonVoice 8.0. What can I use it for? The kotoba-whisper-v1.0 model can be used for a variety of Japanese speech-to-text applications, such as: Transcribing audio recordings of meetings, lectures, or other spoken content Powering voice-controlled interfaces for Japanese-speaking users Improving accessibility by providing captions or subtitles for Japanese audio and video The model's speed and efficiency make it a good choice for deployment in production environments where low latency is important. Things to try One interesting aspect of kotoba-whisper-v1.0 is its use of a WER-based filter to ensure the quality of the training data. By removing examples with high word error rates, the model is able to learn from a more accurate set of transcriptions, which likely contributes to its strong performance. You could experiment with applying similar data filtering techniques when fine-tuning the model on your own datasets. Additionally, the training and evaluation code for kotoba-whisper-v1.0 is available on GitHub, which provides a good starting point for reproducing the model or adapting it to your specific needs.

Updated Invalid Date

Audio-to-Text

styletts2

adirik

4.2K

styletts2 is a text-to-speech (TTS) model developed by Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, and Nima Mesgarani. It leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. Unlike its predecessor, styletts2 models styles as a latent random variable through diffusion models, allowing it to generate the most suitable style for the text without requiring reference speech. It also employs large pre-trained SLMs, such as WavLM, as discriminators with a novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. Model inputs and outputs styletts2 takes in text and generates high-quality speech audio. The model inputs and outputs are as follows: Inputs Text**: The text to be converted to speech. Beta**: A parameter that determines the prosody of the generated speech, with lower values sampling style based on previous or reference speech and higher values sampling more from the text. Alpha**: A parameter that determines the timbre of the generated speech, with lower values sampling style based on previous or reference speech and higher values sampling more from the text. Reference**: An optional reference speech audio to copy the style from. Diffusion Steps**: The number of diffusion steps to use in the generation process, with higher values resulting in better quality but longer generation time. Embedding Scale**: A scaling factor for the text embedding, which can be used to produce more pronounced emotion in the generated speech. Outputs Audio**: The generated speech audio in the form of a URI. Capabilities styletts2 is capable of generating human-level TTS synthesis on both single-speaker and multi-speaker datasets. It surpasses human recordings on the LJSpeech dataset and matches human performance on the VCTK dataset. When trained on the LibriTTS dataset, styletts2 also outperforms previous publicly available models for zero-shot speaker adaptation. What can I use it for? styletts2 can be used for a variety of applications that require high-quality text-to-speech generation, such as audiobook production, voice assistants, language learning tools, and more. The ability to control the prosody and timbre of the generated speech, as well as the option to use reference audio, makes styletts2 a versatile tool for creating personalized and expressive speech output. Things to try One interesting aspect of styletts2 is its ability to perform zero-shot speaker adaptation on the LibriTTS dataset. This means that the model can generate speech in the style of speakers it has not been explicitly trained on, by leveraging the diverse speech synthesis offered by the diffusion model. Developers could explore the limits of this zero-shot adaptation and experiment with fine-tuning the model on new speakers to further improve the quality and diversity of the generated speech.

Updated Invalid Date

Text-to-Audio

🧪

kan-bayashi_ljspeech_vits

espnet

201

The kan-bayashi/ljspeech_vits model is an ESPnet2 text-to-speech (TTS) model trained on the LJSpeech dataset. It is a VITS (Variational Inference for Text-to-Speech) model, a neural vocoder that generates audio samples directly from the input text. This model was developed by the ESPnet team, a group of researchers focused on building an open-source end-to-end speech processing toolkit. Similar TTS models include the mio/amadeus and facebook/fastspeech2-en-ljspeech models, both of which are also trained on the LJSpeech dataset. These models use different architectures, such as FastSpeech 2 and HiFiGAN vocoder, to generate speech from text. Model inputs and outputs Inputs Text**: The model takes in text as input, which it uses to generate an audio waveform. Outputs Audio waveform**: The model outputs an audio waveform representing the synthesized speech. Capabilities The kan-bayashi/ljspeech_vits model is capable of generating high-quality, natural-sounding speech from input text. The VITS architecture allows the model to generate audio directly from text, without the need for a separate vocoder model. What can I use it for? This TTS model can be used to build applications that require text-to-speech functionality, such as audiobook creation, voice assistants, or text-to-speech tools. The model's performance on the LJSpeech dataset suggests it would be suitable for generating speech in a female, English-speaking voice. Things to try You can experiment with the kan-bayashi/ljspeech_vits model by using it to generate audio from different types of text, such as news articles, books, or even user-generated content. You can also compare its performance to other TTS models, such as the fastspeech2-en-ljspeech or tts-tacotron2-ljspeech models, to see how it fares in terms of speech quality and naturalness.

Updated Invalid Date

Text-to-Audio