canary-1b

Maintainer: nvidia

191

Last updated 5/28/2024

👁️

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

The canary-1b model is a part of the NVIDIA NeMo Canary family of multi-lingual, multi-tasking models. With 1 billion parameters, the Canary-1B model supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and translation from English to German/French/Spanish and from German/French/Spanish to English with or without punctuation and capitalization (PnC). The model uses a FastConformer-Transformer encoder-decoder architecture.

Model inputs and outputs

Inputs

Audio files or a jsonl manifest file containing audio data

Outputs

Transcribed text in the specified language (English, German, French, Spanish)
Translated text to/from the specified language pair

Capabilities

The Canary-1B model demonstrates state-of-the-art performance on multiple benchmarks for ASR and translation tasks in the supported languages. It can handle various accents, background noise, and technical language well.

What can I use it for?

The canary-1b model is well-suited for research on robust, multi-lingual speech recognition and translation. It can also be fine-tuned on specific datasets to improve performance for particular domains or applications. Developers may find it useful as a pre-trained model for building ASR or translation tools, especially for the supported languages.

Things to try

You can experiment with the canary-1b model by loading it using the NVIDIA NeMo toolkit. Try transcribing or translating audio samples in different languages, and compare the results to your expectations or other models. You can also fine-tune the model on your own data to see how it performs on specific tasks or domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🛸

parakeet-rnnt-1.1b

nvidia

The parakeet-rnnt-1.1b is an ASR (Automatic Speech Recognition) model developed jointly by the NVIDIA NeMo and Suno.ai teams. It uses the FastConformer Transducer architecture, which is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. This XXL model has around 1.1 billion parameters and can transcribe speech in lower case English alphabet with high accuracy. The model is similar to other high-performing ASR models like Canary-1B, which also uses the FastConformer architecture but supports multiple languages. In contrast, the parakeet-rnnt-1.1b is focused solely on English speech transcription. Model Inputs and Outputs Inputs 16000 Hz mono-channel audio (WAV files) Outputs Transcribed speech as a string for a given audio sample Capabilities The parakeet-rnnt-1.1b model demonstrates state-of-the-art performance on English speech recognition tasks. It was trained on a large, diverse dataset of 85,000 hours of speech data from various public and private sources, including LibriSpeech, Fisher Corpus, Switchboard, and more. What Can I Use It For? The parakeet-rnnt-1.1b model is well-suited for a variety of speech-to-text applications, such as voice transcription, dictation, and audio captioning. It could be particularly useful in scenarios where high-accuracy English speech recognition is required, such as in media production, customer service, or educational applications. Things to Try One interesting aspect of the parakeet-rnnt-1.1b model is its ability to handle a wide range of audio inputs, from clear studio recordings to noisier real-world audio. You could experiment with feeding it different types of audio samples and observe how it performs in terms of transcription accuracy and robustness. Additionally, since the model was trained on a large and diverse dataset, you could try fine-tuning it on a more specialized domain or genre of audio to see if you can further improve its performance for your specific use case.

Updated Invalid Date

Text-to-Audio

🤯

parakeet-tdt-1.1b

nvidia

The parakeet-tdt-1.1b is an ASR (Automatic Speech Recognition) model that transcribes speech in lower case English alphabet. This model is jointly developed by the NVIDIA NeMo and Suno.ai teams. It uses a FastConformer-TDT architecture, which is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model has around 1.1 billion parameters. Similar models include the parakeet-rnnt-1.1b, which is also a large ASR model developed by NVIDIA and Suno.ai. It uses a FastConformer Transducer architecture and has similar performance characteristics. Model inputs and outputs Inputs 16000 Hz mono-channel audio (wav files) as input Outputs Transcribed speech as a string for a given audio sample Capabilities The parakeet-tdt-1.1b model is capable of transcribing English speech with high accuracy. It was trained on a large corpus of speech data, including 64K hours of English speech from various public and private datasets. What can I use it for? You can use the parakeet-tdt-1.1b model for a variety of speech-to-text applications, such as transcribing audio recordings, live speech recognition, or integrating it into your own voice-enabled products and services. The model can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset using the NVIDIA NeMo toolkit. Things to try One interesting thing to try with the parakeet-tdt-1.1b model is to experiment with fine-tuning it on a specific domain or dataset. This could help improve the model's performance on your particular use case. You could also try combining the model with other components, such as language models or audio preprocessing modules, to further enhance its capabilities.

Updated Invalid Date

Text-to-Text

🎲

GPT-2B-001

nvidia

191

GPT-2B-001 is a transformer-based language model developed by NVIDIA. It is part of the GPT family of models, similar to GPT-2 and GPT-3, with a total of 2 billion trainable parameters. The model was trained on 1.1 trillion tokens using NVIDIA's NeMo toolkit. Compared to similar models like gemma-2b-it, prometheus-13b-v1.0, and bge-reranker-base, GPT-2B-001 features several architectural improvements, including the use of the SwiGLU activation function, rotary positional embeddings, and a longer maximum sequence length of 4,096. Model inputs and outputs Inputs Text prompts of variable length, up to a maximum of 4,096 tokens. Outputs Continuation of the input text, generated in an autoregressive manner. The model can be used for a variety of text-to-text tasks, such as language modeling, text generation, and question answering. Capabilities GPT-2B-001 is a powerful language model capable of generating human-like text on a wide range of topics. It can be used for tasks such as creative writing, summarization, and even code generation. The model's large size and robust training process allow it to capture complex linguistic patterns and produce coherent, contextually relevant output. What can I use it for? GPT-2B-001 can be used for a variety of natural language processing tasks, including: Content generation**: The model can be used to generate articles, stories, dialogue, and other forms of text. This can be useful for writers, content creators, and marketers. Question answering**: The model can be fine-tuned to answer questions on a wide range of topics, making it useful for building conversational agents and knowledge-based applications. Summarization**: The model can be used to generate concise summaries of longer text, which can be helpful for researchers, students, and business professionals. Code generation**: The model can be used to generate code snippets and even complete programs, which can assist developers in their work. Things to try One interesting aspect of GPT-2B-001 is its ability to generate text that is both coherent and creative. Try prompting the model with a simple sentence or phrase and see how it expands upon the idea, generating new and unexpected content. You can also experiment with fine-tuning the model on specific datasets to see how it performs on more specialized tasks. Another fascinating area to explore is the model's capability for reasoning and logical inference. Try presenting the model with prompts that require deductive or inductive reasoning, and observe how it approaches the problem and formulates its responses.

Updated Invalid Date

Text-to-Text

📶

Nemotron-4-340B-Base

nvidia

132

Nemotron-4-340B-Base is a large language model (LLM) developed by NVIDIA that can be used as part of a synthetic data generation pipeline. With 340 billion parameters and support for a context length of 4,096 tokens, this multilingual model was pre-trained on a diverse dataset of over 50 natural languages and 40 coding languages. After an initial pre-training phase of 8 trillion tokens, the model underwent continuous pre-training on an additional 1 trillion tokens to improve quality. Similar models include the Nemotron-3-8B-Base-4k, a smaller enterprise-ready 8 billion parameter model, and the GPT-2B-001, a 2 billion parameter multilingual model with architectural improvements. Model Inputs and Outputs Nemotron-4-340B-Base is a powerful text generation model that can be used for a variety of natural language tasks. The model accepts textual inputs and generates corresponding text outputs. Inputs Textual prompts in over 50 natural languages and 40 coding languages Outputs Coherent, contextually relevant text continuations based on the input prompts Capabilities Nemotron-4-340B-Base excels at a range of natural language tasks, including text generation, translation, code generation, and more. The model's large scale and broad multilingual capabilities make it a versatile tool for researchers and developers looking to build advanced language AI applications. What Can I Use It For? Nemotron-4-340B-Base is well-suited for use cases that require high-quality, diverse language generation, such as: Synthetic data generation for training custom language models Multilingual chatbots and virtual assistants Automated content creation for websites, blogs, and social media Code generation and programming assistants By leveraging the NVIDIA NeMo Framework and tools like Parameter-Efficient Fine-Tuning and Model Alignment, users can further customize Nemotron-4-340B-Base to their specific needs. Things to Try One interesting aspect of Nemotron-4-340B-Base is its ability to generate text in a wide range of languages. Try prompting the model with inputs in different languages and observe the quality and coherence of the generated outputs. You can also experiment with combining the model's multilingual capabilities with tasks like translation or cross-lingual information retrieval. Another area worth exploring is the model's potential for synthetic data generation. By fine-tuning Nemotron-4-340B-Base on specific datasets or domains, you can create custom language models tailored to your needs, while leveraging the broad knowledge and capabilities of the base model.

Updated Invalid Date

Text-to-Text