Microsoft

Models by this creator

🌀

phi-2

3.2K

The phi-2 is a 2.7 billion parameter Transformer model developed by Microsoft. It was trained on an augmented version of the same data sources used for the Phi-1.5 model, including additional NLP synthetic texts and filtered websites. The model has demonstrated near state-of-the-art performance on benchmarks testing common sense, language understanding, and logical reasoning, among models with less than 13 billion parameters. Similar models in the Phi family include the Phi-1.5 and Phi-3-mini-4k-instruct. The Phi-1.5 model has 1.3 billion parameters and was trained on a subset of the Phi-2 data sources. The Phi-3-mini-4k-instruct is a 3.8 billion parameter model that has been fine-tuned for instruction following and safety. Model Inputs and Outputs The phi-2 model takes text as input and generates text as output. It is designed to handle prompts in a variety of formats, including question-answering (QA), chat-style conversations, and code generation. Inputs Text prompts**: The model can accept freeform text prompts, such as questions, statements, or instructions. Outputs Generated text**: The model produces text continuations in response to the input prompt, with capabilities spanning tasks like answering questions, engaging in dialogues, and generating code. Capabilities The phi-2 model has shown impressive performance on a range of natural language understanding and reasoning tasks. It can provide detailed analogies, maintain coherent conversations, and generate working code snippets. The model's strength lies in its ability to understand context and formulate concise, relevant responses. What can I use it for? The phi-2 model is well-suited for research projects and applications that require a capable, open-source language model. Potential use cases include virtual assistants, dialogue systems, code generation tools, and educational applications. Due to the model's strong reasoning abilities, it could also be valuable for tasks like question-answering, logical inference, and common sense reasoning. Things to try One interesting aspect of the phi-2 model is its attention overflow issue when used in FP16 mode. Users can experiment with enabling or disabling autocast on the PhiAttention.forward() function to see if it resolves any performance issues. Additionally, the model's capabilities in handling different input formats, such as QA, chat, and code, make it a versatile tool for exploring language model applications across a variety of domains.

Updated 5/28/2024

Text-to-Text

🧠

Phi-3-mini-128k-instruct

microsoft

1.3K

The Phi-3-mini-128k-instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained using the Phi-3 datasets. This dataset includes both synthetic data and filtered publicly available website data, with an emphasis on high-quality and reasoning-dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K, which is the context length (in tokens) that it can support. After initial training, the model underwent a post-training process that involved supervised fine-tuning and direct preference optimization to enhance its ability to follow instructions and adhere to safety measures. When evaluated against benchmarks that test common sense, language understanding, mathematics, coding, long-term context, and logical reasoning, the Phi-3 Mini-128K-Instruct demonstrated robust and state-of-the-art performance among models with fewer than 13 billion parameters. Model inputs and outputs Inputs Text prompts Outputs Generated text responses Capabilities The Phi-3-mini-128k-instruct model is designed to excel in memory/compute constrained environments, latency-bound scenarios, and tasks requiring strong reasoning skills, especially in areas like code, math, and logic. It can be used to accelerate research on language and multimodal models, serving as a building block for generative AI-powered features. What can I use it for? The Phi-3-mini-128k-instruct model is intended for commercial and research use in English. It can be particularly useful for applications that require efficient performance in resource-constrained settings or low-latency scenarios, such as mobile devices or edge computing environments. Given its strong reasoning capabilities, the model can be leveraged for tasks involving coding, mathematical reasoning, and logical problem-solving. Things to try One interesting aspect of the Phi-3-mini-128k-instruct model is its ability to perform well on benchmarks testing common sense, language understanding, and logical reasoning, even with a relatively small parameter count compared to larger language models. This suggests it could be a useful starting point for exploring ways to build efficient and capable AI assistants that can understand and reason about the world in a robust manner.

Updated 5/28/2024

Text-to-Text

➖

phi-1_5

microsoft

1.3K

phi-1.5 is a 1.3 billion parameter Transformer language model developed by Microsoft. It was trained on the same data sources as the phi-1 model, with an additional synthetic NLP data source. The model demonstrates state-of-the-art performance on benchmarks testing common sense, language understanding, and logical reasoning, compared to other models under 10 billion parameters. Unlike phi-1, phi-1.5 was not fine-tuned for instruction following or through reinforcement learning from human feedback. Instead, the intention was to provide the research community with an open-source small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, and enhancing controllability. The model's training data was carefully curated to exclude generic web-crawl sources, which helps prevent direct exposure to potentially harmful online content. However, the model is still vulnerable to generating harmful content, and the researchers hope the model can help further study the safety of language models. Model inputs and outputs Inputs Text prompts in a variety of formats, including QA, chat, and code Outputs Generative text responses, such as poems, emails, stories, summaries, and Python code Capabilities phi-1.5 can perform a wide range of natural language generation tasks, including writing poems, drafting emails, creating stories, summarizing texts, and generating Python code. The model is particularly well-suited for prompts in the QA, chat, and code formats. What can I use it for? The phi-1.5 model can be useful for researchers and developers exploring language model safety challenges, such as reducing toxicity, understanding biases, and enhancing controllability. The model's open-source nature and relatively small size make it an accessible option for these types of investigations. Things to try One interesting aspect of phi-1.5 is its exclusion of generic web-crawl data sources during training, which aims to prevent direct exposure to potentially harmful online content. Researchers could explore how this design choice affects the model's behavior and safety compared to models trained on broader web data. Another area to investigate is the model's performance on prompts that require logical reasoning or common sense understanding, given its strong results on related benchmarks. Developers could experiment with using phi-1.5 for applications that rely on these cognitive capabilities.

Updated 5/28/2024

Text-to-Text

bringing-old-photos-back-to-life

microsoft

874

The bringing-old-photos-back-to-life model is a powerful AI tool developed by Microsoft that can breathe new life into old, faded photographs. This model stands out from similar face restoration models like GFPGAN and CodeFormer by its ability to handle not just facial regions, but entire old photos with various types of degradation, including scratches and uneven lighting. Unlike more generative models like Stable Diffusion, this model focuses on restoring and enhancing existing old photos rather than generating new images from scratch. Model inputs and outputs The bringing-old-photos-back-to-life model takes in old, degraded photos and outputs restored, high-quality versions. The model can handle both regular and high-resolution input images, as well as those with or without visible scratches. Inputs image**: The input old photo to be restored HR**: Whether the input image is high-resolution with_scratch**: Whether the input image has visible scratches Outputs Output**: The restored, high-quality version of the input old photo Capabilities The bringing-old-photos-back-to-life model can effectively restore a wide range of old, degraded photographs. It can handle various types of degradation, including scratches, uneven lighting, and overall fading and quality loss. The model leverages advanced deep learning techniques to seamlessly blend facial features, textures, and colors, resulting in stunning restorations that breathe new life into old photos. What can I use it for? This model is a game-changer for anyone looking to breathe new life into their family photo albums or historical archives. Whether you have old, cherished photos of loved ones or valuable historical images, the bringing-old-photos-back-to-life model can help restore them to their former glory. The model's capabilities also make it a valuable tool for businesses and institutions that work with digitizing and preserving old photographs, such as museums, archives, and photo restoration services. Things to try One exciting aspect of the bringing-old-photos-back-to-life model is its ability to handle high-resolution input images. This opens up the possibility of restoring large, detailed old photos, allowing you to uncover hidden details and preserve them in stunning quality. Additionally, the model's robust handling of scratches makes it a valuable tool for restoring damaged historical photos or family heirlooms. By experimenting with different types of old photos, you can unlock the full potential of this powerful AI model and breathe new life into your cherished memories.

Updated 7/2/2024

Image-to-Image

🔮

Phi-3-vision-128k-instruct

microsoft

741

Phi-3-vision-128k-instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include synthetic data and filtered publicly available websites, with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. Similar models in the Phi-3 family include the Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct. These models have fewer parameters (3.8B) compared to the full Phi-3-vision-128k-instruct but share the same training approach and underlying architecture. Model inputs and outputs Inputs Text**: The model accepts text input, and is best suited for prompts using a chat format. Images**: The model can process visual inputs in addition to text. Outputs Generated text**: The model generates text in response to the input, aiming to provide safe, ethical and accurate information. Capabilities The Phi-3-vision-128k-instruct model is designed for broad commercial and research use, with capabilities that include general image understanding, OCR, and chart and table understanding. It can be used to accelerate research on efficient language and multimodal models, and as a building block for generative AI powered features. What can I use it for? The Phi-3-vision-128k-instruct model is well-suited for applications that require memory/compute constrained environments, latency bound scenarios, or general image and text understanding. Example use cases include: Visual question answering**: Given an image and a text question about the image, the model can generate a relevant response. Image captioning**: The model can generate captions describing the contents of an image. Multimodal task automation**: Combining text and image inputs, the model can be used to automate tasks like form filling, document processing, or data extraction. Things to try To get a sense of the model's capabilities, you can try prompting it with a variety of multimodal tasks, such as: Asking it to describe the contents of an image in detail Posing questions about the objects, people, or activities depicted in an image Requesting the model to summarize the key information from a document containing both text and figures/tables Asking it to generate steps for a visual instruction manual or recipe The model's robust reasoning abilities, combined with its understanding of both text and vision, make it a powerful tool for tackling a wide range of multimodal challenges.

Updated 6/20/2024

Image-to-Text

🖼️

Florence-2-large

microsoft

717

The Florence-2 model is an advanced vision foundation model from Microsoft that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. It leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model. The model comes in both base and large versions, with the large version having 0.77 billion parameters. There are also fine-tuned versions of both the base and large models available. The Florence-2-large-ft model in particular has been finetuned on a collection of downstream tasks. Model inputs and outputs Florence-2 can interpret simple text prompts to perform a variety of vision tasks, including captioning, object detection, and segmentation. The model takes in an image and a text prompt as input, and generates text or bounding boxes/segmentation maps as output, depending on the task. Inputs Image**: The model takes in an image as input. Text prompt**: The model accepts a text prompt that describes the desired task, such as "Detect the objects in this image" or "Caption this image". Outputs Text**: For tasks like captioning, the model will generate text describing the image contents. Bounding boxes and labels**: For object detection tasks, the model will output bounding boxes around detected objects along with class labels. Segmentation masks**: The model can also output pixel-wise segmentation masks for semantic segmentation tasks. Capabilities Florence-2 is capable of performing a wide range of vision and vision-language tasks through its prompt-based approach. For example, the model can be used for image captioning, where it generates descriptive text about an image. It can also be used for object detection, where it identifies and localizes objects in an image. Additionally, the model can be used for semantic segmentation, where it assigns a class label to every pixel in the image. One key capability of Florence-2 is its ability to adapt to different tasks through the use of prompts. By simply changing the text prompt, the model can be directed to perform different tasks, without requiring any additional fine-tuning. What can I use it for? The Florence-2 model can be useful in a variety of applications that involve vision and language understanding, such as: Content creation**: The image captioning and object detection capabilities of Florence-2 can be used to automatically generate descriptions or annotations for images, which can be helpful for tasks like image search, visual storytelling, and content organization. Accessibility**: The model's ability to generate captions and detect objects can be leveraged to improve accessibility for visually impaired users, by providing detailed descriptions of visual content. Robotics and autonomous systems**: Florence-2's perception and language understanding capabilities can be integrated into robotic systems to enable them to better interact with and make sense of their visual environments. Education and research**: Researchers and educators can use Florence-2 to explore the intersection of computer vision and natural language processing, and to develop new applications that leverage the model's unique capabilities. Things to try One interesting aspect of Florence-2 is its ability to handle a diverse range of vision tasks through the use of prompts. You can experiment with different prompts to see how the model's outputs change for various tasks. For example, you could try prompts like "", "", or "" to see the model generate captions, object detection results, or dense region captions, respectively. Another thing to try is fine-tuning the model on your own dataset. The Florence-2-large-ft model demonstrates the potential for further improving the model's performance on specific tasks through fine-tuning.

Updated 7/2/2024

Text-to-Image

🏷️

Orca-2-13b

microsoft

651

Orca-2-13b is a research model developed by Microsoft that aims to enhance the reasoning capabilities of small language models. It is a fine-tuned version of the LLAMA-2 base model, trained on a synthetic dataset created to improve its reasoning abilities. The model is not optimized for chatting and is best used after being fine-tuned for a specific task or after further training with RLHF or DPO. Similar models include StableBeluga2, which is a LLAMA2 70B model fine-tuned on an Orca-style dataset, and llama2-13b-orca-8k-3319, which is a fine-tuning of the LLAMA-2 13B model with an 8K context size on a long-conversation variant of the Dolphin dataset. Model inputs and outputs Orca-2-13b is designed for research purposes and provides single-turn responses in tasks such as reasoning over user-given data, reading comprehension, math problem-solving, and text summarization. The model is particularly focused on enhancing reasoning capabilities. Inputs User-provided data or instructions for the model to reason about and respond to Outputs Single-turn responses from the model, demonstrating its reasoning and problem-solving abilities Capabilities Orca-2-13b is focused on improving the reasoning capabilities of small language models. It has been evaluated on a wide range of tasks, including BigBench-Hard and AGIEval, and has shown significant improvements over its base LLAMA-2 model. What can I use it for? Orca-2-13b is intended for research purposes, to allow the research community to assess its abilities and provide a foundation for building better frontier models. The model could be useful for researchers and developers working on enhancing the reasoning capabilities of language models, as well as for applications that require strong reasoning skills, such as question-answering, math problem-solving, or text summarization. Things to try Researchers and developers could explore fine-tuning Orca-2-13b on specific datasets or tasks to further improve its performance. They could also investigate the model's capabilities in different areas, such as multi-step reasoning, logical inference, or grounding in real-world knowledge.

Updated 5/28/2024

Text-to-Text

🚀

Phi-3-mini-4k-instruct

microsoft

603

The Phi-3-mini-4k-instruct is a compact, 3.8 billion parameter language model developed by Microsoft. It is part of the Phi-3 family of models, which includes both the 4K and 128K variants that differ in their maximum context length. This model was trained on a combination of synthetic data and filtered web data, with a focus on reasoning-dense content. When evaluated on benchmarks testing common sense, language understanding, math, code, long context, and logical reasoning, the Phi-3-mini-4k-instruct demonstrated robust and state-of-the-art performance among models with less than 13 billion parameters. The model has undergone a post-training process that incorporates both supervised fine-tuning and direct preference optimization for instruction following and safety. This aligns it with human preferences for helpfulness and safety. Similar models include the Phi-3-mini-4k-instruct and the Meta-Llama-3-8B-Instruct, which are also compact, instruction-tuned language models. Model inputs and outputs Inputs The Phi-3-mini-4k-instruct model accepts text as input. Outputs The model generates text, including natural language and code. Capabilities The Phi-3-mini-4k-instruct model can be used for a variety of language-related tasks, such as summarization, question answering, and code generation. It has demonstrated strong performance on benchmarks testing common sense, language understanding, math, code, and logical reasoning. The model's compact size and instruction-following capabilities make it suitable for use in memory and compute-constrained environments, as well as latency-bound scenarios. What can I use it for? The Phi-3-mini-4k-instruct model can be a valuable tool for researchers and developers working on language models and generative AI applications. Its strong performance on a range of tasks, coupled with its small footprint, makes it an attractive option for building AI-powered features in resource-constrained environments. Potential use cases include chatbots, question-answering systems, and code generation tools. Things to try One interesting aspect of the Phi-3-mini-4k-instruct model is its ability to reason about complex topics and provide step-by-step solutions. Try prompting the model with math or coding problems and see how it approaches the task. Additionally, the model's instruction-following capabilities could be explored by providing it with detailed prompts or templates for specific tasks, such as writing business emails or creating an outline for a research paper.

Updated 5/28/2024

Text-to-Text

🎯

speecht5_tts

microsoft

540

The speecht5_tts model is a text-to-speech (TTS) model fine-tuned from the SpeechT5 model introduced in the paper "SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing". Developed by researchers at Microsoft, this model demonstrates the potential of encoder-decoder pre-training for speech and text representation learning. Model inputs and outputs The speecht5_tts model takes text as input and generates audio as output, making it capable of high-quality text-to-speech conversion. This can be particularly useful for applications like virtual assistants, audiobook narration, and speech synthesis for accessibility. Inputs Text**: The text to be converted to speech. Outputs Audio**: The generated speech audio corresponding to the input text. Capabilities The speecht5_tts model leverages the success of the T5 (Text-To-Text Transfer Transformer) architecture to achieve state-of-the-art performance on a variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, and more. By pre-training on large-scale unlabeled speech and text data, the model is able to learn a unified representation that can effectively model the sequence-to-sequence transformation between speech and text. What can I use it for? The speecht5_tts model can be a valuable tool for developers and researchers working on speech-based applications. Some potential use cases include: Virtual Assistants**: Integrate the model into virtual assistant systems to provide high-quality text-to-speech capabilities. Audiobook Narration**: Use the model to automatically generate audiobook narrations from text. Accessibility Tools**: Leverage the model's speech synthesis abilities to improve accessibility for visually impaired or low-literacy users. Language Learning**: Incorporate the model into language learning applications to provide realistic speech output for language practice. Things to try One interesting aspect of the speecht5_tts model is its ability to perform zero-shot translation, where it can translate speech from one language to text in another language. This opens up possibilities for building multilingual speech-to-text or speech-to-speech translation systems. Additionally, as the model was pre-trained on a large and diverse dataset, it may exhibit strong performance on lesser-known languages or accents. Experimenting with the model on a variety of languages and domains could uncover interesting capabilities or limitations.

Updated 5/28/2024

Text-to-Audio

🔄

Phi-3-mini-4k-instruct-gguf

microsoft

348

The Phi-3-mini-4k-instruct is a 3.8 billion-parameter, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family, with the Mini version available in two variants - 4K and 128K - which is the context length (in tokens) it can support. The Phi-3-mini-128k-instruct is a similar model with a 128K context length. Both models have undergone a post-training process that incorporates supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. Model inputs and outputs The Phi-3-mini-4k-instruct model takes text as input and generates text as output. The model is best suited for prompts using the chat format, where the user provides a prompt starting with ` and the model generates the response after `. Inputs Text prompts, typically in a chat format like: How to explain the Internet to a medieval knight? Outputs Generated text responses, for example: To explain the Internet to a medieval knight, you could say that it is a magical network of interconnected "talking scrolls" that allow people to share information and communicate across vast distances, almost as if by magic. Just as a messenger on horseback can carry news and messages between distant keeps, the Internet allows information to travel quickly between far-flung locations. However, instead of a single messenger, the Internet has millions of these "talking scrolls" all connected together, allowing information to flow freely between them. You could liken the different websites on the Internet to the various fiefs, manors, and castles of the realm, each with their own unique content and purpose. And just as a knight might travel between these different places to gather news and resources, a user on the Internet can navigate between websites to find the information they seek. Of course, the technology behind the Internet is far more advanced than anything a medieval knight would be familiar with. But hopefully this analogy helps convey the core concept of how this new magical network functions and allows for the rapid sharing of information across vast distances. Capabilities The Phi-3-mini-4k-instruct model showcases robust and state-of-the-art performance on a variety of benchmarks testing common sense, language understanding, math, code, long context, and logical reasoning, particularly among models with less than 13 billion parameters. It demonstrates strong capabilities in areas like memory/compute constrained environments, latency-bound scenarios, and applications requiring strong reasoning skills. What can I use it for? The Phi-3-mini-4k-instruct model is intended for commercial and research use in English. It can be used as a building block for developing generative AI-powered features and applications, especially those with requirements around memory/compute constraints, low latency, or strong reasoning abilities. Some potential use cases include: Language model-powered chatbots and virtual assistants Content generation for education, journalism, or creative writing Code generation and programming assistance tools Reasoning-intensive applications like question-answering systems or intelligent tutoring systems Things to try One interesting aspect of the Phi-3-mini-4k-instruct model is its ability to engage in multi-turn, chat-like conversations using the provided chat format. This allows you to explore the model's conversational capabilities and see how it responds to follow-up questions or requests. Additionally, you can experiment with prompts that require strong reasoning skills, such as math problems or logic puzzles, to assess the model's capabilities in these areas.

Updated 5/28/2024

Text-to-Text