Salesforce

👁️

blip-image-captioning-base

423

The blip-image-captioning-base model is a state-of-the-art image captioning model developed by Salesforce. It uses the Bootstrapping Language-Image Pre-training (BLIP) framework, which can effectively utilize noisy web data by "bootstrapping" captions, where a captioner generates synthetic captions and a filter removes the noisy ones. This allows BLIP to achieve strong performance on a wide range of vision-language tasks, including image-text retrieval, image captioning, and VQA. Similar models like t5-base and vit-base-patch16-224 have also made advances in vision-language understanding and generation. However, BLIP stands out by demonstrating strong generalization abilities and transferring well to both understanding and generation tasks. Model inputs and outputs Inputs Image**: The model takes an image as input, which it encodes and processes to generate a caption. Text prompt (optional)**: The model can also take an optional text prompt as input, which it can use to guide the generation of the image caption. Outputs Image caption**: The primary output of the model is a generated caption that describes the contents of the input image. Capabilities The blip-image-captioning-base model is capable of generating high-quality, context-aware image captions. It can handle a wide variety of image subjects and scenes, and the captions it produces are often both accurate and natural-sounding. The model's ability to effectively leverage noisy web data through its "bootstrapping" technique allows it to achieve state-of-the-art performance on image captioning benchmarks. What can I use it for? The blip-image-captioning-base model can be used for a variety of applications that involve describing the contents of images, such as: Assistive technology**: The model could be used to generate captions for visually impaired users, helping them understand the contents of images. Content moderation**: The model could be used to automatically generate captions for images, which could then be used to detect and filter out inappropriate or harmful content. Multimedia indexing and retrieval**: The model's ability to generate accurate captions could be leveraged to improve the searchability and discoverability of image-based content. Creative applications**: The model could be used to generate novel and interesting captions for images, potentially as part of creative workflows or generative art projects. Things to try One interesting aspect of the blip-image-captioning-base model is its ability to handle both conditional and unconditional image captioning. This means you can use the model to generate captions for a given image, as well as to generate captions for images that don't yet exist, by providing a text prompt as input. To explore the model's capabilities, you could try generating captions for a variety of images, both real and imagined. How do the captions differ when you provide a text prompt versus letting the model generate the caption without any guidance? You could also experiment with providing different types of prompts to see how they influence the generated captions. Another interesting direction to explore would be to investigate the model's performance on specialized or niche domains. While the model has been trained on a large and diverse dataset, it may still have biases or limitations when it comes to certain types of images or subject matter. Trying the model on a range of image types could help you better understand its strengths and weaknesses.

Updated 5/28/2024

🔎

xgen-7b-8k-base

315

XGen-7B-8K-Base is a large language model developed by Salesforce AI Research. It is part of the XGen family of models, which are trained on long sequences of up to 8,000 tokens. The XGen-7B-8K-Base model has 7 billion parameters and is pre-trained on a large corpus of text data. The XGen models are designed for tasks that require processing long input sequences, such as multi-turn conversations, question-answering, and summarization. The 8,000 token context length allows the model to maintain coherence and capture long-range dependencies in the input. This makes the XGen-7B-8K-Base model more versatile compared to models with shorter input lengths. Salesforce has also released an instruction-finetuned version of the model, called XGen-7B-8K-Inst, which is tailored for following instructions and generating helpful, informative responses. Model inputs and outputs Inputs The model accepts text inputs of up to 8,000 tokens. The input can be in the form of a prompt, question, or partially-generated text that the model should continue. Outputs The model generates text continuations, completing the input prompt or answering the input question. The output length can be controlled by specifying a maximum number of new tokens to generate. Capabilities The XGen-7B-8K-Base model is capable of understanding and generating long-form text across a variety of domains. It can be used for tasks like multi-turn dialogue, question-answering, summarization, and open-ended text generation. The long context length allows the model to maintain coherence and consistency over longer inputs. For example, the model could be used to engage in an extended conversation, maintaining the flow and context over many turns. It could also summarize long documents or articles, capturing the key points and high-level structure. Additionally, the model could be used to generate detailed, coherent responses to open-ended questions on a wide range of topics. What can I use it for? The XGen-7B-8K-Base model could be used in a variety of applications that involve processing and generating long-form text. Some potential use cases include: Conversational AI**: Powering chatbots and virtual assistants that can engage in multi-turn dialogues with users. Question-answering**: Building systems that can provide detailed, contextual answers to complex questions. Summarization**: Automatically summarizing long documents, articles, or reports to extract the key information. Content generation**: Generating coherent, long-form text for applications like creative writing, content creation, or storytelling. When using the XGen-7B-8K-Base model, it's important to keep in mind its limitations and potential biases. As with any large language model, the outputs may contain inaccuracies, biases, or inappropriate content. It's recommended to carefully evaluate the model's performance and behavior for specific use cases before deploying it in production. Things to try One interesting aspect of the XGen-7B-8K-Base model is its ability to maintain coherence and consistency over long input sequences. You could try providing the model with a complex, multi-part prompt and see how it continues the text, ensuring that the output is logically consistent and flows naturally from the initial input. Another interesting experiment would be to explore the model's capabilities in open-ended, creative tasks. You could provide the model with a high-level topic or scenario and see how it generates detailed, imaginative responses that build upon the initial prompt. Additionally, you could investigate the model's performance on specific domains or tasks, such as question-answering on specialized subjects or summarizing technical documents. By testing the model's capabilities in various contexts, you can better understand its strengths, limitations, and potential applications.

Updated 5/28/2024

➖

blip2-opt-2.7b

267

The blip2-opt-2.7b model is a multimodal vision-language model developed by Salesforce. It leverages the OPT-2.7b large language model as its foundation, and adds a CLIP-like image encoder and a Querying Transformer (Q-Former) to enable tasks like image captioning, visual question answering, and chat-like conversations by combining the image and previous text. The Q-Former acts as a bridge between the image and language encoders, allowing the model to effectively utilize both modalities. Model inputs and outputs Inputs Image**: The model takes an image as input. Optional text**: The model can also take text as an additional input, such as a prompt or previous conversation. Outputs Conditional text generation**: The model can generate text conditioned on the input image and optional text. Capabilities The blip2-opt-2.7b model can be used for a variety of multimodal tasks, including image captioning, visual question answering, and chat-like conversations that combine image and text inputs. For example, the model can be used to generate captions for images, answer questions about the contents of an image, or continue a conversational exchange that involves both visual and textual information. What can I use it for? You can use the blip2-opt-2.7b model for conditional text generation tasks that involve both images and text. For example, you could use it to build an image captioning application, a visual question answering system, or a multimodal chatbot. The model's ability to leverage both visual and textual information makes it a powerful tool for a variety of real-world applications. Things to try One interesting aspect of the blip2-opt-2.7b model is its ability to blend information from the image encoder and the language model to generate relevant and coherent text. You could experiment with providing the model with different types of images and prompts to see how it responds, and observe how the model's outputs change based on the specific inputs. Additionally, you could try fine-tuning the model on more specialized datasets or tasks to see how it performs in those contexts.

Updated 5/28/2024

👨‍🏫

SFR-Embedding-Mistral

226

SFR-Embedding-Mistral is a model developed by Salesforce that is trained on top of E5-mistral-7b-instruct and Mistral-7B-v0.1. The model is designed for research purposes and can be used for a variety of text-to-text tasks. Model inputs and outputs The SFR-Embedding-Mistral model takes in text inputs and generates text outputs. It is particularly well-suited for tasks like text retrieval, where the model can be used to retrieve relevant passages that answer a given query. Inputs Task Description**: A one-sentence instruction that describes the task, such as "Given a web search query, retrieve relevant passages that answer the query". Query**: The actual text input, such as "How to bake a chocolate cake" or "Symptoms of the flu". Outputs Retrieved Passages**: The relevant passages that answer the given query. Capabilities The SFR-Embedding-Mistral model has been trained to perform well on a variety of text-to-text tasks, with a particular focus on text retrieval. It can be used to find relevant passages that answer a given query, making it useful for applications like open-domain question answering and information retrieval. What can I use it for? The SFR-Embedding-Mistral model can be used for a variety of applications that involve text retrieval or question answering. For example, it could be used to build a search engine that returns relevant passages in response to user queries, or to create a chatbot that can provide informative answers to user questions. Things to try One interesting thing to try with the SFR-Embedding-Mistral model is to experiment with different task descriptions and see how they affect the model's performance. By customizing the task description, you can potentially fine-tune the model for specific use cases or domains. Additionally, you could try combining the model with other AI models or techniques, such as E5-mistral-7b-instruct, to create more powerful and versatile text retrieval and question answering systems.

Updated 5/28/2024

🎲

xgen-mm-phi3-mini-instruct-r-v1

143

xgen-mm-phi3-mini-instruct-r-v1 is a series of foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This model advances upon the successful designs of the BLIP series, incorporating fundamental enhancements that ensure a more robust and superior foundation. The pretrained foundation model, xgen-mm-phi3-mini-base-r-v1, achieves state-of-the-art performance under 5 billion parameters and demonstrates strong in-context learning capabilities. The instruct fine-tuned model, xgen-mm-phi3-mini-instruct-r-v1, also achieves state-of-the-art performance among open-source and closed-source Vision-Language Models (VLMs) under 5 billion parameters. Model inputs and outputs The xgen-mm-phi3-mini-instruct-r-v1 model is designed for image-to-text tasks. It takes in images and generates corresponding textual descriptions. Inputs Images**: The model can accept high-resolution images as input. Outputs Textual Descriptions**: The model generates textual descriptions that caption the input images. Capabilities The xgen-mm-phi3-mini-instruct-r-v1 model demonstrates strong performance in image captioning tasks, outperforming other models of similar size on benchmarks like COCO, NoCaps, and TextCaps. It also shows robust capabilities in open-ended visual question answering on datasets like OKVQA and TextVQA. What can I use it for? The xgen-mm-phi3-mini-instruct-r-v1 model can be used in a variety of applications that involve generating textual descriptions from images, such as: Image captioning**: Automatically generate captions for images to aid in indexing, search, and accessibility. Visual question answering**: Develop applications that can answer questions about the content of images. Image-based task automation**: Build systems that can understand image-based instructions and perform related tasks. The model's state-of-the-art performance and efficiency make it a compelling choice for Salesforce's customers looking to incorporate advanced computer vision and language capabilities into their products and services. Things to try One interesting aspect of the xgen-mm-phi3-mini-instruct-r-v1 model is its support for flexible high-resolution image encoding with efficient visual token sampling. This allows the model to generate high-quality, detailed captions for a wide range of image sizes and resolutions. Developers could experiment with feeding the model images of different sizes and complexities to see how it handles varied input and generates descriptive outputs. Additionally, the model's strong in-context learning capabilities suggest it may be well-suited for few-shot or zero-shot learning tasks, where the model can adapt to new scenarios with limited training data. Trying prompts that require the model to follow instructions or reason about unfamiliar concepts could be a fruitful area of exploration.

Updated 6/11/2024

🛠️

blip3-phi3-mini-instruct-r-v1

143

blip3-phi3-mini-instruct-r-v1 is a large multimodal language model developed by Salesforce AI Research. It is part of the BLIP3 series of foundational multimodal models trained at scale on high-quality image caption datasets and interleaved image-text data. The pretrained version of this model, blip3-phi3-mini-base-r-v1, achieves state-of-the-art performance under 5 billion parameters and demonstrates strong in-context learning capabilities. The instruct-tuned version, blip3-phi3-mini-instruct-r-v1, also achieves state-of-the-art performance among open-source and closed-source vision-language models under 5 billion parameters. It supports flexible high-resolution image encoding with efficient visual token sampling. Model inputs and outputs Inputs Images**: The model can accept high-resolution images as input. Text**: The model can accept text prompts or questions as input. Outputs Image captioning**: The model can generate captions describing the contents of an image. Visual question answering**: The model can answer questions about the contents of an image. Capabilities The blip3-phi3-mini-instruct-r-v1 model demonstrates strong performance on a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering. It can generate detailed and accurate captions for images and provide informative answers to visual questions. What can I use it for? The blip3-phi3-mini-instruct-r-v1 model can be used for a variety of applications that involve understanding and generating natural language in the context of visual information. Some potential use cases include: Image captioning**: Automatically generating captions to describe the contents of images for applications such as photo organization, content moderation, and accessibility. Visual question answering**: Enabling users to ask questions about the contents of images and receive informative answers, which could be useful for educational, assistive, or exploratory applications. Multimodal search and retrieval**: Allowing users to search for and discover relevant images or documents based on natural language queries. Things to try One interesting aspect of the blip3-phi3-mini-instruct-r-v1 model is its ability to perform well on a range of tasks while being relatively lightweight (under 5 billion parameters). This makes it a potentially useful building block for developing more specialized or constrained vision-language applications, such as those targeting memory or latency-constrained environments. Developers could experiment with fine-tuning or adapting the model to their specific use cases to take advantage of its strong underlying capabilities.

Updated 6/9/2024

📶

codegen25-7b-multi_P

131

CodeGen2.5-7B-multi is a family of autoregressive language models for program synthesis, developed by Salesforce. It builds upon the previous CodeGen2 model, achieving competitive results compared to the larger StarCoderBase-15.5B model while being less than half its size. Like CodeGen2, this model is capable of infilling and supports multiple programming languages. The model was trained on the StarCoderData dataset for 1.4T tokens. Salesforce then further trained the model on Python, and then on instruction data, releasing three versions: CodeGen2.5-7B-multi (this repository): Trained on StarCoderData. CodeGen2.5-7B-mono: Further trained on additional Python tokens. CodeGen2.5-7B-instruct: Further trained from CodeGen2.5-7B-mono on instruction data (for research purposes only). Model inputs and outputs Inputs Code context**: The model takes in code context as input, which it can then use to generate, infill, or complete. Outputs Code completion**: The model can generate code completions given a code context. Code infilling**: The model can infill code given a partially completed code snippet. Capabilities CodeGen2.5-7B-multi can be used for a variety of code-related tasks, such as code generation, code completion, and code infilling. By leveraging the model's understanding of multiple programming languages, it can assist developers in tasks like automatic code generation, code refactoring, and code optimization. What can I use it for? Developers and data scientists can use CodeGen2.5-7B-multi to streamline their programming workflows. For example, the model can be used to automatically generate boilerplate code, complete partially written functions, or suggest improvements to existing code. This can save time and increase productivity, allowing developers to focus on more complex problem-solving and creative aspects of their work. Additionally, the model's ability to understand and generate code in multiple languages makes it a valuable tool for cross-language projects or for developers working with unfamiliar languages. Things to try One interesting thing to try with CodeGen2.5-7B-multi is the infilling capability. By inserting the `` token at a specific point in your code, the model can generate appropriate code to fill in that gap, helping you rapidly iterate on your programming tasks. Another intriguing aspect is the model's ability to understand and generate code in multiple programming languages. Try providing the model with a mix of code snippets in different languages and see how it handles the context and generates new content.

Updated 5/28/2024

🛠️

codegen-16B-multi

119

The codegen-16B-multi is a large autoregressive language model developed by Salesforce for the task of program synthesis. It was initialized with the codegen-nl-16b model and further pre-trained on a dataset of multiple programming languages, including C, C++, Go, Java, JavaScript, and Python, totaling 119.2B tokens. The model uses cross-entropy loss to maximize the likelihood of sequential inputs and was trained using multiple TPU-v4-512 devices. Similar models include CodeGen2.5-7B-multi, a smaller but capable program synthesis model also developed by Salesforce, and the StarCoder and StarCoderBase models from BigCode, which were trained on a broader set of 80+ programming languages. Model inputs and outputs The codegen-16B-multi model takes natural language and programming language text as input and generates executable code. It can complete partially-generated code as well, making it useful for tasks like code autocompletion. Inputs Natural language prompts or comments related to the desired code Partially-generated code snippets Outputs Executable code in a variety of programming languages, including C, C++, Go, Java, JavaScript, and Python Capabilities The codegen-16B-multi model is capable of generating high-quality, executable code in multiple programming languages based on natural language prompts. It can understand the context and intent behind text-based instructions and translate that into functional code. The model has been shown to perform well on benchmarks like HumanEval and MTPB, demonstrating its prowess at program synthesis. What can I use it for? The codegen-16B-multi model can be a powerful tool for developers, engineers, and data scientists who need to quickly generate code for a wide range of applications. Some potential use cases include: Automating repetitive coding tasks Generating boilerplate code or scaffolding Prototyping new ideas and concepts Assisting with programming education and learning By leveraging the model's understanding of natural language and programming constructs, users can save time and increase their productivity when working on software projects. Things to try One interesting aspect of the codegen-16B-multi model is its ability to infill or complete partially-generated code. This can be a useful feature for tasks like code autocompletion, where the model can suggest the next logical step in a programming workflow. To experiment with this, you can try providing the model with a code snippet that has a gap or missing section, and see how it fills in the blank. Another thing to explore is the model's performance on different programming languages. While the model was trained on a diverse set of languages, it may exhibit varying levels of proficiency across them. You can try prompting the model with tasks in different languages and observe how it responds.

Updated 5/28/2024

🤯

codegen-16B-mono

116

CodeGen is a family of autoregressive language models for program synthesis from Salesforce. The checkpoint included in this repository is denoted as CodeGen-Mono 16B, where "Mono" means the model is initialized with CodeGen-Multi 16B and further pre-trained on a Python programming language dataset, and "16B" refers to the number of trainable parameters. Similarly, there is a codegen-350M-mono model with 350M parameters. Additionally, the codegen-16B-multi model is pre-trained on a dataset of multiple programming languages including C, C++, Go, Java, JavaScript, and Python. Another related model is CodeT5+, which is a new family of open-code large language models with an encoder-decoder architecture that can operate in different modes to support a wide range of code understanding and generation tasks. The codet5p-16b and instructcodet5p-16b checkpoints are 16B parameter versions of CodeT5+ that are instruction-tuned for aligning with natural language prompts. Lastly, the codegen25-7b-multi_P model is part of the CodeGen2.5 family, which is a smaller but highly capable model trained on the StarCoderData dataset. This model supports multiple programming languages and can perform both code generation and infilling. Model inputs and outputs Inputs Natural language prompts**: The models are designed to generate code from natural language descriptions, where the prompts should be in the form of a comment string. Partially-generated code**: The models can also complete partially-generated code. Outputs Executable code**: The primary output of these models is executable code in various programming languages, generated based on the input prompts. Capabilities These CodeGen and CodeT5+ models are highly capable at program synthesis, i.e. generating executable code given natural language prompts. They have been shown to outperform many large language models on a variety of code generation benchmarks, even surpassing closed-source models in some cases. The models can handle a diverse set of programming languages, and the multi-lingual variants are able to generate code in multiple languages. They also have the ability to complete partially-generated code, making them useful for code editing and autocompletion tasks. Additionally, the CodeT5+ models are designed to be flexible, supporting different modes of operation (encoder-only, decoder-only, encoder-decoder) to handle a wide range of code understanding and generation tasks. What can I use it for? These models are well-suited for a variety of applications that involve generating or understanding code, such as: Code generation**: Automatically generating code from natural language descriptions, which can be helpful for prototyping, automating repetitive tasks, or assisting developers. Code completion**: Completing partially-written code, which can boost developer productivity. Code understanding**: The CodeT5+ models can be used for various code understanding tasks like code search, code summarization, and code translation. By leveraging the capabilities of these models, developers and researchers can build applications that automate or assist with various programming-related tasks, potentially boosting productivity and expanding the reach of AI in software development. Things to try One interesting aspect of these models is their ability to generate code in multiple programming languages. You could try providing prompts that mix natural language and code snippets in different languages, and see how the models handle the cross-lingual generation. Another interesting exercise would be to explore the models' few-shot or zero-shot capabilities on specific programming tasks or benchmarks. By fine-tuning or prompting the models in creative ways, you may be able to unlock new use cases that go beyond the standard code generation tasks. Finally, you could experiment with the different variants of these models (e.g., codegen-16B-mono vs. codegen-16B-multi) to understand how the pretraining data and model architecture choices impact the models' performance and capabilities.

Updated 5/28/2024