Openbmb

Models by this creator

🧪

MiniCPM-Llama3-V-2_5

1.2K

MiniCPM-Llama3-V-2_5 is the latest model in the MiniCPM-V series, built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits significant performance improvements over the previous MiniCPM-V 2.0 model. The model achieves leading performance on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, surpassing widely used proprietary models like GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 with 8B parameters. It also demonstrates strong OCR capabilities, scoring over 700 on OCRBench, outperforming proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro. Model inputs and outputs Inputs Images**: The model can process images with any aspect ratio up to 1.8 million pixels. Text**: The model can engage in multimodal interactions, accepting text prompts and queries. Outputs Text**: The model generates text responses to user prompts and queries, leveraging its multimodal understanding. Extracted text**: The model can perform full-text OCR extraction from images, converting printed or handwritten text into editable markdown. Structured data**: The model can convert tabular information in images into markdown format. Capabilities MiniCPM-Llama3-V-2_5 exhibits trustworthy multimodal behavior, achieving a 10.3% hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%). The model also supports over 30 languages, including German, French, Spanish, Italian, and Russian, through the VisCPM cross-lingual generalization technology. Additionally, the model has been optimized for efficient deployment on edge devices, realizing a 150-fold acceleration in multimodal large model image encoding on mobile phones with Qualcomm chips. What can I use it for? MiniCPM-Llama3-V-2_5 can be used for a variety of multimodal tasks, such as visual question answering, document understanding, and image-to-text generation. Its strong OCR capabilities make it particularly useful for tasks involving text extraction and structured data processing from images, such as digitizing forms, receipts, or whiteboards. The model's multilingual support also enables cross-lingual applications, allowing users to interact with the system in their preferred language. Things to try Experiment with MiniCPM-Llama3-V-2_5's capabilities by providing it with a diverse set of images and prompts. Test its ability to accurately extract and convert text from high-resolution, complex images. Explore its cross-lingual functionality by interacting with the model in different languages. Additionally, assess the model's trustworthiness by monitoring its behavior on potential hallucination tasks.

Updated 6/17/2024

Image-to-Text

🖼️

MiniCPM-V-2

openbmb

509

MiniCPM-V-2 is a strong multimodal large language model developed by openbmb for efficient end-side deployment. It is built based on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. The latest version, MiniCPM-V 2.0, has several notable features. MiniCPM-V 2.0 achieves state-of-the-art performance on multiple benchmarks, even outperforming strong models like Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. It also shows strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding, and state-of-the-art performance on OCRBench among open-source models. Additionally, MiniCPM-V 2.0 is the first end-side LMM aligned via multimodal RLHF for trustworthy behavior, allowing it to match GPT-4V in preventing hallucinations on Object HalBench. The model can also accept high-resolution 1.8 million pixel images at any aspect ratio. Model inputs and outputs Inputs Text**: The model can take in text inputs. Images**: MiniCPM-V 2.0 can accept high-resolution 1.8 million pixel images at any aspect ratio. Outputs Text**: The model generates text outputs. Capabilities MiniCPM-V 2.0 demonstrates state-of-the-art performance on a wide range of multimodal benchmarks, including OCRBench, TextVQA, MME, MMB, and MathVista. It outperforms even larger models like Qwen-VL-Chat 9.6B and Yi-VL 34B on the comprehensive OpenCompass evaluation. The model's strong OCR capabilities make it well-suited for tasks like scene-text understanding. Additionally, MiniCPM-V 2.0 is the first end-side LMM to be aligned via multimodal RLHF for trustworthy behavior, preventing hallucinations on the Object HalBench. This makes it a reliable choice for applications where accuracy and safety are paramount. What can I use it for? The high-performance and trustworthy nature of MiniCPM-V 2.0 make it a great choice for a variety of multimodal applications. Some potential use cases include: Multimodal question answering**: The model's strong performance on benchmarks like TextVQA and MME suggest it could be useful for tasks that involve answering questions based on a combination of text and images. Scene text understanding**: MiniCPM-V 2.0's state-of-the-art OCR capabilities make it well-suited for applications that involve extracting and understanding text from images, such as document digitization or visual search. Multimodal content generation**: The model's ability to generate text conditioned on images could enable applications like image captioning or visual storytelling. Things to try One interesting aspect of MiniCPM-V 2.0 is its ability to accept high-resolution 1.8 million pixel images at any aspect ratio. This could enable better perception of fine-grained visual information, such as small objects and optical characters, which could be useful for applications like optical character recognition or detailed image understanding. Additionally, the model's alignment via multimodal RLHF for trustworthy behavior is a notable feature. Developers could explore ways to leverage this capability to build AI systems that are reliable and safe, particularly in sensitive domains where accurate and unbiased outputs are critical.

Updated 5/28/2024

Text-to-Image

👀

MiniCPM-2B-sft-fp32

openbmb

296

MiniCPM-2B-sft-fp32 is an end-size large language model (LLM) developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings. It is built upon the MiniCPM architecture and has achieved impressive performance, outperforming larger models such as Llama2-13B, MPT-30B, and Falcon-40B on various benchmarks, especially in Chinese, mathematics, and coding tasks. The model has also been fine-tuned using both SFT (Supervised Fine-Tuning) and DPO (Decoding-Guided Prompt Optimization) techniques, further enhancing its capabilities. Model inputs and outputs Inputs Natural language text**: The model can accept natural language input for text generation tasks. Outputs Natural language text**: The model generates coherent and contextually relevant text outputs. Capabilities MiniCPM-2B-sft-fp32 has demonstrated strong performance across a variety of tasks, including language understanding, generation, and reasoning. After SFT, the model has very close performance to the larger Mistral-7B on open-sourced general benchmarks, with better abilities in Chinese, mathematics, and coding. Additionally, the model has been further improved through DPO, outperforming larger models such as Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, and Zephyr-7B-alpha on the MTBench benchmark. What can I use it for? MiniCPM-2B-sft-fp32 can be used for a wide range of natural language processing tasks, such as text generation, language understanding, and even coding and mathematics-related tasks. The model's compact size and high efficiency make it a suitable choice for deployment on mobile devices and resource-constrained environments. Potential use cases include chatbots, virtual assistants, content generation, and task-oriented language models. Things to try One interesting aspect of MiniCPM-2B-sft-fp32 is its ability to perform well on Chinese, mathematics, and coding tasks. Developers could explore using the model for applications that require these specialized capabilities, such as AI-powered programming assistants or language models tailored for scientific and technical domains. Additionally, the model's efficient design and the availability of quantized versions, such as MiniCPM-2B-SFT/DPO-Int4, could be investigated for deployment on low-power devices or in edge computing scenarios.

Updated 5/28/2024

Text-to-Text

👁️

MiniCPM-Llama3-V-2_5-gguf

openbmb

172

MiniCPM-Llama3-V-2_5-gguf is the latest model in the MiniCPM-V series developed by openbmb. It is built on the SigLip-400M and Llama3-8B-Instruct models, resulting in a total of 8B parameters. Compared to the previous MiniCPM-V 2.0 model, MiniCPM-Llama3-V-2_5-gguf has achieved significant performance improvements across a range of benchmarks, surpassing several widely used proprietary models. The model exhibits strong capabilities in areas like OCR, language understanding, and trustworthy behavior. It also supports over 30 languages through minimal instruction-tuning, and has been optimized for efficient deployment on edge devices. This model builds upon the work of the VisCPM, RLHF-V, LLaVA-UHD, and RLAIF-V projects from the openbmb team. Model inputs and outputs Inputs Images**: MiniCPM-Llama3-V-2_5-gguf can process images with any aspect ratio up to 1.8 million pixels. Text**: The model can engage in interactive conversations, processing user messages as input. Outputs Text**: The model generates relevant and coherent text responses to user inputs. Multimodal understanding**: The model can combine its understanding of the input image and text to provide comprehensive, multimodal outputs. Capabilities MiniCPM-Llama3-V-2_5-gguf has demonstrated leading performance on a range of benchmarks, including TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, and Object HalBench. It surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Qwen-VL-Max, and Claude 3 with 8B parameters. The model has also shown strong OCR capabilities, achieving a score of over 700 on OCRBench, outperforming proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max, and Gemini Pro. Additionally, MiniCPM-Llama3-V-2_5-gguf exhibits trustworthy behavior, with a hallucination rate of 10.3% on Object HalBench, lower than GPT-4V-1106 (13.6%). What can I use it for? MiniCPM-Llama3-V-2_5-gguf can be used for a variety of multimodal tasks, such as visual question answering, document understanding, and interactive language-image applications. Its strong OCR capabilities make it well-suited for tasks like text extraction from images, document processing, and table-to-markdown conversion. The model's multilingual support and efficient deployment on edge devices also open up opportunities for developing language-agnostic applications and integrating the model into mobile and IoT solutions. Things to try One exciting aspect of MiniCPM-Llama3-V-2_5-gguf is its ability to engage in interactive, multimodal conversations. You can try providing the model with a series of messages and images, and observe how it leverages its understanding of both modalities to generate coherent and informative responses. Additionally, the model's versatile OCR capabilities allow you to experiment with tasks like extracting text from images of varying complexity, such as documents, receipts, or handwritten notes. You can also explore its ability to understand and reason about the contents of these images in a multimodal context.

Updated 6/27/2024

Text-to-Text

✅

cpm-bee-10b

openbmb

166

The cpm-bee-10b is a fully open-source, commercially-usable Chinese-English bilingual base model with a capacity of ten billion parameters. Developed by OpenBMB, it is the second milestone achieved through the training process of CPM-live. Utilizing the Transformer auto-regressive architecture, CPM-Bee has been pre-trained on an extensive corpus of trillion-scale tokens, thereby possessing remarkable foundational capabilities. Model inputs and outputs The cpm-bee-10b model is a text-to-text generative model that can be used for a variety of natural language processing tasks. It takes textual input and generates relevant and coherent text outputs. Inputs Text**: The model accepts raw text as input, which can include natural language, programming code, or any other textual data. Outputs Generated Text**: The model outputs generated text that is relevant and coherent with the provided input. This could include continuations of the input text, answers to questions, or summaries of longer passages. Capabilities The cpm-bee-10b model has excellent performance in both Chinese and English, thanks to its careful curation and balancing of the pre-training data. It also supports the OpenBMB system's comprehensive ecosystem of tools and scripts for high-performance pre-training, adaptation, compression, deployment, and tool development. Additionally, the model has been fine-tuned to have powerful conversational and tool usage capabilities. What can I use it for? The cpm-bee-10b model can be used for a wide range of natural language processing tasks, such as text generation, question answering, and language translation. Due to its open-source and commercially-usable nature, developers can leverage the model to build various applications and services. Things to try One interesting aspect of the cpm-bee-10b model is its support for long sequence lengths up to 8,192 tokens, enabled by the use of ALiBi position embeddings. This allows the model to handle tasks that require processing of lengthy input or output, such as summarizing long documents or generating coherent multi-paragraph text.

Updated 5/27/2024

Text-to-Text

🌀

MiniCPM-2B-sft-bf16

openbmb

113

MiniCPM-2B-sft-bf16 is a large language model developed by OpenBMB and TsinghuaNLP, with only 2.4 billion parameters excluding embeddings. It is an "end-side" LLM, meaning it is designed for efficient deployment even on resource-constrained devices like smartphones. Compared to larger models like Mistral-7B, Llama2-13B, MPT-30B, and Falcon-40B, MiniCPM-2B-sft-bf16 achieves very close performance on open-source benchmarks, with better abilities in Chinese, mathematics, and coding after supervised fine-tuning (SFT). Its overall performance exceeds that of these larger models. After further training using data-prompted optimization (DPO), the MiniCPM-2B model outperforms even larger models like Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, and Zephyr-7B-alpha on the MTBench evaluation. The MiniCPM-V variant, based on the MiniCPM-2B architecture, achieves the best overall performance among multimodal models of a similar scale, surpassing existing large multimodal models like Phi-2 and even matching the performance of the 9.6B Qwen-VL-Chat model on some tasks. Model inputs and outputs Inputs Text input for language understanding and generation tasks Outputs Generated text based on the input Multimodal outputs (e.g. image captions, VQA) for the MiniCPM-V variant Capabilities MiniCPM-2B-sft-bf16 demonstrates strong performance across a variety of benchmarks, including open-domain language understanding, mathematics, coding, and Chinese language tasks. The MiniCPM-V variant extends these capabilities to multimodal tasks like image captioning and visual question answering. One key advantage of the MiniCPM models is their efficient deployment. They can be run on devices as small as smartphones, with the MiniCPM-V being the first multimodal model that can be deployed on mobile phones. The models also have a low cost of development, requiring only a single 1080/2080 GPU for parameter-efficient fine-tuning and a 3090/4090 GPU for full parameter fine-tuning. What can I use it for? The MiniCPM models are well-suited for a variety of natural language processing and multimodal applications, such as: General language understanding and generation Domain-specific applications (e.g. legal, medical, mathematical) Multimodal tasks like image captioning and visual question answering Conversational AI and virtual assistants Mobile and edge computing applications Thanks to their efficient design and deployment, the MiniCPM models can be particularly useful in resource-constrained environments or for applications that require low latency, such as on-device inference. Things to try One interesting aspect of the MiniCPM models is their ability to perform well on Chinese language tasks, in addition to their strengths in English. This makes them a compelling choice for multilingual applications or for users who require Chinese language capabilities. Additionally, the MiniCPM-V variant's strong multimodal performance, combined with its efficient deployment, opens up opportunities for novel applications that integrate vision and language, such as mobile-based visual question answering or image-guided dialogue systems. Researchers and developers may also be interested in exploring the technical details of the MiniCPM models, such as the use of supervised fine-tuning and data-prompted optimization, to better understand how to build performant and efficient large language models.

Updated 5/28/2024

Image-to-Image

🌀

MiniCPM-V

openbmb

112

MiniCPM-V is an efficient and high-performing multimodal language model developed by the OpenBMB team. It is an improved version of the MiniCPM-2.4B model, with several notable features. Firstly, MiniCPM-V can be efficiently deployed on most GPUs and even mobile phones, thanks to its compressed image representation. It encodes images into just 64 tokens, significantly fewer than other models that typically use over 512 tokens. This allows MiniCPM-V to operate with much less memory and higher inference speed. Secondly, MiniCPM-V demonstrates state-of-the-art performance on multiple benchmarks, such as MMMU, MME, and MMBench, surpassing existing models of comparable size. It even achieves comparable or better results than the larger 9.6B Qwen-VL-Chat model. Lastly, MiniCPM-V is the first end-deployable large language model that supports bilingual multimodal interaction in both English and Chinese. This is enabled by a technique from the VisCPM ICLR 2024 paper that generalizes multimodal capabilities across languages. Model inputs and outputs Inputs Images**: MiniCPM-V can accept images as inputs for tasks such as visual question answering and image description generation. Text**: The model can also take text inputs, allowing for multimodal interactions and conversations. Outputs Text**: Based on the provided inputs, MiniCPM-V can generate relevant text responses, such as answering questions about images or describing their contents. Capabilities MiniCPM-V demonstrates strong multimodal understanding and generation capabilities. For example, it can accurately caption images, as shown in the provided GIFs of a mushroom and a snake. The model is also able to answer questions about images, as evidenced by its high performance on benchmarks like MMMU and MMBench. What can I use it for? Given its strong multimodal abilities, MiniCPM-V can be useful for a variety of applications, such as: Visual question answering**: The model can be used to build applications that allow users to ask questions about images and receive relevant responses. Image captioning**: MiniCPM-V can be integrated into systems that automatically generate descriptions for images. Multimodal conversational assistants**: The model's bilingual support and multimodal capabilities make it a good candidate for building conversational AI assistants that can understand and respond to both text and images. Things to try One interesting aspect of MiniCPM-V is its efficient visual encoding technique, which allows the model to operate with much lower memory requirements compared to other large multimodal models. This could enable the deployment of MiniCPM-V on resource-constrained devices, such as mobile phones, opening up new possibilities for on-the-go multimodal interactions. Additionally, the model's bilingual support is a noteworthy feature, as it allows for seamless multimodal communication in both English and Chinese. Developers could explore building applications that leverage this capability, such as cross-language visual question answering or image-based translation services.

Updated 5/28/2024

Text-to-Text

🔄

UltraLM-13b

openbmb

The UltraLM-13b model is a chat language model fine-tuned from the LLaMA-13b model on the UltraChat dataset. It is maintained by openbmb. Similar models include the 34b-beta model, which is a 34B parameter CausalLM model, and the Llama-2-13b-chat-german model, which is a variant of the Llama 2 13b Chat model fine-tuned on German language data. Model inputs and outputs The UltraLM-13b model is a text-to-text model, meaning it takes text as input and generates text as output. The input follows a multi-turn chat format, with the user providing instructions or prompts, and the model generating responses. Inputs User instructions or prompts, formatted as a multi-turn chat Outputs Model responses to the user's prompts, also formatted as a multi-turn chat Capabilities The UltraLM-13b model is capable of engaging in open-ended dialogue and task-oriented conversations. It can understand and respond to user prompts on a wide range of topics, drawing upon its extensive training data. The model is particularly adept at tasks like question answering, summarization, and language generation. What can I use it for? The UltraLM-13b model can be used for a variety of applications, such as building chatbots, virtual assistants, or interactive language models. It could be integrated into customer service platforms, educational tools, or creative writing applications. Additionally, the model's capabilities could be leveraged for research purposes, such as exploring the limits of language understanding and generation. Things to try One interesting thing to try with the UltraLM-13b model is exploring its multi-turn chat capabilities. Provide the model with a series of related prompts and see how it maintains context and continuity in its responses. You could also experiment with prompting the model to engage in specific tasks, such as summarizing long passages of text or answering follow-up questions. Lastly, consider comparing the model's performance to similar language models, such as the 34b-beta or Llama-2-13b-chat-german models, to gain insights into its unique strengths and limitations.

Updated 5/28/2024

Text-to-Text

🧠

OmniLMM-12B

openbmb

OmniLMM-12B is the most capable version of OmniLMM, a powerful multimodal AI model created by openbmb. It is built upon EVA02-5B and Zephyr-7B-, connecting them with a perceiver resampler layer and training on diverse multimodal data. OmniLMM-12B stands out for its strong performance, trustworthy behavior, and real-time multimodal interaction capabilities. It achieves leading results on multiple benchmarks like MME, MMBench, and SEED-Bench, surpassing many established large language models. Notably, OmniLMM-12B is the first state-of-the-art open-source model aligned via multimodal RLHF for trustworthy behavior, ranking #1 on MMHal-Bench and outperforming GPT-4V on Object HalBench. The model can also be combined with GPT-3.5 to create a real-time multimodal interactive assistant that can handle video and speech inputs. Model Inputs and Outputs Inputs Images**: OmniLMM-12B can accept high-resolution images up to 1.8 million pixels (e.g. 1344x1344) in any aspect ratio. Text**: The model can process natural language text inputs. Multimodal Prompts**: OmniLMM-12B supports multimodal prompts that combine images and text. Outputs Generated Text**: The model can produce human-like text outputs in response to prompts. Multimodal Responses**: OmniLMM-12B can generate multimodal outputs that combine text, images, and other modalities. Capabilities OmniLMM-12B has shown impressive capabilities across a range of tasks, from understanding and generating text to perceiving and reasoning about visual information. It can effectively describe images, answer questions, and complete other multimodal tasks with high accuracy. For example, the model can faithfully describe the contents of an image, even detecting and discussing small or fine-grained details. What Can I Use It For? OmniLMM-12B is a versatile model that can be applied to a wide variety of multimodal applications. Some potential use cases include: Intelligent Assistants**: Integrate OmniLMM-12B into conversational AI agents to enable rich, multimodal interactions. Content Generation**: Use the model to generate informative, human-like text descriptions for images or other visual content. Multimodal Question Answering**: Build systems that can answer questions by combining information from text and visual inputs. Multimodal Reasoning**: Leverage OmniLMM-12B's strong multimodal capabilities to tackle complex reasoning tasks that require understanding across modalities. Things to Try One interesting aspect of OmniLMM-12B is its ability to handle high-resolution images at any aspect ratio. This enables the model to perceive fine-grained visual details that may be missed by models restricted to lower resolutions or fixed aspect ratios. Developers could experiment with using OmniLMM-12B for tasks like fine-grained object detection, text extraction from images, or visual question answering on complex scenes. Another key feature is the model's trustworthy behavior, achieved through multimodal RLHF alignment. Researchers and developers could investigate how this alignment impacts the model's outputs and explore ways to further enhance its safety and reliability for real-world applications. Overall, OmniLMM-12B's strong performance and diverse capabilities make it a compelling model for a range of multimodal AI projects. By leveraging its unique strengths, users can unlock new possibilities in areas like intelligent assistants, content generation, and multimodal reasoning.

Updated 6/1/2024

Text-to-Image

👁️

MiniCPM-Llama3-V-2_5-int4

openbmb

The MiniCPM-Llama3-V-2_5-int4 is an int4 quantized version of the MiniCPM-Llama3-V 2.5 model, developed by openbmb. This means the model has been compressed to use less GPU memory, approximately 9GB, while maintaining performance. It is an image-to-text model capable of generating text descriptions for images. Model inputs and outputs The MiniCPM-Llama3-V-2_5-int4 model takes two main inputs: an image and a set of conversational messages. The image is used as the visual context, while the messages provide textual context for the model to generate a relevant response. Inputs Image**: The model accepts an image in RGB format, which is used to provide visual information for the task. Messages**: A list of conversational messages in the format {'role': 'user', 'content': 'message text'}. These messages give the model additional context to generate an appropriate response. Outputs Generated text**: The model outputs a text response that describes the content of the input image, based on the provided conversational context. Capabilities The MiniCPM-Llama3-V-2_5-int4 model is capable of generating text descriptions for images, leveraging both the visual information and the conversational context. This can be useful for tasks like image captioning, visual question answering, and interactive image-based dialogues. What can I use it for? The MiniCPM-Llama3-V-2_5-int4 model can be used in a variety of applications that involve generating text descriptions for images, such as: Image captioning**: Automatically generating captions for images to aid in accessibility or for search and retrieval purposes. Visual question answering**: Answering questions about the contents of an image by generating relevant text responses. Interactive image-based dialogues**: Building conversational interfaces that can discuss and describe images in a natural way. Things to try One interesting aspect of the MiniCPM-Llama3-V-2_5-int4 model is its ability to generate text responses while considering both the visual and conversational context. You could try providing the model with a variety of image-message pairs to see how it responds, and observe how the generated text changes based on the provided context.

Updated 6/29/2024

Image-to-Text