MiniCPM-Llama3-V-2_5-gguf

Maintainer: openbmb

172

Last updated 6/27/2024

👁️

Property	Value
Model Link	View on HuggingFace
API Spec	View on HuggingFace
Github Link	No Github link provided
Paper Link	No paper link provided

Create account to get full access

Model overview

MiniCPM-Llama3-V-2_5-gguf is the latest model in the MiniCPM-V series developed by openbmb. It is built on the SigLip-400M and Llama3-8B-Instruct models, resulting in a total of 8B parameters. Compared to the previous MiniCPM-V 2.0 model, MiniCPM-Llama3-V-2_5-gguf has achieved significant performance improvements across a range of benchmarks, surpassing several widely used proprietary models.

The model exhibits strong capabilities in areas like OCR, language understanding, and trustworthy behavior. It also supports over 30 languages through minimal instruction-tuning, and has been optimized for efficient deployment on edge devices. This model builds upon the work of the VisCPM, RLHF-V, LLaVA-UHD, and RLAIF-V projects from the openbmb team.

Model inputs and outputs

Inputs

Images: MiniCPM-Llama3-V-2_5-gguf can process images with any aspect ratio up to 1.8 million pixels.
Text: The model can engage in interactive conversations, processing user messages as input.

Outputs

Text: The model generates relevant and coherent text responses to user inputs.
Multimodal understanding: The model can combine its understanding of the input image and text to provide comprehensive, multimodal outputs.

Capabilities

MiniCPM-Llama3-V-2_5-gguf has demonstrated leading performance on a range of benchmarks, including TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, and Object HalBench. It surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Qwen-VL-Max, and Claude 3 with 8B parameters.

The model has also shown strong OCR capabilities, achieving a score of over 700 on OCRBench, outperforming proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max, and Gemini Pro. Additionally, MiniCPM-Llama3-V-2_5-gguf exhibits trustworthy behavior, with a hallucination rate of 10.3% on Object HalBench, lower than GPT-4V-1106 (13.6%).

What can I use it for?

MiniCPM-Llama3-V-2_5-gguf can be used for a variety of multimodal tasks, such as visual question answering, document understanding, and interactive language-image applications. Its strong OCR capabilities make it well-suited for tasks like text extraction from images, document processing, and table-to-markdown conversion.

The model's multilingual support and efficient deployment on edge devices also open up opportunities for developing language-agnostic applications and integrating the model into mobile and IoT solutions.

Things to try

One exciting aspect of MiniCPM-Llama3-V-2_5-gguf is its ability to engage in interactive, multimodal conversations. You can try providing the model with a series of messages and images, and observe how it leverages its understanding of both modalities to generate coherent and informative responses.

Additionally, the model's versatile OCR capabilities allow you to experiment with tasks like extracting text from images of varying complexity, such as documents, receipts, or handwritten notes. You can also explore its ability to understand and reason about the contents of these images in a multimodal context.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🧪

MiniCPM-Llama3-V-2_5

openbmb

1.2K

MiniCPM-Llama3-V-2_5 is the latest model in the MiniCPM-V series, built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits significant performance improvements over the previous MiniCPM-V 2.0 model. The model achieves leading performance on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, surpassing widely used proprietary models like GPT-4V-1106, Gemini Pro, Qwen-VL-Max and Claude 3 with 8B parameters. It also demonstrates strong OCR capabilities, scoring over 700 on OCRBench, outperforming proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro. Model inputs and outputs Inputs Images**: The model can process images with any aspect ratio up to 1.8 million pixels. Text**: The model can engage in multimodal interactions, accepting text prompts and queries. Outputs Text**: The model generates text responses to user prompts and queries, leveraging its multimodal understanding. Extracted text**: The model can perform full-text OCR extraction from images, converting printed or handwritten text into editable markdown. Structured data**: The model can convert tabular information in images into markdown format. Capabilities MiniCPM-Llama3-V-2_5 exhibits trustworthy multimodal behavior, achieving a 10.3% hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%). The model also supports over 30 languages, including German, French, Spanish, Italian, and Russian, through the VisCPM cross-lingual generalization technology. Additionally, the model has been optimized for efficient deployment on edge devices, realizing a 150-fold acceleration in multimodal large model image encoding on mobile phones with Qualcomm chips. What can I use it for? MiniCPM-Llama3-V-2_5 can be used for a variety of multimodal tasks, such as visual question answering, document understanding, and image-to-text generation. Its strong OCR capabilities make it particularly useful for tasks involving text extraction and structured data processing from images, such as digitizing forms, receipts, or whiteboards. The model's multilingual support also enables cross-lingual applications, allowing users to interact with the system in their preferred language. Things to try Experiment with MiniCPM-Llama3-V-2_5's capabilities by providing it with a diverse set of images and prompts. Test its ability to accurately extract and convert text from high-resolution, complex images. Explore its cross-lingual functionality by interacting with the model in different languages. Additionally, assess the model's trustworthiness by monitoring its behavior on potential hallucination tasks.

Updated Invalid Date

Image-to-Text

🐍

ggml_llava-v1.5-7b

mys

The ggml_llava-v1.5-7b is a text-to-text AI model created by mys. It is based on the llava-v1.5-7b model and can be used with the llama.cpp library for end-to-end inference without any extra dependencies. This model is similar to other GGUF-formatted models like codellama-7b-instruct-gguf, llava-v1.6-vicuna-7b, and llama-2-7b-embeddings. Model inputs and outputs The ggml_llava-v1.5-7b model takes text as input and generates text as output. The input can be a prompt, question, or any other natural language text. The output is the model's generated response, which can be used for a variety of text-based tasks. Inputs Text prompt or natural language input Outputs Generated text response Capabilities The ggml_llava-v1.5-7b model can be used for a range of text-to-text tasks, such as language generation, question answering, and text summarization. It has been trained on a large corpus of text data and can generate coherent and contextually relevant responses. What can I use it for? The ggml_llava-v1.5-7b model can be used for a variety of applications, such as chatbots, virtual assistants, and content generation. It can be particularly useful for companies looking to automate customer service, generate product descriptions, or create marketing content. Additionally, the model's ability to understand and generate text can be leveraged for educational or research purposes. Things to try Experiment with the model by providing various types of input prompts, such as open-ended questions, task-oriented instructions, or creative writing prompts. Observe how the model responds and evaluate the coherence, relevance, and quality of the generated text. Additionally, you can explore using the model in combination with other AI tools or frameworks to create more complex applications.

Updated Invalid Date

Text-to-Text

🌀

MiniCPM-V

openbmb

112

MiniCPM-V is an efficient and high-performing multimodal language model developed by the OpenBMB team. It is an improved version of the MiniCPM-2.4B model, with several notable features. Firstly, MiniCPM-V can be efficiently deployed on most GPUs and even mobile phones, thanks to its compressed image representation. It encodes images into just 64 tokens, significantly fewer than other models that typically use over 512 tokens. This allows MiniCPM-V to operate with much less memory and higher inference speed. Secondly, MiniCPM-V demonstrates state-of-the-art performance on multiple benchmarks, such as MMMU, MME, and MMBench, surpassing existing models of comparable size. It even achieves comparable or better results than the larger 9.6B Qwen-VL-Chat model. Lastly, MiniCPM-V is the first end-deployable large language model that supports bilingual multimodal interaction in both English and Chinese. This is enabled by a technique from the VisCPM ICLR 2024 paper that generalizes multimodal capabilities across languages. Model inputs and outputs Inputs Images**: MiniCPM-V can accept images as inputs for tasks such as visual question answering and image description generation. Text**: The model can also take text inputs, allowing for multimodal interactions and conversations. Outputs Text**: Based on the provided inputs, MiniCPM-V can generate relevant text responses, such as answering questions about images or describing their contents. Capabilities MiniCPM-V demonstrates strong multimodal understanding and generation capabilities. For example, it can accurately caption images, as shown in the provided GIFs of a mushroom and a snake. The model is also able to answer questions about images, as evidenced by its high performance on benchmarks like MMMU and MMBench. What can I use it for? Given its strong multimodal abilities, MiniCPM-V can be useful for a variety of applications, such as: Visual question answering**: The model can be used to build applications that allow users to ask questions about images and receive relevant responses. Image captioning**: MiniCPM-V can be integrated into systems that automatically generate descriptions for images. Multimodal conversational assistants**: The model's bilingual support and multimodal capabilities make it a good candidate for building conversational AI assistants that can understand and respond to both text and images. Things to try One interesting aspect of MiniCPM-V is its efficient visual encoding technique, which allows the model to operate with much lower memory requirements compared to other large multimodal models. This could enable the deployment of MiniCPM-V on resource-constrained devices, such as mobile phones, opening up new possibilities for on-the-go multimodal interactions. Additionally, the model's bilingual support is a noteworthy feature, as it allows for seamless multimodal communication in both English and Chinese. Developers could explore building applications that leverage this capability, such as cross-language visual question answering or image-based translation services.

Updated Invalid Date

Text-to-Text

🖼️

MiniCPM-V-2

openbmb

509

MiniCPM-V-2 is a strong multimodal large language model developed by openbmb for efficient end-side deployment. It is built based on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. The latest version, MiniCPM-V 2.0, has several notable features. MiniCPM-V 2.0 achieves state-of-the-art performance on multiple benchmarks, even outperforming strong models like Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. It also shows strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding, and state-of-the-art performance on OCRBench among open-source models. Additionally, MiniCPM-V 2.0 is the first end-side LMM aligned via multimodal RLHF for trustworthy behavior, allowing it to match GPT-4V in preventing hallucinations on Object HalBench. The model can also accept high-resolution 1.8 million pixel images at any aspect ratio. Model inputs and outputs Inputs Text**: The model can take in text inputs. Images**: MiniCPM-V 2.0 can accept high-resolution 1.8 million pixel images at any aspect ratio. Outputs Text**: The model generates text outputs. Capabilities MiniCPM-V 2.0 demonstrates state-of-the-art performance on a wide range of multimodal benchmarks, including OCRBench, TextVQA, MME, MMB, and MathVista. It outperforms even larger models like Qwen-VL-Chat 9.6B and Yi-VL 34B on the comprehensive OpenCompass evaluation. The model's strong OCR capabilities make it well-suited for tasks like scene-text understanding. Additionally, MiniCPM-V 2.0 is the first end-side LMM to be aligned via multimodal RLHF for trustworthy behavior, preventing hallucinations on the Object HalBench. This makes it a reliable choice for applications where accuracy and safety are paramount. What can I use it for? The high-performance and trustworthy nature of MiniCPM-V 2.0 make it a great choice for a variety of multimodal applications. Some potential use cases include: Multimodal question answering**: The model's strong performance on benchmarks like TextVQA and MME suggest it could be useful for tasks that involve answering questions based on a combination of text and images. Scene text understanding**: MiniCPM-V 2.0's state-of-the-art OCR capabilities make it well-suited for applications that involve extracting and understanding text from images, such as document digitization or visual search. Multimodal content generation**: The model's ability to generate text conditioned on images could enable applications like image captioning or visual storytelling. Things to try One interesting aspect of MiniCPM-V 2.0 is its ability to accept high-resolution 1.8 million pixel images at any aspect ratio. This could enable better perception of fine-grained visual information, such as small objects and optical characters, which could be useful for applications like optical character recognition or detailed image understanding. Additionally, the model's alignment via multimodal RLHF for trustworthy behavior is a notable feature. Developers could explore ways to leverage this capability to build AI systems that are reliable and safe, particularly in sensitive domains where accurate and unbiased outputs are critical.

Updated Invalid Date

Text-to-Image