bge-m3

Maintainer: BAAI

846

Last updated 5/27/2024

🧠

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

bge-m3 is a versatile AI model developed by BAAI (Beijing Academy of Artificial Intelligence) that is distinguished by its multi-functionality, multi-linguality, and multi-granularity capabilities. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. The model supports more than 100 working languages and can process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

Compared to similar models like m3e-large, bge-m3 offers a unique combination of retrieval functionalities in a single model. Other related models like bge_1-5_query_embeddings, bge-large-en-v1.5, bge-reranker-base, and bge-reranker-v2-m3 provide specific functionalities like query embedding generation, text embedding, and re-ranking.

Model inputs and outputs

Inputs

Text sequences of varying length, up to 8192 tokens

Outputs

Dense embeddings for retrieval
Sparse token-level representations for retrieval
Multi-vector representations for retrieval

Capabilities

bge-m3 can effectively handle a wide range of text-related tasks, such as dense retrieval, multi-vector retrieval, and sparse retrieval. The model's multi-functionality allows it to leverage the strengths of different retrieval methods, resulting in higher accuracy and stronger generalization capabilities. For example, the model can be used in a hybrid retrieval pipeline that combines embedding-based retrieval and the BM25 algorithm, without incurring additional cost.

What can I use it for?

bge-m3 can be leveraged in various applications that require effective text retrieval, such as chatbots, search engines, question-answering systems, and content recommendation engines. By taking advantage of the model's multi-functionality, users can build robust and versatile retrieval pipelines that cater to their specific needs.

Things to try

One interesting aspect of bge-m3 is its ability to process inputs of different granularities, from short sentences to long documents. This feature can be particularly useful in applications that involve working with a diverse range of text sources, such as social media posts, news articles, or research papers. Experiment with inputting text of varying lengths and observe how the model performs across these different scenarios.

Additionally, the model's support for over 100 languages makes it a valuable tool for building multilingual systems. Consider exploring the model's performance on non-English text and how it compares to language-specific models or other multilingual alternatives.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

✨

llm-embedder

BAAI

llm-embedder is a text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) that can map any text to a low-dimensional dense vector. This can be used for tasks like retrieval, classification, clustering, and semantic search. It is part of the FlagEmbedding project, which also includes other models like bge-reranker-base and bge-reranker-large. The model is available in multiple sizes, including bge-large-en-v1.5, bge-base-en-v1.5, and bge-small-en-v1.5. These models have been optimized to have more reasonable similarity distributions and enhanced retrieval abilities compared to earlier versions. Model inputs and outputs Inputs Text to be embedded Outputs Low-dimensional dense vector representation of the input text Capabilities The llm-embedder model can generate high-quality embeddings that capture the semantic meaning of text. These embeddings can then be used in a variety of downstream applications, such as: Information retrieval: Finding relevant documents or passages for a given query Text classification: Categorizing text into different classes or topics Clustering: Grouping similar text together Semantic search: Finding text that is semantically similar to a given query The model has been shown to achieve state-of-the-art performance on benchmarks like MTEB and C-MTEB. What can I use it for? The llm-embedder model can be useful in a wide range of applications that require understanding the semantic content of text, such as: Building search engines or recommendation systems that can retrieve relevant information based on user queries Developing chatbots or virtual assistants that can engage in more natural conversations by understanding the context and meaning of user inputs Improving the accuracy of text classification models for tasks like sentiment analysis, topic modeling, or spam detection Powering knowledge management systems that can organize and retrieve information based on the conceptual relationships between documents Additionally, the model can be fine-tuned on domain-specific data to improve its performance for specific use cases. Things to try One interesting aspect of the llm-embedder model is its support for retrieval augmentation for large language models (LLMs). The LLM-Embedder variant of the model is designed to provide a unified embedding solution to support diverse retrieval needs for LLMs. Another interesting direction to explore is the use of the bge-reranker-base and bge-reranker-large models, which are cross-encoder models that can be used to re-rank the top-k documents retrieved by the embedding model. This can help improve the overall accuracy of the retrieval system.

Updated Invalid Date

Text-to-Text

🌀

bge-multilingual-gemma2

BAAI

The bge-multilingual-gemma2 model is a large language model (LLM) based multilingual embedding model developed by BAAI. It is trained on a diverse range of languages and tasks, building on the google/gemma-2-9b model. The model demonstrates strong performance on multilingual benchmarks like MIRACL, MTEB-pl, and MTEB-fr, as well as major evaluations like MTEB, C-MTEB and AIR-Bench. Model inputs and outputs Inputs Text**: The model accepts text input, which can be used for tasks like retrieval, classification, and clustering. Outputs Text embeddings**: The model outputs dense vector representations of the input text, which can be used for downstream applications. Capabilities The bge-multilingual-gemma2 model exhibits state-of-the-art performance on a variety of multilingual tasks. It is able to effectively process and represent text in a diverse range of languages, including English, Chinese, Japanese, Korean, and French, among others. The model's capabilities make it well-suited for applications that require cross-lingual understanding and interoperability. What can I use it for? The bge-multilingual-gemma2 model can be leveraged for a wide range of natural language processing tasks, such as: Multilingual text retrieval**: Use the model's embeddings to find relevant passages or documents in different languages for a given query. Cross-lingual classification**: Classify text in one language based on training data in another language. Multilingual semantic similarity**: Identify semantically similar text across languages. Multilingual clustering**: Group text documents in different languages based on their semantic content. By taking advantage of the model's strong multilingual capabilities, you can build applications that seamlessly handle text in multiple languages, opening up new possibilities for global reach and user experiences. Things to try One interesting aspect of the bge-multilingual-gemma2 model is its ability to perform well without the need for explicit instruction during inference. While adding instruction to queries can provide a slight boost in retrieval performance, the model is able to generate useful embeddings even without the instruction, making it more convenient to use in certain scenarios. Experiment with using the model both with and without instruction to see which approach works best for your specific use case.

Updated Invalid Date

Text-to-Text

📈

bge-small-en

BAAI

The bge-small-en model is a small-scale English text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) as part of their FlagEmbedding project. It is one of several bge (BAAI General Embedding) models that achieve state-of-the-art performance on text embedding benchmarks like MTEB and C-MTEB. The bge-small-en model is a smaller version of the BAAI/bge-large-en-v1.5 and BAAI/bge-base-en-v1.5 models, with 384 embedding dimensions compared to 1024 and 768 respectively. Despite its smaller size, the bge-small-en model still provides competitive performance, making it a good choice when computation resources are limited. Model inputs and outputs Inputs Text sentences**: The model can take a list of text sentences as input. Outputs Sentence embeddings**: The model outputs a numpy array of sentence embeddings, where each row corresponds to the embedding of the corresponding input sentence. Capabilities The bge-small-en model can be used for a variety of natural language processing tasks that benefit from semantic text representations, such as: Information retrieval**: The embeddings can be used to find relevant passages or documents for a given query, by computing similarity scores between the query and the passages/documents. Text classification**: The embeddings can be used as features for training classification models on text data. Clustering**: The embeddings can be used to group similar text documents into clusters. Semantic search**: The embeddings can be used to find semantically similar text based on their meaning, rather than just lexical matching. What can I use it for? The bge-small-en model can be a useful tool for a variety of applications that involve working with English text data. For example, you could use it to build a semantic search engine for your company's knowledge base, or to improve the text classification capabilities of your customer support chatbot. Since the model is smaller and more efficient than the larger bge models, it may be particularly well-suited for deployment on edge devices or in resource-constrained environments. You could also fine-tune the model on your specific text data to further improve its performance for your use case. Things to try One interesting thing to try with the bge-small-en model is to compare its performance to the larger bge models, such as BAAI/bge-large-en-v1.5 and BAAI/bge-base-en-v1.5, on your specific tasks. You may find that the smaller model provides nearly the same performance as the larger models, while being more efficient and easier to deploy. Another thing to try is to fine-tune the bge-small-en model on your own text data, using the techniques described in the FlagEmbedding documentation. This can help the model better capture the semantics of your domain-specific text, potentially leading to improved performance on your tasks.

Updated Invalid Date

Image-to-Text

🌀

bge-small-zh-v1.5

BAAI

The bge-small-zh-v1.5 model from BAAI is a small-scale version of the BAAI General Embedding (BGE) model, which can map any text to a low-dimensional dense vector. Unlike previous BGE models, version 1.5 has a more reasonable similarity distribution, enhancing its retrieval ability without the need for instruction. The bge-small-zh-v1.5 model is competitive in performance compared to larger models, making it a good option for projects with computational constraints. Model inputs and outputs The bge-small-zh-v1.5 model takes in text as input and outputs a fixed-size embedding vector. This embedding can then be used for tasks like retrieval, classification, clustering, or semantic search. The model supports both Chinese and English text. Inputs Text**: The model can accept any Chinese or English text as input. Outputs Embedding vector**: The model outputs a fixed-size vector representation of the input text, which can be used for downstream tasks. Capabilities The bge-small-zh-v1.5 model is capable of generating high-quality text embeddings that can be used for a variety of natural language processing tasks. Its performance is competitive with larger BGE models, making it a good choice for projects with limited computational resources. The model's improved similarity distribution helps to better differentiate between similar and dissimilar text. What can I use it for? The bge-small-zh-v1.5 embedding can be used in a wide range of applications, such as: Semantic search**: Use the embeddings to find relevant passages or documents for a given query. Text classification**: Train a classifier on top of the embeddings to categorize text into different classes. Clustering**: Group similar text together based on the embeddings. Recommendation systems**: Use the embeddings to find similar items or content for recommendation. Things to try One interesting thing to try with the bge-small-zh-v1.5 model is to fine-tune it on your specific data and task. The examples provided by the maintainers show how to prepare data and fine-tune the model to improve performance on your use case. Additionally, you can experiment with using the model in conjunction with the provided reranker models to further enhance retrieval performance.

Updated Invalid Date

Text-to-Text