AI model creator details for BAAI

🧠

bge-m3

846

bge-m3 is a versatile AI model developed by BAAI (Beijing Academy of Artificial Intelligence) that is distinguished by its multi-functionality, multi-linguality, and multi-granularity capabilities. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. The model supports more than 100 working languages and can process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. Compared to similar models like m3e-large, bge-m3 offers a unique combination of retrieval functionalities in a single model. Other related models like bge_1-5_query_embeddings, bge-large-en-v1.5, bge-reranker-base, and bge-reranker-v2-m3 provide specific functionalities like query embedding generation, text embedding, and re-ranking. Model inputs and outputs Inputs Text sequences of varying length, up to 8192 tokens Outputs Dense embeddings for retrieval Sparse token-level representations for retrieval Multi-vector representations for retrieval Capabilities bge-m3 can effectively handle a wide range of text-related tasks, such as dense retrieval, multi-vector retrieval, and sparse retrieval. The model's multi-functionality allows it to leverage the strengths of different retrieval methods, resulting in higher accuracy and stronger generalization capabilities. For example, the model can be used in a hybrid retrieval pipeline that combines embedding-based retrieval and the BM25 algorithm, without incurring additional cost. What can I use it for? bge-m3 can be leveraged in various applications that require effective text retrieval, such as chatbots, search engines, question-answering systems, and content recommendation engines. By taking advantage of the model's multi-functionality, users can build robust and versatile retrieval pipelines that cater to their specific needs. Things to try One interesting aspect of bge-m3 is its ability to process inputs of different granularities, from short sentences to long documents. This feature can be particularly useful in applications that involve working with a diverse range of text sources, such as social media posts, news articles, or research papers. Experiment with inputting text of varying lengths and observe how the model performs across these different scenarios. Additionally, the model's support for over 100 languages makes it a valuable tool for building multilingual systems. Consider exploring the model's performance on non-English text and how it compares to language-specific models or other multilingual alternatives.

Updated 5/27/2024

Text-to-Text

💬

bge-large-en-v1.5

BAAI

358

FlagEmbedding Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License For more details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. English | FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: Long-Context LLM**: Activation Beacon Fine-tuning of LM** : LM-Cocktail Dense Retrieval**: BGE-M3, LLM Embedder, BGE Embedding Reranker Model**: BGE Reranker Benchmark**: C-MTEB News 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model that supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report 09/15/2023: The technical report and massive training data of BGE has been released 09/12/2023: New models: New reranker model: release cross-encoder models BAAI/bge-reranker-base and BAAI/bge-reranker-large, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. update embedding model: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. More 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. 08/05/2023: Release base-scale and small-scale models, *best performance among the models of the same size * 08/02/2023: Release bge-large-(short for BAAI General Embedding) Models, *rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada: 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. Model List bge is short for BAAI general embedding. Model Language Description query instruction for retrieval \[1\] BAAI/bge-m3 Multilingual Inference Fine-tune Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) BAAI/llm-embedder English Inference Fine-tune a unified embedding model to support diverse retrieval augmentation needs for LLMs See README BAAI/bge-reranker-large Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-reranker-base Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-large-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-base-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-small-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-large-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-base-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-small-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-large-en English Inference Fine-tune :trophy: rank 1st in MTEB leaderboard Represent this sentence for searching relevant passages: BAAI/bge-base-en English Inference Fine-tune a base-scale model but with similar ability to bge-large-en Represent this sentence for searching relevant passages: BAAI/bge-small-en English Inference Fine-tune a small-scale model but with competitive performance Represent this sentence for searching relevant passages: BAAI/bge-large-zh Chinese Inference Fine-tune :trophy: rank 1st in C-MTEB benchmark `` BAAI/bge-base-zh Chinese Inference Fine-tune a base-scale model but with similar ability to bge-large-zh `` BAAI/bge-small-zh Chinese Inference Fine-tune a small-scale model but with competitive performance `` \[1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. \[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Frequently asked questions 1\. How to fine-tune bge embedding model? Following this example to prepare data and fine-tune your model. Some suggestions: Mine hard negatives following this example, which can improve the retrieval performance. If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2\. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). 3\. When does the query instruction need to be used For the bge-*-v1.5, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Usage Usage for Embedding Model Here are some examples for using bge models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. Using FlagEmbedding pip install -U FlagEmbedding If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. from FlagEmbedding import FlagModel sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="", use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode(sentences_1) embeddings_2 = model.encode(sentences_2) similarity = embeddings_1 @ embeddings_2.T print(similarity) for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction queries = ['query_1', 'query_2'] passages = ["-1", "-2"] q_embeddings = model.encode_queries(queries) p_embeddings = model.encode(passages) scores = q_embeddings @ p_embeddings.T For the value of the argument query_instruction_for_retrieval, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set os.environ["CUDA_VISIBLE_DEVICES"] to select specific GPUs. You also can set os.environ["CUDA_VISIBLE_DEVICES"]="" to make all GPUs unavailable. Using Sentence-Transformers You can also use the bge models with sentence-transformers: pip install -U sentence-transformers from sentence_transformers import SentenceTransformer sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = SentenceTransformer('BAAI/bge-large-zh-v1.5') embeddings_1 = model.encode(sentences_1, normalize_embeddings=True) embeddings_2 = model.encode(sentences_2, normalize_embeddings=True) similarity = embeddings_1 @ embeddings_2.T print(similarity) For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. from sentence_transformers import SentenceTransformer queries = ['query_1', 'query_2'] passages = ["-1", "-2"] instruction = "" model = SentenceTransformer('BAAI/bge-large-zh-v1.5') q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True) p_embeddings = model.encode(passages, normalize_embeddings=True) scores = q_embeddings @ p_embeddings.T Using Langchain You can use bge in langchain like this: from langchain.embeddings import HuggingFaceBgeEmbeddings model_name = "BAAI/bge-large-en-v1.5" model_kwargs = {'device': 'cuda'} encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity model = HuggingFaceBgeEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="" ) model.query_instruction = "" Using HuggingFace Transformers With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., \[CLS\]) as the sentence embedding. from transformers import AutoTokenizer, AutoModel import torch Sentences we want sentence embeddings for sentences = ["-1", "-2"] Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5') model.eval() Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) Perform pooling. In this case, cls pooling. sentence_embeddings = model_output0 normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) Usage of the ONNX files from optimum.onnxruntime import ORTModelForFeatureExtraction # type: ignore import torch from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-en-v1.5', revision="refs/pr/13") model_ort = ORTModelForFeatureExtraction.from_pretrained('BAAI/bge-large-en-v1.5', revision="refs/pr/13",file_name="onnx/model.onnx") Sentences we want sentence embeddings for sentences = ["-1", "-2"] Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') model_output_ort = model_ort(**encoded_input) Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) model_output and model_output_ort are identical Its also possible to deploy the onnx files with the infinity\_emb pip package. import asyncio from infinity_emb import AsyncEmbeddingEngine, EngineArgs sentences = ["Embed this is sentence via Infinity.", "Paris is in France."] engine = AsyncEmbeddingEngine.from_args( EngineArgs(model_name_or_path = "BAAI/bge-large-en-v1.5", device="cpu", engine="optimum" # or engine="torch" )) async def main(): async with engine: embeddings, usage = await engine.embed(sentences=sentences) asyncio.run(main()) Usage for Reranker Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Using FlagEmbedding pip install -U FlagEmbedding Get relevance scores (higher scores indicate more relevance): from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) Using Huggingface transformers import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) Evaluation baai-general-embedding models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. MTEB**: Model Name Dimension Sequence Length Average (56) Retrieval (15) Clustering (11) Pair Classification (3) Reranking (4) STS (10) Summarization (1) Classification (12) BAAI/bge-large-en-v1.5 1024 512 64.23 54.29 46.08 87.12 60.03 83.11 31.61 75.97 BAAI/bge-base-en-v1.5 768 512 63.55 53.25 45.77 86.55 58.86 82.4 31.07 75.53 BAAI/bge-small-en-v1.5 384 512 62.17 51.68 43.82 84.92 58.36 81.59 30.12 74.14 bge-large-en 1024 512 63.98 53.9 46.98 85.8 59.48 81.56 32.06 76.21 bge-base-en 768 512 63.36 53.0 46.32 85.86 58.7 81.84 29.27 75.27 gte-large 1024 512 63.13 52.22 46.84 85.00 59.13 83.35 31.66 73.33 gte-base 768 512 62.39 51.14 46.2 84.57 58.61 82.3 31.17 73.01 e5-large-v2 1024 512 62.25 50.56 44.49 86.03 56.61 82.05 30.19 75.24 bge-small-en 384 512 62.11 51.82 44.31 83.78 57.97 80.72 30.53 74.37 instructor-xl 768 512 61.79 49.26 44.74 86.62 57.29 83.06 32.32 61.79 e5-base-v2 768 512 61.5 50.29 43.80 85.73 55.91 81.05 30.28 73.84 gte-small 384 512 61.36 49.46 44.89 83.54 57.7 82.07 30.42 72.31 text-embedding-ada-002 1536 8192 60.99 49.25 45.9 84.89 56.32 80.97 30.8 70.93 e5-small-v2 384 512 59.93 49.04 39.92 84.67 54.32 80.39 31.16 72.94 sentence-t5-xxl 768 512 59.51 42.24 43.72 85.06 56.42 82.63 30.08 73.42 all-mpnet-base-v2 768 514 57.78 43.81 43.69 83.04 59.36 80.28 27.49 65.07 sgpt-bloom-7b1-msmarco 4096 2048 57.59 48.22 38.93 81.9 55.65 77.74 33.6 66.19 C-MTEB**: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to C\_MTEB for a detailed introduction. Model Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering BAAI/bge-large-zh-v1.5 1024 64.53 70.46 56.25 81.6 69.13 65.84 48.99 BAAI/bge-base-zh-v1.5 768 63.13 69.49 53.72 79.75 68.07 65.39 47.53 BAAI/bge-small-zh-v1.5 512 57.82 61.77 49.11 70.41 63.96 60.92 44.18 BAAI/bge-large-zh 1024 64.20 71.53 54.98 78.94 68.32 65.11 48.39 bge-large-zh-noinstruct 1024 63.53 70.55 53 76.77 68.58 64.91 50.01 BAAI/bge-base-zh 768 62.96 69.53 54.12 77.5 67.07 64.91 47.63 multilingual-e5-large 1024 58.79 63.66 48.44 69.89 67.34 56.00 48.23 BAAI/bge-small-zh 512 58.27 63.07 49.45 70.35 63.64 61.48 45.09 m3e-base 768 57.10 56.91 50.47 63.99 67.52 59.34 47.68 m3e-large 1024 57.05 54.75 50.42 64.3 68.2 59.66 48.88 multilingual-e5-base 768 55.48 61.63 46.49 67.07 65.35 54.35 40.68 multilingual-e5-small 384 55.38 59.95 45.27 66.45 65.85 53.86 45.26 text-embedding-ada-002(OpenAI) 1536 53.02 52.0 43.35 69.56 64.31 54.28 45.68 luotuo 1024 49.37 44.4 42.78 66.62 61 49.25 44.39 text2vec-base 768 47.63 38.79 43.41 67.41 62.19 49.45 37.66 text2vec-large 1024 47.36 41.94 44.97 70.86 60.66 49.16 30.02 Reranking**: See C\_MTEB for evaluation script. Model T2Reranking T2RerankingZh2En\* T2RerankingEn2Zh\* MMarcoReranking CMedQAv1 CMedQAv2 Avg text2vec-base-multilingual 64.66 62.94 62.51 14.37 48.46 48.6 50.26 multilingual-e5-small 65.62 60.94 56.41 29.91 67.26 66.54 57.78 multilingual-e5-large 64.55 61.61 54.28 28.6 67.42 67.92 57.4 multilingual-e5-base 64.21 62.13 54.68 29.5 66.23 66.98 57.29 m3e-base 66.03 62.74 56.07 17.51 77.05 76.76 59.36 m3e-large 66.13 62.72 56.1 16.46 77.76 78.27 59.57 bge-base-zh-v1.5 66.49 63.25 57.02 29.74 80.47 84.88 63.64 bge-large-zh-v1.5 65.74 63.39 57.03 28.74 83.45 85.44 63.97 BAAI/bge-reranker-base 67.28 63.95 60.45 35.46 81.26 84.1 65.42 BAAI/bge-reranker-large 67.6 64.03 61.44 37.16 82.15 84.18 66.09 \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks Train BAAI Embedding We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baai\_general\_embedding. BGE Reranker Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). Citation If you find this repository useful, please consider giving a star :star: and citation @misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} } License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

Updated 5/28/2024

Text-to-Text

🧪

bge-large-zh-v1.5

BAAI

300

FlagEmbedding Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License For more details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. English | FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: Long-Context LLM**: Activation Beacon Fine-tuning of LM** : LM-Cocktail Dense Retrieval**: BGE-M3, LLM Embedder, BGE Embedding Reranker Model**: BGE Reranker Benchmark**: C-MTEB News 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report 09/15/2023: The technical report and massive training data of BGE has been released 09/12/2023: New models: New reranker model: release cross-encoder models BAAI/bge-reranker-base and BAAI/bge-reranker-large, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. update embedding model: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. More 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. 08/05/2023: Release base-scale and small-scale models, *best performance among the models of the same size * 08/02/2023: Release bge-large-(short for BAAI General Embedding) Models, *rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada: 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. Model List bge is short for BAAI general embedding. Model Language Description query instruction for retrieval \[1\] BAAI/bge-m3 Multilingual Inference Fine-tune Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) BAAI/llm-embedder English Inference Fine-tune a unified embedding model to support diverse retrieval augmentation needs for LLMs See README BAAI/bge-reranker-large Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-reranker-base Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-large-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-base-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-small-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-large-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-base-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-small-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-large-en English Inference Fine-tune :trophy: rank 1st in MTEB leaderboard Represent this sentence for searching relevant passages: BAAI/bge-base-en English Inference Fine-tune a base-scale model but with similar ability to bge-large-en Represent this sentence for searching relevant passages: BAAI/bge-small-en English Inference Fine-tune a small-scale model but with competitive performance Represent this sentence for searching relevant passages: BAAI/bge-large-zh Chinese Inference Fine-tune :trophy: rank 1st in C-MTEB benchmark `` BAAI/bge-base-zh Chinese Inference Fine-tune a base-scale model but with similar ability to bge-large-zh `` BAAI/bge-small-zh Chinese Inference Fine-tune a small-scale model but with competitive performance `` \[1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. \[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Frequently asked questions 1\. How to fine-tune bge embedding model? Following this example to prepare data and fine-tune your model. Some suggestions: Mine hard negatives following this example, which can improve the retrieval performance. If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2\. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). 3\. When does the query instruction need to be used For the bge-*-v1.5, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Usage Usage for Embedding Model Here are some examples for using bge models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. Using FlagEmbedding pip install -U FlagEmbedding If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. from FlagEmbedding import FlagModel sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="", use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode(sentences_1) embeddings_2 = model.encode(sentences_2) similarity = embeddings_1 @ embeddings_2.T print(similarity) for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction queries = ['query_1', 'query_2'] passages = ["-1", "-2"] q_embeddings = model.encode_queries(queries) p_embeddings = model.encode(passages) scores = q_embeddings @ p_embeddings.T For the value of the argument query_instruction_for_retrieval, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set os.environ["CUDA_VISIBLE_DEVICES"] to select specific GPUs. You also can set os.environ["CUDA_VISIBLE_DEVICES"]="" to make all GPUs unavailable. Using Sentence-Transformers You can also use the bge models with sentence-transformers: pip install -U sentence-transformers from sentence_transformers import SentenceTransformer sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = SentenceTransformer('BAAI/bge-large-zh-v1.5') embeddings_1 = model.encode(sentences_1, normalize_embeddings=True) embeddings_2 = model.encode(sentences_2, normalize_embeddings=True) similarity = embeddings_1 @ embeddings_2.T print(similarity) For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. from sentence_transformers import SentenceTransformer queries = ['query_1', 'query_2'] passages = ["-1", "-2"] instruction = "" model = SentenceTransformer('BAAI/bge-large-zh-v1.5') q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True) p_embeddings = model.encode(passages, normalize_embeddings=True) scores = q_embeddings @ p_embeddings.T Using Langchain You can use bge in langchain like this: from langchain.embeddings import HuggingFaceBgeEmbeddings model_name = "BAAI/bge-large-en-v1.5" model_kwargs = {'device': 'cuda'} encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity model = HuggingFaceBgeEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="" ) model.query_instruction = "" Using HuggingFace Transformers With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., \[CLS\]) as the sentence embedding. from transformers import AutoTokenizer, AutoModel import torch Sentences we want sentence embeddings for sentences = ["-1", "-2"] Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5') model.eval() Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) Perform pooling. In this case, cls pooling. sentence_embeddings = model_output0 normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) Usage for Reranker Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Using FlagEmbedding pip install -U FlagEmbedding Get relevance scores (higher scores indicate more relevance): from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) Using Huggingface transformers import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) Evaluation baai-general-embedding models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. MTEB**: Model Name Dimension Sequence Length Average (56) Retrieval (15) Clustering (11) Pair Classification (3) Reranking (4) STS (10) Summarization (1) Classification (12) BAAI/bge-large-en-v1.5 1024 512 64.23 54.29 46.08 87.12 60.03 83.11 31.61 75.97 BAAI/bge-base-en-v1.5 768 512 63.55 53.25 45.77 86.55 58.86 82.4 31.07 75.53 BAAI/bge-small-en-v1.5 384 512 62.17 51.68 43.82 84.92 58.36 81.59 30.12 74.14 bge-large-en 1024 512 63.98 53.9 46.98 85.8 59.48 81.56 32.06 76.21 bge-base-en 768 512 63.36 53.0 46.32 85.86 58.7 81.84 29.27 75.27 gte-large 1024 512 63.13 52.22 46.84 85.00 59.13 83.35 31.66 73.33 gte-base 768 512 62.39 51.14 46.2 84.57 58.61 82.3 31.17 73.01 e5-large-v2 1024 512 62.25 50.56 44.49 86.03 56.61 82.05 30.19 75.24 bge-small-en 384 512 62.11 51.82 44.31 83.78 57.97 80.72 30.53 74.37 instructor-xl 768 512 61.79 49.26 44.74 86.62 57.29 83.06 32.32 61.79 e5-base-v2 768 512 61.5 50.29 43.80 85.73 55.91 81.05 30.28 73.84 gte-small 384 512 61.36 49.46 44.89 83.54 57.7 82.07 30.42 72.31 text-embedding-ada-002 1536 8192 60.99 49.25 45.9 84.89 56.32 80.97 30.8 70.93 e5-small-v2 384 512 59.93 49.04 39.92 84.67 54.32 80.39 31.16 72.94 sentence-t5-xxl 768 512 59.51 42.24 43.72 85.06 56.42 82.63 30.08 73.42 all-mpnet-base-v2 768 514 57.78 43.81 43.69 83.04 59.36 80.28 27.49 65.07 sgpt-bloom-7b1-msmarco 4096 2048 57.59 48.22 38.93 81.9 55.65 77.74 33.6 66.19 C-MTEB**: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to C\_MTEB for a detailed introduction. Model Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering BAAI/bge-large-zh-v1.5 1024 64.53 70.46 56.25 81.6 69.13 65.84 48.99 BAAI/bge-base-zh-v1.5 768 63.13 69.49 53.72 79.75 68.07 65.39 47.53 BAAI/bge-small-zh-v1.5 512 57.82 61.77 49.11 70.41 63.96 60.92 44.18 BAAI/bge-large-zh 1024 64.20 71.53 54.98 78.94 68.32 65.11 48.39 bge-large-zh-noinstruct 1024 63.53 70.55 53 76.77 68.58 64.91 50.01 BAAI/bge-base-zh 768 62.96 69.53 54.12 77.5 67.07 64.91 47.63 multilingual-e5-large 1024 58.79 63.66 48.44 69.89 67.34 56.00 48.23 BAAI/bge-small-zh 512 58.27 63.07 49.45 70.35 63.64 61.48 45.09 m3e-base 768 57.10 56.91 50.47 63.99 67.52 59.34 47.68 m3e-large 1024 57.05 54.75 50.42 64.3 68.2 59.66 48.88 multilingual-e5-base 768 55.48 61.63 46.49 67.07 65.35 54.35 40.68 multilingual-e5-small 384 55.38 59.95 45.27 66.45 65.85 53.86 45.26 text-embedding-ada-002(OpenAI) 1536 53.02 52.0 43.35 69.56 64.31 54.28 45.68 luotuo 1024 49.37 44.4 42.78 66.62 61 49.25 44.39 text2vec-base 768 47.63 38.79 43.41 67.41 62.19 49.45 37.66 text2vec-large 1024 47.36 41.94 44.97 70.86 60.66 49.16 30.02 Reranking**: See C\_MTEB for evaluation script. Model T2Reranking T2RerankingZh2En\* T2RerankingEn2Zh\* MMarcoReranking CMedQAv1 CMedQAv2 Avg text2vec-base-multilingual 64.66 62.94 62.51 14.37 48.46 48.6 50.26 multilingual-e5-small 65.62 60.94 56.41 29.91 67.26 66.54 57.78 multilingual-e5-large 64.55 61.61 54.28 28.6 67.42 67.92 57.4 multilingual-e5-base 64.21 62.13 54.68 29.5 66.23 66.98 57.29 m3e-base 66.03 62.74 56.07 17.51 77.05 76.76 59.36 m3e-large 66.13 62.72 56.1 16.46 77.76 78.27 59.57 bge-base-zh-v1.5 66.49 63.25 57.02 29.74 80.47 84.88 63.64 bge-large-zh-v1.5 65.74 63.39 57.03 28.74 83.45 85.44 63.97 BAAI/bge-reranker-base 67.28 63.95 60.45 35.46 81.26 84.1 65.42 BAAI/bge-reranker-large 67.6 64.03 61.44 37.16 82.15 84.18 66.09 \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks Train BAAI Embedding We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baai\_general\_embedding. BGE Reranker Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). Citation If you find this repository useful, please consider giving a star :star: and citation @misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} } License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

Updated 5/28/2024

Text-to-Text

🖼️

bge-large-zh

BAAI

290

The bge-large-zh model is a state-of-the-art text embedding model developed by the Beijing Academy of Artificial Intelligence (BAAI). It is part of the BAAI General Embedding (BGE) family of models, which have achieved top performance on both the MTEB and C-MTEB benchmarks. The bge-large-zh model is specifically designed for Chinese text processing, and it can map any Chinese text into a low-dimensional dense vector that can be used for tasks like retrieval, classification, clustering, or semantic search. Compared to similar models like BAAI/bge-large-en and BAAI/bge-small-en, the bge-large-zh model has been optimized for Chinese text and has demonstrated state-of-the-art performance on Chinese benchmarks. The BAAI/llm-embedder model is a more recent addition to the BAAI family, serving as a unified embedding model to support diverse retrieval augmentation needs for large language models (LLMs). Model inputs and outputs Inputs Text**: The bge-large-zh model can take any Chinese text as input, ranging from short queries to long passages. Instruction (optional)**: For retrieval tasks that use short queries to find long related documents, it is recommended to add an instruction to the query to help the model better understand the intent. The instruction should be placed at the beginning of the query text. No instruction is needed for the passage/document text. Outputs Embeddings**: The primary output of the bge-large-zh model is a dense vector embedding of the input text. These embeddings can be used for a variety of downstream tasks, such as: Retrieval: The embeddings can be used to find related passages or documents by computing the similarity between the query embedding and the passage/document embeddings. Classification: The embeddings can be used as features for training classification models. Clustering: The embeddings can be used to group similar text together. Semantic search: The embeddings can be used to find semantically related text. Capabilities The bge-large-zh model demonstrates state-of-the-art performance on a range of Chinese text processing tasks. On the Chinese Massive Text Embedding Benchmark (C-MTEB), the bge-large-zh-v1.5 model ranked first overall, showing strong results across tasks like retrieval, semantic similarity, and classification. Additionally, the bge-large-zh model has been designed to handle long input text, with a maximum sequence length of 512 tokens. This makes it well-suited for tasks that involve processing lengthy passages or documents, such as research paper retrieval or legal document search. What can I use it for? The bge-large-zh model can be used for a variety of Chinese text processing tasks, including: Retrieval**: Use the model to find relevant passages or documents given a query. This can be helpful for building search engines, Q&A systems, or knowledge management tools. Classification**: Use the model's embeddings as features to train classification models for tasks like sentiment analysis, topic classification, or intent detection. Clustering**: Group similar Chinese text together using the model's embeddings, which can be useful for organizing large collections of documents or categorizing user-generated content. Semantic search**: Find semantically related text by computing the similarity between the model's embeddings, enabling more advanced search experiences. Things to try One interesting aspect of the bge-large-zh model is its ability to handle queries with or without instruction. While adding an instruction to the query can improve retrieval performance, the model's v1.5 version has been enhanced to perform well even without the instruction. This makes it more convenient to use in certain applications, as you don't need to worry about crafting the perfect query instruction. Another thing to try is fine-tuning the bge-large-zh model on your own data. The provided examples show how you can prepare data and fine-tune the model to improve its performance on your specific use case. This can be particularly helpful if you have domain-specific text that the pre-trained model doesn't handle as well.

Updated 5/28/2024

Text-to-Text

💬

bge-reranker-large

BAAI

246

We have updated the new reranker, supporting larger lengths, more languages, and achieving better performance. FlagEmbedding Model List | FAQ | Usage | Evaluation | Train | Citation | License More details please refer to our Github: FlagEmbedding. English | FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: Long-Context LLM**: Activation Beacon Fine-tuning of LM** : LM-Cocktail Embedding Model**: Visualized-BGE, BGE-M3, LLM Embedder, BGE Embedding Reranker Model**: llm rerankers, BGE Reranker Benchmark**: C-MTEB News 3/18/2024: Release new rerankers, built upon powerful M3 and LLM (GEMMA and MiniCPM, not so large actually) backbones, supporitng multi-lingual processing and larger inputs, massive improvements of ranking performances on BEIR, C-MTEB/Retrieval, MIRACL, LlamaIndex Evaluation. 3/18/2024: Release Visualized-BGE, equipping BGE with visual capabilities. Visualized-BGE can be utilized to generate embeddings for hybrid image-text data. 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report 09/15/2023: The technical report of BGE has been released 09/15/2023: The massive training data of BGE has been released 09/12/2023: New models: New reranker model: release cross-encoder models BAAI/bge-reranker-base and BAAI/bge-reranker-large, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. update embedding model: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. More 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. 08/05/2023: Release base-scale and small-scale models, *best performance among the models of the same size * 08/02/2023: Release bge-large-(short for BAAI General Embedding) Models, *rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada: 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. Model List bge is short for BAAI general embedding. Model Language Description query instruction for retrieval \[1\] BAAI/bge-m3 Multilingual Inference Fine-tune Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) BAAI/llm-embedder English Inference Fine-tune a unified embedding model to support diverse retrieval augmentation needs for LLMs See README BAAI/bge-reranker-large Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-reranker-base Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-large-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-base-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-small-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-large-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-base-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-small-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-large-en English Inference Fine-tune :trophy: rank 1st in MTEB leaderboard Represent this sentence for searching relevant passages: BAAI/bge-base-en English Inference Fine-tune a base-scale model but with similar ability to bge-large-en Represent this sentence for searching relevant passages: BAAI/bge-small-en English Inference Fine-tune a small-scale model but with competitive performance Represent this sentence for searching relevant passages: BAAI/bge-large-zh Chinese Inference Fine-tune :trophy: rank 1st in C-MTEB benchmark `` BAAI/bge-base-zh Chinese Inference Fine-tune a base-scale model but with similar ability to bge-large-zh `` BAAI/bge-small-zh Chinese Inference Fine-tune a small-scale model but with competitive performance `` \[1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. \[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Frequently asked questions 1\. How to fine-tune bge embedding model? Following this example to prepare data and fine-tune your model. Some suggestions: Mine hard negatives following this example, which can improve the retrieval performance. If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. Refer to this example for the fine-tuning for reranker 2\. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). 3\. When does the query instruction need to be used For the bge-*-v1.5, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Usage Usage for Embedding Model Here are some examples for using bge models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. Using FlagEmbedding pip install -U FlagEmbedding If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. from FlagEmbedding import FlagModel sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="", use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode(sentences_1) embeddings_2 = model.encode(sentences_2) similarity = embeddings_1 @ embeddings_2.T print(similarity) for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction queries = ['query_1', 'query_2'] passages = ["-1", "-2"] q_embeddings = model.encode_queries(queries) p_embeddings = model.encode(passages) scores = q_embeddings @ p_embeddings.T For the value of the argument query_instruction_for_retrieval, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set os.environ["CUDA_VISIBLE_DEVICES"] to select specific GPUs. You also can set os.environ["CUDA_VISIBLE_DEVICES"]="" to make all GPUs unavailable. Using Sentence-Transformers You can also use the bge models with sentence-transformers: pip install -U sentence-transformers from sentence_transformers import SentenceTransformer sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = SentenceTransformer('BAAI/bge-large-zh-v1.5') embeddings_1 = model.encode(sentences_1, normalize_embeddings=True) embeddings_2 = model.encode(sentences_2, normalize_embeddings=True) similarity = embeddings_1 @ embeddings_2.T print(similarity) For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. from sentence_transformers import SentenceTransformer queries = ['query_1', 'query_2'] passages = ["-1", "-2"] instruction = "" model = SentenceTransformer('BAAI/bge-large-zh-v1.5') q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True) p_embeddings = model.encode(passages, normalize_embeddings=True) scores = q_embeddings @ p_embeddings.T Using Langchain You can use bge in langchain like this: from langchain.embeddings import HuggingFaceBgeEmbeddings model_name = "BAAI/bge-large-en-v1.5" model_kwargs = {'device': 'cuda'} encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity model = HuggingFaceBgeEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="" ) model.query_instruction = "" Using HuggingFace Transformers With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., \[CLS\]) as the sentence embedding. from transformers import AutoTokenizer, AutoModel import torch Sentences we want sentence embeddings for sentences = ["-1", "-2"] Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5') model.eval() Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) Perform pooling. In this case, cls pooling. sentence_embeddings = model_output0 normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) Usage for Reranker Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Using FlagEmbedding pip install -U FlagEmbedding Get relevance scores (higher scores indicate more relevance): from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) Using Huggingface transformers import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) Usage reranker with the ONNX files from optimum.onnxruntime import ORTModelForSequenceClassification # type: ignore import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base') model_ort = ORTModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base', file_name="onnx/model.onnx") Sentences we want sentence embeddings for pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] Tokenize sentences encoded_input = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt') scores_ort = model_ort(**encoded_input, return_dict=True).logits.view(-1, ).float() Compute token embeddings with torch.inference_mode(): scores = model_ort(**encoded_input, return_dict=True).logits.view(-1, ).float() scores and scores_ort are identical Usage reranker with infinity Its also possible to deploy the onnx/torch files with the infinity\_emb pip package. import asyncio from infinity_emb import AsyncEmbeddingEngine, EngineArgs query='what is a panda?' docs = ['The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear', "Paris is in France."] engine = AsyncEmbeddingEngine.from_args( EngineArgs(model_name_or_path = "BAAI/bge-reranker-base", device="cpu", engine="torch" # or engine="optimum" for onnx )) async def main(): async with engine: ranking, usage = await engine.rerank(query=query, docs=docs) print(list(zip(ranking, docs))) asyncio.run(main()) Evaluation baai-general-embedding models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. MTEB**: Model Name Dimension Sequence Length Average (56) Retrieval (15) Clustering (11) Pair Classification (3) Reranking (4) STS (10) Summarization (1) Classification (12) BAAI/bge-large-en-v1.5 1024 512 64.23 54.29 46.08 87.12 60.03 83.11 31.61 75.97 BAAI/bge-base-en-v1.5 768 512 63.55 53.25 45.77 86.55 58.86 82.4 31.07 75.53 BAAI/bge-small-en-v1.5 384 512 62.17 51.68 43.82 84.92 58.36 81.59 30.12 74.14 bge-large-en 1024 512 63.98 53.9 46.98 85.8 59.48 81.56 32.06 76.21 bge-base-en 768 512 63.36 53.0 46.32 85.86 58.7 81.84 29.27 75.27 gte-large 1024 512 63.13 52.22 46.84 85.00 59.13 83.35 31.66 73.33 gte-base 768 512 62.39 51.14 46.2 84.57 58.61 82.3 31.17 73.01 e5-large-v2 1024 512 62.25 50.56 44.49 86.03 56.61 82.05 30.19 75.24 bge-small-en 384 512 62.11 51.82 44.31 83.78 57.97 80.72 30.53 74.37 instructor-xl 768 512 61.79 49.26 44.74 86.62 57.29 83.06 32.32 61.79 e5-base-v2 768 512 61.5 50.29 43.80 85.73 55.91 81.05 30.28 73.84 gte-small 384 512 61.36 49.46 44.89 83.54 57.7 82.07 30.42 72.31 text-embedding-ada-002 1536 8192 60.99 49.25 45.9 84.89 56.32 80.97 30.8 70.93 e5-small-v2 384 512 59.93 49.04 39.92 84.67 54.32 80.39 31.16 72.94 sentence-t5-xxl 768 512 59.51 42.24 43.72 85.06 56.42 82.63 30.08 73.42 all-mpnet-base-v2 768 514 57.78 43.81 43.69 83.04 59.36 80.28 27.49 65.07 sgpt-bloom-7b1-msmarco 4096 2048 57.59 48.22 38.93 81.9 55.65 77.74 33.6 66.19 C-MTEB**: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to C\_MTEB for a detailed introduction. Model Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering BAAI/bge-large-zh-v1.5 1024 64.53 70.46 56.25 81.6 69.13 65.84 48.99 BAAI/bge-base-zh-v1.5 768 63.13 69.49 53.72 79.75 68.07 65.39 47.53 BAAI/bge-small-zh-v1.5 512 57.82 61.77 49.11 70.41 63.96 60.92 44.18 BAAI/bge-large-zh 1024 64.20 71.53 54.98 78.94 68.32 65.11 48.39 bge-large-zh-noinstruct 1024 63.53 70.55 53 76.77 68.58 64.91 50.01 BAAI/bge-base-zh 768 62.96 69.53 54.12 77.5 67.07 64.91 47.63 multilingual-e5-large 1024 58.79 63.66 48.44 69.89 67.34 56.00 48.23 BAAI/bge-small-zh 512 58.27 63.07 49.45 70.35 63.64 61.48 45.09 m3e-base 768 57.10 56.91 50.47 63.99 67.52 59.34 47.68 m3e-large 1024 57.05 54.75 50.42 64.3 68.2 59.66 48.88 multilingual-e5-base 768 55.48 61.63 46.49 67.07 65.35 54.35 40.68 multilingual-e5-small 384 55.38 59.95 45.27 66.45 65.85 53.86 45.26 text-embedding-ada-002(OpenAI) 1536 53.02 52.0 43.35 69.56 64.31 54.28 45.68 luotuo 1024 49.37 44.4 42.78 66.62 61 49.25 44.39 text2vec-base 768 47.63 38.79 43.41 67.41 62.19 49.45 37.66 text2vec-large 1024 47.36 41.94 44.97 70.86 60.66 49.16 30.02 Reranking**: See C\_MTEB for evaluation script. Model T2Reranking T2RerankingZh2En\* T2RerankingEn2Zh\* MMarcoReranking CMedQAv1 CMedQAv2 Avg text2vec-base-multilingual 64.66 62.94 62.51 14.37 48.46 48.6 50.26 multilingual-e5-small 65.62 60.94 56.41 29.91 67.26 66.54 57.78 multilingual-e5-large 64.55 61.61 54.28 28.6 67.42 67.92 57.4 multilingual-e5-base 64.21 62.13 54.68 29.5 66.23 66.98 57.29 m3e-base 66.03 62.74 56.07 17.51 77.05 76.76 59.36 m3e-large 66.13 62.72 56.1 16.46 77.76 78.27 59.57 bge-base-zh-v1.5 66.49 63.25 57.02 29.74 80.47 84.88 63.64 bge-large-zh-v1.5 65.74 63.39 57.03 28.74 83.45 85.44 63.97 BAAI/bge-reranker-base 67.28 63.95 60.45 35.46 81.26 84.1 65.42 BAAI/bge-reranker-large 67.6 64.03 61.44 37.16 82.15 84.18 66.09 \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks Train BAAI Embedding We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baai\_general\_embedding. BGE Reranker Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Citation If you find this repository useful, please consider giving a star :star: and citation @misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} } License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

Updated 5/28/2024

Text-to-Text

✅

bge-small-en-v1.5

BAAI

181

FlagEmbedding Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License More details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. English | FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: Long-Context LLM**: Activation Beacon Fine-tuning of LM** : LM-Cocktail Dense Retrieval**: BGE-M3, LLM Embedder, BGE Embedding Reranker Model**: BGE Reranker Benchmark**: C-MTEB News 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report 09/15/2023: The technical report of BGE has been released 09/15/2023: The massive training data of BGE has been released 09/12/2023: New models: New reranker model: release cross-encoder models BAAI/bge-reranker-base and BAAI/bge-reranker-large, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. update embedding model: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. More 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. 08/05/2023: Release base-scale and small-scale models, *best performance among the models of the same size * 08/02/2023: Release bge-large-(short for BAAI General Embedding) Models, *rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada: 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. Model List bge is short for BAAI general embedding. Model Language Description query instruction for retrieval \[1\] BAAI/bge-m3 Multilingual Inference Fine-tune Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) BAAI/llm-embedder English Inference Fine-tune a unified embedding model to support diverse retrieval augmentation needs for LLMs See README BAAI/bge-reranker-large Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-reranker-base Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-large-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-base-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-small-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-large-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-base-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-small-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-large-en English Inference Fine-tune :trophy: rank 1st in MTEB leaderboard Represent this sentence for searching relevant passages: BAAI/bge-base-en English Inference Fine-tune a base-scale model but with similar ability to bge-large-en Represent this sentence for searching relevant passages: BAAI/bge-small-en English Inference Fine-tune a small-scale model but with competitive performance Represent this sentence for searching relevant passages: BAAI/bge-large-zh Chinese Inference Fine-tune :trophy: rank 1st in C-MTEB benchmark `` BAAI/bge-base-zh Chinese Inference Fine-tune a base-scale model but with similar ability to bge-large-zh `` BAAI/bge-small-zh Chinese Inference Fine-tune a small-scale model but with competitive performance `` \[1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. \[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Frequently asked questions 1\. How to fine-tune bge embedding model? Following this example to prepare data and fine-tune your model. Some suggestions: Mine hard negatives following this example, which can improve the retrieval performance. If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2\. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). 3\. When does the query instruction need to be used For the bge-*-v1.5, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Usage Usage for Embedding Model Here are some examples for using bge models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. Using FlagEmbedding pip install -U FlagEmbedding If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. from FlagEmbedding import FlagModel sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="", use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode(sentences_1) embeddings_2 = model.encode(sentences_2) similarity = embeddings_1 @ embeddings_2.T print(similarity) for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction queries = ['query_1', 'query_2'] passages = ["-1", "-2"] q_embeddings = model.encode_queries(queries) p_embeddings = model.encode(passages) scores = q_embeddings @ p_embeddings.T For the value of the argument query_instruction_for_retrieval, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set os.environ["CUDA_VISIBLE_DEVICES"] to select specific GPUs. You also can set os.environ["CUDA_VISIBLE_DEVICES"]="" to make all GPUs unavailable. Using Sentence-Transformers You can also use the bge models with sentence-transformers: pip install -U sentence-transformers from sentence_transformers import SentenceTransformer sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = SentenceTransformer('BAAI/bge-large-zh-v1.5') embeddings_1 = model.encode(sentences_1, normalize_embeddings=True) embeddings_2 = model.encode(sentences_2, normalize_embeddings=True) similarity = embeddings_1 @ embeddings_2.T print(similarity) For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. from sentence_transformers import SentenceTransformer queries = ['query_1', 'query_2'] passages = ["-1", "-2"] instruction = "" model = SentenceTransformer('BAAI/bge-large-zh-v1.5') q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True) p_embeddings = model.encode(passages, normalize_embeddings=True) scores = q_embeddings @ p_embeddings.T Using Langchain You can use bge in langchain like this: from langchain.embeddings import HuggingFaceBgeEmbeddings model_name = "BAAI/bge-large-en-v1.5" model_kwargs = {'device': 'cuda'} encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity model = HuggingFaceBgeEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="" ) model.query_instruction = "" Using HuggingFace Transformers With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., \[CLS\]) as the sentence embedding. from transformers import AutoTokenizer, AutoModel import torch Sentences we want sentence embeddings for sentences = ["-1", "-2"] Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5') model.eval() Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) Perform pooling. In this case, cls pooling. sentence_embeddings = model_output0 normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) Usage for Reranker Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Using FlagEmbedding pip install -U FlagEmbedding Get relevance scores (higher scores indicate more relevance): from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) Using Huggingface transformers import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) Usage of the ONNX files from optimum.onnxruntime import ORTModelForFeatureExtraction # type: ignore import torch from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5') model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5') model_ort = ORTModelForFeatureExtraction.from_pretrained('BAAI/bge-small-en-v1.5', file_name="onnx/model.onnx") Sentences we want sentence embeddings for sentences = ["-1", "-2"] Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') model_output_ort = model_ort(**encoded_input) Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) model_output and model_output_ort are identical Usage via infinity Its also possible to deploy the onnx files with the infinity\_emb pip package. Recommended is device="cuda", engine="torch" with flash attention on gpu, and device="cpu", engine="optimum" for onnx inference. import asyncio from infinity_emb import AsyncEmbeddingEngine, EngineArgs sentences = ["Embed this is sentence via Infinity.", "Paris is in France."] engine = AsyncEmbeddingEngine.from_args( EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", device="cpu", engine="optimum" # or engine="torch" )) async def main(): async with engine: embeddings, usage = await engine.embed(sentences=sentences) asyncio.run(main()) Evaluation baai-general-embedding models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. MTEB**: Model Name Dimension Sequence Length Average (56) Retrieval (15) Clustering (11) Pair Classification (3) Reranking (4) STS (10) Summarization (1) Classification (12) BAAI/bge-large-en-v1.5 1024 512 64.23 54.29 46.08 87.12 60.03 83.11 31.61 75.97 BAAI/bge-base-en-v1.5 768 512 63.55 53.25 45.77 86.55 58.86 82.4 31.07 75.53 BAAI/bge-small-en-v1.5 384 512 62.17 51.68 43.82 84.92 58.36 81.59 30.12 74.14 bge-large-en 1024 512 63.98 53.9 46.98 85.8 59.48 81.56 32.06 76.21 bge-base-en 768 512 63.36 53.0 46.32 85.86 58.7 81.84 29.27 75.27 gte-large 1024 512 63.13 52.22 46.84 85.00 59.13 83.35 31.66 73.33 gte-base 768 512 62.39 51.14 46.2 84.57 58.61 82.3 31.17 73.01 e5-large-v2 1024 512 62.25 50.56 44.49 86.03 56.61 82.05 30.19 75.24 bge-small-en 384 512 62.11 51.82 44.31 83.78 57.97 80.72 30.53 74.37 instructor-xl 768 512 61.79 49.26 44.74 86.62 57.29 83.06 32.32 61.79 e5-base-v2 768 512 61.5 50.29 43.80 85.73 55.91 81.05 30.28 73.84 gte-small 384 512 61.36 49.46 44.89 83.54 57.7 82.07 30.42 72.31 text-embedding-ada-002 1536 8192 60.99 49.25 45.9 84.89 56.32 80.97 30.8 70.93 e5-small-v2 384 512 59.93 49.04 39.92 84.67 54.32 80.39 31.16 72.94 sentence-t5-xxl 768 512 59.51 42.24 43.72 85.06 56.42 82.63 30.08 73.42 all-mpnet-base-v2 768 514 57.78 43.81 43.69 83.04 59.36 80.28 27.49 65.07 sgpt-bloom-7b1-msmarco 4096 2048 57.59 48.22 38.93 81.9 55.65 77.74 33.6 66.19 C-MTEB**: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to C\_MTEB for a detailed introduction. Model Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering BAAI/bge-large-zh-v1.5 1024 64.53 70.46 56.25 81.6 69.13 65.84 48.99 BAAI/bge-base-zh-v1.5 768 63.13 69.49 53.72 79.75 68.07 65.39 47.53 BAAI/bge-small-zh-v1.5 512 57.82 61.77 49.11 70.41 63.96 60.92 44.18 BAAI/bge-large-zh 1024 64.20 71.53 54.98 78.94 68.32 65.11 48.39 bge-large-zh-noinstruct 1024 63.53 70.55 53 76.77 68.58 64.91 50.01 BAAI/bge-base-zh 768 62.96 69.53 54.12 77.5 67.07 64.91 47.63 multilingual-e5-large 1024 58.79 63.66 48.44 69.89 67.34 56.00 48.23 BAAI/bge-small-zh 512 58.27 63.07 49.45 70.35 63.64 61.48 45.09 m3e-base 768 57.10 56.91 50.47 63.99 67.52 59.34 47.68 m3e-large 1024 57.05 54.75 50.42 64.3 68.2 59.66 48.88 multilingual-e5-base 768 55.48 61.63 46.49 67.07 65.35 54.35 40.68 multilingual-e5-small 384 55.38 59.95 45.27 66.45 65.85 53.86 45.26 text-embedding-ada-002(OpenAI) 1536 53.02 52.0 43.35 69.56 64.31 54.28 45.68 luotuo 1024 49.37 44.4 42.78 66.62 61 49.25 44.39 text2vec-base 768 47.63 38.79 43.41 67.41 62.19 49.45 37.66 text2vec-large 1024 47.36 41.94 44.97 70.86 60.66 49.16 30.02 Reranking**: See C\_MTEB for evaluation script. Model T2Reranking T2RerankingZh2En\* T2RerankingEn2Zh\* MMarcoReranking CMedQAv1 CMedQAv2 Avg text2vec-base-multilingual 64.66 62.94 62.51 14.37 48.46 48.6 50.26 multilingual-e5-small 65.62 60.94 56.41 29.91 67.26 66.54 57.78 multilingual-e5-large 64.55 61.61 54.28 28.6 67.42 67.92 57.4 multilingual-e5-base 64.21 62.13 54.68 29.5 66.23 66.98 57.29 m3e-base 66.03 62.74 56.07 17.51 77.05 76.76 59.36 m3e-large 66.13 62.72 56.1 16.46 77.76 78.27 59.57 bge-base-zh-v1.5 66.49 63.25 57.02 29.74 80.47 84.88 63.64 bge-large-zh-v1.5 65.74 63.39 57.03 28.74 83.45 85.44 63.97 BAAI/bge-reranker-base 67.28 63.95 60.45 35.46 81.26 84.1 65.42 BAAI/bge-reranker-large 67.6 64.03 61.44 37.16 82.15 84.18 66.09 \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks Train BAAI Embedding We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baai\_general\_embedding. BGE Reranker Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). Citation If you find this repository useful, please consider giving a star :star: and citation @misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} } License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

Updated 5/28/2024

Text-to-Text

🔄

bge-large-en

BAAI

181

The bge-large-en model is a text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence). It is part of the BAAI General Embedding (BGE) family of models, which can map text to low-dimensional dense vectors for tasks like retrieval, classification, and semantic search. The maintainers recommend using the newer BAAI/bge-large-en-v1.5 model, which has a more reasonable similarity distribution and the same usage method. Model inputs and outputs Inputs Text sequences of up to 512 tokens Outputs 1024-dimensional dense vector embeddings Capabilities The bge-large-en model can generate high-quality text embeddings that capture semantic meaning. These embeddings can be used for a variety of downstream tasks, such as: Retrieval**: Finding relevant documents or passages given a query Classification**: Classifying text into predefined categories Clustering**: Grouping similar text documents together Semantic search**: Searching for relevant content based on meaning, not just keywords What can I use it for? The bge-large-en embeddings can be leveraged in various applications that require understanding the semantic meaning of text. For example, you could use them to build a powerful search engine that returns relevant results based on the query's intent, rather than just matching keywords. Another potential use case is intelligent document retrieval and recommendation, where the model can surface the most relevant information to users based on their needs. This could be especially useful in enterprise settings or academic research, where users need to quickly find relevant information among large document collections. Things to try One interesting experiment would be to fine-tune the bge-large-en model on a specific domain or task, such as legal document retrieval or scientific paper recommendation. This could help the model better capture the nuances and specialized vocabulary of your particular use case. You could also explore using the bge-large-en embeddings in combination with other techniques, such as sparse lexical matching or multi-vector retrieval, to create a hybrid search system that leverages the strengths of different approaches.

Updated 5/27/2024

Text-to-Text

✅

bge-base-en-v1.5

BAAI

172

FlagEmbedding Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License For more details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. English | FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: Long-Context LLM**: Activation Beacon Fine-tuning of LM** : LM-Cocktail Dense Retrieval**: BGE-M3, LLM Embedder, BGE Embedding Reranker Model**: BGE Reranker Benchmark**: C-MTEB News 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report 09/15/2023: The technical report and massive training data of BGE has been released 09/12/2023: New models: New reranker model: release cross-encoder models BAAI/bge-reranker-base and BAAI/bge-reranker-large, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. update embedding model: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. More 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. 08/05/2023: Release base-scale and small-scale models, *best performance among the models of the same size * 08/02/2023: Release bge-large-(short for BAAI General Embedding) Models, *rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada: 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. Model List bge is short for BAAI general embedding. Model Language Description query instruction for retrieval \[1\] BAAI/bge-m3 Multilingual Inference Fine-tune Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) BAAI/llm-embedder English Inference Fine-tune a unified embedding model to support diverse retrieval augmentation needs for LLMs See README BAAI/bge-reranker-large Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-reranker-base Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-large-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-base-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-small-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-large-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-base-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-small-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-large-en English Inference Fine-tune :trophy: rank 1st in MTEB leaderboard Represent this sentence for searching relevant passages: BAAI/bge-base-en English Inference Fine-tune a base-scale model but with similar ability to bge-large-en Represent this sentence for searching relevant passages: BAAI/bge-small-en English Inference Fine-tune a small-scale model but with competitive performance Represent this sentence for searching relevant passages: BAAI/bge-large-zh Chinese Inference Fine-tune :trophy: rank 1st in C-MTEB benchmark `` BAAI/bge-base-zh Chinese Inference Fine-tune a base-scale model but with similar ability to bge-large-zh `` BAAI/bge-small-zh Chinese Inference Fine-tune a small-scale model but with competitive performance `` \[1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. \[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Frequently asked questions 1\. How to fine-tune bge embedding model? Following this example to prepare data and fine-tune your model. Some suggestions: Mine hard negatives following this example, which can improve the retrieval performance. If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2\. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). 3\. When does the query instruction need to be used For the bge-*-v1.5, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Usage Usage for Embedding Model Here are some examples for using bge models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. Using FlagEmbedding pip install -U FlagEmbedding If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. from FlagEmbedding import FlagModel sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="", use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode(sentences_1) embeddings_2 = model.encode(sentences_2) similarity = embeddings_1 @ embeddings_2.T print(similarity) for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction queries = ['query_1', 'query_2'] passages = ["-1", "-2"] q_embeddings = model.encode_queries(queries) p_embeddings = model.encode(passages) scores = q_embeddings @ p_embeddings.T For the value of the argument query_instruction_for_retrieval, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set os.environ["CUDA_VISIBLE_DEVICES"] to select specific GPUs. You also can set os.environ["CUDA_VISIBLE_DEVICES"]="" to make all GPUs unavailable. Using Sentence-Transformers You can also use the bge models with sentence-transformers: pip install -U sentence-transformers from sentence_transformers import SentenceTransformer sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = SentenceTransformer('BAAI/bge-large-zh-v1.5') embeddings_1 = model.encode(sentences_1, normalize_embeddings=True) embeddings_2 = model.encode(sentences_2, normalize_embeddings=True) similarity = embeddings_1 @ embeddings_2.T print(similarity) For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. from sentence_transformers import SentenceTransformer queries = ['query_1', 'query_2'] passages = ["-1", "-2"] instruction = "" model = SentenceTransformer('BAAI/bge-large-zh-v1.5') q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True) p_embeddings = model.encode(passages, normalize_embeddings=True) scores = q_embeddings @ p_embeddings.T Using Langchain You can use bge in langchain like this: from langchain.embeddings import HuggingFaceBgeEmbeddings model_name = "BAAI/bge-large-en-v1.5" model_kwargs = {'device': 'cuda'} encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity model = HuggingFaceBgeEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="" ) model.query_instruction = "" Using HuggingFace Transformers With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., \[CLS\]) as the sentence embedding. from transformers import AutoTokenizer, AutoModel import torch Sentences we want sentence embeddings for sentences = ["-1", "-2"] Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5') model.eval() Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) Perform pooling. In this case, cls pooling. sentence_embeddings = model_output0 normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) Usage of the ONNX files from optimum.onnxruntime import ORTModelForFeatureExtraction # type: ignore import torch from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-en-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-en-v1.5', revision="refs/pr/13") model_ort = ORTModelForFeatureExtraction.from_pretrained('BAAI/bge-large-en-v1.5', revision="refs/pr/13",file_name="onnx/model.onnx") Sentences we want sentence embeddings for sentences = ["-1", "-2"] Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') model_output_ort = model_ort(**encoded_input) Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) model_output and model_output_ort are identical Usage via infinity Its also possible to deploy the onnx files with the infinity\_emb pip package. import asyncio from infinity_emb import AsyncEmbeddingEngine, EngineArgs sentences = ["Embed this is sentence via Infinity.", "Paris is in France."] engine = AsyncEmbeddingEngine.from_args( EngineArgs(model_name_or_path = "BAAI/bge-large-en-v1.5", device="cpu", engine="optimum" # or engine="torch" )) async def main(): async with engine: embeddings, usage = await engine.embed(sentences=sentences) asyncio.run(main()) Usage for Reranker Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Using FlagEmbedding pip install -U FlagEmbedding Get relevance scores (higher scores indicate more relevance): from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) Using Huggingface transformers import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) Evaluation baai-general-embedding models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. MTEB**: Model Name Dimension Sequence Length Average (56) Retrieval (15) Clustering (11) Pair Classification (3) Reranking (4) STS (10) Summarization (1) Classification (12) BAAI/bge-large-en-v1.5 1024 512 64.23 54.29 46.08 87.12 60.03 83.11 31.61 75.97 BAAI/bge-base-en-v1.5 768 512 63.55 53.25 45.77 86.55 58.86 82.4 31.07 75.53 BAAI/bge-small-en-v1.5 384 512 62.17 51.68 43.82 84.92 58.36 81.59 30.12 74.14 bge-large-en 1024 512 63.98 53.9 46.98 85.8 59.48 81.56 32.06 76.21 bge-base-en 768 512 63.36 53.0 46.32 85.86 58.7 81.84 29.27 75.27 gte-large 1024 512 63.13 52.22 46.84 85.00 59.13 83.35 31.66 73.33 gte-base 768 512 62.39 51.14 46.2 84.57 58.61 82.3 31.17 73.01 e5-large-v2 1024 512 62.25 50.56 44.49 86.03 56.61 82.05 30.19 75.24 bge-small-en 384 512 62.11 51.82 44.31 83.78 57.97 80.72 30.53 74.37 instructor-xl 768 512 61.79 49.26 44.74 86.62 57.29 83.06 32.32 61.79 e5-base-v2 768 512 61.5 50.29 43.80 85.73 55.91 81.05 30.28 73.84 gte-small 384 512 61.36 49.46 44.89 83.54 57.7 82.07 30.42 72.31 text-embedding-ada-002 1536 8192 60.99 49.25 45.9 84.89 56.32 80.97 30.8 70.93 e5-small-v2 384 512 59.93 49.04 39.92 84.67 54.32 80.39 31.16 72.94 sentence-t5-xxl 768 512 59.51 42.24 43.72 85.06 56.42 82.63 30.08 73.42 all-mpnet-base-v2 768 514 57.78 43.81 43.69 83.04 59.36 80.28 27.49 65.07 sgpt-bloom-7b1-msmarco 4096 2048 57.59 48.22 38.93 81.9 55.65 77.74 33.6 66.19 C-MTEB**: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to C\_MTEB for a detailed introduction. Model Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering BAAI/bge-large-zh-v1.5 1024 64.53 70.46 56.25 81.6 69.13 65.84 48.99 BAAI/bge-base-zh-v1.5 768 63.13 69.49 53.72 79.75 68.07 65.39 47.53 BAAI/bge-small-zh-v1.5 512 57.82 61.77 49.11 70.41 63.96 60.92 44.18 BAAI/bge-large-zh 1024 64.20 71.53 54.98 78.94 68.32 65.11 48.39 bge-large-zh-noinstruct 1024 63.53 70.55 53 76.77 68.58 64.91 50.01 BAAI/bge-base-zh 768 62.96 69.53 54.12 77.5 67.07 64.91 47.63 multilingual-e5-large 1024 58.79 63.66 48.44 69.89 67.34 56.00 48.23 BAAI/bge-small-zh 512 58.27 63.07 49.45 70.35 63.64 61.48 45.09 m3e-base 768 57.10 56.91 50.47 63.99 67.52 59.34 47.68 m3e-large 1024 57.05 54.75 50.42 64.3 68.2 59.66 48.88 multilingual-e5-base 768 55.48 61.63 46.49 67.07 65.35 54.35 40.68 multilingual-e5-small 384 55.38 59.95 45.27 66.45 65.85 53.86 45.26 text-embedding-ada-002(OpenAI) 1536 53.02 52.0 43.35 69.56 64.31 54.28 45.68 luotuo 1024 49.37 44.4 42.78 66.62 61 49.25 44.39 text2vec-base 768 47.63 38.79 43.41 67.41 62.19 49.45 37.66 text2vec-large 1024 47.36 41.94 44.97 70.86 60.66 49.16 30.02 Reranking**: See C\_MTEB for evaluation script. Model T2Reranking T2RerankingZh2En\* T2RerankingEn2Zh\* MMarcoReranking CMedQAv1 CMedQAv2 Avg text2vec-base-multilingual 64.66 62.94 62.51 14.37 48.46 48.6 50.26 multilingual-e5-small 65.62 60.94 56.41 29.91 67.26 66.54 57.78 multilingual-e5-large 64.55 61.61 54.28 28.6 67.42 67.92 57.4 multilingual-e5-base 64.21 62.13 54.68 29.5 66.23 66.98 57.29 m3e-base 66.03 62.74 56.07 17.51 77.05 76.76 59.36 m3e-large 66.13 62.72 56.1 16.46 77.76 78.27 59.57 bge-base-zh-v1.5 66.49 63.25 57.02 29.74 80.47 84.88 63.64 bge-large-zh-v1.5 65.74 63.39 57.03 28.74 83.45 85.44 63.97 BAAI/bge-reranker-base 67.28 63.95 60.45 35.46 81.26 84.1 65.42 BAAI/bge-reranker-large 67.6 64.03 61.44 37.16 82.15 84.18 66.09 \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks Train BAAI Embedding We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baai\_general\_embedding. BGE Reranker Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). Citation If you find this repository useful, please consider giving a star :star: and citation @misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} } License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

Updated 5/28/2024

Text-to-Text

✅

bge-reranker-base

BAAI

112

We have updated the new reranker, supporting larger lengths, more languages, and achieving better performance. FlagEmbedding Model List | FAQ | Usage | Evaluation | Train | Citation | License More details please refer to our Github: FlagEmbedding. English | FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: Long-Context LLM**: Activation Beacon Fine-tuning of LM** : LM-Cocktail Embedding Model**: Visualized-BGE, BGE-M3, LLM Embedder, BGE Embedding Reranker Model**: llm rerankers, BGE Reranker Benchmark**: C-MTEB News 3/18/2024: Release new rerankers, built upon powerful M3 and LLM (GEMMA and MiniCPM, not so large actually) backbones, supporitng multi-lingual processing and larger inputs, massive improvements of ranking performances on BEIR, C-MTEB/Retrieval, MIRACL, LlamaIndex Evaluation. 3/18/2024: Release Visualized-BGE, equipping BGE with visual capabilities. Visualized-BGE can be utilized to generate embeddings for hybrid image-text data. 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report 09/15/2023: The technical report of BGE has been released 09/15/2023: The massive training data of BGE has been released 09/12/2023: New models: New reranker model: release cross-encoder models BAAI/bge-reranker-base and BAAI/bge-reranker-large, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. update embedding model: release bge-*-v1.5 embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. More 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. 08/05/2023: Release base-scale and small-scale models, *best performance among the models of the same size * 08/02/2023: Release bge-large-(short for BAAI General Embedding) Models, *rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada: 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. Model List bge is short for BAAI general embedding. Model Language Description query instruction for retrieval \[1\] BAAI/bge-m3 Multilingual Inference Fine-tune Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) BAAI/llm-embedder English Inference Fine-tune a unified embedding model to support diverse retrieval augmentation needs for LLMs See README BAAI/bge-reranker-large Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-reranker-base Chinese and English Inference Fine-tune a cross-encoder model which is more accurate but less efficient \[2\] BAAI/bge-large-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-base-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-small-en-v1.5 English Inference Fine-tune version 1.5 with more reasonable similarity distribution Represent this sentence for searching relevant passages: BAAI/bge-large-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-base-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-small-zh-v1.5 Chinese Inference Fine-tune version 1.5 with more reasonable similarity distribution `` BAAI/bge-large-en English Inference Fine-tune :trophy: rank 1st in MTEB leaderboard Represent this sentence for searching relevant passages: BAAI/bge-base-en English Inference Fine-tune a base-scale model but with similar ability to bge-large-en Represent this sentence for searching relevant passages: BAAI/bge-small-en English Inference Fine-tune a small-scale model but with competitive performance Represent this sentence for searching relevant passages: BAAI/bge-large-zh Chinese Inference Fine-tune :trophy: rank 1st in C-MTEB benchmark `` BAAI/bge-base-zh Chinese Inference Fine-tune a base-scale model but with similar ability to bge-large-zh `` BAAI/bge-small-zh Chinese Inference Fine-tune a small-scale model but with competitive performance `` \[1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. \[2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Frequently asked questions 1\. How to fine-tune bge embedding model? Following this example to prepare data and fine-tune your model. Some suggestions: Mine hard negatives following this example, which can improve the retrieval performance. If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. Refer to this example for the fine-tuning for reranker 2\. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). 3\. When does the query instruction need to be used For the bge-*-v1.5, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Usage Usage for Embedding Model Here are some examples for using bge models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. Using FlagEmbedding pip install -U FlagEmbedding If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. from FlagEmbedding import FlagModel sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="", use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode(sentences_1) embeddings_2 = model.encode(sentences_2) similarity = embeddings_1 @ embeddings_2.T print(similarity) for s2p(short query to long passage) retrieval task, suggest to use encode_queries() which will automatically add the instruction to each query corpus in retrieval task can still use encode() or encode_corpus(), since they don't need instruction queries = ['query_1', 'query_2'] passages = ["-1", "-2"] q_embeddings = model.encode_queries(queries) p_embeddings = model.encode(passages) scores = q_embeddings @ p_embeddings.T For the value of the argument query_instruction_for_retrieval, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set os.environ["CUDA_VISIBLE_DEVICES"] to select specific GPUs. You also can set os.environ["CUDA_VISIBLE_DEVICES"]="" to make all GPUs unavailable. Using Sentence-Transformers You can also use the bge models with sentence-transformers: pip install -U sentence-transformers from sentence_transformers import SentenceTransformer sentences_1 = ["-1", "-2"] sentences_2 = ["-3", "-4"] model = SentenceTransformer('BAAI/bge-large-zh-v1.5') embeddings_1 = model.encode(sentences_1, normalize_embeddings=True) embeddings_2 = model.encode(sentences_2, normalize_embeddings=True) similarity = embeddings_1 @ embeddings_2.T print(similarity) For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. from sentence_transformers import SentenceTransformer queries = ['query_1', 'query_2'] passages = ["-1", "-2"] instruction = "" model = SentenceTransformer('BAAI/bge-large-zh-v1.5') q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True) p_embeddings = model.encode(passages, normalize_embeddings=True) scores = q_embeddings @ p_embeddings.T Using Langchain You can use bge in langchain like this: from langchain.embeddings import HuggingFaceBgeEmbeddings model_name = "BAAI/bge-large-en-v1.5" model_kwargs = {'device': 'cuda'} encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity model = HuggingFaceBgeEmbeddings( model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="" ) model.query_instruction = "" Using HuggingFace Transformers With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., \[CLS\]) as the sentence embedding. from transformers import AutoTokenizer, AutoModel import torch Sentences we want sentence embeddings for sentences = ["-1", "-2"] Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5') model.eval() Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) Perform pooling. In this case, cls pooling. sentence_embeddings = model_output0 normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) Usage for Reranker Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Using FlagEmbedding pip install -U FlagEmbedding Get relevance scores (higher scores indicate more relevance): from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) Using Huggingface transformers import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) Usage reranker with the ONNX files from optimum.onnxruntime import ORTModelForSequenceClassification # type: ignore import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base') model_ort = ORTModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base', file_name="onnx/model.onnx") Sentences we want sentence embeddings for pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] Tokenize sentences encoded_input = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt') scores_ort = model_ort(**encoded_input, return_dict=True).logits.view(-1, ).float() Compute token embeddings with torch.inference_mode(): scores = model_ort(**encoded_input, return_dict=True).logits.view(-1, ).float() scores and scores_ort are identical Usage reranker with infinity Its also possible to deploy the onnx/torch files with the infinity\_emb pip package. import asyncio from infinity_emb import AsyncEmbeddingEngine, EngineArgs query='what is a panda?' docs = ['The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear', "Paris is in France."] engine = AsyncEmbeddingEngine.from_args( EngineArgs(model_name_or_path = "BAAI/bge-reranker-base", device="cpu", engine="torch" # or engine="optimum" for onnx )) async def main(): async with engine: ranking, usage = await engine.rerank(query=query, docs=docs) print(list(zip(ranking, docs))) asyncio.run(main()) Evaluation baai-general-embedding models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. MTEB**: Model Name Dimension Sequence Length Average (56) Retrieval (15) Clustering (11) Pair Classification (3) Reranking (4) STS (10) Summarization (1) Classification (12) BAAI/bge-large-en-v1.5 1024 512 64.23 54.29 46.08 87.12 60.03 83.11 31.61 75.97 BAAI/bge-base-en-v1.5 768 512 63.55 53.25 45.77 86.55 58.86 82.4 31.07 75.53 BAAI/bge-small-en-v1.5 384 512 62.17 51.68 43.82 84.92 58.36 81.59 30.12 74.14 bge-large-en 1024 512 63.98 53.9 46.98 85.8 59.48 81.56 32.06 76.21 bge-base-en 768 512 63.36 53.0 46.32 85.86 58.7 81.84 29.27 75.27 gte-large 1024 512 63.13 52.22 46.84 85.00 59.13 83.35 31.66 73.33 gte-base 768 512 62.39 51.14 46.2 84.57 58.61 82.3 31.17 73.01 e5-large-v2 1024 512 62.25 50.56 44.49 86.03 56.61 82.05 30.19 75.24 bge-small-en 384 512 62.11 51.82 44.31 83.78 57.97 80.72 30.53 74.37 instructor-xl 768 512 61.79 49.26 44.74 86.62 57.29 83.06 32.32 61.79 e5-base-v2 768 512 61.5 50.29 43.80 85.73 55.91 81.05 30.28 73.84 gte-small 384 512 61.36 49.46 44.89 83.54 57.7 82.07 30.42 72.31 text-embedding-ada-002 1536 8192 60.99 49.25 45.9 84.89 56.32 80.97 30.8 70.93 e5-small-v2 384 512 59.93 49.04 39.92 84.67 54.32 80.39 31.16 72.94 sentence-t5-xxl 768 512 59.51 42.24 43.72 85.06 56.42 82.63 30.08 73.42 all-mpnet-base-v2 768 514 57.78 43.81 43.69 83.04 59.36 80.28 27.49 65.07 sgpt-bloom-7b1-msmarco 4096 2048 57.59 48.22 38.93 81.9 55.65 77.74 33.6 66.19 C-MTEB**: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to C\_MTEB for a detailed introduction. Model Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering BAAI/bge-large-zh-v1.5 1024 64.53 70.46 56.25 81.6 69.13 65.84 48.99 BAAI/bge-base-zh-v1.5 768 63.13 69.49 53.72 79.75 68.07 65.39 47.53 BAAI/bge-small-zh-v1.5 512 57.82 61.77 49.11 70.41 63.96 60.92 44.18 BAAI/bge-large-zh 1024 64.20 71.53 54.98 78.94 68.32 65.11 48.39 bge-large-zh-noinstruct 1024 63.53 70.55 53 76.77 68.58 64.91 50.01 BAAI/bge-base-zh 768 62.96 69.53 54.12 77.5 67.07 64.91 47.63 multilingual-e5-large 1024 58.79 63.66 48.44 69.89 67.34 56.00 48.23 BAAI/bge-small-zh 512 58.27 63.07 49.45 70.35 63.64 61.48 45.09 m3e-base 768 57.10 56.91 50.47 63.99 67.52 59.34 47.68 m3e-large 1024 57.05 54.75 50.42 64.3 68.2 59.66 48.88 multilingual-e5-base 768 55.48 61.63 46.49 67.07 65.35 54.35 40.68 multilingual-e5-small 384 55.38 59.95 45.27 66.45 65.85 53.86 45.26 text-embedding-ada-002(OpenAI) 1536 53.02 52.0 43.35 69.56 64.31 54.28 45.68 luotuo 1024 49.37 44.4 42.78 66.62 61 49.25 44.39 text2vec-base 768 47.63 38.79 43.41 67.41 62.19 49.45 37.66 text2vec-large 1024 47.36 41.94 44.97 70.86 60.66 49.16 30.02 Reranking**: See C\_MTEB for evaluation script. Model T2Reranking T2RerankingZh2En\* T2RerankingEn2Zh\* MMarcoReranking CMedQAv1 CMedQAv2 Avg text2vec-base-multilingual 64.66 62.94 62.51 14.37 48.46 48.6 50.26 multilingual-e5-small 65.62 60.94 56.41 29.91 67.26 66.54 57.78 multilingual-e5-large 64.55 61.61 54.28 28.6 67.42 67.92 57.4 multilingual-e5-base 64.21 62.13 54.68 29.5 66.23 66.98 57.29 m3e-base 66.03 62.74 56.07 17.51 77.05 76.76 59.36 m3e-large 66.13 62.72 56.1 16.46 77.76 78.27 59.57 bge-base-zh-v1.5 66.49 63.25 57.02 29.74 80.47 84.88 63.64 bge-large-zh-v1.5 65.74 63.39 57.03 28.74 83.45 85.44 63.97 BAAI/bge-reranker-base 67.28 63.95 60.45 35.46 81.26 84.1 65.42 BAAI/bge-reranker-large 67.6 64.03 61.44 37.16 82.15 84.18 66.09 \* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks Train BAAI Embedding We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baai\_general\_embedding. BGE Reranker Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Citation If you find this repository useful, please consider giving a star :star: and citation @misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} } License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

Updated 5/28/2024

Text-to-Text

🛠️

bge-reranker-v2-m3

BAAI

98

The bge-reranker-v2-m3 model is a lightweight reranker model from BAAI that possesses strong multilingual capabilities. It is built on top of the bge-m3 base model, which is a versatile AI model that can simultaneously perform dense retrieval, multi-vector retrieval, and sparse retrieval. The bge-reranker-v2-m3 model is easy to deploy and provides fast inference, making it suitable for a variety of multilingual contexts. Model inputs and outputs The bge-reranker-v2-m3 model takes as input a query and a passage, and outputs a relevance score that indicates how relevant the passage is to the query. The relevance score is not bounded to a specific range, as the model is optimized based on cross-entropy loss. This allows for more fine-grained ranking of passages compared to models that output similarity scores bounded between 0 and 1. Inputs Query**: The text of the query to be evaluated Passage**: The text of the passage to be evaluated for relevance to the query Outputs Relevance score**: A float value representing the relevance of the passage to the query, with higher scores indicating more relevance. Capabilities The bge-reranker-v2-m3 model is designed to be a powerful and efficient reranker for multilingual contexts. It can be used to rerank the top-k documents retrieved by an embedding model, such as the bge-m3 model, to further improve the relevance of the final results. What can I use it for? The bge-reranker-v2-m3 model is well-suited for a variety of multilingual information retrieval and question-answering tasks. It can be used to rerank results from a search engine, to filter and sort documents for research or analysis, or to improve the relevance of responses in a multilingual chatbot or virtual assistant. Its fast inference and strong multilingual capabilities make it a versatile tool for building language-agnostic applications. Things to try One interesting aspect of the bge-reranker-v2-m3 model is its ability to output relevance scores that are not bounded between 0 and 1. This allows for more nuanced ranking of passages, which could be particularly useful in applications where small differences in relevance are important. Developers could experiment with using these unbounded scores to improve the precision of their retrieval systems, or to surface more contextually relevant information to users. Another interesting thing to try would be to combine the bge-reranker-v2-m3 model with the bge-m3 model in a hybrid retrieval pipeline. By using the bge-m3 model for initial dense retrieval and the bge-reranker-v2-m3 model for reranking, you could potentially achieve higher accuracy and better performance across a range of multilingual use cases.

Updated 5/30/2024

Text-to-Text

Baai

Models by this creator

bge-m3

bge-large-en-v1.5

bge-large-zh-v1.5

bge-large-zh

bge-reranker-large

bge-small-en-v1.5

bge-large-en

bge-base-en-v1.5

bge-reranker-base

bge-reranker-v2-m3