bge-small-zh-v1.5

Maintainer: BAAI

Last updated 9/6/2024

🌀

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

The bge-small-zh-v1.5 model from BAAI is a small-scale version of the BAAI General Embedding (BGE) model, which can map any text to a low-dimensional dense vector. Unlike previous BGE models, version 1.5 has a more reasonable similarity distribution, enhancing its retrieval ability without the need for instruction. The bge-small-zh-v1.5 model is competitive in performance compared to larger models, making it a good option for projects with computational constraints.

Model inputs and outputs

The bge-small-zh-v1.5 model takes in text as input and outputs a fixed-size embedding vector. This embedding can then be used for tasks like retrieval, classification, clustering, or semantic search. The model supports both Chinese and English text.

Inputs

Text: The model can accept any Chinese or English text as input.

Outputs

Embedding vector: The model outputs a fixed-size vector representation of the input text, which can be used for downstream tasks.

Capabilities

The bge-small-zh-v1.5 model is capable of generating high-quality text embeddings that can be used for a variety of natural language processing tasks. Its performance is competitive with larger BGE models, making it a good choice for projects with limited computational resources. The model's improved similarity distribution helps to better differentiate between similar and dissimilar text.

What can I use it for?

The bge-small-zh-v1.5 embedding can be used in a wide range of applications, such as:

Semantic search: Use the embeddings to find relevant passages or documents for a given query.
Text classification: Train a classifier on top of the embeddings to categorize text into different classes.
Clustering: Group similar text together based on the embeddings.
Recommendation systems: Use the embeddings to find similar items or content for recommendation.

Things to try

One interesting thing to try with the bge-small-zh-v1.5 model is to fine-tune it on your specific data and task. The examples provided by the maintainers show how to prepare data and fine-tune the model to improve performance on your use case. Additionally, you can experiment with using the model in conjunction with the provided reranker models to further enhance retrieval performance.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🌀

bge-base-zh-v1.5

BAAI

The bge-base-zh-v1.5 model is a text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence). It is part of the BAAI General Embedding (BGE) family of models, which can map any text to a low-dimensional dense vector. This can be used for tasks like retrieval, classification, clustering, or semantic search. The bge-base-zh-v1.5 model is the Chinese version of the base-scale BGE model, updated to version 1.5 to have a more reasonable similarity distribution compared to previous versions. The bge-base-zh-v1.5 model is similar in capability to the BAAI/bge-large-zh-v1.5 model, which is the large-scale Chinese BGE model, but the base-scale model has a smaller embedding size. The BAAI/bge-small-zh-v1.5 model is an even smaller-scale Chinese BGE model, with further reduced embedding size but still competitive performance. Model inputs and outputs Inputs Text**: The model can take any text as input, such as short queries or long passages. Outputs Embeddings**: The model outputs a low-dimensional dense vector representation (embedding) of the input text. Capabilities The bge-base-zh-v1.5 model can effectively map Chinese text to a semantic embedding space. It achieves state-of-the-art performance on the Chinese Massive Text Embedding Benchmark (C-MTEB), ranking 1st in multiple evaluation tasks. What can I use it for? The bge-base-zh-v1.5 embedding model can be used in a variety of natural language processing applications that require semantic understanding of text, such as: Retrieval**: Use the embeddings to find the most relevant passages or documents for a given query. Classification**: Train a classifier on top of the embeddings to categorize text into different classes. Clustering**: Group similar text together based on the proximity of their embeddings. Semantic search**: Find documents or passages that are semantically similar to a given query. The model can also be integrated into vector databases to support retrieval-augmented large language models (LLMs). Things to try One interesting aspect of the bge-base-zh-v1.5 model is that it has improved retrieval performance without using any instruction in the query, compared to previous versions that required an instruction. This makes it more convenient to use in many applications. You can experiment with using the model with and without instructions to see which setting works best for your specific task. Additionally, you can try fine-tuning the bge-base-zh-v1.5 model on your own data using the provided examples. This can help improve the model's performance on your domain-specific tasks.

Updated Invalid Date

Text-to-Text

🤷

bge-base-zh

BAAI

The bge-base-zh model is part of the BAAI FlagEmbedding suite, which focuses on retrieval-augmented language models. It is a Chinese-language text embedding model trained by BAAI using contrastive learning on a large-scale dataset. The model can map any Chinese text to a low-dimensional dense vector, which can be used for tasks like retrieval, classification, clustering, or semantic search. The FlagEmbedding project also includes the LLM-Embedder model, which is a unified embedding model designed to support diverse retrieval augmentation needs for large language models (LLMs). Additionally, the project features BGE Reranker models, which are cross-encoder models that are more accurate but less efficient than the embedding models. Model inputs and outputs Inputs Chinese text**: The model takes arbitrary Chinese text as input and encodes it into a low-dimensional dense vector. Outputs Embedding vector**: The model outputs a low-dimensional (e.g. 768-dimensional) dense vector representation of the input text. Capabilities The bge-base-zh model can map Chinese text to a semantic vector space, enabling a variety of downstream tasks. It has been shown to achieve state-of-the-art performance on the Chinese Massive Text Embedding Benchmark (C-MTEB), outperforming other widely used models like multilingual-e5 and text2vec. What can I use it for? The bge-base-zh model can be used for a variety of natural language processing tasks, such as: Semantic search**: Use the embeddings to find relevant documents or passages given a query. Text classification**: Train a classifier on top of the embeddings to categorize text into different classes. Clustering**: Group similar text together based on the embedding vectors. Semantic similarity**: Compute the similarity between two text snippets using the cosine similarity of their embeddings. The model can also be fine-tuned on domain-specific data to further improve performance on specialized tasks. Things to try One interesting aspect of the bge-base-zh model is its ability to generate embeddings without the need for an instruction prefix, which can simplify the usage in some scenarios. However, for retrieval tasks involving short queries and long passages, it is recommended to add an instruction prefix to the query to improve performance. When using the model, it's also important to consider the similarity distribution of the embeddings. The current bge-base-zh model has a similarity distribution in the range of [0.6, 1], so a similarity score greater than 0.5 does not necessarily indicate that the two sentences are similar. For downstream tasks, the relative order of the scores is more important than the absolute value.

Updated Invalid Date

Text-to-Text

🖼️

bge-large-zh

BAAI

290

The bge-large-zh model is a state-of-the-art text embedding model developed by the Beijing Academy of Artificial Intelligence (BAAI). It is part of the BAAI General Embedding (BGE) family of models, which have achieved top performance on both the MTEB and C-MTEB benchmarks. The bge-large-zh model is specifically designed for Chinese text processing, and it can map any Chinese text into a low-dimensional dense vector that can be used for tasks like retrieval, classification, clustering, or semantic search. Compared to similar models like BAAI/bge-large-en and BAAI/bge-small-en, the bge-large-zh model has been optimized for Chinese text and has demonstrated state-of-the-art performance on Chinese benchmarks. The BAAI/llm-embedder model is a more recent addition to the BAAI family, serving as a unified embedding model to support diverse retrieval augmentation needs for large language models (LLMs). Model inputs and outputs Inputs Text**: The bge-large-zh model can take any Chinese text as input, ranging from short queries to long passages. Instruction (optional)**: For retrieval tasks that use short queries to find long related documents, it is recommended to add an instruction to the query to help the model better understand the intent. The instruction should be placed at the beginning of the query text. No instruction is needed for the passage/document text. Outputs Embeddings**: The primary output of the bge-large-zh model is a dense vector embedding of the input text. These embeddings can be used for a variety of downstream tasks, such as: Retrieval: The embeddings can be used to find related passages or documents by computing the similarity between the query embedding and the passage/document embeddings. Classification: The embeddings can be used as features for training classification models. Clustering: The embeddings can be used to group similar text together. Semantic search: The embeddings can be used to find semantically related text. Capabilities The bge-large-zh model demonstrates state-of-the-art performance on a range of Chinese text processing tasks. On the Chinese Massive Text Embedding Benchmark (C-MTEB), the bge-large-zh-v1.5 model ranked first overall, showing strong results across tasks like retrieval, semantic similarity, and classification. Additionally, the bge-large-zh model has been designed to handle long input text, with a maximum sequence length of 512 tokens. This makes it well-suited for tasks that involve processing lengthy passages or documents, such as research paper retrieval or legal document search. What can I use it for? The bge-large-zh model can be used for a variety of Chinese text processing tasks, including: Retrieval**: Use the model to find relevant passages or documents given a query. This can be helpful for building search engines, Q&A systems, or knowledge management tools. Classification**: Use the model's embeddings as features to train classification models for tasks like sentiment analysis, topic classification, or intent detection. Clustering**: Group similar Chinese text together using the model's embeddings, which can be useful for organizing large collections of documents or categorizing user-generated content. Semantic search**: Find semantically related text by computing the similarity between the model's embeddings, enabling more advanced search experiences. Things to try One interesting aspect of the bge-large-zh model is its ability to handle queries with or without instruction. While adding an instruction to the query can improve retrieval performance, the model's v1.5 version has been enhanced to perform well even without the instruction. This makes it more convenient to use in certain applications, as you don't need to worry about crafting the perfect query instruction. Another thing to try is fine-tuning the bge-large-zh model on your own data. The provided examples show how you can prepare data and fine-tune the model to improve its performance on your specific use case. This can be particularly helpful if you have domain-specific text that the pre-trained model doesn't handle as well.

Updated Invalid Date

Text-to-Text

🔄

bge-large-en

BAAI

181

The bge-large-en model is a text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence). It is part of the BAAI General Embedding (BGE) family of models, which can map text to low-dimensional dense vectors for tasks like retrieval, classification, and semantic search. The maintainers recommend using the newer BAAI/bge-large-en-v1.5 model, which has a more reasonable similarity distribution and the same usage method. Model inputs and outputs Inputs Text sequences of up to 512 tokens Outputs 1024-dimensional dense vector embeddings Capabilities The bge-large-en model can generate high-quality text embeddings that capture semantic meaning. These embeddings can be used for a variety of downstream tasks, such as: Retrieval**: Finding relevant documents or passages given a query Classification**: Classifying text into predefined categories Clustering**: Grouping similar text documents together Semantic search**: Searching for relevant content based on meaning, not just keywords What can I use it for? The bge-large-en embeddings can be leveraged in various applications that require understanding the semantic meaning of text. For example, you could use them to build a powerful search engine that returns relevant results based on the query's intent, rather than just matching keywords. Another potential use case is intelligent document retrieval and recommendation, where the model can surface the most relevant information to users based on their needs. This could be especially useful in enterprise settings or academic research, where users need to quickly find relevant information among large document collections. Things to try One interesting experiment would be to fine-tune the bge-large-en model on a specific domain or task, such as legal document retrieval or scientific paper recommendation. This could help the model better capture the nuances and specialized vocabulary of your particular use case. You could also explore using the bge-large-en embeddings in combination with other techniques, such as sparse lexical matching or multi-vector retrieval, to create a hybrid search system that leverages the strengths of different approaches.

Updated Invalid Date

Text-to-Text