bge-large-en-v1.5

Maintainer: nateraw

202

Last updated 9/18/2024

Property	Value
Run this model	Run on Replicate
API spec	View on Replicate
Github link	View on Github
Paper link	View on Arxiv

Create account to get full access

Model overview

The bge-large-en-v1.5 is a text embedding model created by BAAI (Beijing Academy of Artificial Intelligence). It is designed to generate high-quality embeddings for text sequences in English. This model builds upon BAAI's previous work on the bge-reranker-base and multilingual-e5-large models, which have shown strong performance on various language tasks. The bge-large-en-v1.5 model offers enhanced capabilities and is well-suited for a range of natural language processing applications.

Model inputs and outputs

The bge-large-en-v1.5 model takes text sequences as input and generates corresponding embeddings. Users can provide the text either as a path to a file containing JSONL data with a 'text' field, or as a JSON list of strings. The model also accepts a batch size parameter to control the processing of the input data. Additionally, users can choose to normalize the output embeddings and convert the results to a NumPy format.

Inputs

Path: Path to a file containing text as JSONL with a 'text' field or a valid JSON string list.
Texts: Text to be embedded, formatted as a JSON list of strings.
Batch Size: Batch size to use when processing the text data.
Convert To Numpy: Option to return the output as a NumPy file instead of JSON.
Normalize Embeddings: Option to normalize the generated embeddings.

Outputs

The model outputs the text embeddings, which can be returned either as a JSON array or as a NumPy file, depending on the user's preference.

Capabilities

The bge-large-en-v1.5 model is capable of generating high-quality text embeddings that capture the semantic and contextual meaning of the input text. These embeddings can be utilized in a wide range of natural language processing tasks, such as text classification, semantic search, and content recommendation. The model's performance has been demonstrated in various benchmarks and real-world applications.

What can I use it for?

The bge-large-en-v1.5 model can be a valuable tool for developers and researchers working on natural language processing projects. The text embeddings generated by the model can be used as input features for downstream machine learning models, enabling more accurate and efficient text-based applications. For example, the embeddings could be used in sentiment analysis, topic modeling, or to power personalized content recommendations.

Things to try

To get the most out of the bge-large-en-v1.5 model, you can experiment with different input text formats, batch sizes, and normalization options to find the configuration that works best for your specific use case. You can also explore how the model's performance compares to other similar models, such as the bge-reranker-base and multilingual-e5-large models, to determine the most suitable approach for your needs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

bge_1-5_query_embeddings

center-for-curriculum-redesign

The bge_1-5_query_embeddings model is a query embedding generator developed by the Center for Curriculum Redesign. It is built on top of BAAI's bge-large-en v1.5 embedding model, which is a powerful text encoding model for embedding text sequences. Similar models include the bge-large-en-v1.5 model, the bge-reranker-base model, and the multilingual-e5-large model. Model inputs and outputs The bge_1-5_query_embeddings model takes in a list of text queries and generates corresponding embedding vectors for retrieval and comparison purposes. The model automatically formats the input queries for retrieval, so users do not need to preprocess the text. Inputs Query Texts**: A serialized JSON array of strings to be used as text queries for generating embeddings. Normalize**: A boolean flag to control whether the output embeddings are normalized to a magnitude of 1. Precision**: The numerical precision to use for the inference computations, either "full" or "half". Batchtoken Max**: The maximum number of kibiTokens (1 kibiToken = 1024 tokens) to include in a single batch, to avoid out-of-memory errors. Outputs Query Embeddings**: An array of embedding vectors, where each vector corresponds to one of the input text queries. Extra Metrics**: Additional metrics or data associated with the embedding generation process. Capabilities The bge_1-5_query_embeddings model is capable of generating high-quality text embeddings that can be used for a variety of natural language processing tasks, such as information retrieval, text similarity comparison, and document clustering. The embeddings capture the semantic meaning of the input text, allowing for more effective downstream applications. What can I use it for? The bge_1-5_query_embeddings model can be used in a wide range of applications that require text encoding and comparison, such as search engines, recommendation systems, and content analysis tools. By generating embeddings for text queries, you can leverage the model's powerful encoding capabilities to improve the relevance and accuracy of your search or recommendation results. Things to try One interesting thing to try with the bge_1-5_query_embeddings model is to experiment with different levels of precision for the inference computations. Depending on your specific use case and hardware constraints, you may find that the "half" precision setting provides sufficient accuracy while requiring less computational resources. Additionally, you could explore how the model's performance varies when using different normalization strategies for the output embeddings.

Updated Invalid Date

Text-to-Text

bge-reranker-base

ninehills

The bge-reranker-base model from BAAI (Beijing Academy of Artificial Intelligence) is a cross-encoder model that can be used to re-rank the top-k documents returned by an embedding model. It is more accurate than embedding models like BGE-M3 or LLM Embedder, but less efficient. This model can be fine-tuned on your own data to improve performance on specific tasks. Model inputs and outputs Inputs pairs_json**: A JSON string containing input pairs, e.g. [["a", "b"], ["c", "d"]] Outputs scores**: An array of scores for the input pairs use_fp16**: A boolean indicating whether the model used FP16 inference model_name**: The name of the model used Capabilities The bge-reranker-base model can effectively re-rank the top-k documents returned by an embedding model, making the final ranking more accurate. This can be particularly useful when you need high-precision retrieval results, such as for question answering or knowledge-intensive tasks. What can I use it for? You can use the bge-reranker-base model to re-rank the results of an embedding model like BGE-M3 or LLM Embedder. This can help improve the accuracy of your retrieval system, especially for critical applications where precision is important. Things to try You can try fine-tuning the bge-reranker-base model on your own data to further improve its performance on your specific use case. The examples provided can be a good starting point for this.

Updated Invalid Date

Text-to-Text

🔍

multilingual-e5-large

beautyyuyanli

8.6K

The multilingual-e5-large is a multi-language text embedding model developed by beautyyuyanli. This model is similar to other large language models like qwen1.5-72b, llava-13b, qwen1.5-110b, uform-gen, and cog-a1111-ui, which aim to provide large-scale language understanding capabilities across multiple languages. Model inputs and outputs The multilingual-e5-large model takes text data as input and generates embeddings, which are numerical representations of the input text. The input text can be provided as a JSON list of strings, and the model also accepts parameters for batch size and whether to normalize the output embeddings. Inputs texts**: Text to embed, formatted as a JSON list of strings (e.g. ["In the water, fish are swimming.", "Fish swim in the water.", "A book lies open on the table."]) batch_size**: Batch size to use when processing text data (default is 32) normalize_embeddings**: Whether to normalize the output embeddings (default is true) Outputs An array of arrays, where each inner array represents the embedding for the corresponding input text. Capabilities The multilingual-e5-large model is capable of generating high-quality text embeddings for a wide range of languages, making it a useful tool for various natural language processing tasks such as text classification, semantic search, and data analysis. What can I use it for? The multilingual-e5-large model can be used in a variety of applications that require text embeddings, such as building multilingual search engines, recommendation systems, or language translation tools. By leveraging the model's ability to generate embeddings for multiple languages, developers can create more inclusive and accessible applications that serve a global audience. Things to try One interesting thing to try with the multilingual-e5-large model is to explore how the generated embeddings capture the semantic relationships between words and phrases across different languages. You could experiment with using the embeddings for cross-lingual text similarity or clustering tasks, which could provide valuable insights into the model's language understanding capabilities.

Updated Invalid Date

Text-to-Text

goliath-120b

nateraw

235

goliath-120b is an auto-regressive causal language model created by combining two fine-tuned Llama-2 70B models into one. Developed by Nateraw, this large language model (LLM) represents an advancement in the Llama 2 line of models, offering increased capability and scale. Similar models in this space include the Mixtral-8x7B and various CodeLlama models, which focus on coding and conversational abilities. Model inputs and outputs goliath-120b is a text-to-text generative model, taking in a prompt as input and generating a response as output. The model allows for customization of several key parameters, including temperature, top-k and top-p filtering, maximum new tokens, and presence and frequency penalties. Inputs Prompt**: The text prompt that the model will use to generate a response. Temperature**: A value used to modulate the next token probabilities, controlling the "creativity" of the model's output. Top K**: The number of highest probability tokens to consider for generating the output. Top P**: A probability threshold for generating the output, using nucleus filtering. Max New Tokens**: The maximum number of tokens the model should generate as output. Outputs Generated Text**: The model's response, generated based on the provided input prompt and parameters. Capabilities goliath-120b is a powerful language model capable of a wide range of text generation tasks, from creative writing to task-oriented dialogue. The model's large size and fine-tuning allow it to produce coherent, contextually-appropriate text with high quality. What can I use it for? goliath-120b can be used for various natural language processing applications, such as chatbots, content generation, and language modeling. The model's versatility makes it a valuable tool for businesses and developers looking to incorporate advanced language capabilities into their products or services. Things to try Experiment with different prompts and parameter settings to see the model's full capabilities. Try using goliath-120b for tasks like story generation, question answering, or code completion to explore its strengths and limitations. The model's large scale and fine-tuning can produce impressive results, but it's important to carefully monitor the outputs and ensure they align with your intended use case.

Updated Invalid Date

Text-to-Text