Alibaba-nlp

Models by this creator

🤯

gte-Qwen2-7B-instruct

Alibaba-NLP

Total Score

103

gte-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family developed by Alibaba-NLP. It ranks #1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark (MTEB) as of June 16, 2024. The model is based on the Qwen2-7B large language model, and builds upon the previous gte-Qwen1.5-7B-instruct model by incorporating several key advancements, including bidirectional attention mechanisms for enhanced contextual understanding, and instruction tuning applied solely on the query side for streamlined efficiency. Model inputs and outputs The gte-Qwen2-7B-instruct model takes text inputs and produces contextual embeddings. It can handle a wide range of text, from short queries to lengthy documents, with a maximum input length of 32,000 tokens. Inputs Text data, such as sentences, paragraphs, or documents Outputs Contextual embeddings, a high-dimensional vector representation of the input text The model outputs embeddings with a dimensionality of 3,584 Capabilities The gte-Qwen2-7B-instruct model excels at a variety of text-related tasks, including semantic search, text classification, and data augmentation. Its comprehensive training across a vast, multilingual text corpus spanning diverse domains and scenarios makes it highly applicable across numerous languages and a wide array of downstream tasks. What can I use it for? The gte-Qwen2-7B-instruct model can be leveraged for a wide range of applications, such as: Semantic search**: Use the model's contextual embeddings to power semantic search engines, allowing users to find relevant information based on the meaning of their queries, not just keyword matching. Text classification**: Fine-tune the model for specialized text classification tasks, such as sentiment analysis, topic classification, or intent detection. Data augmentation**: Leverage the model's understanding of language to generate synthetic text data, which can be used to expand and diversify training datasets for machine learning models. Things to try One interesting aspect of the gte-Qwen2-7B-instruct model is its ability to handle long-form text inputs. Try using the model to generate embeddings for lengthy documents, such as research papers or technical manuals, and explore how the contextual understanding can be applied to tasks like document summarization or knowledge extraction.

Read more

Updated 7/18/2024

🤯

gte-Qwen2-7B-instruct

Alibaba-NLP

Total Score

103

gte-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family developed by Alibaba-NLP. It ranks #1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark (MTEB) as of June 16, 2024. The model is based on the Qwen2-7B large language model, and builds upon the previous gte-Qwen1.5-7B-instruct model by incorporating several key advancements, including bidirectional attention mechanisms for enhanced contextual understanding, and instruction tuning applied solely on the query side for streamlined efficiency. Model inputs and outputs The gte-Qwen2-7B-instruct model takes text inputs and produces contextual embeddings. It can handle a wide range of text, from short queries to lengthy documents, with a maximum input length of 32,000 tokens. Inputs Text data, such as sentences, paragraphs, or documents Outputs Contextual embeddings, a high-dimensional vector representation of the input text The model outputs embeddings with a dimensionality of 3,584 Capabilities The gte-Qwen2-7B-instruct model excels at a variety of text-related tasks, including semantic search, text classification, and data augmentation. Its comprehensive training across a vast, multilingual text corpus spanning diverse domains and scenarios makes it highly applicable across numerous languages and a wide array of downstream tasks. What can I use it for? The gte-Qwen2-7B-instruct model can be leveraged for a wide range of applications, such as: Semantic search**: Use the model's contextual embeddings to power semantic search engines, allowing users to find relevant information based on the meaning of their queries, not just keyword matching. Text classification**: Fine-tune the model for specialized text classification tasks, such as sentiment analysis, topic classification, or intent detection. Data augmentation**: Leverage the model's understanding of language to generate synthetic text data, which can be used to expand and diversify training datasets for machine learning models. Things to try One interesting aspect of the gte-Qwen2-7B-instruct model is its ability to handle long-form text inputs. Try using the model to generate embeddings for lengthy documents, such as research papers or technical manuals, and explore how the contextual understanding can be applied to tasks like document summarization or knowledge extraction.

Read more

Updated 7/18/2024

🧪

gte-multilingual-base

Alibaba-NLP

Total Score

84

The gte-multilingual-base model is the latest in the GTE (General Text Embedding) family of models from Alibaba-NLP. It achieves state-of-the-art results in multilingual retrieval tasks and multi-task representation model evaluations compared to models of similar size. Unlike previous GTE models based on decode-only LLM architecture (e.g., gte-qwen2-1.5b-instruct), this encoder-only transformers model has lower hardware requirements for inference, offering a 10x increase in speed. It supports text lengths up to 8192 tokens and over 70 languages. Model inputs and outputs The gte-multilingual-base model takes in text as input and outputs dense embeddings. It can also generate sparse vectors in addition to the dense representations. The elastic dense embedding output helps reduce storage costs and improve execution efficiency while maintaining effectiveness on downstream tasks. Inputs Text sequences up to 8192 tokens in length Outputs Dense vector embeddings of size 768 Sparse vector embeddings Capabilities The gte-multilingual-base model excels at multilingual text retrieval and representation tasks. It achieves state-of-the-art performance on the MTEB benchmark compared to models of similar size. The model's ability to handle long-form text up to 8192 tokens makes it suitable for applications that require processing lengthy documents or passages. What can I use it for? The gte-multilingual-base model is well-suited for a variety of text-based applications that require effective cross-lingual representations, such as: Multilingual information retrieval**: The model's high performance on multilingual retrieval tasks makes it useful for building search engines or recommender systems that need to handle queries and documents in multiple languages. Semantic text similarity**: The model's dense embeddings can be used to measure the semantic similarity between text, enabling applications like paraphrase detection, document clustering, or content-based recommendation. Text reranking**: The model's effectiveness on reranking tasks makes it applicable for improving the ranking of search results or other text-based content. Things to try One interesting aspect of the gte-multilingual-base model is its ability to generate sparse vector embeddings in addition to the dense representations. Sparse vectors can be more efficient to store and transmit, which could be beneficial for applications with storage or bandwidth constraints. Exploring the use of the sparse embeddings and comparing their performance to the dense ones could yield valuable insights.

Read more

Updated 9/19/2024

🛠️

gte-large-en-v1.5

Alibaba-NLP

Total Score

80

The gte-large-en-v1.5 is a state-of-the-art text embedding model developed by Alibaba-NLP. It is part of the GTE (General Text Embeddings) model series, which are based on the BERT framework and trained on a large-scale corpus of relevant text pairs. This enables the GTE models to perform well on a variety of downstream tasks like information retrieval, semantic textual similarity, and text reranking. The gte-large-en-v1.5 model in particular achieves high scores on the MTEB benchmark, outperforming other popular text embedding models in the same size category. It also performs competitively on the LoCo long-context retrieval tests. Alibaba-NLP has also released other GTE models, including the gte-large-zh for Chinese text and the gte-small and gte-base for English. Model Inputs and Outputs The gte-large-en-v1.5 model takes in text inputs and generates dense vector representations, also known as text embeddings. These embeddings can capture the semantic meaning of the input text, allowing them to be used in a variety of downstream NLP tasks. Inputs Text data, up to 8192 tokens in length Outputs 1024-dimensional text embeddings for each input Capabilities The gte-large-en-v1.5 model is particularly adept at tasks that involve understanding the semantic relationship between text, such as information retrieval, text ranking, and semantic textual similarity. For example, it can be used to find relevant documents for a given query, or to identify similar paragraphs or sentences across a corpus. What Can I Use It For? The gte-large-en-v1.5 model can be a powerful tool for a variety of NLP applications. Some potential use cases include: Information retrieval**: Use the model to find the most relevant documents or web pages for a given query. Semantic search**: Leverage the model's ability to understand text semantics to build advanced search engines. Text ranking**: Apply the model to rank and order text data, such as search results or recommendation lists. Text summarization**: Combine the model with other techniques to generate concise summaries of longer text. Things to Try One key advantage of the gte-large-en-v1.5 model is its ability to handle long-form text inputs, up to 8192 tokens. This makes it well-suited for tasks that involve analyzing and processing lengthy documents or passages. Try experimenting with the model on tasks that require understanding the overall meaning and context of longer text, rather than just individual sentences or short snippets. You can also explore how the gte-large-en-v1.5 model compares to other text embedding models, such as the gte-small or gte-base, in terms of performance on your specific use cases. The tradeoffs between model size, speed, and accuracy may vary depending on your requirements.

Read more

Updated 5/30/2024

📈

gte-Qwen2-1.5B-instruct

Alibaba-NLP

Total Score

72

gte-Qwen2-1.5B-instruct is the latest model in the gte (General Text Embedding) model family. This model is built on the Qwen2-1.5B LLM model and uses the same training data and strategies as the gte-Qwen2-7B-instruct model. The model incorporates several key advancements, including the integration of bidirectional attention mechanisms, instruction tuning, and comprehensive training across a vast, multilingual text corpus. Model inputs and outputs Inputs Text inputs of up to 32,000 tokens Outputs Contextualized text embeddings with a dimension of 1,536 Capabilities The gte-Qwen2-1.5B-instruct model has been trained to excel at a wide range of natural language processing tasks, including text classification, clustering, retrieval, and similarity measurement. Its robust contextual understanding and multi-lingual capabilities make it a powerful tool for various applications. What can I use it for? The gte-Qwen2-1.5B-instruct model can be used for a variety of applications, such as semantic search, text classification, and text similarity. Its large model size and extensive training make it suitable for tasks that require robust language understanding and generalization, such as document retrieval, question answering, and content recommendation. Things to try One interesting aspect of the gte-Qwen2-1.5B-instruct model is its ability to handle long-form text inputs. By supporting a maximum input length of 32,000 tokens, the model can be used for tasks that require processing of lengthy documents or passages, such as summarization or knowledge extraction from research papers or legal contracts.

Read more

Updated 8/29/2024

🏅

gte-Qwen1.5-7B-instruct

Alibaba-NLP

Total Score

50

gte-Qwen1.5-7B-instruct is the latest addition to the gte embedding family from Alibaba-NLP. Built upon the robust natural language processing capabilities of the Qwen1.5-7B model, it incorporates several key advancements. These include the integration of bidirectional attention mechanisms to enrich its contextual understanding, as well as instruction tuning applied solely on the query side for streamlined efficiency. The model has also been comprehensively trained across a vast, multilingual text corpus spanning diverse domains and scenarios. Model Inputs and Outputs gte-Qwen1.5-7B-instruct is a powerful text embedding model that can handle a wide range of inputs, from short queries to longer text passages. The model supports a maximum input length of 32k tokens, making it suitable for a variety of natural language processing tasks. Inputs Text sequences of up to 32,000 tokens Outputs High-dimensional vector representations (embeddings) of the input text, with a dimension of 4096 Capabilities The enhancements made to gte-Qwen1.5-7B-instruct allow it to excel at a variety of natural language processing tasks. Its robust contextual understanding and multilingual training make it a versatile tool for applications such as semantic search, text classification, and language generation. What Can I Use It For? gte-Qwen1.5-7B-instruct can be leveraged for a wide range of applications, from building personalized recommendations to powering multilingual chatbots. Its state-of-the-art performance on the MTEB benchmark, as demonstrated by the gte-base-en-v1.5 and gte-large-en-v1.5 models, makes it a compelling choice for embedding-based tasks. Things to Try Experiment with gte-Qwen1.5-7B-instruct to unlock its full potential. Utilize the model's robust contextual understanding and multilingual capabilities to tackle complex natural language processing challenges, such as cross-lingual information retrieval or multilingual sentiment analysis.

Read more

Updated 5/15/2024

🏅

gte-Qwen1.5-7B-instruct

Alibaba-NLP

Total Score

50

gte-Qwen1.5-7B-instruct is the latest addition to the gte embedding family from Alibaba-NLP. Built upon the robust natural language processing capabilities of the Qwen1.5-7B model, it incorporates several key advancements. These include the integration of bidirectional attention mechanisms to enrich its contextual understanding, as well as instruction tuning applied solely on the query side for streamlined efficiency. The model has also been comprehensively trained across a vast, multilingual text corpus spanning diverse domains and scenarios. Model Inputs and Outputs gte-Qwen1.5-7B-instruct is a powerful text embedding model that can handle a wide range of inputs, from short queries to longer text passages. The model supports a maximum input length of 32k tokens, making it suitable for a variety of natural language processing tasks. Inputs Text sequences of up to 32,000 tokens Outputs High-dimensional vector representations (embeddings) of the input text, with a dimension of 4096 Capabilities The enhancements made to gte-Qwen1.5-7B-instruct allow it to excel at a variety of natural language processing tasks. Its robust contextual understanding and multilingual training make it a versatile tool for applications such as semantic search, text classification, and language generation. What Can I Use It For? gte-Qwen1.5-7B-instruct can be leveraged for a wide range of applications, from building personalized recommendations to powering multilingual chatbots. Its state-of-the-art performance on the MTEB benchmark, as demonstrated by the gte-base-en-v1.5 and gte-large-en-v1.5 models, makes it a compelling choice for embedding-based tasks. Things to Try Experiment with gte-Qwen1.5-7B-instruct to unlock its full potential. Utilize the model's robust contextual understanding and multilingual capabilities to tackle complex natural language processing challenges, such as cross-lingual information retrieval or multilingual sentiment analysis.

Read more

Updated 5/15/2024

📉

gte-base-en-v1.5

Alibaba-NLP

Total Score

48

The gte-base-en-v1.5 is a text embedding model developed by Alibaba-NLP. It is part of the GTE (General Text Embedding) series of models that aim to provide state-of-the-art performance on a variety of text representation tasks. The gte-base-en-v1.5 model is an upgraded version of the previous gte embeddings, with support for context lengths of up to 8192 tokens and enhanced model performance. It is built upon the transformer++ encoder backbone which combines BERT, RoPE, and GLU components. Compared to similar models like gte-large-en-v1.5 and gte-Qwen1.5-7B-instruct, the gte-base-en-v1.5 is a smaller model with 137M parameters, but still achieves state-of-the-art scores on the MTEB benchmark within the same model size category. Model inputs and outputs Inputs Text**: The model accepts text inputs of up to 8192 tokens. Outputs Text embeddings**: The model outputs 768-dimensional text embeddings that capture the semantic meaning of the input text. These embeddings can be used for a variety of downstream tasks like text classification, retrieval, and similarity. Capabilities The gte-base-en-v1.5 model has demonstrated strong performance on a range of text representation tasks, including: Text classification**: The model achieves high accuracy on benchmarks like GLUE and SuperGLUE, indicating its ability to capture relevant semantic features for classification tasks. Text retrieval**: The model performs competitively on long-context retrieval tests like LoCo, showing its effectiveness in encoding relevant information for retrieval. Semantic similarity**: The model can be used to compute meaningful similarity scores between text inputs, enabling applications like semantic search and recommendation. What can I use it for? The gte-base-en-v1.5 model can be a valuable tool for a variety of natural language processing applications. Some potential use cases include: Semantic search**: Encode text queries and documents into a shared embedding space, enabling efficient and accurate semantic search over large text corpora. Content recommendation**: Use the model's text embeddings to find similar content or products, powering personalized recommendation systems. Text analytics**: Leverage the model's semantic understanding to extract insights, classify documents, or cluster text data in various business intelligence and knowledge management applications. Things to try One interesting aspect of the gte-base-en-v1.5 model is its ability to handle long-form text inputs. This can be particularly useful for tasks that involve processing lengthy documents, such as research papers, technical manuals, or legal contracts. Developers could experiment with using the model's long-context capabilities to improve the accuracy and robustness of their text processing pipelines. Additionally, the model's strong performance on a wide range of benchmarks suggests that it could be a valuable starting point for transfer learning or fine-tuning on domain-specific tasks. Practitioners could explore adapting the gte-base-en-v1.5 model to their particular use case, potentially unlocking even greater performance gains.

Read more

Updated 9/19/2024