m3e-small

Maintainer: moka-ai

Last updated 9/6/2024

🧠

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

The m3e-small model is part of the M3E (Moka Massive Mixed Embedding) series of models developed by moka-ai. M3E models are large-scale Chinese language models trained on over 22 million text samples, with capabilities spanning sentence-to-sentence, sentence-to-passage, and sentence-to-code tasks. The m3e-small model is the smaller version, with 24M parameters, while the m3e-base model has 110M parameters. Both models demonstrate strong performance on various Chinese NLP benchmarks, outperforming models like text2vec and openai-ada-002.

Model inputs and outputs

The M3E models are sentence embedding models, meaning they take in natural language sentences as input and produce vector representations as output. These vector representations can then be used for a variety of downstream tasks like text similarity, classification, and retrieval.

Inputs

Natural language sentences in Chinese

Outputs

Numerical vector representations of the input sentences, which capture the semantic meaning of the text

Capabilities

The M3E models excel at capturing the semantic and contextual meaning of Chinese text. They have shown strong performance on tasks like natural language inference, sentence similarity, and information retrieval. For example, on the MTEB-zh benchmark, the m3e-base model achieved an average accuracy of 0.6157, outperforming text2vec (0.5755) and openai-ada-002 (0.5956).

What can I use it for?

The M3E models can be leveraged for a wide range of Chinese NLP applications, such as:

Semantic search: Use the sentence embeddings to perform efficient retrieval of relevant documents or passages from a large corpus.
Text classification: Fine-tune the models on labeled datasets to classify text into different categories.
Recommendation systems: Utilize the sentence representations to compute semantic similarity between items and provide personalized recommendations.
Chatbots and dialogue systems: Incorporate the M3E models to understand user intents and generate relevant responses.

sentence-transformers, chroma, guidance, and semantic-kernel are some popular libraries and frameworks that can leverage the M3E models for these types of applications.

Things to try

One interesting aspect of the M3E models is their ability to be fine-tuned on domain-specific datasets using the uniem library. By fine-tuning the m3e-small model on the STS-B dataset, for example, you can further improve its performance on sentence similarity tasks. This flexibility allows the M3E models to be adapted for a wide range of use cases.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🧠

m3e-base

moka-ai

833

The m3e-base model is part of the M3E (Moka Massive Mixed Embedding) series of models developed by Moka AI. M3E models are designed to be versatile, supporting a variety of natural language processing tasks such as dense retrieval, multi-vector retrieval, and sparse retrieval. The m3e-base model has 110 million parameters and a hidden size of 768. M3E models are trained on a massive 2.2 billion+ token corpus, making them well-suited for general-purpose language understanding. The models have demonstrated strong performance on benchmarks like MTEB-zh, outperforming models like openai-ada-002 on tasks like sentence-to-sentence (s2s) accuracy and sentence-to-passage (s2p) nDCG@10. Similar models in the M3E series include the m3e-small and m3e-large versions, which have different parameter sizes and performance characteristics depending on the task. Model Inputs and Outputs Inputs Text**: The m3e-base model can accept text inputs of varying lengths, up to a maximum of 8,192 tokens. Outputs Embeddings**: The model outputs dense vector representations of the input text, which can be used for a variety of downstream tasks such as similarity search, text classification, and retrieval. Capabilities The m3e-base model has demonstrated strong performance on a range of natural language processing tasks, including: Sentence Similarity**: The model can be used to compute the semantic similarity between sentences, which is useful for applications like paraphrase detection and text summarization. Text Classification**: The embeddings produced by the model can be used as features for training text classification models, such as for sentiment analysis or topic classification. Retrieval**: The model's dense and sparse retrieval capabilities make it well-suited for building search engines and question-answering systems. What Can I Use It For? The versatility of the m3e-base model makes it a valuable tool for a wide range of natural language processing applications. Some potential use cases include: Semantic Search**: Use the model's dense embeddings to build a semantic search engine, allowing users to find relevant information based on the meaning of their queries rather than just keyword matching. Personalized Recommendations**: Leverage the model's strong text understanding capabilities to build personalized recommendation systems, such as for content or product recommendations. Chatbots and Conversational AI**: Integrate the model into chatbot or virtual assistant applications to enable more natural and contextual language understanding and generation. Things to Try One interesting aspect of the m3e-base model is its ability to perform both dense and sparse retrieval. This hybrid approach can be beneficial for building more robust and accurate retrieval systems. To experiment with the model's retrieval capabilities, you can try integrating it with tools like chroma, guidance, and semantic-kernel. These tools provide abstractions and utilities for building search and question-answering applications using large language models like m3e-base. Additionally, the uniem library provides a convenient interface for fine-tuning the m3e-base model on domain-specific datasets, which can further improve its performance on your specific use case.

Updated Invalid Date

Text-to-Text

🛸

m3e-large

moka-ai

185

The m3e-large model is part of the M3E (Moka Massive Mixed Embedding) series of text embedding models developed by the Moka AI team. The M3E models are large-scale multilingual text embedding models that can be used for a variety of natural language processing tasks. The m3e-large model is the largest in the series, with 340 million parameters and a 768-dimensional embedding size. The M3E models are designed to provide strong performance on a range of benchmarks, including the MTEB-zh Chinese language benchmark. Compared to similar models like multilingual-e5-large, bge-large-en-v1.5, and moe-llava, the M3E models leverage a massive, mixed-domain training dataset to learn rich and generalizable text representations. The m3e-base model in this series has also shown strong performance, outperforming OpenAI's text-embedding-ada-002 model on several MTEB-zh tasks. Model inputs and outputs Inputs Text sequences**: The m3e-large model can accept single sentences or longer text passages as input. Outputs Text embeddings**: The model outputs fixed-length vector representations (embeddings) of the input text. These embeddings can be used for a variety of downstream tasks, such as semantic search, text classification, and clustering. Capabilities The m3e-large model demonstrates strong performance on a variety of text-based tasks, especially those involving semantic understanding and retrieval. For example, it has achieved a 0.6231 accuracy score on the sentence-to-sentence (s2s) task and a 0.7974 NDCG@10 score on the sentence-to-passage (s2p) task in the MTEB-zh benchmark. What can I use it for? The m3e-large model can be used for a wide range of natural language processing applications, such as: Semantic search**: The rich text embeddings produced by the model can be used to build powerful semantic search engines, allowing users to find relevant information based on the meaning of their queries rather than just keyword matching. Text classification**: The model's embeddings can be used as features for training high-performance text classification models, such as those for sentiment analysis, topic categorization, or intent detection. Recommendation systems**: The semantic understanding of the m3e-large model can be leveraged to build advanced recommendation systems that suggest relevant content or products based on user preferences and behavior. Things to try One interesting aspect of the m3e-large model is its potential for domain-specific fine-tuning. By further training the model on task-specific data using tools like the uniem library, you can likely achieve even stronger performance on specialized applications. Additionally, the model's large size and diverse training data make it a promising starting point for exploring few-shot and zero-shot learning approaches, where the model can leverage its broad knowledge to quickly adapt to new tasks with limited additional training.

Updated Invalid Date

Text-to-Text

🐍

Phi-3.5-MoE-instruct

microsoft

473

The Phi-3.5-MoE-instruct is a lightweight, state-of-the-art open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available documents - with a focus on very high-quality, reasoning dense data. The model supports multilingual and comes with 128K context length (in tokens). The model underwent a rigorous enhancement process, incorporating supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures. The Phi-3.5-MoE-instruct model is part of the Phi-3 model family, which also includes the Phi-3.5-mini-instruct and Phi-3.5-vision-instruct models. Model inputs and outputs Inputs Text**: The model is best suited for prompts using the chat format. Outputs Generated text**: The model generates text in response to the input. Capabilities The Phi-3.5-MoE-instruct model is designed for strong reasoning, especially in areas like code, math, and logic. It performs well in memory/compute constrained environments and latency bound scenarios. What can I use it for? The Phi-3.5-MoE-instruct model is intended for commercial and research use in multiple languages. It can be used as a building block for general purpose AI systems and applications that require memory/compute constrained environments, latency bound scenarios, and strong reasoning capabilities. Things to try You can try the Phi-3.5-MoE-instruct model using the Try It link provided. This allows you to interactively experiment with the model and see its capabilities in action.

Updated Invalid Date

Text-to-Text

⚙️

piccolo-large-zh

sensenova

The piccolo-large-zh is a general text embedding model for Chinese, powered by the General Model Group from SenseTime Research. Inspired by E5 and GTE, piccolo is trained using a two-stage pipeline. First, the model is trained on 400 million weakly supervised Chinese text pairs collected from the internet, using a pair (text and text pos) softmax contrastive loss. In the second stage, the model is fine-tuned on 20 million human-labeled Chinese text pairs, using a triplet (text, text_pos, text_neg) contrastive loss. This approach enables piccolo-large-zh to capture rich semantic information and perform well on a variety of downstream tasks. The piccolo-large-zh model has 1024 embedding dimensions and can handle input sequences up to 512 tokens long. It outperforms other Chinese embedding models like bge-large-zh and piccolo-base-zh on the C-MTEB benchmark, achieving an average score of 64.11 across 35 datasets. Model Inputs and Outputs Inputs Text sequences up to 512 tokens long Outputs 1024-dimensional text embeddings that capture the semantic meaning of the input text Capabilities The piccolo-large-zh model is highly capable at encoding Chinese text into semantic representations. These embeddings can be used for a variety of downstream tasks, such as: Information retrieval: The embeddings can be used to find relevant documents or passages given a query. Semantic search: The model can be used to find similar documents or passages based on their semantic content. Text classification: The embeddings can be used as features for training text classification models. Paraphrase detection: The model can be used to identify paraphrases of a given input text. What Can I Use It For? The piccolo-large-zh model can be used in a wide range of applications that involve working with Chinese text. Some potential use cases include: Search and Recommendation**: Use the embeddings to build semantic search engines or recommendation systems for Chinese content. Content Clustering and Organization**: Group related Chinese documents or passages based on their semantic similarity. Text Analytics and Insights**: Extract meaningful insights from Chinese text data by leveraging the model's ability to capture semantic meaning. Multilingual Applications**: Combine piccolo-large-zh with other language models to build cross-lingual applications. Things to Try One interesting aspect of the piccolo-large-zh model is its ability to handle long input sequences, up to 512 tokens. This makes it well-suited for tasks involving long-form Chinese text, such as document retrieval or question answering. You could try experimenting with the model's performance on such tasks and see how it compares to other Chinese language models. Another interesting avenue to explore would be to fine-tune the piccolo-large-zh model on domain-specific data, such as scientific literature or legal documents, to see if it can capture specialized semantic knowledge in those areas. This could lead to improved performance on tasks like technical search or legal document classification.

Updated Invalid Date

Text-to-Text