Sensenova

Models by this creator

⚙️

piccolo-large-zh

The piccolo-large-zh is a general text embedding model for Chinese, powered by the General Model Group from SenseTime Research. Inspired by E5 and GTE, piccolo is trained using a two-stage pipeline. First, the model is trained on 400 million weakly supervised Chinese text pairs collected from the internet, using a pair (text and text pos) softmax contrastive loss. In the second stage, the model is fine-tuned on 20 million human-labeled Chinese text pairs, using a triplet (text, text_pos, text_neg) contrastive loss. This approach enables piccolo-large-zh to capture rich semantic information and perform well on a variety of downstream tasks. The piccolo-large-zh model has 1024 embedding dimensions and can handle input sequences up to 512 tokens long. It outperforms other Chinese embedding models like bge-large-zh and piccolo-base-zh on the C-MTEB benchmark, achieving an average score of 64.11 across 35 datasets. Model Inputs and Outputs Inputs Text sequences up to 512 tokens long Outputs 1024-dimensional text embeddings that capture the semantic meaning of the input text Capabilities The piccolo-large-zh model is highly capable at encoding Chinese text into semantic representations. These embeddings can be used for a variety of downstream tasks, such as: Information retrieval: The embeddings can be used to find relevant documents or passages given a query. Semantic search: The model can be used to find similar documents or passages based on their semantic content. Text classification: The embeddings can be used as features for training text classification models. Paraphrase detection: The model can be used to identify paraphrases of a given input text. What Can I Use It For? The piccolo-large-zh model can be used in a wide range of applications that involve working with Chinese text. Some potential use cases include: Search and Recommendation**: Use the embeddings to build semantic search engines or recommendation systems for Chinese content. Content Clustering and Organization**: Group related Chinese documents or passages based on their semantic similarity. Text Analytics and Insights**: Extract meaningful insights from Chinese text data by leveraging the model's ability to capture semantic meaning. Multilingual Applications**: Combine piccolo-large-zh with other language models to build cross-lingual applications. Things to Try One interesting aspect of the piccolo-large-zh model is its ability to handle long input sequences, up to 512 tokens. This makes it well-suited for tasks involving long-form Chinese text, such as document retrieval or question answering. You could try experimenting with the model's performance on such tasks and see how it compares to other Chinese language models. Another interesting avenue to explore would be to fine-tune the piccolo-large-zh model on domain-specific data, such as scientific literature or legal documents, to see if it can capture specialized semantic knowledge in those areas. This could lead to improved performance on tasks like technical search or legal document classification.

Updated 5/28/2024

Text-to-Text

🤔

piccolo-large-zh-v2

sensenova

The piccolo-large-zh-v2 model is a Chinese text embedding model developed by the General Model Group from SenseTime Research. This upgraded version of the original Piccolo model aims to improve upon general downstream fine-tuning methods. Piccolo2 primarily leverages an efficient multi-task hybrid loss training approach, effectively harnessing textual data and labels from diverse downstream tasks. Additionally, Piccolo2 scales up the embedding dimension and uses MRL training to support more flexible vector dimensions. Compared to similar models like the piccolo-large-zh and Baichuan2-7B-Base, the piccolo-large-zh-v2 model utilizes a multi-task hybrid loss training approach and larger embedding dimensions to enhance its performance on downstream tasks. Model inputs and outputs Inputs Text**: The piccolo-large-zh-v2 model takes text inputs and generates text embeddings. Outputs Text embeddings**: The model outputs fixed-size vector representations of the input text, which can be used for a variety of downstream NLP tasks such as text classification, retrieval, and similarity matching. Capabilities The piccolo-large-zh-v2 model has demonstrated strong performance on the C-MTEB benchmark, outperforming previous BERT models by around 1.9 points. The model's key capabilities include: Effective text representation learning through a multi-task hybrid loss training approach Support for flexible vector dimensions through MRL training Robust performance on a wide range of NLP tasks, including text retrieval, classification, and similarity matching What can I use it for? The piccolo-large-zh-v2 model can be used for a variety of NLP applications that require high-quality text embeddings, such as: Semantic search and information retrieval Text classification and clustering Recommendation systems Question-answering and dialog systems The model's strong performance and efficient architecture make it a suitable choice for a wide range of applications that require high-quality text representations. Things to try One interesting aspect of the piccolo-large-zh-v2 model is its use of a multi-task hybrid loss training approach. This allows the model to effectively leverage diverse datasets and task labels, leading to improved performance on downstream tasks. Researchers and developers could experiment with applying this training strategy to other NLP models or datasets to see if similar performance gains can be achieved. Additionally, the model's support for flexible vector dimensions through MRL training opens up possibilities for exploring more efficient and scalable text representation learning. Users could experiment with adjusting the vector dimensions to find the optimal balance between model size, inference speed, and task-specific performance.

Updated 8/7/2024

Text-to-Text