Bkai-foundation-models

Models by this creator

📊

vietnamese-bi-encoder

The vietnamese-bi-encoder model is a sentence-transformers model from the bkai-foundation-models team. It maps sentences and paragraphs into a 768-dimensional dense vector space, which can be useful for tasks like clustering or semantic search. The model was trained on a merged dataset that includes MS MARCO (translated into Vietnamese), SQuAD v2 (translated into Vietnamese), and 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge. It uses the phobert-base-v2 model as its pre-trained backbone. Compared to the Vietnamese-SBERT model, the vietnamese-bi-encoder model achieves higher performance on the remaining 20% of the Legal Text Retrieval Zalo 2021 challenge dataset, with an Accuracy@1 of 73.28%, Accuracy@10 of 93.59%, and an MRR@10 of 80.73%. This suggests the vietnamese-bi-encoder model is a strong option for Vietnamese sentence embedding tasks. Model inputs and outputs Inputs Text**: The model takes Vietnamese text as input, which must be pre-segmented into words. The maximum sequence length is 128 tokens. Outputs Sentence embeddings**: The model outputs a 768-dimensional dense vector representation for the input text, capturing the semantic meaning of the sentence or paragraph. Capabilities The vietnamese-bi-encoder model can be used for a variety of tasks that involve processing Vietnamese text, such as: Semantic search**: The sentence embeddings produced by the model can be used to find semantically similar documents or passages in a corpus. Text clustering**: The vector representations can be used to group similar Vietnamese text documents or paragraphs together. Paraphrase identification**: The model can be used to identify whether two Vietnamese sentences have similar meanings. What can I use it for? The vietnamese-bi-encoder model could be useful for companies or researchers working on Vietnamese natural language processing tasks. Some potential use cases include: Enterprise search**: Indexing Vietnamese documents and enabling semantic search capabilities within a company's knowledge base. Recommendation systems**: Clustering Vietnamese content to improve personalized recommendations for users. Question answering**: Using the sentence embeddings to match questions with the most relevant answers in a Vietnamese FAQ or knowledge base. Things to try One interesting aspect of the vietnamese-bi-encoder model is its use of the phobert-base-v2 model as its pre-trained backbone. This suggests the model may be particularly well-suited for tasks that involve Vietnamese-language text, as the underlying language model has been specifically trained on Vietnamese data. Researchers or developers could experiment with fine-tuning the vietnamese-bi-encoder model on additional Vietnamese datasets to see if they can further improve its performance on specific tasks. They could also compare its performance to other Vietnamese sentence embedding models, such as the Vietnamese-SBERT model, to better understand its relative strengths and weaknesses.

Updated 9/6/2024

Text-to-Text