NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

2405.17428

Published 5/28/2024 by Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

cs.CL cs.AI cs.IR cs.LG

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Abstract

Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For model training, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval datasets into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. Combining these techniques, our NV-Embed model, using only publicly available data, has achieved a record-high score of 69.32, ranking No. 1 on the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024), with 56 tasks, encompassing retrieval, reranking, classification, clustering, and semantic textual similarity tasks. Notably, our model also attains the highest score of 59.36 on 15 retrieval tasks in the MTEB benchmark (also known as BEIR). We will open-source the model at: https://huggingface.co/nvidia/NV-Embed-v1.

Create account to get full access

Overview

The paper proposes "NV-Embed," a technique for training large language models (LLMs) as generalist embedding models.
The authors claim that NV-Embed can improve the performance of LLMs on various downstream tasks compared to previous embedding models.
The paper explores techniques for training LLMs to produce high-quality embeddings that capture general semantic information.

Plain English Explanation

The paper introduces a new method called "NV-Embed" for training large language models (LLMs) to create powerful word embeddings. Word embeddings are numerical representations of words that capture their meaning and relationships. The authors argue that their NV-Embed technique can produce better embeddings than previous methods, which can then be used to improve the performance of various AI applications.

Typically, LLMs are trained to perform tasks like answering questions or generating text, but the authors found a way to instead train them to create high-quality word embeddings. These embeddings can then be used as input to other AI models, like those used for information retrieval or to help protect user privacy. The authors claim their technique leads to embeddings that are more "generalist" - meaning they capture a broader range of semantic information - compared to previous approaches.

Technical Explanation

The key innovation in NV-Embed is the use of a novel training objective that encourages the LLM to learn embeddings that are useful for a wide range of downstream tasks, rather than optimizing for a specific task. The authors introduce "neighborhood-visualization" (NV) loss, which aims to ensure that similar words have similar embeddings by minimizing the distance between a word's embedding and the embeddings of its neighboring words in the corpus.

The paper also explores techniques for scaling up the training of NV-Embed models, including distributed training and techniques to reduce the memory footprint. The authors evaluate NV-Embed on a variety of benchmarks, including word similarity, analogies, and probing tasks, and show that it outperforms previous state-of-the-art embedding models like BERT and LLM2Vec.

Critical Analysis

The authors provide a comprehensive evaluation of NV-Embed and demonstrate its advantages over previous approaches. However, the paper does not address some potential limitations or areas for further research. For example, the authors do not discuss the computational cost or training time required for NV-Embed compared to other methods, which could be an important practical consideration.

Additionally, the paper does not explore the impact of the NV-Embed embeddings on specific downstream tasks, such as information retrieval or language understanding. Further research could investigate how NV-Embed embeddings perform in real-world applications compared to other embedding techniques.

Conclusion

Overall, the NV-Embed technique presented in this paper represents an interesting advancement in the field of generalist embedding models. By training LLMs to produce high-quality, task-agnostic embeddings, the authors have developed a approach that could have significant implications for a wide range of AI applications. While the paper does not address all potential limitations, it provides a solid foundation for future research and development in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improving Text Embeddings with Large Language Models

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei

In this paper, we introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps. Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets that are often constrained by task diversity and language coverage. We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages. We then fine-tune open-source decoder-only LLMs on the synthetic data using standard contrastive loss. Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data. Furthermore, when fine-tuned with a mixture of synthetic and labeled data, our model sets new state-of-the-art results on the BEIR and MTEB benchmarks.

6/3/2024

cs.CL cs.IR

🚀

Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

Nicholas Harris, Anand Butani, Syed Hashmy

Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.

4/19/2024

cs.CL

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

Mingrui Wu, Sheng Cao

Recently embedding-based retrieval or dense retrieval have shown state of the art results, compared with traditional sparse or bag-of-words based approaches. This paper introduces a model-agnostic doc-level embedding framework through large language model (LLM) augmentation. In addition, it also improves some important components in the retrieval model training process, such as negative sampling, loss function, etc. By implementing this LLM-augmented retrieval framework, we have been able to significantly improve the effectiveness of widely-used retriever models such as Bi-encoders (Contriever, DRAGON) and late-interaction models (ColBERTv2), thereby achieving state-of-the-art results on LoTTE datasets and BEIR datasets.

4/10/2024

cs.IR cs.AI

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

4/10/2024

cs.CL cs.AI