AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation

Read original: arXiv:2212.08841 - Published 9/19/2024 by Rui Meng, Ye Liu, Semih Yavuz, Divyansh Agarwal, Lifu Tu, Ning Yu, Jianguo Zhang, Meghana Bhat, Yingbo Zhou

🤷

Overview

Dense retrieval models have made significant progress in text retrieval and open-domain question answering.
Most of these achievements have relied heavily on extensive human-annotated supervision.
This study aims to develop unsupervised methods for improving dense retrieval models.
Two annotation-free and scalable training approaches are proposed: query extraction and transferred query generation.

Plain English Explanation

Information retrieval is the process of finding relevant documents or content in response to a user's query. Dense retrieval models are a type of AI system that have become very good at this task, outperforming traditional search methods. However, these models typically require a lot of labeled data, where humans have carefully annotated which documents are relevant to which queries.

This study explores ways to train dense retrieval models without needing all that labeled data. The researchers propose two new methods:

Query extraction: Selecting important sentences or phrases from a document to use as a "pseudo-query" that the model can learn from.
Transferred query generation: Using other AI models, like summarization systems, to automatically generate queries that are relevant to a document.

By using these techniques, the researchers were able to train dense retrieval models that perform just as well, or even better, than models trained on human-labeled data. This is an important step forward, as it makes these powerful AI systems more accessible and scalable, without requiring extensive manual annotation.

Technical Explanation

The key technical contributions of this work are the proposed query extraction and transferred query generation methods for unsupervised dense retrieval model training.

The query extraction approach selects salient spans from the original document to generate "pseudo-queries" that can be paired with the document for training. This leverages the intuition that important sentences or phrases in a document are likely to be good queries that a user would search for.

The transferred query generation method utilizes pre-trained generation models, such as those for summarization, to automatically produce relevant queries for a given document. The researchers show that these generated queries can be effectively used to train dense retrieval models.

Through extensive experiments, the researchers demonstrate that models trained using these unsupervised augmentation methods can achieve comparable or better performance than strong dense retrieval baselines that rely on human-annotated data. Furthermore, combining the two strategies leads to even greater improvements in unsupervised dense retrieval, unsupervised domain adaptation, and supervised fine-tuning, across both BEIR and ODQA benchmarks.

Critical Analysis

The researchers provide a thorough evaluation of their proposed methods, demonstrating their effectiveness across a range of tasks and datasets. However, some potential limitations and areas for further research are worth considering:

The performance of the query generation models, particularly the transferred generation approach, may be dependent on the quality and relevance of the pre-trained models used. Further investigation into the impact of the generation model's domain and capabilities would be valuable.
While the unsupervised methods achieve strong results, there may still be room for improvement compared to models trained on high-quality human-annotated data. Exploring hybrid approaches that combine unsupervised and supervised techniques could potentially lead to even better performance.
The researchers focused on dense retrieval, but it would be interesting to see how these techniques could be applied to other information retrieval tasks, such as semantic search or document ranking.

Overall, this work makes a significant contribution by demonstrating the viability of unsupervised dense retrieval, which has the potential to make these powerful AI systems more accessible and scalable.

Conclusion

This study presents two novel unsupervised approaches for training dense retrieval models: query extraction and transferred query generation. These methods enable annotation-free and scalable training by creating pseudo query-document pairs, without relying on extensive human-labeled data.

The researchers demonstrate that models trained using these unsupervised augmentation techniques can achieve comparable or even better performance than strong dense retrieval baselines. Furthermore, combining the two strategies leads to further improvements, resulting in superior performance across a range of benchmarks.

This work represents a significant advancement in the field of information retrieval, as it paves the way for more accessible and scalable dense retrieval systems. By reducing the need for human-annotated data, these techniques have the potential to make powerful AI-driven search and question-answering capabilities more widely available.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤷

AugTriever: Unsupervised Dense Retrieval by Scalable Data Augmentation

Rui Meng, Ye Liu, Semih Yavuz, Divyansh Agarwal, Lifu Tu, Ning Yu, Jianguo Zhang, Meghana Bhat, Yingbo Zhou

Dense retrievers have made significant strides in text retrieval and open-domain question answering. However, most of these achievements have relied heavily on extensive human-annotated supervision. In this study, we aim to develop unsupervised methods for improving dense retrieval models. We propose two approaches that enable annotation-free and scalable training by creating pseudo querydocument pairs: query extraction and transferred query generation. The query extraction method involves selecting salient spans from the original document to generate pseudo queries. On the other hand, the transferred query generation method utilizes generation models trained for other NLP tasks, such as summarization, to produce pseudo queries. Through extensive experimentation, we demonstrate that models trained using these augmentation methods can achieve comparable, if not better, performance than multiple strong dense baselines. Moreover, combining these strategies leads to further improvements, resulting in superior performance of unsupervised dense retrieval, unsupervised domain adaptation and supervised finetuning, benchmarked on both BEIR and ODQA datasets. Code and datasets are publicly available at https://github.com/salesforce/AugTriever.

9/19/2024

QAEA-DR: A Unified Text Augmentation Framework for Dense Retrieval

Hongming Tan (Victor), Shaoxiong Zhan (Victor), Hai Lin (Victor), Hai-Tao Zheng (Victor), Wai Kin (Victor), Chan

In dense retrieval, embedding long texts into dense vectors can result in information loss, leading to inaccurate query-text matching. Additionally, low-quality texts with excessive noise or sparse key information are unlikely to align well with relevant queries. Recent studies mainly focus on improving the sentence embedding model or retrieval process. In this work, we introduce a novel text augmentation framework for dense retrieval. This framework transforms raw documents into information-dense text formats, which supplement the original texts to effectively address the aforementioned issues without modifying embedding or retrieval methodologies. Two text representations are generated via large language models (LLMs) zero-shot prompting: question-answer pairs and element-driven events. We term this approach QAEA-DR: unifying question-answer generation and event extraction in a text augmentation framework for dense retrieval. To further enhance the quality of generated texts, a scoring-based evaluation and regeneration mechanism is introduced in LLM prompting. Our QAEA-DR model has a positive impact on dense retrieval, supported by both theoretical analysis and empirical experiments.

7/30/2024

🛸

Retrieval Augmented Generation for Domain-specific Question Answering

Sanat Sharma, David Seunghyun Yoon, Franck Dernoncourt, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, Varun Kotte

Question answering (QA) has become an important application in the advanced development of large language models. General pre-trained large language models for question-answering are not trained to properly understand the knowledge or terminology for a specific domain, such as finance, healthcare, education, and customer service for a product. To better cater to domain-specific understanding, we build an in-house question-answering system for Adobe products. We propose a novel framework to compile a large question-answer database and develop the approach for retrieval-aware finetuning of a Large Language model. We showcase that fine-tuning the retriever leads to major improvements in the final generation. Our overall approach reduces hallucinations during generation while keeping in context the latest retrieval information for contextual grounding.

5/30/2024

Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation

Chaoyi Ai, Yong Jiang, Shen Huang, Pengjun Xie, Kewei Tu

Named entity recognition (NER) models often struggle with noisy inputs, such as those with spelling mistakes or errors generated by Optical Character Recognition processes, and learning a robust NER model is challenging. Existing robust NER models utilize both noisy text and its corresponding gold text for training, which is infeasible in many real-world applications in which gold text is not available. In this paper, we consider a more realistic setting in which only noisy text and its NER labels are available. We propose to retrieve relevant text of the noisy text from a knowledge corpus and use it to enhance the representation of the original noisy input. We design three retrieval methods: sparse retrieval based on lexicon similarity, dense retrieval based on semantic similarity, and self-retrieval based on task-specific text. After retrieving relevant text, we concatenate the retrieved text with the original noisy text and encode them with a transformer network, utilizing self-attention to enhance the contextual token representations of the noisy text using the retrieved text. We further employ a multi-view training framework that improves robust NER without retrieving text during inference. Experiments show that our retrieval-augmented model achieves significant improvements in various noisy NER settings.

7/29/2024