DAPR: A Benchmark on Document-Aware Passage Retrieval

Read original: arXiv:2305.13915 - Published 6/11/2024 by Kexin Wang, Nils Reimers, Iryna Gurevych

↗️

Overview

The paper focuses on the task of Document-Aware Passage Retrieval (DAPR), where the goal is to find relevant passages within long documents like Wikipedia articles or research papers.
The authors find that current state-of-the-art passage retrieval models struggle with this task, as 53.5% of their errors are due to a lack of understanding of the document context.
To address this, the authors create a new benchmark for DAPR and experiment with extending existing passage retrieval models with document context information.

Plain English Explanation

When searching for information, people often want to find a specific section or passage within a long document, like a Wikipedia article or research paper. This paper explores a new task called Document-Aware Passage Retrieval (DAPR) to address this need.

The authors found that existing passage retrieval models, which are designed to work well on short texts, struggle with long documents. Over half of the errors made by these models are due to a lack of understanding of the full document context.

To better support this type of search, the authors created a new benchmark dataset for DAPR, drawing from a variety of document sources. They then tested ways to improve passage retrieval by incorporating document-level information, such as prepending the document title to the passage or using a hybrid approach that combines passage-level and document-level scoring.

The results show that while the hybrid approach performs best overall, it struggles on the most challenging queries that truly require understanding the document context. The contextualized passage representations, on the other hand, are better able to handle these difficult cases, though they don't perform as strongly overall.

By establishing this new benchmark and experimenting with different approaches, the authors hope to spur further research into building more effective retrieval systems for finding relevant passages within long documents.

Technical Explanation

The paper proposes the task of Document-Aware Passage Retrieval (DAPR), where the goal is to find relevant passages within long documents like Wikipedia articles or research papers, rather than just ranking short texts.

Through an analysis of the errors made by state-of-the-art passage retrieval models, the authors find that over 53.5% of the errors are due to a lack of understanding of the document context. This motivates them to create a new benchmark for DAPR, drawing from multiple datasets across different domains.

To address the document context challenge, the authors experiment with two approaches:

Hybrid retrieval, which combines passage-level and document-level scoring using BM25.
Contextualized passage representations, where they prepend the document title to the passage to inform the passage representation with document context.

The results show that the hybrid approach performs the strongest on the overall benchmark, which contains a mixture of easy and hard queries. However, it completely fails on the hard queries that specifically require document-context understanding.

In contrast, the contextualized passage representations achieve good improvement on these hard queries, but perform rather poorly overall compared to the hybrid approach. The authors suggest that further research is needed to develop retrieval systems that can effectively leverage document-level information.

Critical Analysis

The paper makes a valuable contribution by defining the new task of Document-Aware Passage Retrieval (DAPR) and creating a corresponding benchmark dataset. This addresses an important real-world need, as users often want to find relevant information within long, complex documents.

The authors' analysis of the errors made by existing passage retrieval models is insightful, highlighting the limitations of these systems when it comes to understanding document context. Their experiments with hybrid retrieval and contextualized passage representations provide a solid starting point for addressing this challenge.

However, the results also reveal the difficulty of the DAPR task. While the hybrid approach performs best overall, its complete failure on the hard queries suggests that more sophisticated techniques are needed to truly leverage document-level information. The contextualized representations, while better suited for the hard cases, still underperform compared to the hybrid method.

Further research could explore more advanced ways of incorporating document context, such as through the use of hierarchical or multi-level retrieval models. Additionally, the benchmark dataset could be expanded to include a wider range of document types and query difficulties to better assess the capabilities of different retrieval approaches.

Overall, this paper lays important groundwork for the DAPR task and highlights the need for continued innovation in document-level information retrieval.

Conclusion

This paper introduces the new task of Document-Aware Passage Retrieval (DAPR), where the goal is to find relevant passages within long documents like Wikipedia articles or research papers. The authors find that existing passage retrieval models struggle with this task due to a lack of understanding of document context.

To address this, the authors create a new benchmark dataset for DAPR and experiment with extending these models by incorporating document-level information. While a hybrid approach that combines passage-level and document-level scoring performs best overall, it fails on the most challenging queries that require deep comprehension of the document context.

Contextualized passage representations, which prepend the document title to the passage, show promise for handling these difficult cases, but overall performance is still relatively poor. The authors argue that further research is needed to develop more effective retrieval systems for the DAPR task, which has important real-world applications.

By establishing this new benchmark and exploring different approaches, this paper lays the groundwork for future advancements in document-level information retrieval.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

↗️

DAPR: A Benchmark on Document-Aware Passage Retrieval

Kexin Wang, Nils Reimers, Iryna Gurevych

The work of neural retrieval so far focuses on ranking short texts and is challenged with long documents. There are many cases where the users want to find a relevant passage within a long document from a huge corpus, e.g. Wikipedia articles, research papers, etc. We propose and name this task emph{Document-Aware Passage Retrieval} (DAPR). While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. This drives us to build a benchmark for this task including multiple datasets from heterogeneous domains. In the experiments, we extend the SoTA passage retrievers with document context via (1) hybrid retrieval with BM25 and (2) contextualized passage representations, which inform the passage representation with document context. We find despite that hybrid retrieval performs the strongest on the mixture of the easy and the hard queries, it completely fails on the hard queries that require document-context understanding. On the other hand, contextualized passage representations (e.g. prepending document titles) achieve good improvement on these hard queries, but overall they also perform rather poorly. Our created benchmark enables future research on developing and comparing retrieval systems for the new task. The code and the data are available at https://github.com/UKPLab/arxiv2023-dapr.

6/11/2024

SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering

Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Lin-shan Lee

Spoken Question Answering (SQA) is essential for machines to reply to user's question by finding the answer span within a given spoken passage. SQA has been previously achieved without ASR to avoid recognition errors and Out-of-Vocabulary (OOV) problems. However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered. This paper proposes the first known end-to-end framework, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. SpeechDPR learns a sentence-level semantic representation by distilling knowledge from the cascading model of unsupervised ASR (UASR) and text dense retriever (TDR). No manually transcribed speech data is needed. Initial experiments showed performance comparable to the cascading model of UASR and TDR, and significantly better when UASR was poor, verifying this approach is more robust to speech recognition errors.

8/27/2024

What are the limits of cross-lingual dense passage retrieval for low-resource languages?

Jie Wu, Zhaochun Ren, Suzan Verberne

In this paper, we analyze the capabilities of the multi-lingual Dense Passage Retriever (mDPR) for extremely low-resource languages. In the Cross-lingual Open-Retrieval Answer Generation (CORA) pipeline, mDPR achieves success on multilingual open QA benchmarks across 26 languages, of which 9 were unseen during training. These results are promising for Question Answering (QA) for low-resource languages. We focus on two extremely low-resource languages for which mDPR performs poorly: Amharic and Khmer. We collect and curate datasets to train mDPR models using Translation Language Modeling (TLM) and question--passage alignment. We also investigate the effect of our extension on the language distribution in the retrieval results. Our results on the MKQA and AmQA datasets show that language alignment brings improvements to mDPR for the low-resource languages, but the improvements are modest and the results remain low. We conclude that fulfilling CORA's promise to enable multilingual open QA in extremely low-resource settings is challenging because the model, the data, and the evaluation approach are intertwined. Hence, all three need attention in follow-up work. We release our code for reproducibility and future work: https://anonymous.4open.science/r/Question-Answering-for-Low-Resource-Languages-B13C/

8/23/2024

💬

Control Token with Dense Passage Retrieval

Juhwan Lee, Jisu Kim

This study addresses the hallucination problem in large language models (LLMs). We adopted Retrieval-Augmented Generation(RAG) (Lewis et al., 2020), a technique that involves embedding relevant information in the prompt to obtain accurate answers. However, RAG also faced inherent issues in retrieving correct information. To address this, we employed the Dense Passage Retrieval(DPR) (Karpukhin et al., 2020) model for fetching domain-specific documents related to user queries. Despite this, the DPR model still lacked accuracy in document retrieval. We enhanced the DPR model by incorporating control tokens, achieving significantly superior performance over the standard DPR model, with a 13% improvement in Top-1 accuracy and a 4% improvement in Top-20 accuracy.

5/24/2024