BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Read original: arXiv:2407.12883 - Published 7/19/2024 by Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang and 5 others

📉

Overview

Existing benchmarks for text retrieval are often based on simple information-seeking queries, where keyword or semantic matching is usually sufficient.
However, many real-world queries require in-depth reasoning to identify relevant documents beyond surface-level matching.
To address this, the researchers introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning.

Plain English Explanation

Most existing benchmarks for testing text retrieval systems, such as search engines, are based on simple information-seeking queries like questions commonly asked on search engines. For these types of queries, keyword matching or understanding the general meaning of the text is usually sufficient to find relevant documents.

However, many real-world queries are much more complex and require deeper reasoning to identify the most relevant information. For example, finding documentation to help solve a coding problem requires understanding the logic and syntax of the functions involved, not just matching keywords.

To better evaluate text retrieval systems on these more challenging queries, the researchers created a new benchmark called BRIGHT. BRIGHT is based on 1,398 real-world queries spanning diverse domains like economics, psychology, robotics, and software engineering. These queries were carefully curated from human-generated data, and they require intensive reasoning to determine the most relevant supporting documents.

The researchers found that even state-of-the-art retrieval models struggled on the BRIGHT benchmark, scoring much lower than on other benchmarks. However, they discovered that augmenting the queries with "chain-of-thought" reasoning generated by large language models (LLMs) could improve the performance by up to 12 points.

The BRIGHT benchmark is designed to push the boundaries of text retrieval systems and encourage the development of more advanced models that can handle complex, real-world queries. By focusing on reasoning-intensive tasks, it aims to better reflect the challenges faced by users in practical applications.

Technical Explanation

The paper introduces BRIGHT, a new text retrieval benchmark that requires intensive reasoning to identify relevant documents. Unlike existing benchmarks that primarily consist of simple information-seeking queries, BRIGHT is constructed from 1,398 real-world queries across diverse domains, including economics, psychology, robotics, and software engineering.

These queries were carefully curated from naturally occurring or human-generated data, ensuring they represent challenging, reasoning-intensive scenarios that go beyond surface-level matching. For example, finding documentation to solve a coding problem requires understanding the logic and syntax of the functions involved, not just keyword matching.

The researchers evaluated state-of-the-art retrieval models on the BRIGHT benchmark and found that even the leading model on the MTEB leaderboard achieved a much lower score (nDCG@10 of 18.0) compared to its performance on other benchmarks (nDCG@10 of 59.0).

To improve the performance on BRIGHT, the researchers explored augmenting the queries with "chain-of-thought" reasoning generated by large language models (LLMs). This approach, inspired by recent work on cognitive reasoning benchmarks, led to performance gains of up to 12.2 points.

The researchers also validated that the BRIGHT benchmark is robust against data leakage, as they found similar performance even when documents from the benchmark were included in the training data of the evaluated models. This is an important consideration, as some benchmarks have been criticized for this issue.

Critical Analysis

The BRIGHT benchmark represents an important step forward in evaluating text retrieval systems on more realistic and challenging queries. By focusing on reasoning-intensive tasks, it addresses a crucial limitation of existing benchmarks that primarily test keyword or semantic-based retrieval.

However, the researchers acknowledge that BRIGHT is not a comprehensive solution and may not capture all the challenges faced in real-world applications. There may be additional factors, such as contextual understanding, task-specific knowledge, or multimodal information, that are not fully reflected in the benchmark.

Additionally, while the inclusion of chain-of-thought reasoning improved performance, the researchers did not explore other potential strategies, such as incorporating code-specific information retrieval or leveraging relational knowledge to tackle the complex queries in BRIGHT.

Further research is needed to better understand the limitations of current retrieval models and develop more robust and versatile systems that can handle a wide range of real-world information needs. The BRIGHT benchmark provides a valuable tool for driving progress in this direction, but continued efforts to expand and refine the benchmark will be essential.

Conclusion

The BRIGHT benchmark represents a significant advancement in the field of text retrieval by introducing a benchmark that requires intensive reasoning to identify relevant documents. This addressing a crucial limitation of existing benchmarks, which primarily focus on simple information-seeking queries.

By curating a diverse set of real-world queries that span multiple domains, BRIGHT pushes the boundaries of current retrieval models and highlights the need for more advanced systems that can handle complex, reasoning-intensive tasks. The researchers' findings that even state-of-the-art models struggle on BRIGHT and that augmenting queries with chain-of-thought reasoning can improve performance underscores the potential of this benchmark to drive innovation in the field.

As the research community continues to explore more realistic and challenging benchmarks, BRIGHT stands as an important milestone in the ongoing effort to develop text retrieval systems that can better serve the diverse information needs of users in the real world.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

📉

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O. Arik, Danqi Chen, Tao Yu

Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. BRIGHT is constructed from the 1,398 real-world queries collected from diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard [38 ], which achieves a score of 59.0 nDCG@10,2 produces a score of nDCG@10 of 18.0 on BRIGHT. We further demonstrate that augmenting queries with Chain-of-Thought reasoning generated by large language models (LLMs) improves performance by up to 12.2 points. Moreover, BRIGHT is robust against data leakage during pretraining of the benchmarked models as we validate by showing similar performance even when documents from the benchmark are included in the training data. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings. Our code and data are available at https://brightbenchmark.github.io.

7/19/2024

RAR-b: Reasoning as Retrieval Benchmark

Chenghao Xiao, G Thomas Hudson, Noura Al Moubayed

Semantic textual similartiy (STS) and information retrieval tasks (IR) tasks have been the two major avenues to record the progress of embedding models in the past few years. Under the emerging Retrieval-augmented Generation (RAG) paradigm, we envision the need to evaluate next-level language understanding abilities of embedding models, and take a conscious look at the reasoning abilities stored in them. Addressing this, we pose the question: Can retrievers solve reasoning problems? By transforming reasoning tasks into retrieval tasks, we find that without specifically trained for reasoning-level language understanding, current state-of-the-art retriever models may still be far from being competent for playing the role of assisting LLMs, especially in reasoning-intensive tasks. Moreover, albeit trained to be aware of instructions, instruction-aware IR models are often better off without instructions in inference time for reasoning tasks, posing an overlooked retriever-LLM behavioral gap for the research community to align. However, recent decoder-based embedding models show great promise in narrowing the gap, highlighting the pathway for embedding models to achieve reasoning-level language understanding. We also show that, although current off-the-shelf re-ranker models fail on these tasks, injecting reasoning abilities into them through fine-tuning still appears easier than doing so to bi-encoders, and we are able to achieve state-of-the-art performance across all tasks by fine-tuning a reranking model. We release Reasoning as Retrieval Benchmark (RAR-b), a holistic suite of tasks and settings to evaluate the reasoning abilities stored in retriever models. RAR-b is available at https://github.com/gowitheflow-1998/RAR-b.

5/14/2024

LitSearch: A Retrieval Benchmark for Scientific Literature Search

Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, Tianyu Gao

Literature search questions, such as where can I find research on the evaluation of consistency in generated summaries? pose significant challenges for modern search engines and retrieval systems. These questions often require a deep understanding of research concepts and the ability to reason over entire articles. In this work, we introduce LitSearch, a retrieval benchmark comprising 597 realistic literature search queries about recent ML and NLP papers. LitSearch is constructed using a combination of (1) questions generated by GPT-4 based on paragraphs containing inline citations from research papers and (2) questions about recently published papers, manually written by their authors. All LitSearch questions were manually examined or edited by experts to ensure high quality. We extensively benchmark state-of-the-art retrieval models and also evaluate two LLM-based reranking pipelines. We find a significant performance gap between BM25 and state-of-the-art dense retrievers, with a 24.8% difference in absolute recall@5. The LLM-based reranking strategies further improve the best-performing dense retriever by 4.4%. Additionally, commercial search engines and research tools like Google Search perform poorly on LitSearch, lagging behind the best dense retriever by 32 points. Taken together, these results show that LitSearch is an informative new testbed for retrieval systems while catering to a real-world use case.

7/30/2024

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, Jure Leskovec

Answering real-world complex queries, such as complex product search, often requires accurate retrieval from semi-structured knowledge bases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, previous works have mostly studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational K nowledge Bases. Our benchmark covers three domains/datasets: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties, together with their ground-truth answers (items). We conduct rigorous human evaluation to validate the quality of our synthesized queries. We further enhance the benchmark with high-quality human-generated queries to provide an authentic reference. STARK serves as a comprehensive testbed for evaluating the performance of retrieval systems driven by large language models (LLMs). Our experiments suggest that STARK presents significant challenges to the current retrieval and LLM systems, indicating the demand for building more capable retrieval systems. The benchmark data and code are available on https://github.com/snap-stanford/stark.

5/22/2024