LitSearch: A Retrieval Benchmark for Scientific Literature Search

Read original: arXiv:2407.18940 - Published 7/30/2024 by Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, Tianyu Gao

LitSearch: A Retrieval Benchmark for Scientific Literature Search

Overview

The paper presents a new benchmark for evaluating the performance of citation retrieval systems.
The benchmark, called "LitSearch", includes two types of citation-related tasks: inline-citation questions and full-text citation recommendation.
The benchmark aims to assess the ability of citation retrieval models to identify relevant references for a given scientific statement or document.

Plain English Explanation

The researchers have developed a new way to test how well AI systems can find relevant scientific papers to cite. This is important because being able to accurately recommend citations is a key task for AI models that help researchers find information.

The LitSearch benchmark includes two types of citation-related tasks:

Inline-citation Questions: Given a sentence from a scientific paper, the system must identify the most relevant citations to support that statement.
Full-text Citation Recommendation: Given an entire scientific paper, the system must recommend the most relevant citations to include.

By testing AI systems on these tasks, the researchers can evaluate how well they are able to understand the context and meaning of scientific statements and documents in order to provide accurate citation recommendations. This is an important capability for AI-powered research tools.

Technical Explanation

The LitSearch benchmark consists of two main tasks:

Inline-citation Questions: Given a scientific sentence, the model must identify the most relevant citations to support that statement from a pool of candidate citations. This tests the model's ability to understand the contextual meaning of a statement and match it to the appropriate references.
Full-text Citation Recommendation: Given an entire scientific paper, the model must recommend the most relevant citations to include in the paper. This tests the model's ability to comprehend the overall content and context of a document and suggest appropriate citations.

The benchmark was constructed by extracting sentences and citation information from a large corpus of scientific papers. The tasks are designed to be challenging, requiring models to go beyond simple lexical matching and truly understand the semantics and intent behind the text.

Critical Analysis

The authors acknowledge some limitations of the LitSearch benchmark, including the potential for biases in the dataset selection and the challenges of evaluating citation recommendation at scale.

Additionally, the benchmark focuses on retrieving relevant citations, but does not assess the quality or appropriateness of the recommended citations. Further research could explore methods for evaluating the overall citation quality and relevance beyond just retrieving the citations.

It would also be valuable to extend the benchmark to other domains beyond just scientific literature, as the ability to recommend relevant sources is important across many types of informational text.

Conclusion

The LitSearch benchmark provides a valuable new tool for evaluating the citation retrieval capabilities of AI systems. By testing models on both inline-citation and full-text citation recommendation tasks, the benchmark can help drive progress in developing more intelligent and contextually-aware citation recommendation systems.

These types of systems have important applications in helping researchers efficiently discover and incorporate relevant prior work into their studies. Continued advancements in this area could significantly improve the research workflow and the quality of scientific literature.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

LitSearch: A Retrieval Benchmark for Scientific Literature Search

Anirudh Ajith, Mengzhou Xia, Alexis Chevalier, Tanya Goyal, Danqi Chen, Tianyu Gao

Literature search questions, such as where can I find research on the evaluation of consistency in generated summaries? pose significant challenges for modern search engines and retrieval systems. These questions often require a deep understanding of research concepts and the ability to reason over entire articles. In this work, we introduce LitSearch, a retrieval benchmark comprising 597 realistic literature search queries about recent ML and NLP papers. LitSearch is constructed using a combination of (1) questions generated by GPT-4 based on paragraphs containing inline citations from research papers and (2) questions about recently published papers, manually written by their authors. All LitSearch questions were manually examined or edited by experts to ensure high quality. We extensively benchmark state-of-the-art retrieval models and also evaluate two LLM-based reranking pipelines. We find a significant performance gap between BM25 and state-of-the-art dense retrievers, with a 24.8% difference in absolute recall@5. The LLM-based reranking strategies further improve the best-performing dense retriever by 4.4%. Additionally, commercial search engines and research tools like Google Search perform poorly on LitSearch, lagging behind the best dense retriever by 32 points. Taken together, these results show that LitSearch is an informative new testbed for retrieval systems while catering to a real-world use case.

7/30/2024

👀

REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs

Deepa Tilwani, Yash Saxena, Ali Mohammadi, Edward Raff, Amit Sheth, Srinivasan Parthasarathy, Manas Gaur

Automatic citation generation for sentences in a document or report is paramount for intelligence analysts, cybersecurity, news agencies, and education personnel. In this research, we investigate whether large language models (LLMs) are capable of generating references based on two forms of sentence queries: (a) Direct Queries, LLMs are asked to provide author names of the given research article, and (b) Indirect Queries, LLMs are asked to provide the title of a mentioned article when given a sentence from a different article. To demonstrate where LLM stands in this task, we introduce a large dataset called REASONS comprising abstracts of the 12 most popular domains of scientific research on arXiv. From around 20K research articles, we make the following deductions on public and proprietary LLMs: (a) State-of-the-art, often called anthropomorphic GPT-4 and GPT-3.5, suffers from high pass percentage (PP) to minimize the hallucination rate (HR). When tested with Perplexity.ai (7B), they unexpectedly made more errors; (b) Augmenting relevant metadata lowered the PP and gave the lowest HR; (c) Advance retrieval-augmented generation (RAG) using Mistral demonstrates consistent and robust citation support on indirect queries and matched performance to GPT-3.5 and GPT-4. The HR across all domains and models decreased by an average of 41.93%, and the PP was reduced to 0% in most cases. In terms of generation quality, the average F1 Score and BLEU were 68.09% and 57.51%, respectively; (d) Testing with adversarial samples showed that LLMs, including the Advance RAG Mistral, struggle to understand context, but the extent of this issue was small in Mistral and GPT-4-Preview. Our study contributes valuable insights into the reliability of RAG for automated citation generation tasks.

5/10/2024

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, Jure Leskovec

Answering real-world complex queries, such as complex product search, often requires accurate retrieval from semi-structured knowledge bases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, previous works have mostly studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational K nowledge Bases. Our benchmark covers three domains/datasets: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties, together with their ground-truth answers (items). We conduct rigorous human evaluation to validate the quality of our synthesized queries. We further enhance the benchmark with high-quality human-generated queries to provide an authentic reference. STARK serves as a comprehensive testbed for evaluating the performance of retrieval systems driven by large language models (LLMs). Our experiments suggest that STARK presents significant challenges to the current retrieval and LLM systems, indicating the demand for building more capable retrieval systems. The benchmark data and code are available on https://github.com/snap-stanford/stark.

5/22/2024

DocReLM: Mastering Document Retrieval with Language Model

Gengchen Wei, Xinle Pang, Tianning Zhang, Yu Sun, Xun Qian, Chen Lin, Han-Sen Zhong, Wanli Ouyang

With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance. The results show that DocReLM achieves a Top 10 accuracy of 44.12% in computer vision, compared to Google Scholar's 15.69%, and an increase to 36.21% in quantum physics, while that of Google Scholar is 12.96%.

5/21/2024