DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Read original: arXiv:2407.10701 - Published 7/16/2024 by Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, Dong Yu

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Overview

• This paper introduces DocBench, a new benchmark for evaluating the performance of large language models (LLMs) on document reading tasks.

• DocBench is designed to assess an LLM's ability to understand and reason about long-form documents, going beyond traditional QA benchmarks that focus on short passages.

• The benchmark includes a diverse set of tasks, such as summarization, question answering, and information extraction, to provide a comprehensive evaluation of an LLM's document understanding capabilities.

Plain English Explanation

<a href="https://aimodels.fyi/papers/arxiv/mmlongbench-doc-benchmarking-long-context-document-understanding">DocBench</a> is a new tool that researchers can use to test how well large language models (LLMs) can read and understand long documents. Unlike previous benchmarks that only looked at short passages of text, DocBench includes a variety of tasks that assess an LLM's ability to summarize, answer questions about, and extract information from full-length documents.

The researchers who created DocBench believe that evaluating LLMs on these more complex, real-world document reading tasks is important for understanding the true capabilities of these powerful AI systems. By testing how well LLMs can handle long-form, multi-paragraph texts, DocBench provides a more realistic and comprehensive assessment of their document understanding abilities.

Technical Explanation

The <a href="https://aimodels.fyi/papers/arxiv/benchmarking-llms-open-domain-dialogue-evaluation">DocBench</a> benchmark includes a diverse set of document reading tasks, such as:

Summarization: Generating concise summaries of long documents
Question Answering: Answering questions that require understanding the full context of a document
Information Extraction: Extracting key facts and entities from documents

The researchers carefully curated a dataset of high-quality, diverse documents from sources like academic papers, news articles, and government reports. These documents cover a wide range of topics and genres, allowing for a comprehensive evaluation of an LLM's capabilities.

To assess performance, DocBench defines appropriate evaluation metrics for each task, such as ROUGE scores for summarization and F1 scores for question answering. The benchmark also includes a leaderboard to track the progress of different LLM models on these document understanding challenges.

Critical Analysis

The <a href="https://aimodels.fyi/papers/arxiv/beyond-traditional-benchmarks-analyzing-behaviors-open-llms">DocBench</a> benchmark represents an important step forward in evaluating the document reading abilities of LLMs. By moving beyond traditional QA benchmarks, it provides a more realistic assessment of how these models perform on the types of complex, real-world tasks they will encounter in practical applications.

However, the paper acknowledges that DocBench is still a relatively narrow set of tasks and that further work is needed to capture the full range of document understanding capabilities. For example, the benchmark does not currently include tasks related to multi-document reasoning or deep causal/commonsense understanding.

Additionally, the researchers note that the current dataset may not fully represent the diversity of documents that LLMs will need to handle in the real world. Continued efforts to expand and refine the DocBench dataset will be important for ensuring its relevance and usefulness.

Conclusion

<a href="https://aimodels.fyi/papers/arxiv/citybench-evaluating-capabilities-large-language-model-as">DocBench</a> represents an important step forward in the evaluation of LLM capabilities, providing a more comprehensive and realistic assessment of document understanding skills. By focusing on long-form, multi-paragraph texts and a variety of relevant tasks, the benchmark can help drive progress in developing LLMs that can truly excel at reading and comprehending complex documents.

As the field of AI continues to advance, benchmarks like DocBench will become increasingly crucial for understanding the strengths and limitations of these powerful language models and guiding future research and development efforts. The insights gained from DocBench can help ensure that LLMs are well-equipped to tackle the real-world challenges of document-centric tasks.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, Dong Yu

Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, multi-modal information understanding and long-context reading. However, no current benchmark exists to evaluate their performance in such scenarios, where a raw file and questions are provided as input, and a corresponding response is expected as output. In this paper, we introduce DocBench, a new benchmark designed to evaluate LLM-based document reading systems. Our benchmark involves a meticulously crafted process, including the recruitment of human annotators and the generation of synthetic questions. It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions. We evaluate both proprietary LLM-based systems accessible via web interfaces or APIs, and a parse-then-read pipeline employing open-source LLMs. Our evaluations reveal noticeable gaps between existing LLM-based document reading systems and human performance, underscoring the challenges of developing proficient systems. To summarize, DocBench aims to establish a standardized benchmark for evaluating LLM-based document reading systems under diverse real-world scenarios, thereby guiding future advancements in this research area.

7/16/2024

💬

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu, Jinru Ding, Jie Xu, Weiguo Hu, Xiaoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, Pengfei Liu, Xiaofan Zhang, Shanshan Wang, Kang Li, Haofen Wang, Tong Ruan, Xuanjing Huang, Xin Sun, Shaoting Zhang

Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce MedBench, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300,901 questions) to cover 43 clinical specialties and performs multi-facet evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations for question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals' perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

7/17/2024

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: https://mayubo2333.github.io/MMLongBench-Doc

7/11/2024

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

John Mendonc{c}a, Alon Lavie, Isabel Trancoso

Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots.

7/8/2024