SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

2406.10421

Published 6/18/2024 by Tu Anh Dinh, Carlos Mullov, Leonard Barmann, Zhaolin Li, Danni Liu, Simon Rei{ss}, Jueun Lee, Nathan Lerzer, Fabian Ternava, Jianfeng Gao and 7 others

cs.CL

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Abstract

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.

Create account to get full access

Overview

• The paper discusses a new benchmark called "SciEx" that evaluates the performance of large language models (LLMs) on scientific exams, using both human expert grading and automatic grading.

• The researchers developed a dataset of scientific exam questions across various disciplines, which was used to test the capabilities of LLMs like GPT-3, PaLM, and Chinchilla.

• The study compared the performance of LLMs to human expert graders, as well as explored the use of automatic grading methods to assess the models' responses.

Plain English Explanation

The researchers created a new way to test how well large language models (AI systems that can understand and generate human-like text) can perform on scientific exams. They developed a dataset of exam questions covering different scientific fields, like biology, physics, and chemistry.

The researchers then had the language models, such as GPT-3, PaLM, and Chinchilla, answer these exam questions. Importantly, the models' answers were graded by both human experts and an automated grading system. This allowed the researchers to compare how well the AI models did compared to human experts.

The key goal was to better understand the capabilities of these powerful language models when it comes to demonstrating scientific knowledge and problem-solving skills. This type of benchmark could help researchers and developers improve language models for scientific and educational applications.

Technical Explanation

The paper introduces the "SciEx" benchmark, which was designed to evaluate the performance of large language models (LLMs) on scientific exams. The researchers created a dataset of exam questions spanning various scientific disciplines, including biology, physics, chemistry, and more.

To assess the LLMs, the researchers had models like GPT-3, PaLM, and Chinchilla generate responses to the exam questions. These responses were then graded by both human experts and an automated grading system. The human expert grading provided a high-quality assessment of the models' scientific understanding, while the automated grading explored the feasibility of scalable evaluation methods.

The results showed varying performance of the LLMs across different scientific domains, with some models performing better than others on specific types of questions. The study also highlighted the challenges in automatically grading open-ended responses to capture nuanced scientific reasoning.

Critical Analysis

The SciEx benchmark represents an important step in evaluating the capabilities of large language models in the scientific domain. By using a diverse set of exam questions and comparing model performance to human experts, the researchers have provided valuable insights into the strengths and limitations of current LLM technology.

However, the paper acknowledges several caveats and areas for further research. For example, the dataset may not capture the full breadth of scientific knowledge, and the human grading process could be subject to biases. Additionally, the automated grading approaches explored in the study were not as comprehensive as the human evaluations, suggesting the need for more advanced techniques to accurately assess scientific reasoning.

Future research could explore ways to expand the SciEx benchmark, such as incorporating more advanced problem-solving tasks or integrating multimedia elements (e.g., diagrams, formulas) into the exam questions. Investigating the performance of LLMs on other scientific benchmarks, such as SciKnowEval or Grade-Like-a-Human, could also provide a more holistic understanding of these models' scientific capabilities.

Conclusion

The SciEx benchmark represents a significant contribution to the field of AI and scientific assessment. By evaluating large language models on a diverse set of scientific exam questions, the researchers have shed light on the current capabilities and limitations of these models in demonstrating scientific knowledge and problem-solving skills.

The findings from this study have important implications for the development of AI systems that can assist in scientific education, research, and knowledge dissemination. As language models continue to advance, benchmarks like SciEx will be crucial for tracking their progress and ensuring they can reliably and accurately engage with scientific content. This work also highlights the value of combining human expert evaluation with automated grading to provide a comprehensive assessment of AI systems' scientific competencies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

New!SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, Wei Wang

Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

7/1/2024

cs.CL cs.AI cs.LG

SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Hongshuai Wang, Yongge Li, Mujie Lin, Shuwen Yang, Jiankun Wang, Mingjun Xu, Jin Huang, Fang Xi, Jiaxi Zhuang, Yuqi Yin, Yaqi Li, Changhong Chen, Zheng Cheng, Zifeng Zhao, Linfeng Zhang, Guolin Ke

Recent breakthroughs in Large Language Models (LLMs) have revolutionized natural language understanding and generation, sparking significant interest in applying them to scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data. In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. SciAssess aims to thoroughly assess the efficacy of LLMs by focusing on their capabilities in Memorization (L1), Comprehension (L2), and Analysis & Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including fundamental science, alloy materials, biomedicine, drug discovery, and organic materials. To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, including GPT, Claude, and Gemini, highlighting their strengths and areas for improvement. This evaluation supports the ongoing development of LLM applications in the analysis of scientific literature. SciAssess and its resources are available at url{https://sci-assess.github.io/}.

6/19/2024

cs.CL

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, Huajun Chen

The burgeoning utilization of Large Language Models (LLMs) in scientific research necessitates advanced benchmarks capable of evaluating their understanding and application of scientific knowledge comprehensively. To address this need, we introduce the SciKnowEval benchmark, a novel framework that systematically evaluates LLMs across five progressive levels of scientific knowledge: studying extensively, inquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including knowledge coverage, inquiry and exploration capabilities, reflection and reasoning abilities, ethic and safety considerations, as well as practice proficiency. Specifically, we take biology and chemistry as the two instances of SciKnowEval and construct a dataset encompassing 50K multi-level scientific problems and solutions. By leveraging this dataset, we benchmark 20 leading open-source and proprietary LLMs using zero-shot and few-shot prompting strategies. The results reveal that despite achieving state-of-the-art performance, the proprietary LLMs still have considerable room for improvement, particularly in addressing scientific computations and applications. We anticipate that SciKnowEval will establish a comprehensive standard for benchmarking LLMs in science research and discovery, and promote the development of LLMs that integrate scientific knowledge with strong safety awareness. The dataset and code are publicly available at https://github.com/hicai-zju/sciknoweval .

6/14/2024

cs.CL

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Wenjing Xie, Juxin Niu, Chun Jason Xue, Nan Guan

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a particular step in the grading procedure: grading using predefined rubrics. However, grading is a multifaceted procedure that encompasses other crucial steps, such as grading rubrics design and post-grading review. There has been a lack of systematic research exploring the potential of LLMs to enhance the entire grading~process. In this paper, we propose an LLM-based grading system that addresses the entire grading procedure, including the following key components: 1) Developing grading rubrics that not only consider the questions but also the student answers, which can more accurately reflect students' performance. 2) Under the guidance of grading rubrics, providing accurate and consistent scores for each student, along with customized feedback. 3) Conducting post-grading review to better ensure accuracy and fairness. Additionally, we collected a new dataset named OS from a university operating system course and conducted extensive experiments on both our new dataset and the widely used Mohler dataset. Experiments demonstrate the effectiveness of our proposed approach, providing some new insights for developing automated grading systems based on LLMs.

5/31/2024

cs.AI