CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Read original: arXiv:2409.11363 - Published 9/18/2024 by Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Overview

Introduces CORE-Bench, a benchmark for evaluating the computational reproducibility of published research through AI agents
Aims to foster credibility in published research by incentivizing researchers to ensure their work is computationally reproducible
Provides a standardized way to assess an AI agent's ability to reproduce the computational experiments described in a research paper

Plain English Explanation

CORE-Bench is a tool designed to help improve the credibility of scientific research. Often, when researchers publish their work, it can be difficult for others to reproduce the computational experiments they describe. This can undermine confidence in the findings.

CORE-Bench addresses this issue by providing a benchmark that evaluates AI agents on their ability to computationally reproduce the experiments from a given research paper. The idea is that if an AI agent can successfully recreate the computational steps outlined in a paper, it increases the likelihood that the original research was conducted correctly and the results are reliable.

By incentivizing researchers to ensure their work is computationally reproducible, CORE-Bench aims to foster greater trust in published research and encourage more rigorous scientific practices.

Technical Explanation

CORE-Bench is a benchmark designed to evaluate the computational reproducibility of published research through AI agents. The benchmark involves a set of research papers, each with a corresponding computational experiment that an AI agent must attempt to reproduce.

The key elements of CORE-Bench include:

Paper Selection: Researchers curate a set of high-quality research papers that cover a diverse range of scientific domains and computational techniques.
Computational Experiment Extraction: For each paper, the researchers extract the computational experiments described in the paper, including the data, code, and computational environment required to reproduce the experiments.
Agent Evaluation: AI agents are tasked with attempting to reproduce the computational experiments for each paper. The agents are evaluated on their ability to successfully recreate the experiments, as well as the efficiency and fidelity of their reproduction.
Reproducibility Scoring: CORE-Bench provides a standardized scoring system to assess the computational reproducibility of each paper, based on the performance of the AI agents.

By providing a standardized benchmark, CORE-Bench aims to incentivize researchers to ensure their work is computationally reproducible, ultimately enhancing the credibility of published research.

Critical Analysis

The CORE-Bench paper acknowledges several caveats and limitations of the approach:

The selection of papers and computational experiments included in the benchmark may not be representative of all scientific domains or computational techniques.
The evaluation of AI agents may be influenced by the specific implementation details of the benchmark, which could introduce biases.
Computational reproducibility is just one aspect of research credibility, and other factors, such as experimental design and data validity, are not directly addressed by CORE-Bench.

Additionally, the paper does not discuss potential issues that could arise from the use of CORE-Bench, such as the risk of researchers gaming the system or the challenges of evaluating complex computational workflows.

Overall, while CORE-Bench represents an important step towards fostering greater credibility in published research, further research and refinement may be needed to address these limitations and ensure the widespread adoption and effectiveness of the benchmark.

Conclusion

CORE-Bench is a novel approach to addressing the issue of computational reproducibility in scientific research. By providing a standardized benchmark for evaluating AI agents' ability to reproduce the computational experiments described in published papers, CORE-Bench aims to incentivize researchers to ensure their work is computationally reproducible.

This, in turn, has the potential to increase the credibility and trustworthiness of published research, which is crucial for advancing scientific knowledge and informing important decisions in fields like healthcare, policy, and technology development. While CORE-Bench has some limitations, it represents a significant step towards creating a more robust and reliable scientific ecosystem.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

9/18/2024

AI Agents That Matter

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, Arvind Narayanan

AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without attention to other metrics. As a result, SOTA agents are needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains. Our focus on cost in addition to accuracy motivates the new goal of jointly optimizing the two metrics. We design and implement one such optimization, showing its potential to greatly reduce cost while maintaining accuracy. Second, the benchmarking needs of model and downstream developers have been conflated, making it hard to identify which agent would be best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, and sometimes none at all. This has led to agents that are fragile because they take shortcuts and overfit to the benchmark in various ways. We prescribe a principled framework for avoiding overfitting. Finally, there is a lack of standardization in evaluation practices, leading to a pervasive lack of reproducibility. We hope that the steps we introduce for addressing these shortcomings will spur the development of agents that are useful in the real world and not just accurate on benchmarks.

7/2/2024

💬

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang, Jure Leskovec

A central aspect of machine learning research is experimentation, the process of designing and running experiments, analyzing the results, and iterating towards some positive outcome (e.g., improving accuracy). Could agents driven by powerful language models perform machine learning experimentation effectively? To answer this question, we introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We then construct an agent that can perform ML experimentation based on ReAct framework. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgentBench with 37.5% average success rate. Our agents also display highly interpretable plans and actions. However, the success rates vary considerably; they span from 100% on well-established older datasets to as low as 0% on recent Kaggle challenges created potentially after the underlying LM was trained. Finally, we identify several key challenges for LM-based agents such as long-term planning and reducing hallucination. Our code is released at https://github.com/snap-stanford/MLAgentBench.

4/16/2024

BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science

Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z. Li, Kaicheng Yu

Pursuing artificial intelligence for biomedical science, a.k.a. AI Scientist, draws increasing attention, where one common approach is to build a copilot agent driven by Large Language Models (LLMs). However, to evaluate such systems, people either rely on direct Question-Answering (QA) to the LLM itself, or in a biomedical experimental manner. How to precisely benchmark biomedical agents from an AI Scientist perspective remains largely unexplored. To this end, we draw inspiration from one most important abilities of scientists, understanding the literature, and introduce BioKGBench. In contrast to traditional evaluation benchmark that only focuses on factual QA, where the LLMs are known to have hallucination issues, we first disentangle Understanding Literature into two atomic abilities, i) Understanding the unstructured text from research papers by performing scientific claim verification, and ii) Ability to interact with structured Knowledge-Graph Question-Answering (KGQA) as a form of Literature grounding. We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation (RAG) to identify the factual errors of existing large-scale knowledge graph databases. We collect over two thousand data for two atomic tasks and 225 high-quality annotated data for the agent task. Surprisingly, we discover that state-of-the-art agents, both daily scenarios and biomedical ones, have either failed or inferior performance on our benchmark. We then introduce a simple yet effective baseline, dubbed BKGAgent. On the widely used popular knowledge graph, we discover over 90 factual errors which provide scenarios for agents to make discoveries and demonstrate the effectiveness of our approach. The code and data are available at https://github.com/westlake-autolab/BioKGBench.

7/2/2024