LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

2403.07974

Published 6/7/2024 by Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica

cs.SE cs.CL cs.LG

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Abstract

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

Create account to get full access

Overview

This paper introduces LiveCodeBench, a new benchmark for holistically evaluating the code-related capabilities of large language models (LLMs).
LiveCodeBench aims to provide a comprehensive and contamination-free assessment of an LLM's ability to perform various code-related tasks, including code generation, understanding, and debugging.
The benchmark is designed to measure an LLM's performance on a diverse set of real-world coding challenges, rather than relying on synthetic or limited datasets.

Plain English Explanation

The paper discusses a new benchmark called LiveCodeBench that is designed to thoroughly evaluate the code-related abilities of large language models (LLMs). LLMs are AI systems that can understand and generate human language, and they are being increasingly used for coding-related tasks. However, the existing ways of testing these models' coding capabilities often use artificial or limited datasets, which may not accurately reflect their real-world performance.

LiveCodeBench aims to address this issue by providing a more comprehensive and realistic assessment of an LLM's coding skills. The benchmark includes a wide range of coding challenges, such as generating working code from natural language descriptions, debugging code, and performing cybersecurity tasks. These challenges are based on real-world coding problems, rather than artificially created ones.

The key advantage of LiveCodeBench is that it helps researchers and developers assess the true capabilities of LLMs in a way that is not influenced by data contamination. Data contamination occurs when the training data used to develop an LLM contains information about the test data, which can lead to inflated performance results. LiveCodeBench is designed to avoid this issue, ensuring that the evaluation is truly holistic and unbiased.

Technical Explanation

The paper introduces a new benchmark called LiveCodeBench for comprehensively evaluating the code-related capabilities of large language models (LLMs). The benchmark is designed to provide a holistic assessment of an LLM's performance on a diverse set of real-world coding challenges, including code generation, code understanding, code debugging, and cybersecurity tasks.

The key innovation of LiveCodeBench is its focus on contamination-free evaluation. The authors argue that many existing code-related benchmarks suffer from data contamination, where the training data used to develop the LLM contains information about the test data, leading to inflated performance results. LiveCodeBench addresses this issue by curating a benchmark dataset that is completely separate from the LLM's training data, ensuring a fair and unbiased evaluation.

The benchmark curation process involves several steps, including the collection of real-world coding challenges from various sources, the filtering of challenges to ensure diversity and quality, and the verification that the challenges are not present in the LLM's training data. This process is designed to create a comprehensive and representative benchmark that accurately reflects the real-world coding capabilities of the LLMs being evaluated.

Critical Analysis

The LiveCodeBench paper presents a well-designed and thorough approach to evaluating the code-related capabilities of large language models. The focus on contamination-free evaluation is a significant strength, as it helps to ensure that the benchmark results are not skewed by data leakage.

However, the paper does acknowledge some limitations and areas for further research. For example, the authors note that the current benchmark dataset may not fully capture the diversity of real-world coding challenges, and they encourage the community to contribute additional challenges to expand the benchmark's coverage.

Additionally, the paper does not provide a detailed analysis of the specific coding tasks or the performance of existing LLMs on the benchmark. While the overall framework and methodology are clearly described, the lack of concrete results makes it difficult to fully assess the practical implications of the LiveCodeBench approach.

Further research could also explore the potential for using LiveCodeBench to inform the development and fine-tuning of LLMs for code-related applications. By identifying the strengths and weaknesses of these models on a diverse set of coding challenges, the benchmark could help guide the design of more capable and robust systems.

Conclusion

The LiveCodeBench paper presents a significant advancement in the evaluation of large language models for code-related tasks. By providing a comprehensive, contamination-free benchmark, the authors have created a valuable tool for assessing the true capabilities of these AI systems in real-world coding scenarios.

The widespread adoption of LiveCodeBench has the potential to drive meaningful progress in the development of LLMs for coding applications, as it will enable more accurate and reliable assessment of their performance. This, in turn, could lead to the creation of more capable and trustworthy AI assistants for software development, cybersecurity, and other critical domains.

Overall, the LiveCodeBench framework represents an important contribution to the field of AI-powered coding, and its ongoing development and application will be an area to watch closely in the years to come.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi LI, Ruibo Liu, Yue Wang, Shuyue Guo, Xingwei Qu, Xiang Yue, Ge Zhang, Wenhu Chen, Jie Fu

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.

4/9/2024

cs.SE cs.AI cs.CL cs.LG

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, Weiran Xu

Computer Science (CS) stands as a testament to the intricacies of human intelligence, profoundly advancing the development of artificial intelligence and modern society. However, the current community of large language models (LLMs) overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer science field. To bridge this gap, we introduce CS-Bench, the first bilingual (Chinese-English) benchmark dedicated to evaluating the performance of LLMs in computer science. CS-Bench comprises approximately 5K meticulously curated test samples, covering 26 subfields across 4 key areas of computer science, encompassing various task forms and divisions of knowledge and reasoning. Utilizing CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs, revealing the relationship between CS performance and model scales. We also quantitatively analyze the reasons for failures in existing LLMs and highlight directions for improvements, including knowledge supplementation and CS-specific reasoning. Further cross-capability experiments show a high correlation between LLMs' capabilities in computer science and their abilities in mathematics and coding. Moreover, expert LLMs specialized in mathematics and coding also demonstrate strong performances in several CS subfields. Looking ahead, we envision CS-Bench serving as a cornerstone for LLM applications in the CS field and paving new avenues in assessing LLMs' diverse reasoning capabilities. The CS-Bench data and evaluation code are available at https://github.com/csbench/csbench.

6/14/2024

cs.CL cs.AI cs.LG

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, Joshua Saxe

Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We introduce two new areas for testing: prompt injection and code interpreter abuse. We evaluated multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama. Our results show that conditioning away risk of attack remains an unsolved problem; for example, all tested models showed between 26% and 41% successful prompt injection tests. We further introduce the safety-utility tradeoff: conditioning an LLM to reject unsafe prompts can cause the LLM to falsely reject answering benign prompts, which lowers utility. We propose quantifying this tradeoff using False Refusal Rate (FRR). As an illustration, we introduce a novel test set to quantify FRR for cyberattack helpfulness risk. We find many LLMs able to successfully comply with borderline benign requests while still rejecting most unsafe requests. Finally, we quantify the utility of LLMs for automating a core cybersecurity task, that of exploiting software vulnerabilities. This is important because the offensive capabilities of LLMs are of intense interest; we quantify this by creating novel test sets for four representative problems. We find that models with coding capabilities perform better than those without, but that further work is needed for LLMs to become proficient at exploit generation. Our code is open source and can be used to evaluate other LLMs.

4/23/2024

cs.CR cs.LG

💬

DebugBench: Evaluating Debugging Capability of Large Language Models

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, Maosong Sun

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

6/7/2024

cs.SE cs.AI cs.CL