NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

2403.00862

Published 6/5/2024 by Miao Li, Ming-Bin Chen, Bo Tang, Shengbin Hou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Peng Cheng and 1 other

cs.CL cs.AI

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

Abstract

We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of ten popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations.

Create account to get full access

Overview

Introduces a new benchmark dataset called "NewsBench" for evaluating the writing proficiency and safety adherence of large language models (LLMs) in Chinese journalistic editorial applications
Designed to systematically assess the ability of LLMs to generate high-quality, safe, and appropriate content for Chinese news articles
Covers technical details on the dataset construction and evaluation methodology

Plain English Explanation

The paper presents a new benchmark called "NewsBench" that is designed to evaluate how well large language models (LLMs) can write Chinese news articles. The goal is to systematically assess the models' ability to generate content that is of high quality, safe, and appropriate for a journalistic setting.

The authors built NewsBench by carefully curating a dataset of real news articles and annotations that can be used to assess different aspects of an LLM's writing proficiency and adherence to safety guidelines. This includes evaluating factors like grammar, style, factual accuracy, ethical considerations, and potential for generating harmful or biased content.

The idea is that by using this standardized benchmark, researchers and developers can better understand the strengths and limitations of LLMs when it comes to producing content for Chinese news publications. This information can then guide the development of more capable and responsible AI systems for journalism and other content creation tasks.

Technical Explanation

The NewsBench paper introduces a new dataset and evaluation framework for assessing the performance of large language models (LLMs) on Chinese journalistic editorial tasks. The dataset consists of a curated collection of real news articles and associated annotations that cover various aspects of writing proficiency and safety adherence.

The key components of the NewsBench benchmark include:

Writing Proficiency Evaluation: This evaluates an LLM's ability to generate news articles that exhibit strong grammar, coherent structure, appropriate style, and factual accuracy. The dataset includes human-written articles as well as annotations on these dimensions.
Safety Adherence Evaluation: This assesses an LLM's capacity to avoid generating content that is unethical, biased, or potentially harmful. The dataset includes annotations related to ethical considerations, political sensitivity, and other safety-critical factors.

The authors describe the systematic process they used to construct the NewsBench dataset, including article collection, annotation procedures, and quality control measures. They also outline the evaluation metrics and benchmarking protocols that can be used to compare the performance of different LLMs on this task.

Critical Analysis

The NewsBench benchmark represents a valuable contribution to the field of language model evaluation, particularly in the context of journalistic applications. By providing a standardized dataset and evaluation framework, the authors enable more rigorous and comparable assessments of LLM capabilities in this domain.

One strength of the NewsBench approach is its comprehensive coverage of both writing proficiency and safety adherence. This aligns with the growing recognition that responsible AI development must consider not just technical performance, but also the potential for models to generate harmful or biased content.

However, the authors acknowledge several limitations and areas for further research. For example, the dataset is focused on Chinese-language news, so its applicability to other languages and domains may be limited. Additionally, the safety annotations, while extensive, may not capture the full complexity of ethical considerations in journalism.

Further research could explore ways to expand the benchmark's scope, incorporate more diverse data sources, and develop refined evaluation metrics that better capture the nuances of journalistic writing and ethical decision-making. Ongoing collaboration between AI researchers, journalists, and other stakeholders will be crucial to ensure that language models are developed and deployed in ways that uphold the integrity and social responsibility of the media industry.

Conclusion

The NewsBench paper presents a comprehensive benchmark for evaluating the writing proficiency and safety adherence of large language models in the context of Chinese journalistic editorial applications. By providing a standardized dataset and evaluation framework, the authors enable more rigorous and comparable assessments of LLM capabilities in this important domain.

The NewsBench benchmark represents a valuable contribution to the field of responsible AI development, as it recognizes the need to consider not just technical performance, but also the potential for language models to generate harmful or biased content. While the current dataset and evaluation protocols have some limitations, this work lays the foundation for continued research and collaboration to ensure that AI systems can be safely and ethically deployed in the media industry and other content creation contexts.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

SafetyBench: Evaluating the Safety of Large Language Models

Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang

With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at url{https://llmbench.ai/safety}{https://llmbench.ai/safety}.

6/26/2024

cs.CL

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, Jinyang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He

In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choice questions across common sense and K-12 educational subjects, meticulously curated to reflect the breadth and depth of everyday and academic knowledge. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities. The insights gleaned from FoundaBench evaluations set a new standard for understanding the fundamental knowledge of LLMs, providing a robust framework for future advancements in the field.

4/30/2024

cs.CL cs.AI

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi LI, Ruibo Liu, Yue Wang, Shuyue Guo, Xingwei Qu, Xiang Yue, Ge Zhang, Wenhu Chen, Jie Fu

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models (particularly Gemini-Ultra and GPT-4), outperform open-source models in CodeEditorBench, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. We will release all prompts and datasets to enable the community to expand the dataset and benchmark emerging LLMs. By introducing CodeEditorBench, we contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.

4/9/2024

cs.SE cs.AI cs.CL cs.LG

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Xuanfan Ni, Piji Li

Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.

5/17/2024

cs.CL