Using ChatGPT to Score Essays and Short-Form Constructed Responses

Read original: arXiv:2408.09540 - Published 8/20/2024 by Mark D. Shermis

🎲

Overview

This study aimed to determine if ChatGPT's large language models could match the scoring accuracy of human and machine scores from the ASAP competition.
The investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost.
ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics.

Plain English Explanation

The study looked at whether ChatGPT, a powerful language model, could match the scoring accuracy of humans and machines in the ASAP competition. The researchers tested different prediction models, like linear regression and random forest, to see how well ChatGPT could score essays compared to human raters. They used a special metric called quadratic weighted kappa (QWK) to measure the similarity between ChatGPT's scores and the human scores.

The results showed that while ChatGPT's gradient boost model (a type of machine learning model) did well and got QWK scores close to the human raters for some data sets, its overall performance was not consistent and was often lower than the human scores. This means ChatGPT still has room for improvement, especially when it comes to handling biases and ensuring the fairness of the scoring.

Despite these challenges, the study found that ChatGPT has the potential to be used for efficient scoring, especially if it's fine-tuned for specific domains. However, the researchers concluded that ChatGPT would need more development before it could be relied upon for high-stakes assessments. Future research should focus on improving the model's accuracy, addressing ethical considerations, and exploring ways to combine ChatGPT with other methods to get the best results.

Technical Explanation

The study aimed to evaluate the performance of ChatGPT, a large language model, in matching the scoring accuracy of human and machine scores from the Automated Student Assessment Prize (ASAP) competition. The researchers tested various prediction models, including linear regression, random forest, gradient boost, and boost, to assess ChatGPT's capabilities.

To evaluate ChatGPT's performance, the researchers used the quadratic weighted kappa (QWK) metric, which measures the agreement between ChatGPT's scores and the human raters' scores. The results showed that while ChatGPT's gradient boost model achieved QWKs close to human raters for some data sets, its overall performance was inconsistent and often lower than the human scores.

The study highlighted the need for further refinement of ChatGPT's scoring capabilities, particularly in handling biases and ensuring scoring fairness. Despite these challenges, the researchers found that ChatGPT demonstrated potential for scoring efficiency, especially with domain-specific fine-tuning.

Critical Analysis

The study acknowledges the limitations of ChatGPT's current performance and the need for further development to make it a reliable tool for high-stakes assessments. The researchers raise important concerns about the model's inconsistency and potential biases, which must be addressed before ChatGPT can be widely adopted for scoring purposes.

While the study suggests that ChatGPT could complement human scoring, it is essential to consider the ethical implications of relying on an AI system for such critical tasks. The researchers emphasize the importance of addressing these ethical considerations in future research.

Additionally, the study highlights the need for exploring hybrid models that combine ChatGPT with empirical methods to improve the overall accuracy and reliability of the scoring process. This approach could potentially leverage the strengths of both human and machine intelligence to deliver more robust and equitable assessments.

Conclusion

The study's findings suggest that while ChatGPT has the potential to enhance scoring efficiency, it still requires further refinement and development to match the accuracy and consistency of human raters. The researchers recommend continued research to improve the model's performance, address ethical concerns, and explore hybrid approaches that integrate ChatGPT with other empirical methods. By addressing these challenges, the research community can work towards developing AI-powered scoring systems that are reliable, fair, and aligned with the needs of educational and assessment domains.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎲

Using ChatGPT to Score Essays and Short-Form Constructed Responses

Mark D. Shermis

This study aimed to determine if ChatGPT's large language models could match the scoring accuracy of human and machine scores from the ASAP competition. The investigation focused on various prediction models, including linear regression, random forest, gradient boost, and boost. ChatGPT's performance was evaluated against human raters using quadratic weighted kappa (QWK) metrics. Results indicated that while ChatGPT's gradient boost model achieved QWKs close to human raters for some data sets, its overall performance was inconsistent and often lower than human scores. The study highlighted the need for further refinement, particularly in handling biases and ensuring scoring fairness. Despite these challenges, ChatGPT demonstrated potential for scoring efficiency, especially with domain-specific fine-tuning. The study concludes that ChatGPT can complement human scoring but requires additional development to be reliable for high-stakes assessments. Future research should improve model accuracy, address ethical considerations, and explore hybrid models combining ChatGPT with empirical methods.

8/20/2024

🎲

Can we trust the evaluation on ChatGPT?

Rachith Aiyappa, Jisun An, Haewoon Kwak, Yong-Yeol Ahn

ChatGPT, the first large language model (LLM) with mass adoption, has demonstrated remarkable performance in numerous natural language tasks. Despite its evident usefulness, evaluating ChatGPT's performance in diverse problem domains remains challenging due to the closed nature of the model and its continuous updates via Reinforcement Learning from Human Feedback (RLHF). We highlight the issue of data contamination in ChatGPT evaluations, with a case study of the task of stance detection. We discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.

8/23/2024

💬

Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs

Mike Thelwall

Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process. This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts. The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66). The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones. Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

8/14/2024

Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version)

Yiming Zhu, Peixian Zhang, Ehsan-Ul Haq, Pan Hui, Gareth Tyson

Harnessing the potential of large language models (LLMs) like ChatGPT can help address social challenges through inclusive, ethical, and sustainable means. In this paper, we investigate the extent to which ChatGPT can annotate data for social computing tasks, aiming to reduce the complexity and cost of undertaking web research. To evaluate ChatGPT's potential, we re-annotate seven datasets using ChatGPT, covering topics related to pressing social issues like COVID-19 misinformation, social bot deception, cyberbully, clickbait news, and the Russo-Ukrainian War. Our findings demonstrate that ChatGPT exhibits promise in handling these data annotation tasks, albeit with some challenges. Across the seven datasets, ChatGPT achieves an average annotation F1-score of 72.00%. Its performance excels in clickbait news annotation, correctly labeling 89.66% of the data. However, we also observe significant variations in performance across individual labels. Our study reveals predictable patterns in ChatGPT's annotation performance. Thus, we propose GPT-Rater, a tool to predict if ChatGPT can correctly label data for a given annotation task. Researchers can use this to identify where ChatGPT might be suitable for their annotation requirements. We show that GPT-Rater effectively predicts ChatGPT's performance. It performs best on a clickbait headlines dataset by achieving an average F1-score of 95.00%. We believe that this research opens new avenues for analysis and can reduce barriers to engaging in social computing research.

7/10/2024