Comparative Analysis of ChatGPT, GPT-4, and Microsoft Bing Chatbots for GRE Test

2312.03719

Published 4/9/2024 by Mohammad Abu-Haifa, Bara'a Etawi, Huthaifa Alkhatatbeh, Ayman Ababneh

🧠

Abstract

This research paper presents an analysis of how well three artificial intelligence chatbots: Bing, ChatGPT, and GPT-4, perform when answering questions from standardized tests. The Graduate Record Examination is used in this paper as a case study. A total of 137 questions with different forms of quantitative reasoning and 157 questions with verbal categories were used to assess their capabilities. This paper presents the performance of each chatbot across various skills and styles tested in the exam. The proficiency of these chatbots in addressing image-based questions is also explored, and the uncertainty level of each chatbot is illustrated. The results show varying degrees of success across the chatbots, where GPT-4 served as the most proficient, especially in complex language understanding tasks and image-based questions. Results highlight the ability of these chatbots to pass the GRE with a high score, which encourages the use of these chatbots in test preparation. The results also show how important it is to ensure that, if the test is administered online, as it was during COVID, the test taker is segregated from these resources for a fair competition on higher education opportunities.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper examines the performance of three AI chatbots - Bing, ChatGPT, and GPT-4 - in answering questions from the Graduate Record Examination (GRE).
The study assessed the chatbots' capabilities across different question types, including quantitative reasoning and verbal categories, as well as their ability to handle image-based questions.
The results show that GPT-4 outperformed the other chatbots, particularly in complex language understanding tasks and image-based questions.
The paper highlights the ability of these chatbots to potentially score well on the GRE, which raises concerns about fairness in higher education admissions if the test is administered online.

Plain English Explanation

This research paper looked at how well three different AI chatbots - Bing, ChatGPT, and GPT-4 - performed when answering questions from a standardized test called the Graduate Record Examination (GRE). The GRE is a common exam that many graduate school applicants take.

The researchers gave the chatbots 137 math-related questions and 157 verbal questions from the GRE to see how they would do. They also tested the chatbots' abilities to answer questions that involved images.

The results showed that the most advanced chatbot, GPT-4, was the best at answering the GRE questions, especially the ones that required deeper understanding of language and analysis of images. The other chatbots, Bing and ChatGPT, also performed well but not as strongly as GPT-4.

These findings suggest that these AI chatbots are becoming so advanced that they could potentially score very high on the GRE. This raises concerns about fairness in the graduate school admissions process, especially if the GRE is taken online and students have access to these chatbots during the test. The researchers emphasize the importance of ensuring test-takers cannot use these chatbots when taking the GRE, to maintain a level playing field.

Technical Explanation

The paper used a total of 294 questions from the GRE, including 137 quantitative reasoning questions and 157 verbal questions, to assess the performance of Bing, ChatGPT, and GPT-4. The chatbots were tasked with answering these questions, and their responses were evaluated across different skills and question styles.

The researchers also explored the chatbots' proficiency in handling image-based questions, which are a common feature of the GRE. Additionally, the uncertainty levels of each chatbot were measured and reported.

The results showed that GPT-4 outperformed the other chatbots, particularly in complex language understanding tasks and image-based questions. This suggests that the latest advancements in large language models, such as those used in GPT-4, have significantly improved their ability to comprehend and reason with visual and textual information.

The findings highlight the potential for these chatbots to achieve high scores on the GRE, which could have implications for the fairness and integrity of the graduate school admissions process, especially if the test is administered online. The paper emphasizes the need to ensure that test-takers are segregated from these AI resources during the examination to maintain a level playing field.

Critical Analysis

The paper provides valuable insights into the current capabilities of AI chatbots in addressing standardized test questions, particularly the GRE. However, it is important to consider several caveats and limitations:

The study focuses on a single standardized test, the GRE, and the findings may not necessarily generalize to other types of exams or assessment frameworks. Further research is needed to explore the chatbots' performance on a wider range of standardized tests.
The paper does not delve into the specific strategies or approaches used by the chatbots to solve the GRE questions. Understanding these underlying mechanisms could provide deeper insights into the strengths and limitations of each chatbot.
The study does not address the potential for these chatbots to be used for academic dishonesty or unfair advantage in the admissions process. Extensive discussions and policy considerations may be necessary to address these ethical concerns.
The paper's findings highlight the rapid advancements in large language models, but it is essential to continue monitoring and evaluating these technologies as they evolve, to ensure their responsible and ethical deployment in educational contexts.

Conclusion

This research paper provides a compelling investigation into the capabilities of three prominent AI chatbots - Bing, ChatGPT, and GPT-4 - in answering questions from the Graduate Record Examination (GRE). The results indicate that the most advanced chatbot, GPT-4, is able to perform exceptionally well on the GRE, surpassing the other chatbots in both quantitative reasoning and verbal tasks, as well as in handling image-based questions.

These findings raise important considerations about the potential implications of these chatbots on the fairness and integrity of the graduate school admissions process, especially if the GRE is administered online. The paper emphasizes the need to ensure that test-takers are segregated from these AI resources during the examination to maintain a level playing field and prevent unfair advantages.

As AI technologies continue to advance, it will be crucial for educational institutions, policymakers, and researchers to closely monitor and address the ethical considerations surrounding the use of these chatbots in academic settings. Ongoing collaboration and thoughtful policy decisions will be essential to ensure that the admissions process remains fair and equitable for all applicants.

Related Papers

⛏️

Does GPT-4 pass the Turing test?

Cameron R. Jones, Benjamin K. Bergen

We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants' decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient to pass the Turing test. Participant knowledge about LLMs and number of games played positively correlated with accuracy in detecting AI, suggesting learning and practice as possible strategies to mitigate deception. Despite known limitations as a test of intelligence, we argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.

4/23/2024

cs.AI cs.CL

❗

Google or ChatGPT: Who is the Better Helper for University Students

Mengmeng Zhang, Xiantong Yang

Using information technology tools for academic help-seeking among college students has become a popular trend. In the evolutionary process between Generation Artificial Intelligence (GenAI) and traditional search engines, when students face academic challenges, do they tend to prefer Google, or are they more inclined to utilize ChatGPT? And what are the key factors influencing learners' preference to use ChatGPT for academic help-seeking? These relevant questions merit attention. The study employed a mixed-methods research design to investigate Taiwanese university students' online academic help-seeking preferences. The results indicated that students tend to prefer using ChatGPT to seek academic assistance, reflecting the potential popularity of GenAI in the educational field. Additionally, in comparing seven machine learning algorithms, the Random Forest and LightGBM algorithms exhibited superior performance. These two algorithms were employed to evaluate the predictive capability of 18 potential factors. It was found that GenAI fluency, GenAI distortions, and age were the core factors influencing how university students seek academic help. Overall, this study underscores that educators should prioritize the cultivation of students' critical thinking skills, while technical personnel should enhance the fluency and reliability of ChatGPT and Google searches and explore the integration of chat and search functions to achieve optimal balance.

5/2/2024

cs.HC

🌐

ChatGPT Is Here to Help, Not to Replace Anybody -- An Evaluation of Students' Opinions On Integrating ChatGPT In CS Courses

Bruno Pereira Cipriano, Pedro Alves

Large Language Models (LLMs) like GPT and Bard are capable of producing code based on textual descriptions, with remarkable efficacy. Such technology will have profound implications for computing education, raising concerns about cheating, excessive dependence, and a decline in computational thinking skills, among others. There has been extensive research on how teachers should handle this challenge but it is also important to understand how students feel about this paradigm shift. In this research, 52 first-year CS students were surveyed in order to assess their views on technologies with code-generation capabilities, both from academic and professional perspectives. Our findings indicate that while students generally favor the academic use of GPT, they don't over rely on it, only mildly asking for its help. Although most students benefit from GPT, some struggle to use it effectively, urging the need for specific GPT training. Opinions on GPT's impact on their professional lives vary, but there is a consensus on its importance in academic practice.

4/29/2024

cs.ET cs.AI cs.HC

🌀

A Survey on the Real Power of ChatGPT

Ming Liu, Ran Liu, Hua Wang, Wray Buntine

ChatGPT has changed the AI community and an active research line is the performance evaluation of ChatGPT. A key challenge for the evaluation is that ChatGPT is still closed-source and traditional benchmark datasets may have been used by ChatGPT as the training data. In this paper, (i) we survey recent studies which uncover the real performance levels of ChatGPT in seven categories of NLP tasks, (ii) review the social implications and safety issues of ChatGPT, and (iii) emphasize key challenges and opportunities for its evaluation. We hope our survey can shed some light on its blackbox manner, so that researchers are not misleaded by its surface generation.

5/3/2024

cs.CL cs.AI