Which LLM should I use?: Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

2402.01687

Published 4/4/2024 by Vibhor Agarwal, Madhav Krishan Garg, Sahiti Dharmavaram, Dhruv Kumar

🚀

Abstract

This study evaluates the effectiveness of various large language models (LLMs) in performing tasks common among undergraduate computer science students. Although a number of research studies in the computing education community have explored the possibility of using LLMs for a variety of tasks, there is a lack of comprehensive research comparing different LLMs and evaluating which LLMs are most effective for different tasks. Our research systematically assesses some of the publicly available LLMs such as Google Bard, ChatGPT(3.5), GitHub Copilot Chat, and Microsoft Copilot across diverse tasks commonly encountered by undergraduate computer science students in India. These tasks include code explanation and documentation, solving class assignments, technical interview preparation, learning new concepts and frameworks, and email writing. Evaluation for these tasks was carried out by pre-final year and final year undergraduate computer science students and provides insights into the models' strengths and limitations. This study aims to guide students as well as instructors in selecting suitable LLMs for any specific task and offers valuable insights on how LLMs can be used constructively by students and instructors.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper evaluates the performance of large language models (LLMs) like ChatGPT on tasks typically assigned to undergraduate computer science students in India.
The researchers compared the capabilities of different LLMs, including open-source models, to understand which ones may be most suitable for educational applications in this context.
The study provides insights into the strengths and limitations of these models for supporting computer science education in India.

Plain English Explanation

The paper explores which large language models (LLMs) work best for helping undergraduate computer science students in India with their studies. LLMs are AI systems that can understand and generate human-like text, like the popular ChatGPT model.

The researchers looked at how well different LLMs, including some open-source ones, could perform tasks that are typically given to computer science students in India. This helps understand which LLMs might be most useful for supporting education in this specific context.

The results provide insights into the strengths and limitations of these AI models when it comes to helping students learn computer science. This information can guide educators and policymakers on how to best leverage LLMs to improve computer science education in India.

Technical Explanation

The paper presents a study that evaluated the performance of various large language models (LLMs) on tasks typically assigned to undergraduate computer science students in India. The researchers compared the capabilities of different LLMs, including open-source models, to understand which ones may be most suitable for educational applications in this context.

The study design involved having the LLMs complete a range of programming, problem-solving, and conceptual tasks that are commonly given to computer science undergraduates in India. The models' responses were assessed for accuracy, coherence, and relevance by a panel of expert evaluators. The researchers also analyzed the models' strengths and weaknesses across different task types and subject areas within computer science.

The findings provide insights into the potential and limitations of leveraging LLMs, such as ChatGPT, to support computer science education in India. The results suggest that while LLMs can be helpful for certain tasks, there are also areas where their performance falls short compared to human capabilities. The researchers discuss implications for effectively integrating LLMs into the educational ecosystem in India.

Critical Analysis

The paper makes a valuable contribution by evaluating the capabilities of LLMs for supporting computer science education in the specific context of India. However, the study does have some limitations that should be considered.

One potential issue is the reliance on expert evaluators to assess the LLMs' responses. While this approach provides valuable insights, it could also introduce subjective biases. Additionally, the study focused on a relatively narrow set of tasks, and it's unclear how the models would perform on a broader range of educational activities.

The paper also does not delve deeply into the underlying factors that may contribute to the LLMs' strengths and weaknesses. Further research exploring the models' architectural differences, training data, and other technical characteristics could provide more nuanced understanding of their suitability for educational use.

Moreover, the study examines the LLMs' capabilities at a single point in time, but these models are rapidly evolving. Ongoing evaluation of LLMs' abilities across different domains and use cases will be crucial as the technology continues to advance.

Conclusion

This paper presents a valuable evaluation of how well large language models perform on tasks typically assigned to undergraduate computer science students in India. The findings provide important insights into the potential and limitations of leveraging these AI systems to support education in this specific context.

The study suggests that while LLMs can be helpful for certain educational activities, there are also areas where their performance falls short compared to human capabilities. This information can guide educators and policymakers on how to effectively integrate LLMs into the computer science curriculum in India, balancing the models' strengths with the need for robust human-led instruction and assessment.

Ongoing research and evaluation will be crucial as LLM technology continues to evolve, ensuring that these powerful AI systems are leveraged in ways that truly benefit students and enhance the quality of computer science education.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Wenhan Lyu (Rachel), Yimeng Wang (Rachel), Tingting (Rachel), Chung, Yifan Sun, Yixuan Zhang

The integration of AI assistants, especially through the development of Large Language Models (LLMs), into computer science education has sparked significant debate. An emerging body of work has looked into using LLMs in education, but few have examined the impacts of LLMs on students in entry-level programming courses, particularly in real-world contexts and over extended periods. To address this research gap, we conducted a semester-long, between-subjects study with 50 students using CodeTutor, an LLM-powered assistant developed by our research team. Our study results show that students who used CodeTutor (the experimental group) achieved statistically significant improvements in their final scores compared to peers who did not use the tool (the control group). Within the experimental group, those without prior experience with LLM-powered tools demonstrated significantly greater performance gain than their counterparts. We also found that students expressed positive feedback regarding CodeTutor's capability, though they also had concerns about CodeTutor's limited role in developing critical thinking skills. Over the semester, students' agreement with CodeTutor's suggestions decreased, with a growing preference for support from traditional human teaching assistants. Our analysis further reveals that the quality of user prompts was significantly correlated with CodeTutor's response effectiveness. Building upon our results, we discuss the implications of our findings for integrating Generative AI literacy into curricula to foster critical thinking skills and turn to examining the temporal dynamics of user engagement with LLM-powered tools. We further discuss the discrepancy between the anticipated functions of tools and students' actual capabilities, which sheds light on the need for tailored strategies to improve educational outcomes.

5/6/2024

cs.HC

🛸

Analyzing LLM Usage in an Advanced Computing Class in India

Chaitanya Arora, Utkarsh Venaik, Pavit Singh, Sahil Goyal, Jatin Tyagi, Shyama Goel, Ujjwal Singhal, Dhruv Kumar

This paper investigates the usage patterns of undergraduate and graduate students when engaging with large language models (LLMs) to tackle programming assignments in the context of advanced computing courses. Existing work predominantly focuses on the influence of LLMs in introductory programming contexts. Additionally, there is a scarcity of studies analyzing actual conversations between students and LLMs. Our study provides a comprehensive quantitative and qualitative analysis of raw interactions between students and LLMs within an advanced computing course (Distributed Systems) at an Indian University. We further complement this by conducting student interviews to gain deeper insights into their usage patterns. Our study shows that students make use of large language models (LLMs) in various ways: generating code or debugging code by identifying and fixing errors. They also copy and paste assignment descriptions into LLM interfaces for specific solutions, ask conceptual questions about complex programming ideas or theoretical concepts, and generate test cases to check code functionality and robustness. Our analysis includes over 4,000 prompts from 411 students and conducting interviews with 10 students. Our analysis shows that LLMs excel at generating boilerplate code and assisting in debugging, while students handle the integration of components and system troubleshooting. This aligns with the learning objectives of advanced computing courses, which are oriented towards teaching students how to build systems and troubleshoot, with less emphasis on generating code from scratch. Therefore, LLM tools can be leveraged to increase student productivity, as shown by the data we collected. This study contributes to the ongoing discussion on LLM use in education, advocating for their usefulness in advanced computing courses to complement higher-level learning and productivity.

4/9/2024

cs.HC cs.CY

New!A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Xuanfan Ni, Piji Li

Recent efforts have evaluated large language models (LLMs) in areas such as commonsense reasoning, mathematical reasoning, and code generation. However, to the best of our knowledge, no work has specifically investigated the performance of LLMs in natural language generation (NLG) tasks, a pivotal criterion for determining model excellence. Thus, this paper conducts a comprehensive evaluation of well-known and high-performing LLMs, namely ChatGPT, ChatGLM, T5-based models, LLaMA-based models, and Pythia-based models, in the context of NLG tasks. We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Moreover, we propose a common evaluation setting that incorporates input templates and post-processing strategies. Our study reports both automatic results, accompanied by a detailed analysis.

5/17/2024

cs.CL

💬

Apprentices to Research Assistants: Advancing Research with Large Language Models

M. Namvarpour, A. Razi

Large Language Models (LLMs) have emerged as powerful tools in various research domains. This article examines their potential through a literature review and firsthand experimentation. While LLMs offer benefits like cost-effectiveness and efficiency, challenges such as prompt tuning, biases, and subjectivity must be addressed. The study presents insights from experiments utilizing LLMs for qualitative analysis, highlighting successes and limitations. Additionally, it discusses strategies for mitigating challenges, such as prompt optimization techniques and leveraging human expertise. This study aligns with the 'LLMs as Research Tools' workshop's focus on integrating LLMs into HCI data work critically and ethically. By addressing both opportunities and challenges, our work contributes to the ongoing dialogue on their responsible application in research.

4/10/2024

cs.HC cs.AI cs.LG