Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice

2404.14901

Published 4/24/2024 by Ranim Khojah, Mazen Mohamad, Philipp Leitner, Francisco Gomes de Oliveira Neto

✨

Abstract

Large Language Models (LLMs) are frequently discussed in academia and the general public as support tools for virtually any use case that relies on the production of text, including software engineering. Currently there is much debate, but little empirical evidence, regarding the practical usefulness of LLM-based tools such as ChatGPT for engineers in industry. We conduct an observational study of 24 professional software engineers who have been using ChatGPT over a period of one week in their jobs, and qualitatively analyse their dialogues with the chatbot as well as their overall experience (as captured by an exit survey). We find that, rather than expecting ChatGPT to generate ready-to-use software artifacts (e.g., code), practitioners more often use ChatGPT to receive guidance on how to solve their tasks or learn about a topic in more abstract terms. We also propose a theoretical framework for how (i) purpose of the interaction, (ii) internal factors (e.g., the user's personality), and (iii) external factors (e.g., company policy) together shape the experience (in terms of perceived usefulness and trust). We envision that our framework can be used by future research to further the academic discussion on LLM usage by software engineering practitioners, and to serve as a reference point for the design of future empirical LLM research in this domain.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Researchers conducted an observational study of 24 software engineers using ChatGPT for one week in their jobs.
They analyzed the engineers' interactions with ChatGPT and their overall experience.
The study found that engineers used ChatGPT more for guidance and learning than for generating ready-to-use code.
The researchers propose a framework to understand how the purpose of the interaction, the user's personality, and external factors shape the user's experience with ChatGPT.

Plain English Explanation

Researchers wanted to understand how software engineers in the real world are using large language models (LLMs) like ChatGPT to help with their work. They studied 24 professional engineers who used ChatGPT for one week on the job.

The researchers looked at the engineers' conversations with ChatGPT and asked them about their overall experience. They found that the engineers didn't expect ChatGPT to just give them finished code they could use. Instead, they tended to use ChatGPT to get guidance on how to solve their tasks or learn about topics in a more general way.

The researchers also came up with a framework to explain what shapes an engineer's experience with ChatGPT. They think it depends on three main things: 1) the purpose of the interaction (e.g., getting guidance vs. generating code), 2) the engineer's own personality and preferences, and 3) external factors like company policies.

The researchers hope this framework can help guide future research on how software engineers use LLMs and how to design better LLM-based tools for engineering tasks.

Technical Explanation

The researchers conducted an observational study of 24 professional software engineers who used ChatGPT over the course of one week in their regular jobs. They qualitatively analyzed the engineers' dialogues with the chatbot as well as their overall experience, which was captured through an exit survey.

The key finding was that the engineers tended to use ChatGPT more for guidance and learning than for generating ready-to-use software artifacts (like code). The researchers propose a theoretical framework to explain how the user's experience with ChatGPT is shaped by three main factors:

The purpose of the interaction (e.g., getting guidance vs. generating code)
Internal factors like the user's personality and preferences
External factors such as company policies around using AI tools

This framework is intended to guide future research on how software engineers use large language models and inform the design of more effective LLM-based tools for engineering tasks.

Critical Analysis

The researchers acknowledge several limitations of their study. First, the sample size of 24 engineers is relatively small, and they were all from the same company. This may limit the generalizability of their findings. Additionally, the one-week observation period may not have been long enough to capture the full range of how engineers use ChatGPT over time.

The researchers also note that their study was observational, so they cannot make strong claims about causation. It's possible that other factors beyond the three in their framework also shape the user experience.

Further research could explore these issues in more depth, such as by conducting a larger-scale study across multiple organizations or using experimental methods to test the proposed framework. It would also be valuable to gather more quantitative data on the specific ways engineers use ChatGPT and other LLMs to support their work.

Conclusion

This study provides an important first step in understanding how professional software engineers are actually using large language models like ChatGPT in their day-to-day work. The researchers' framework offers a useful theoretical lens for guiding future research and design efforts in this area.

As LLMs become more advanced and integrated into various industries, it will be crucial to continue studying their practical usefulness and impact, especially for knowledge-intensive fields like software engineering. This study suggests that engineers may leverage these tools more for guidance and learning than for fully automating their tasks, which has implications for how we think about the role of AI in the workplace.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🛸

Evaluation of ChatGPT Usability as A Code Generation Tool

Tanha Miah, Hong Zhu

With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimic the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper reports an application of the method in the evaluation of ChatGPT usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5. Our experiment also shows that it is hard for human developers to learn from experiences to improve the skill of using ChatGPT to generate code.

4/10/2024

cs.SE cs.AI

💬

ChatGPT as an inventor: Eliciting the strengths and weaknesses of current large language models against humans in engineering design

Daniel Nyg{aa}rd Ege, Henrik H. {O}vreb{o}, Vegar Stubberud, Martin Francis Berg, Christer Elverum, Martin Steinert, H{aa}vard Vestad

This study compares the design practices and performance of ChatGPT 4.0, a large language model (LLM), against graduate engineering students in a 48-hour prototyping hackathon, based on a dataset comprising more than 100 prototypes. The LLM participated by instructing two participants who executed its instructions and provided objective feedback, generated ideas autonomously and made all design decisions without human intervention. The LLM exhibited similar prototyping practices to human participants and finished second among six teams, successfully designing and providing building instructions for functional prototypes. The LLM's concept generation capabilities were particularly strong. However, the LLM prematurely abandoned promising concepts when facing minor difficulties, added unnecessary complexity to designs, and experienced design fixation. Communication between the LLM and participants was challenging due to vague or unclear descriptions, and the LLM had difficulty maintaining continuity and relevance in answers. Based on these findings, six recommendations for implementing an LLM like ChatGPT in the design process are proposed, including leveraging it for ideation, ensuring human oversight for key decisions, implementing iterative feedback loops, prompting it to consider alternatives, and assigning specific and manageable tasks at a subsystem level.

4/30/2024

cs.HC

🌐

ChatGPT Is Here to Help, Not to Replace Anybody -- An Evaluation of Students' Opinions On Integrating ChatGPT In CS Courses

Bruno Pereira Cipriano, Pedro Alves

Large Language Models (LLMs) like GPT and Bard are capable of producing code based on textual descriptions, with remarkable efficacy. Such technology will have profound implications for computing education, raising concerns about cheating, excessive dependence, and a decline in computational thinking skills, among others. There has been extensive research on how teachers should handle this challenge but it is also important to understand how students feel about this paradigm shift. In this research, 52 first-year CS students were surveyed in order to assess their views on technologies with code-generation capabilities, both from academic and professional perspectives. Our findings indicate that while students generally favor the academic use of GPT, they don't over rely on it, only mildly asking for its help. Although most students benefit from GPT, some struggle to use it effectively, urging the need for specific GPT training. Opinions on GPT's impact on their professional lives vary, but there is a consensus on its importance in academic practice.

4/29/2024

cs.ET cs.AI cs.HC

If the Machine Is As Good As Me, Then What Use Am I? -- How the Use of ChatGPT Changes Young Professionals' Perception of Productivity and Accomplishment

Charlotte Kobiella, Yarhy Said Flores L'opez, Fiona Draxler, Albrecht Schmidt

Large language models (LLMs) like ChatGPT have been widely adopted in work contexts. We explore the impact of ChatGPT on young professionals' perception of productivity and sense of accomplishment. We collected LLMs' main use cases in knowledge work through a preliminary study, which served as the basis for a two-week diary study with 21 young professionals reflecting on their ChatGPT use. Findings indicate that ChatGPT enhanced some participants' perceptions of productivity and accomplishment by enabling greater creative output and satisfaction from efficient tool utilization. Others experienced decreased perceived productivity and accomplishment, driven by a diminished sense of ownership, perceived lack of challenge, and mediocre results. We found that the suitability of task delegation to ChatGPT varies strongly depending on the task nature. It's especially suitable for comprehending broad subject domains, generating creative solutions, and uncovering new information. It's less suitable for research tasks due to hallucinations, which necessitate extensive validation.

4/22/2024

cs.HC