Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs

2406.04482

Published 6/10/2024 by Claire Jin, Sudha Rao, Xiangyu Peng, Portia Botchway, Jessica Quaye, Chris Brockett, Bill Dolan

Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs

Abstract

Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws.

Create account to get full access

Overview

This research paper explores the use of large language models (LLMs) to automatically detect bugs in text-based games powered by LLMs.
The authors present a system called "DejaBoom!" that uses LLMs to generate text-based game narratives and then leverages LLMs to identify potential bugs or inconsistencies in the generated content.
The paper demonstrates the feasibility of using LLMs for both game generation and bug detection, highlighting the potential of this approach to streamline the development of text-based games.

Plain English Explanation

The paper discusses a new system called "DejaBoom!" that uses large language models to create and test text-based games. Text-based games are interactive stories where players navigate through a virtual world by typing in commands, and the game responds with descriptions of the environment and events.

The researchers have developed a way to use LLMs to automatically generate the narratives and dialogue for these text-based games. This allows for the creation of complex, engaging game worlds without the need for extensive manual writing by game developers.

However, the researchers recognized that relying on LLMs for game generation could also introduce potential bugs or inconsistencies in the game content. To address this, they've developed a novel approach that uses LLMs to identify and flag potential issues within the generated game narratives.

By combining the game generation and bug detection capabilities of LLMs, the "DejaBoom!" system aims to streamline the development of text-based games, making it easier for developers to create high-quality interactive experiences. This research builds on previous work exploring the use of large language models for software vulnerability detection and player-driven game narrative generation.

Technical Explanation

The paper presents the "DejaBoom!" system, which leverages large language models to both generate text-based game narratives and detect potential bugs or inconsistencies in the generated content.

The game generation component of "DejaBoom!" uses LLMs to create interactive game worlds, character dialogues, and player interactions. The authors demonstrate the ability of their system to generate coherent and engaging text-based game narratives.

To address the potential for bugs and inconsistencies in the LLM-generated content, the researchers developed a novel bug detection approach. This component of "DejaBoom!" uses LLMs to analyze the generated game narratives and identify potential issues, such as contradictions, logical inconsistencies, or violations of game rules.

The authors evaluated the performance of their bug detection system on a range of test cases, demonstrating its effectiveness in identifying various types of bugs. They also conducted user studies to assess the quality and coherence of the LLM-generated game narratives.

The results of this research suggest that the integration of game generation and bug detection capabilities powered by LLMs can streamline the development of text-based games, reducing the manual effort required while maintaining high-quality interactive experiences for players.

Critical Analysis

The research presented in this paper offers a promising approach to leveraging the capabilities of large language models for game development. By combining LLM-powered game generation with automated bug detection, the "DejaBoom!" system demonstrates the potential to reduce development time and costs while maintaining high-quality text-based game experiences.

However, the paper does acknowledge some limitations of the current approach. For example, the bug detection system may not be able to identify all types of bugs, particularly those that require more complex reasoning or domain-specific knowledge. Additionally, the authors note that the performance of the system may be sensitive to the quality and diversity of the training data used for the LLMs.

Further research could explore ways to enhance the bug detection capabilities of the system, potentially by incorporating additional techniques or integrating the system with other software testing tools. Additionally, investigating the scalability of the approach to larger and more complex game scenarios would be valuable.

Overall, this paper highlights the exciting possibilities of leveraging large language models for user interfaces and interactive applications, and the "DejaBoom!" system represents a significant step forward in the integration of LLMs for game development and quality assurance.

Conclusion

This research paper presents a novel system called "DejaBoom!" that utilizes large language models (LLMs) for both the generation of text-based game narratives and the automatic detection of potential bugs or inconsistencies in the generated content.

By combining these two key capabilities, the "DejaBoom!" system demonstrates the potential to streamline the development of high-quality text-based games, reducing the manual effort required while maintaining engaging interactive experiences for players.

The results of this research highlight the growing importance of integrating large language models into automated software development and testing tools, as the field of AI-powered game generation and quality assurance continues to evolve.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Leveraging Large Language Models for Efficient Failure Analysis in Game Development

Leonardo Marini, Linus Gissl'en, Alessandro Sestini

In games, and more generally in the field of software development, early detection of bugs is vital to maintain a high quality of the final product. Automated tests are a powerful tool that can catch a problem earlier in development by executing periodically. As an example, when new code is submitted to the code base, a new automated test verifies these changes. However, identifying the specific change responsible for a test failure becomes harder when dealing with batches of changes -- especially in the case of a large-scale project such as a AAA game, where thousands of people contribute to a single code base. This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. We investigate the effectiveness of our approach with quantitative and qualitative evaluations. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year. We further evaluated our model through a user study to assess the utility and usability of the tool from a developer perspective, resulting in a significant reduction in time -- up to 60% -- spent investigating issues.

6/12/2024

cs.LG

Generating Games via LLMs: An Investigation with Video Game Description Language

Chengpeng Hu, Yunlong Zhao, Jialin Liu

Recently, the emergence of large language models (LLMs) has unlocked new opportunities for procedural content generation. However, recent attempts mainly focus on level generation for specific games with defined game rules such as Super Mario Bros. and Zelda. This paper investigates the game generation via LLMs. Based on video game description language, this paper proposes an LLM-based framework to generate game rules and levels simultaneously. Experiments demonstrate how the framework works with prompts considering different combinations of context. Our findings extend the current applications of LLMs and offer new insights for generating new games in the area of procedural content generation.

5/31/2024

cs.AI

Player-Driven Emergence in LLM-Driven Game Narrative

Xiangyu Peng, Jessica Quaye, Sudha Rao, Weijia Xu, Portia Botchway, Chris Brockett, Nebojsa Jojic, Gabriel DesGarennes, Ken Lobb, Michael Xu, Jorge Leandro, Claire Jin, Bill Dolan

We explore how interaction with large language models (LLMs) can give rise to emergent behaviors, empowering players to participate in the evolution of game narratives. Our testbed is a text-adventure game in which players attempt to solve a mystery under a fixed narrative premise, but can freely interact with non-player characters generated by GPT-4, a large language model. We recruit 28 gamers to play the game and use GPT-4 to automatically convert the game logs into a node-graph representing the narrative in the player's gameplay. We find that through their interactions with the non-deterministic behavior of the LLM, players are able to discover interesting new emergent nodes that were not a part of the original narrative but have potential for being fun and engaging. Players that created the most emergent nodes tended to be those that often enjoy games that facilitate discovery, exploration and experimentation.

6/5/2024

cs.CL cs.AI

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models

Jiale Cheng, Yida Lu, Xiaotao Gu, Pei Ke, Xiao Liu, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang

Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.

6/26/2024

cs.CL cs.AI cs.LG