An Investigation into Misuse of Java Security APIs by Large Language Models

2404.03823

Published 4/8/2024 by Zahra Mousavi, Chadni Islam, Kristen Moore, Alsharif Abuadbba, Muhammad Ali Babar

An Investigation into Misuse of Java Security APIs by Large Language Models

Abstract

The increasing trend of using Large Language Models (LLMs) for code generation raises the question of their capability to generate trustworthy code. While many researchers are exploring the utility of code generation for uncovering software vulnerabilities, one crucial but often overlooked aspect is the security Application Programming Interfaces (APIs). APIs play an integral role in upholding software security, yet effectively integrating security APIs presents substantial challenges. This leads to inadvertent misuse by developers, thereby exposing software to vulnerabilities. To overcome these challenges, developers may seek assistance from LLMs. In this paper, we systematically assess ChatGPT's trustworthiness in code generation for security API use cases in Java. To conduct a thorough evaluation, we compile an extensive collection of 48 programming tasks for 5 widely used security APIs. We employ both automated and manual approaches to effectively detect security API misuse in the code generated by ChatGPT for these tasks. Our findings are concerning: around 70% of the code instances across 30 attempts per task contain security API misuse, with 20 distinct misuse types identified. Moreover, for roughly half of the tasks, this rate reaches 100%, indicating that there is a long way to go before developers can rely on ChatGPT to securely implement security API code.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper investigates how large language models (LLMs) like ChatGPT can misuse Java security APIs, leading to security vulnerabilities in the generated code.
The researchers analyze real-world code examples produced by ChatGPT and identify common types of API misuse, such as incorrect SSL/TLS configuration and insecure cryptographic operations.
The findings highlight the need for improved security practices and safeguards when using LLMs for software development.

Plain English Explanation

The paper examines how powerful AI language models, such as ChatGPT, can inadvertently introduce security vulnerabilities when generating code that uses Java security APIs. These APIs are designed to help developers build secure software, but the researchers found that LLMs don't always use them correctly.

For example, the paper discusses a case where ChatGPT generated code that incorrectly configured an SSL/TLS connection, potentially allowing attackers to access users' personal information. The researchers also identified instances where the LLM used insecure cryptographic operations, which could compromise the confidentiality and integrity of sensitive data.

These findings are significant because LLMs are becoming increasingly popular for automating software development tasks. While these models can be powerful tools, the paper highlights the importance of understanding their limitations and potential pitfalls, especially when it comes to security-critical components of an application.

Technical Explanation

The researchers conducted an in-depth analysis of code samples generated by ChatGPT, focusing on its use of Java security APIs. They identified several common types of API misuse, including:

Incorrect SSL/TLS Configuration: The paper presents a case where ChatGPT generated code that failed to properly validate SSL/TLS certificates, potentially allowing attackers to intercept and decrypt sensitive user data. [See related paper: "How Effective Are Neural Networks at Fixing Security Vulnerabilities in Real-World Code?"]
Insecure Cryptographic Operations: The researchers found instances where the LLM used cryptographic algorithms or settings that were outdated or insecure, compromising the confidentiality and integrity of sensitive data. [See related paper: "Deciphering Textual Authenticity: A Generalized Strategy Through the Lens of AI-Generated Content"]
Weak Random Number Generation: ChatGPT sometimes generated code that used predictable random number generators, which could allow attackers to guess sensitive values like cryptographic keys or nonces.

To identify these issues, the researchers developed a system to automatically detect and analyze the security properties of LLM-generated code. They used this system to examine thousands of code samples produced by ChatGPT, revealing the widespread nature of the problem.

Critical Analysis

The paper provides a valuable contribution to the growing body of research on the security implications of large language models. The researchers have identified real-world examples of API misuse that could lead to significant security vulnerabilities, highlighting the need for caution and additional safeguards when using LLMs for software development.

However, the paper also acknowledges several limitations and areas for further research. For example, the analysis was focused solely on ChatGPT and Java security APIs, and it's unclear how the findings might translate to other LLMs or programming languages. [See related paper: "How Trustworthy Are Open-Source Large Language Models? A Comprehensive Assessment"]

Additionally, the paper does not delve into the underlying reasons why LLMs might struggle with security-critical API usage. It would be useful to investigate whether these issues stem from limitations in the training data, architectural constraints, or other factors.

Finally, the paper does not provide specific recommendations for mitigating the identified risks, such as techniques for detecting and preventing API misuse in LLM-generated code. Further research in this direction could help developers and organizations better harness the benefits of LLMs while managing their security-related challenges. [See related paper: "AI-Tutoring for Software Engineering Education: Enhancing Security Practices"]

Conclusion

This paper highlights a concerning trend: large language models like ChatGPT can inadvertently introduce security vulnerabilities when generating code that uses Java security APIs. The researchers have identified real-world examples of API misuse, such as incorrect SSL/TLS configuration and insecure cryptographic operations, that could potentially lead to the leakage of sensitive user information.

These findings underscore the importance of understanding the limitations and security implications of LLMs, especially as they become more widely adopted in software development workflows. While these models can be powerful tools, developers and organizations must exercise caution and implement appropriate safeguards to ensure the security and integrity of the systems they build.

The paper's insights provide a valuable starting point for further research and development in this critical area, ultimately helping to guide the responsible and secure use of large language models in software engineering.

Related Papers

💬

A Case Study of Large Language Models (ChatGPT and CodeBERT) for Security-Oriented Code Analysis

Zhilong Wang, Lan Zhang, Chen Cao, Nanqing Luo, Peng Liu

LLMs can be used on code analysis tasks like code review, vulnerabilities analysis and etc. However, the strengths and limitations of adopting these LLMs to the code analysis are still unclear. In this paper, we delve into LLMs' capabilities in security-oriented program analysis, considering perspectives from both attackers and security analysts. We focus on two representative LLMs, ChatGPT and CodeBert, and evaluate their performance in solving typical analytic tasks with varying levels of difficulty. Our study demonstrates the LLM's efficiency in learning high-level semantics from code, positioning ChatGPT as a potential asset in security-oriented contexts. However, it is essential to acknowledge certain limitations, such as the heavy reliance on well-defined variable and function names, making them unable to learn from anonymized code. For example, the performance of these LLMs heavily relies on the well-defined variable and function names, therefore, will not be able to learn anonymized code. We believe that the concerns raised in this case study deserve in-depth investigation in the future.

5/3/2024

cs.CR cs.AI

🧪

LLMs in Web-Development: Evaluating LLM-Generated PHP code unveiling vulnerabilities and limitations

Rebeka T'oth, Tamas Bisztray, L'aszl'o Erdodi

This research carries out a comprehensive examination of web application code security, when generated by Large Language Models through analyzing a dataset comprising 2,500 small dynamic PHP websites. These AI-generated sites are scanned for security vulnerabilities after being deployed as standalone websites in Docker containers. The evaluation of the websites was conducted using a hybrid methodology, incorporating the Burp Suite active scanner, static analysis, and manual checks. Our investigation zeroes in on identifying and analyzing File Upload, SQL Injection, Stored XSS, and Reflected XSS. This approach not only underscores the potential security flaws within AI-generated PHP code but also provides a critical perspective on the reliability and security implications of deploying such code in real-world scenarios. Our evaluation confirms that 27% of the programs generated by GPT-4 verifiably contains vulnerabilities in the PHP code, where this number -- based on static scanning and manual verification -- is potentially much higher. This poses a substantial risks to software safety and security. In an effort to contribute to the research community and foster further analysis, we have made the source codes publicly available, alongside a record enumerating the detected vulnerabilities for each sample. This study not only sheds light on the security aspects of AI-generated code but also underscores the critical need for rigorous testing and evaluation of such technologies for software development.

4/24/2024

cs.SE cs.AI

🛸

Evaluation of ChatGPT Usability as A Code Generation Tool

Tanha Miah, Hong Zhu

With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimic the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper reports an application of the method in the evaluation of ChatGPT usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5. Our experiment also shows that it is hard for human developers to learn from experiences to improve the skill of using ChatGPT to generate code.

4/10/2024

cs.SE cs.AI

💬

Attacks on Third-Party APIs of Large Language Models

Wanru Zhao, Vidit Khazanchi, Haodi Xing, Xuanli He, Qiongkai Xu, Nicholas Donald Lane

Large language model (LLM) services have recently begun offering a plugin ecosystem to interact with third-party API services. This innovation enhances the capabilities of LLMs, but it also introduces risks, as these plugins developed by various third parties cannot be easily trusted. This paper proposes a new attacking framework to examine security and safety vulnerabilities within LLM platforms that incorporate third-party services. Applying our framework specifically to widely used LLMs, we identify real-world malicious attacks across various domains on third-party APIs that can imperceptibly modify LLM outputs. The paper discusses the unique challenges posed by third-party API integration and offers strategic possibilities to improve the security and safety of LLM ecosystems moving forward. Our code is released at https://github.com/vk0812/Third-Party-Attacks-on-LLMs.

4/29/2024

cs.CR cs.AI cs.CL cs.CY