Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants
Summary of seminar based on Sandoval et al. paper; CSCE 689 601 ML-Based Cyber Defenses
This paper describes an experimental study examining the impact of user-AI interactive systems on productivity and security risks. This blog is originally written for CSCE 689:601 and is the 20th blog of the series: "Machine Learning-Based CyberDefenses".
Paper Highlights
The study investigated whether interactive systems involving users and AI pose increased security risks. It addressed several research questions: Does using an AI code assistant lead to more functional code? How does the presence of an AI assistant affect the occurrence rate of security bugs? Where do bugs originate in an LLM-assisted system?
Participants included 58 individuals majoring in Computer Science at a research university, divided into Assisted and Control groups. The experiment involved implementing a singly linked list in C, chosen for its tendency to reveal memory-related bugs. OpenAI's code-cushman-001 was used as the AI assistant, operating swiftly with a temperature setting of 0.6 for diverse suggestions.
In the autopilot experiment, AI-generated code with function templates was provided. Participants had 10 attempts to implement each function, simulating trust in AI assistance. If code failed to compile after 10 attempts, it remained as a template, simulating users' reliance on AI.
The experiment utilized hypothesis testing, including a comparative test and an inferiority test. The comparative test compared the productivity of the Assisted and Control groups, with the null hypothesis stating equality and the alternative hypothesis suggesting better performance by the Assisted group. The inferiority test assessed the risk posed by the Assisted group compared to the Control group, with the null hypothesis indicating a higher risk and the alternative hypothesis proposing a lower risk. Assessment involved two key metrics: productivity, including functions compiled and implemented, and pass rates in tests; and security risks, measured by CWEs per lines of code, including severe CWEs identified by MITRE.
The AI tool improved productivity, with the Assisted group outperforming overall, especially with the Autopilot function. Security risks were similar between the Assisted and Control groups, with no significant difference in bug rates or CWEs per lines of code. Despite LLMs suggesting buggy code, users often introduced bugs themselves, especially those who accepted more suggestions from LLMs.
Takeaways
The group used in this study consists of students who might not be skilled programmers. While comparing to "bad programmers" might show favorable results, it may not accurately reflect real-world scenarios where professionals are exist. This is similar to psychology experiments where it is common to recruit participants from within the researcher's own student population, but whether this practice accurately represents society is questionable.
Extending the findings of this paper to real-world applications may be challenging. While the paper may demonstrate promise, its applicability in practical settings requires further examination. Evaluating all LLMs by comparing them to human performance is logical, but given that many are now using LLMs, everyone essentially becomes a researcher. This shifting paradigm complicates evaluation methods.
In the paper, authors consider a threshold of 10%, there are no clear reasons for doing so. Determining an acceptable threshold for measuring software quality is complex and open-discussion topic. Metrics like bug counts may serve as proxy indicators, but they don't fully capture software quality. The question of how to accurately measure software quality remains open for exploration.
Comparing software systems typically involves assessing factors like bug count, software complexity (measured by lines of code and functionality), and overall quality. Using lines of code as a metric for traditional code is effective, as humans tend to make consistent mistakes per line. However, comparing LLM-generated code adds complexity, as LLMs can produce significantly more output, potentially multiple lines for one instruction. This complicates comparisons, suggesting lines of code may not suit LLMs. Further research is needed in this area.
The authors chose to use CWEs (Common Weakness Enumerations) rather than CVEs (Common Vulnerabilities and Exposures) because CWEs represent classes of weaknesses in code, such as buffer overflow, while CVEs denote specific instances or cases of vulnerabilities. This reminds us of the discussion about bugs and vulnerabilities in Dos and Don'ts of Machine Learning In Security. To recall, bugs are errors or flaws in the code, while vulnerabilities are weaknesses that can be exploited to compromise the security of a system.
The correlation between bugs and vulnerabilities underscores the disparity between safety and security. Safety focuses on protecting one's own assets and ensuring functionality, while security involves defending against external threats. Safety aims to prevent accidents or harm within a system, like the 'Miracle on the Hudson' aircraft incident, while security safeguards against malicious attacks, such as the 9/11 hijackings. Recognizing this difference is crucial for maintaining system integrity and resilience across different domains.
Safety | Security |
Focuses on functional safety of a system to protect the environment and people from system malfunctions | Aims to protect a system and its data from unauthorized access and damage from the environment |
Characterized by a static nature where safety concepts are developed and implemented with infrequent adaptations | Fast-moving discipline that reacts flexibly to emerging weaknesses, especially with increased networked components |
Standards like ISO 26262, ISO 61508, IEC 61511, IEC 62061, and ISO 10218-1-2 provide frameworks and guidelines | Standards include BSI IT-Grundschutz, ISO 27000 series, IEC 62443, and ISO/SAE 21434 |
Ensures integrity of the environment and people over a long period | Requires dynamic responses to evolving threats and weaknesses, with potential threats increasing steadily |
Transitions between safety and security domains can be fluid due to increasing digitalization and networked components | Clear differentiation between safety and security domains can be challenging, as the domains overlap in certain areas |
The programming language used not only influences the vulnerabilities of the code but also affects the validity of any general claims made. Considerations go beyond explicit vulnerabilities in the code; there are implicit risks as well, such as using a poisoned library. Even seemingly innocuous code like "printf hello world" could be exploited due to vulnerabilities in functions like printf, potentially leading to string formatting attacks.
When using AI assistants to generate code, there is a need to ensure that the resulting code is not malicious. However, this raises concerns about the integrity of libraries and compilers. Compilers, for example, could be compromised to produce malicious binaries, highlighting the importance of trust in the entire toolchain. To establish trust, components like compilers, libraries, operating systems (OS), boot loaders, and hardware need to be included in the Trusted Computing Base (TCB). However, ensuring the security of each component becomes increasingly complex, leading to questions about who guarantees the security of each layer. For example, the OS could be compromised, leading to a cascade of security issues. Even hardware, which may have Trusted Platform Modules (TPM), is not immune to tampering. This chain of trust highlights the need for a robust threat model, especially as threats evolve, with instances like modified OS distributions or hacked BIOSes.
Efforts like Intel's security boot, which initiates a chain of trust during system boot-up, aim to mitigate risks. However, the ultimate question remains: Where does one draw the line in ensuring security against potential threats? This underscores the importance of continuous vigilance and adaptation in the face of evolving security challenges.
The paper Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions authored by the same authors as the current paper, initially concluded that using an AI assistant was inferior to using a human. However, their conclusion was later updated with the findings from the current paper, suggesting that AI is actually quite effective. The shift in conclusion can be attributed to the fact that both users and researchers gained knowledge on how to effectively work with LLM over time. Through experimentation and experience, better interaction methods with LLMs were identified, leading to improved prompts and more successful outcomes. Although the LLM remained the same, the users' and researchers' understanding of how to leverage its capabilities evolved.
Further research holds the potential to uncover even better ways to use LLMs. Discovering the appropriate threshold and metric to assess their effectiveness would be a significant milestone. Currently, metrics like bugs per line of code lack clarity, but exploring alternatives such as bugs per line of code per number of developers may provide better insights. In essence, while the initial findings may have suggested limitations in using AI assistants, subsequent research and experience have highlighted their potential effectiveness. Continued exploration and refinement of interaction methods with LLMs may unlock further benefits in the future.