Examining Zero-Shot Vulnerability Repair with Large Language Models
Summary of seminar based on Pearce et al. paper; CSCE 689 601 ML-Based Cyber Defenses
This paper discusses testing challenges, including fuzzing and bug detection, and proposes solutions such as combining fuzzers with Large Language Models (LLMs) and using LLMs for test case generation, while also addressing ethical concerns surrounding LLM usage. This blog is originally written for CSCE 689:601 and is the 19th blog of the series: "Machine Learning-Based CyberDefenses".
Paper Highlights
Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.
Security bug identification involves two main approaches: static analysis of the source code (OWASP's list (mapped CWEs), CodeQL) and using sanitizers during compile check.
LLMs have the capability to write codes. Given a sequence of tokens, they predict the next token in the sequence. The properties of code generated by LLMs include being syntactically correct, broadly functional, sometimes insecure, and non-deterministic.
There is a question of whether we could use an LLM without training it. LLM parameters include temperature (more leads to more randomness), top-p (threshold for next probable words), and length. A repair framework, as proposed in a paper, addresses the issue of having bugs in code and aims to find security bugs.
Vulnerable program generation involved using two models (code-Cushman-001 and code-davinci- 001) to generate ten programs. Model parameter sweep was conducted for vulnerable program repair.
Prompt engineering and hand-crafted vulnerable code were used in repair prompt design, including options like no help template, commented code template, commented code with message template, and a general template.
Experiments were conducted with real-world bugs. The results showed that 8 out of 12 projects had at least one successful repair from an LLM from the ensemble. However, this LLM was deemed unreliable and very inefficient.
A baseline comparison was made with ExtractFix, a state-of-the-art repair tool, which fixed 10 out of 12 projects. LLMs were found to be most reliable when tasked with producing short local fixes, passing tests but not effectively solving the actual problem.
Limitations:
Repairs restricted to a single place within a single file and single function
Potentially inadequate functional tests
Reliance on a single security tool for vulnerability discovery (CodeQL)
Issues with prompt engineering
Function's ability to call other files posed a challenge as prompts couldn't be easily applied across multiple files.
Takeaways
Fuzz testing, also known as fuzzing, is an automated software testing method that involves injecting invalid, malformed, or unexpected inputs into a system to uncover software defects and vulnerabilities. Fuzzing tools generate these inputs and monitor the system for exceptions such as crashes or information leaks.
The primary focus of the paper discussed is on bug repair - identifying and fixing bugs within software using LLMs.
LLMs understand how to write code => LLMs know how it functions => LLMs can identify what is wrong (bugs) in the code => LLMs can write code to fix bugs!
We are discussing this topic even though it is not entirely true (at the end of the day, LLMs are only predicting what should be the next token!) because it holds promise despite lacking a complete reasoning aspect. It represents ongoing progress in the field. In the paper, authors used 7 LLMs to find bugs which is inefficient. However, using multiple LLMs instead of hiring multiple persons can potentially speed up the process, make it more accessible, and unlock greater potential.
Before LLMs, bugs were found through code analysis and fuzzing. Context is crucial for bug discovery, often found through trial and error, making fuzzing and LLMs, though lacking context, more accessible options. Other methods include regression testing and unit tests, while techniques like taint tracking and dynamic flow tracking are effective but require expertise. LLMs present a more user-friendly alternative for bug detection.
There is a gap in discussions around security regression tests, which should be a focus of research and discussion. Regression testing ensures security post-patching, reproducing vulnerabilities to confirm fixes. Without it, security verification is incomplete, posing a crucial step often overlooked but essential for peace of mind post-testing and patching. The cycle of fixing bugs, releasing a new version, and encountering vulnerabilities similar to the previous ones highlights the importance of addressing security regression.
The efficiency of LLMs prompts the question of whether they should aim for perfection, but in reality, they only need to surpass human bug-finding capabilities. Similar debates surround self-driving cars, they should just cause fewer accidents than human drivers! In the event of a self-driving car causing harm, who should be blamed? Should it be the developer who coded it, the team that provided the data, or those who trained the model? The responsibility might not lie solely with one party but could involve political decisions. Similarly, when an LLM generates vulnerable code, there may not be a clear answer to whom to blame. It is a complex issue without clear right or wrong answers.
The paper discusses various target languages, one of which is Verilog, commonly used in hardware design using Hardware Description Languages (HDL) like VHDL. Finding bugs in hardware is indeed possible, as mentioned in the paper. However, the lack of reasoning in LLMs can pose challenges in this area. While LLMs can understand logic descriptions of functionality, they may miss specific classes of problems. For example, LLMs may not detect race conditions on the software side, and glitches in hardware can lead to errors that LLMs might not identify. For these reasons, additional steps are necessary to ensure hardware reliability and functionality. Hardware fuzzing is complex due to testing challenges. Semiconductor Intellectual Property (IP) cores, reusable logic units in electronic design, include Soft Intellectual Properties (SIPs), offered as synthesizable RTL modules in languages like Verilog. SIPs implement specific, testable logic suitable for fuzzing without scalability issues encountered in testing System on a Chip (SoC).
Building processors is a complex process, and testing them is particularly challenging. Even well-known processors like Intel's i3, i5, and i7 can have imperfections. For example, a chip built to function as an i7 might exhibit glitches at higher frequencies, leading to it being sold as an i5 instead of an i7. This illustrates the complexities involved in hardware testing and quality assurance.
There is a relationship LLMs and grammar-based fuzzers. While traditional fuzzers often use random data, a grammar-based approach uses structured data input based on predefined rules (e.g., integer followed by string). LLMs have the capability to output such grammar, allowing for learning from unknown files.
In the future, a potential solution is to combine fuzzers with LLMs in a pipeline approach. This could involve using a fuzzer guided by an LLM to enhance testing efficiency. There's also discussion about potential protests or bans regarding the use of LLMs in certain contexts, indicating ethical considerations surrounding their deployment. A startup idea involves using LLMs within GitHub actions to detect vulnerabilities in pull requests, enhancing code review processes.
LLMs could also be utilized for generating test cases rather than just generating code, offering a novel approach to software testing. One of the challenges in software testing is achieving sufficient coverage. LLMs can assist in identifying areas that need more testing, helping to improve coverage.