Dos and Don'ts of Machine Learning in Computer Security - Part 2
Summary of seminar presented by Sidharth Arivarasan and Sahil Salunkhe; based on section 3,4 of Arp et al. paper; CSCE 689 601 ML-Based Cyber Defenses
The paper describes the good and bad practices for using machine learning in computer security. The seminar discussed prevalent pitfalls and best practices, shedding light on both the challenges and opportunities in this evolving domain. This blog is originally written for CSCE 689:601 and is the sixth blog of the series: "Machine Learning-Based CyberDefenses".
Paper highlights
The talk started by looking at how 30 papers were selected, what was focused on and the many different topics that were covered, like finding malware and dealing with game bots and ad blocking. It was made clear what the authors were trying to achieve and stressed the importance of carefully reviewing and assessing everything. They also highlighted how important it is to fix any problems found, as feedback from authors showed how it could affect their work.
Assessment criteria: 7 levels depending on the extent to which pitfall is acknowledged in the paper
Takeaways: pervasive, raise awareness, not blaming anyone for their work
To estimate the experimental impact of some pitfalls, four popular research topics in computer security were considered:
Mobile malware detection: suffered from sampling bias, spurious correlations and inappropriate performance measures.
Vulnerability discovery: suffered from spurious correlations, inappropriate baseline and label inaccuracy.
Spurious correlation: e.g. buffer size
Model names: VulDeePecker, SVM baseline
Code was generalized before feeding to the model. Top 10 token were found to be ( ) , which were considered as important features. Another problem was that sometimes models don't consider sequences.
Source code author attribution: suffered from sampling bias and spurious correlations
- Code templates are used in competitive programming which results in biasing.
Network intrusion detection: suffered from inappropriate baseline, lab-only evaluations and spurious correlations.
Takeaways
Cybersecurity is hard and sensitive especially because there are no constants, everything is changing, therefore pitfalls happen even to the best researchers.
In statistics, a spurious correlation refers to a connection between two variables that appears to be causal but is not. With spurious correlation, any observed dependencies between variables are merely due to chance or are both related to some unseen confounding factor.
Most common pitfall is sampling bias because it is hard to get good training data for cybersecurity as data is private. On one hand, antivirus companies don't share what data they used to train their model and on the other hand, companies who were attacked also do not share much about their data. This makes it difficult for researchers and can mess up the results of machine learning models.
Challenge for attackers and defenders both is to get good data!
Identifying which parts of computer code are critical for detecting security problems can be challenging. Security vulnerabilities often stem from complex interactions between different components of software systems, making it difficult to pinpoint specific areas of concern. Additionally, attackers continuously adapt their tactics to evade detection, making it essential for security professionals to stay vigilant and continually update their detection methods.
Shared templates are important for consistency, efficiency, best practices, collaboration, and risk management in cybersecurity efforts.
There are two main types of detection: detecting something we're familiar with and detecting the unknown. Each serves a different purpose and context matters. Using only one type of detection isn't enough because different situations require different approaches. Generally, outlier detection is quite reliable, but it can be costly because it is dynamic and scaling it up can be difficult.
Differences between bug, vulnerability and malware:
Bug | Vulnerability | Malware | |
Definition | A mistake or flaw in software code or design. | A weakness or loophole (bug) in software that can be exploited. | Malicious software designed to disrupt, damage, or gain unauthorized access to a system or network. |
Harmfulness | Typically not harmful on its own, but may affect performance or functionality. | Can be exploited by attackers to compromise security and access sensitive information. | Intentionally designed to cause harm or inconvenience to users or systems. |
Examples | Logic errors, syntax mistakes, or unexpected behaviors in software. | Buffer overflows, SQL injection, or authentication bypasses. | Viruses, worms, trojans, ransomware, or spyware. |
Impact | May result in crashes, glitches, or unexpected behavior. | Can lead to unauthorized access, data breaches, or system compromise. | Can cause data loss, financial damage, system downtime, or privacy violations. |
Mitigation | Debugging, testing, and code review processes to identify and fix bugs. | Regular security assessments, patch management, and secure coding practices. | Antivirus software, firewalls, intrusion detection systems, and user education on safe browsing habits. |