Dos and Don'ts of Machine Learning in Computer Security

The paper describes the good and bad practices for using machine learning in computer security. The seminar discussed prevalent pitfalls and best practices, shedding light on both the challenges and opportunities in this evolving domain. This blog is originally written for CSCE 689:601 and is the sixth blog of the series: "Machine Learning-Based CyberDefenses".

Paper highlights

The talk started by looking at how 30 papers were selected, what was focused on and the many different topics that were covered, like finding malware and dealing with game bots and ad blocking. It was made clear what the authors were trying to achieve and stressed the importance of carefully reviewing and assessing everything. They also highlighted how important it is to fix any problems found, as feedback from authors showed how it could affect their work.
Assessment criteria: 7 levels depending on the extent to which pitfall is acknowledged in the paper
Takeaways: pervasive, raise awareness, not blaming anyone for their work
To estimate the experimental impact of some pitfalls, four popular research topics in computer security were considered:
1. Mobile malware detection: suffered from sampling bias, spurious correlations and inappropriate performance measures.
2. Vulnerability discovery: suffered from spurious correlations, inappropriate baseline and label inaccuracy.
  - Spurious correlation: e.g. buffer size
  - Model names: VulDeePecker, SVM baseline
  - Code was generalized before feeding to the model. Top 10 token were found to be ( ) , which were considered as important features. Another problem was that sometimes models don't consider sequences.
3. Source code author attribution: suffered from sampling bias and spurious correlations
  - Code templates are used in competitive programming which results in biasing.
4. Network intrusion detection: suffered from inappropriate baseline, lab-only evaluations and spurious correlations.

Takeaways

Cybersecurity is hard and sensitive especially because there are no constants, everything is changing, therefore pitfalls happen even to the best researchers.
In statistics, a spurious correlation refers to a connection between two variables that appears to be causal but is not. With spurious correlation, any observed dependencies between variables are merely due to chance or are both related to some unseen confounding factor.
Most common pitfall is sampling bias because it is hard to get good training data for cybersecurity as data is private. On one hand, antivirus companies don't share what data they used to train their model and on the other hand, companies who were attacked also do not share much about their data. This makes it difficult for researchers and can mess up the results of machine learning models.

Challenge for attackers and defenders both is to get good data!

💡

Some types of malware are designed to evade detection by mimicking harmless or legitimate software behavior. This technique, known as "cloaking" or "obfuscation" involves hiding malicious code or behavior within seemingly benign elements. For instance, including a string like "play.google.com" in an app's code can trick detection systems into overlooking it, as it may appear to be a legitimate reference to the Google Play Store.

Identifying which parts of computer code are critical for detecting security problems can be challenging. Security vulnerabilities often stem from complex interactions between different components of software systems, making it difficult to pinpoint specific areas of concern. Additionally, attackers continuously adapt their tactics to evade detection, making it essential for security professionals to stay vigilant and continually update their detection methods.
Shared templates are important for consistency, efficiency, best practices, collaboration, and risk management in cybersecurity efforts.
There are two main types of detection: detecting something we're familiar with and detecting the unknown. Each serves a different purpose and context matters. Using only one type of detection isn't enough because different situations require different approaches. Generally, outlier detection is quite reliable, but it can be costly because it is dynamic and scaling it up can be difficult.
Differences between bug, vulnerability and malware:

	Bug	Vulnerability	Malware
Definition	A mistake or flaw in software code or design.	A weakness or loophole (bug) in software that can be exploited.	Malicious software designed to disrupt, damage, or gain unauthorized access to a system or network.
Harmfulness	Typically not harmful on its own, but may affect performance or functionality.	Can be exploited by attackers to compromise security and access sensitive information.	Intentionally designed to cause harm or inconvenience to users or systems.
Examples	Logic errors, syntax mistakes, or unexpected behaviors in software.	Buffer overflows, SQL injection, or authentication bypasses.	Viruses, worms, trojans, ransomware, or spyware.
Impact	May result in crashes, glitches, or unexpected behavior.	Can lead to unauthorized access, data breaches, or system compromise.	Can cause data loss, financial damage, system downtime, or privacy violations.
Mitigation	Debugging, testing, and code review processes to identify and fix bugs.	Regular security assessments, patch management, and secure coding practices.	Antivirus software, firewalls, intrusion detection systems, and user education on safe browsing habits.

💡

Bug: The Heartbleed bug was a flaw in the OpenSSL cryptography library, which is widely used to secure internet communications. The bug, officially designated as CVE-2014-0160, was discovered in 2014 by security researchers at Google and Codenomicon. It resulted from a coding mistake that allowed an attacker to access sensitive information from the memory of a server running a vulnerable version of OpenSSL.

💡

Vulnerability: Once the Heartbleed bug was identified, it became a serious vulnerability. By exploiting this vulnerability, attackers could potentially retrieve sensitive data, such as usernames, passwords, and cryptographic keys, from affected servers. This made a vast number of websites and online services vulnerable to exploitation.

💡

Malware Exploitation: Following the discovery of the Heartbleed vulnerability, malicious actors quickly capitalized on the opportunity to exploit it. Cybercriminals developed malware and exploit tools specifically designed to target vulnerable systems and exploit the Heartbleed bug. These malicious programs could automatically scan the internet for servers running vulnerable versions of OpenSSL and exploit the Heartbleed vulnerability to steal sensitive data or launch further attacks.

Dos and Don'ts of Machine Learning in Computer Security - Part 2

Summary of seminar presented by Sidharth Arivarasan and Sahil Salunkhe; based on section 3,4 of Arp et al. paper; CSCE 689 601 ML-Based Cyber Defenses

Paper highlights

Takeaways

References