Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift

The paper is an extension of the previous paper published by authors on TRANSCEND. This blog is originally written for CSCE 689:601 and is the tenth blog of the series: "Machine Learning-Based CyberDefenses".

Paper highlights

The paper provides a more formal treatment and demonstrates the effectiveness of Conformal Evaluation (CE). It introduces novel conformal evaluators with computation optimization techniques. The authors also conduct an extensive evaluation and make its code implementation available as open source on GitHub.
Nearest Centroid Method (NCM) measures the dissimilarity of a new sample in comparison to a history of past examples.
Conformal evaluation builds on conformal prediction by using prediction regions to assess model performance. It evaluates the consistency between predicted confidence levels and actual outcomes, providing insights into model reliability, especially in uncertain environments like malware classification with concept drift.
The authors make two assumptions: exchangeability, where the order of examples is irrelevant, and closed-world assumption meaning that new examples belong to the same class observed during training.
Credibility quantifies how relevant the training set is to the prediction. Low credibility leads to a high probability of impossible results, violating the assumption of a closed world, representing a drift in data.
The authors describe four novel conformal evaluators:
- Transductive CE (TCE): Use every training point as a calibration point. While effective, it does not scale well to larger datasets due to its computational complexity of O(n^2).
- Approximate TCE: Use calibration points in batches, using k folds and parallel processing to improve efficiency. It has a computational complexity of O(n/(1-p)).
- Inductive CE (ICE): Divide the dataset into two sets, resulting in a speedy process. However, it may not efficiently use information and has a computational complexity of O(pn).
- Cross CE (CCE): Combines approximate TCE and ICE. Its computational complexity is O(pn/(1-p)), offering a balance between speed and information efficiency.
The authors also propose using random search instead of grid search for finding the right threshold which is an improvement over the earlier TRANSCEND framework. This simplifies the process and saves time, making it more efficient for classification tasks.

Takeaways

The assumption of exchangeability implies that the sequence of data instances does not affect the outcome, but this assumption is not true in reality.
This paper stands out due to its combination of theory, mathematics, and new methods. It presents a refined approach compared to previous works, with the authors addressing sampling biases identified in their earlier research.
A takeaway from this research is that a single solution is inadequate to combat concept drift in malware classification. To tackle concept drift in malware classification effectively, it is important to use multiple drift detectors, statistics, probabilities, and a pool of models.
PDF malware and Word malware are types of malicious software that can be embedded within PDF and Word documents, respectively. These files may contain harmful scripts or code that can be triggered when the document is opened.
- Actions in a PDF file are events associated with certain triggers, such as clicking a button or selecting a link.
- JavaScript embedded in the PDF document adds advanced functionality and automation, such as form validation, calculations, and dynamic content based on user input.
- Word documents can contain scripts that run within the Word application itself.
Classifying malware for different file extensions: Initially, the PDF file is parsed to extract relevant features, such as embedded scripts or suspicious content. These features are then used to train a classifier that can differentiate between benign and malicious PDF files. Word malware classification follows a similar process tailored to Word document characteristics. Since different file formats have distinct characteristics, separate detectors and classifiers are required for each type. For instance, executable (EXE) files are more commonly used to deliver malware, making them a prominent threat model. To address this, a pipeline of solutions should be implemented, incorporating detectors and classifiers specific to each file format to enhance overall cybersecurity defenses.
In a phishing scheme identified in 2023, attackers use Microsoft Word documents to spread harmful software that can secretly monitor what a person types, steal cryptocurrency funds, and take sensitive information. The attackers send emails pretending to be trustworthy, with the harmful document attached. When the victim opens the document, it activates a hidden link that delivers three types of harmful software: RedLine Clipper, Agent Tesla, and OriginBotnet. RedLine Clipper changes cryptocurrency wallet addresses to the attacker's address, particularly targeting long and complex addresses. Agent Tesla records keystrokes and lists installed software, while OriginBotnet gathers sensitive data from the victim's computer and connects to the attackers' server for further instructions. The researchers noted that the attack was highly sophisticated and hard to detect.
In 2023, security experts from JPCERT/CC discovered a tricky new way that cyber attackers used to sneak harmful files into seemingly safe PDF documents. They called it MalDoc in PDF. This technique hides a malicious Word file inside a PDF, making it hard to detect. When someone opens the PDF, the Word file quietly opens in Microsoft Word, which then activates harmful scripts that can do bad things to the computer. Even though the file looks like a PDF, it behaves like a Word file, so regular security tools might miss it. JPCERT/CC suggested using special tools like OLEVBA to find these hidden files and warned that new trick makes it tough for cybersecurity teams to keep computers safe.

Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift

Summary of seminar presented by Ali Ayati; based on Jordaney et al. paper; CSCE 689 601 ML-Based Cyber Defenses

Paper highlights

Takeaways

References