DeepSign: Deep learning for automatic malware signature generation and classification

This paper describes Deep Sign, a new method for automatic generation and classification of malware signatures. This blog is originally written for CSCE 689:601 and is the 24th (last) blog of the series: "Machine Learning-Based CyberDefenses".

Paper Highlights

Limitations are faced in detecting new malware variants as it checks only strings. A proposed method for signature generation is based on program behavior represented as a binary vector. Approaches like Autograph, Honeycomb, etc., are suggested for improving automatic signature generation.
Unlike traditional signature and token methods, which often miss new variants of existing malware, the DBN-generated signatures in this paper accurately classify new malware variants.
Testing on a dataset with hundreds of variants across major malware families, the proposed method achieves a 98.6% classification accuracy using these signatures.

Takeaways

In this paper, the classification task involves dynamic analysis, specifically classifying file executions rather than files themselves. The features used for classification are functions/API calls that are called during the execution, derived from log traces.
Applying GAN to these features involves using the network to suggest additional API calls. However, a drawback arises in that GAN could potentially recommend removing previously identified API calls. For example, GAN might suggest removing a function associated with ransomware, which could compromise the accuracy and effectiveness of the malware detection system.
Training a classifier using deep learning or GANs for dynamic malware analysis:
- Model Development in Sandbox: Initially, develop and train the model in a controlled environment like a sandbox. This allows for experimentation and refinement without risking actual user devices.
- Real-time Anti-virus Implementation: While deploying a deep learning model directly onto users' devices might lead to slow execution due to the computational requirements, there are alternative approaches:
  - Edge Computing: Utilize edge computing to offload some of the computational burden from the device itself, enabling more efficient execution of the model.
  - Hybrid Approach: Implement a hybrid system where some aspects of the analysis are performed locally on the device, while more resource-intensive tasks are offloaded to a cloud-based service.
  - Pre-trained Models: Utilize pre-trained models or model compression techniques to reduce the computational overhead on user devices while maintaining effectiveness.
- Dynamic Analysis Integration: Incorporate dynamic analysis into the real-time antivirus solution. When static analysis isn't sufficient, dynamic analysis can provide valuable insights into the behavior of files during execution.
- Zero-trust Implementation: Aim for a zero-trust architecture where every file is treated as untrusted until proven otherwise. This involves continuous monitoring and analysis of file behavior rather than relying solely on static signatures.
- Efficiency Considerations: Recognize the need for efficiency in real-time malware detection. Running complex deep learning models on every file in real-time is impractical due to the computational overhead. Instead, focus on optimizing the analysis process to minimize latency while maintaining accuracy.
- Early Detection: Strive for early detection of malware by optimizing the analysis process to detect threats as soon as possible during execution. Discovering malware at the end of the execution trace indicates a significant delay in detection, which is undesirable.
The authors use rule-based methods for malware detection in this paper, which are advantageous due to their simplicity and efficiency in pattern recognition. By incorporating clustering techniques, specifically mixing clustering (from previous classes) with rules, the paper aims to group malware with similar API call sequences into the same families.

💡

Pitfall 1: Often, systems are categorized into two types: either machine learning (ML) or rule-based. However, a more effective approach could involve using ML models to generate rules, or using rules to classify ML-generated features. Combining both methods can lead to better results, as the intersection of these approaches can leverage the strengths of both.

💡

Pitfall 2: Static and dynamic analysis refer to the type of data being analyzed (i.e., before or during execution), while ML and rules refer to the method used for analysis. Therefore, it is incorrect to compare static vs. dynamic with ML vs. rules, as they operate in different dimensions. Instead, a combination of static and dynamic analysis with ML and rules can provide a comprehensive approach to malware detection.

The key concept of the paper revolves around denoising autoencoders and their significance in the context of malware detection. Denoising AEs is crucial as it can help in filtering out noise from the data traces. In the context of malware detection, noise refers to benign API calls or irrelevant sequences of API calls that could confuse the classifier. By using proposed approach, the authors aims to remove this noise and extract meaningful patterns indicative of malicious behavior.
When examining ransomware behavior, using a moving window size may exhaust legitimate API calls before capturing malicious ones. To address this, ransomware uses a strategy of incorporating malicious API calls after exhausting legitimate ones, thus ensuring successful encryption or other malicious activities!
There is no discernible difference between legitimate file encryption followed by cloud transmission and ransomware-induced file encryption followed by the same action. Both actions involve encrypting files and sending them to the cloud, making it challenging to distinguish between benign and malicious activities solely based on this behavior. So, by introducing additional benign API calls within malicious sequences, attackers can confuse the classifier. These additional API calls act as noise, making it difficult for the classifier to distinguish between legitimate and malicious behavior.
Traditional antivirus (AV) solutions typically run a malware for 3-4 minutes at most. One strategy for evading detection involves delaying malicious activity until after the AV sandbox has completed its analysis, such as waiting for 10 minutes before executing malicious behavior. However, the issue arises with AVs running continuously on user devices for 24 hours. It is impractical to sustain legitimate API calls indefinitely; at some point, the malware must engage in malicious activity to fulfill its purpose. Delaying this behavior increases the risk of user detection over time.

💡

A future scenario could be that malware adopts a distributed approach by spawning multiple files, each performing a specific task. Unlike traditional single-file malware, this distributed malware strategy allows for parallel execution of malicious activities across multiple processes. For instance, one file may open and enumerate directories, while another encrypts files, and so on. This distributed approach complicates detection as it diversifies the malicious behavior across multiple processes, making it harder for traditional AV solutions to detect and mitigate effectively. Defenders face challenge when malware files are launched by different parent processes, making it difficult to track and analyze their behavior effectively.

One proposed solution to for defending is to merge the logs of these processes and analyze their sequences. However, this approach presents challenges due to the large number of processes running on the system, resulting in an overwhelming volume of logs. Processing and analyzing this sheer volume of data in real-time is impractical from a performance perspective. Additionally, even if it were feasible, determining the correct order of events within these logs becomes exponentially complex, making it practically impossible to derive meaningful insights.
An alternative solution proposed involves leveraging hardware capabilities to address these challenges: use hardware designed to match rules generated by machine learning models. This hardware-based approach offers the potential for efficient and effective analysis of process sequences, enabling defenders to better detect and respond to malware threats in real-time.

Why more Windows malware? Windows operating systems are predominantly used by domestic users, including individuals and small businesses. In contrast, Linux is more prevalent in corporate environments and servers. Although Linux-based malware exists, Windows tends to attract more attention from malware authors due to its larger user base and historical focus on desktop computing. However, Linux-based systems are not immune to exploits, as evidenced by the abundance of vulnerabilities listed on platforms like Exploit Database (ExploitDB).

Malware authors often tailor their attacks to target specific platforms based on their objectives. For example, attackers may prioritize exploiting vulnerabilities in cloud computing platforms like AWS rather than individual users' devices.

💡

Malware authors may adjust their strategies based on the demographics of the target population. For example, the type of malware used may differ depending on whether the target is a corporation, an individual, or a specific demographic group. The goals of malware, such as ransomware, are often driven by financial motives. Attackers may target platforms or user groups perceived to be wealthier, such as Apple users over Android users. The choice of platforms for DoS (Denial of Service) botnets may be influenced by the prevalence of the platform. Android, with its large user base, may be a prime target for such attacks due to the potential for a significant impact.

References

https://ieeexplore.ieee.org/document/7280815

DeepSign: Deep learning for automatic malware signature generation and classification

Summary of seminar based on Omid et al. paper; CSCE 689 601 ML-Based Cyber Defenses

Paper Highlights

Takeaways

References