DeepSign: Deep learning for automatic malware signature generation and classification
Summary of seminar based on Omid et al. paper; CSCE 689 601 ML-Based Cyber Defenses
This paper describes Deep Sign, a new method for automatic generation and classification of malware signatures. This blog is originally written for CSCE 689:601 and is the 24th (last) blog of the series: "Machine Learning-Based CyberDefenses".
Paper Highlights
Limitations are faced in detecting new malware variants as it checks only strings. A proposed method for signature generation is based on program behavior represented as a binary vector. Approaches like Autograph, Honeycomb, etc., are suggested for improving automatic signature generation.
Unlike traditional signature and token methods, which often miss new variants of existing malware, the DBN-generated signatures in this paper accurately classify new malware variants.
Testing on a dataset with hundreds of variants across major malware families, the proposed method achieves a 98.6% classification accuracy using these signatures.
Takeaways
In this paper, the classification task involves dynamic analysis, specifically classifying file executions rather than files themselves. The features used for classification are functions/API calls that are called during the execution, derived from log traces.
Applying GAN to these features involves using the network to suggest additional API calls. However, a drawback arises in that GAN could potentially recommend removing previously identified API calls. For example, GAN might suggest removing a function associated with ransomware, which could compromise the accuracy and effectiveness of the malware detection system.
Training a classifier using deep learning or GANs for dynamic malware analysis:
Model Development in Sandbox: Initially, develop and train the model in a controlled environment like a sandbox. This allows for experimentation and refinement without risking actual user devices.
Real-time Anti-virus Implementation: While deploying a deep learning model directly onto users' devices might lead to slow execution due to the computational requirements, there are alternative approaches:
Edge Computing: Utilize edge computing to offload some of the computational burden from the device itself, enabling more efficient execution of the model.
Hybrid Approach: Implement a hybrid system where some aspects of the analysis are performed locally on the device, while more resource-intensive tasks are offloaded to a cloud-based service.
Pre-trained Models: Utilize pre-trained models or model compression techniques to reduce the computational overhead on user devices while maintaining effectiveness.
Dynamic Analysis Integration: Incorporate dynamic analysis into the real-time antivirus solution. When static analysis isn't sufficient, dynamic analysis can provide valuable insights into the behavior of files during execution.
Zero-trust Implementation: Aim for a zero-trust architecture where every file is treated as untrusted until proven otherwise. This involves continuous monitoring and analysis of file behavior rather than relying solely on static signatures.
Efficiency Considerations: Recognize the need for efficiency in real-time malware detection. Running complex deep learning models on every file in real-time is impractical due to the computational overhead. Instead, focus on optimizing the analysis process to minimize latency while maintaining accuracy.
Early Detection: Strive for early detection of malware by optimizing the analysis process to detect threats as soon as possible during execution. Discovering malware at the end of the execution trace indicates a significant delay in detection, which is undesirable.
The authors use rule-based methods for malware detection in this paper, which are advantageous due to their simplicity and efficiency in pattern recognition. By incorporating clustering techniques, specifically mixing clustering (from previous classes) with rules, the paper aims to group malware with similar API call sequences into the same families.
The key concept of the paper revolves around denoising autoencoders and their significance in the context of malware detection. Denoising AEs is crucial as it can help in filtering out noise from the data traces. In the context of malware detection, noise refers to benign API calls or irrelevant sequences of API calls that could confuse the classifier. By using proposed approach, the authors aims to remove this noise and extract meaningful patterns indicative of malicious behavior.
When examining ransomware behavior, using a moving window size may exhaust legitimate API calls before capturing malicious ones. To address this, ransomware uses a strategy of incorporating malicious API calls after exhausting legitimate ones, thus ensuring successful encryption or other malicious activities!
There is no discernible difference between legitimate file encryption followed by cloud transmission and ransomware-induced file encryption followed by the same action. Both actions involve encrypting files and sending them to the cloud, making it challenging to distinguish between benign and malicious activities solely based on this behavior. So, by introducing additional benign API calls within malicious sequences, attackers can confuse the classifier. These additional API calls act as noise, making it difficult for the classifier to distinguish between legitimate and malicious behavior.
Traditional antivirus (AV) solutions typically run a malware for 3-4 minutes at most. One strategy for evading detection involves delaying malicious activity until after the AV sandbox has completed its analysis, such as waiting for 10 minutes before executing malicious behavior. However, the issue arises with AVs running continuously on user devices for 24 hours. It is impractical to sustain legitimate API calls indefinitely; at some point, the malware must engage in malicious activity to fulfill its purpose. Delaying this behavior increases the risk of user detection over time.
One proposed solution to for defending is to merge the logs of these processes and analyze their sequences. However, this approach presents challenges due to the large number of processes running on the system, resulting in an overwhelming volume of logs. Processing and analyzing this sheer volume of data in real-time is impractical from a performance perspective. Additionally, even if it were feasible, determining the correct order of events within these logs becomes exponentially complex, making it practically impossible to derive meaningful insights.
An alternative solution proposed involves leveraging hardware capabilities to address these challenges: use hardware designed to match rules generated by machine learning models. This hardware-based approach offers the potential for efficient and effective analysis of process sequences, enabling defenders to better detect and respond to malware threats in real-time.
Why more Windows malware? Windows operating systems are predominantly used by domestic users, including individuals and small businesses. In contrast, Linux is more prevalent in corporate environments and servers. Although Linux-based malware exists, Windows tends to attract more attention from malware authors due to its larger user base and historical focus on desktop computing. However, Linux-based systems are not immune to exploits, as evidenced by the abundance of vulnerabilities listed on platforms like Exploit Database (ExploitDB).
Malware authors often tailor their attacks to target specific platforms based on their objectives. For example, attackers may prioritize exploiting vulnerabilities in cloud computing platforms like AWS rather than individual users' devices.