Malware Detection on Highly Imbalanced Data through Sequence Modeling
Summary of seminar presented by Akshat Punjabi+Pandey; based on Oak et al. paper; CSCE 689 601 ML-Based Cyber Defenses
The paper discusses sequencing modeling for malware detection in Android OS. The seminar delved into the challenges posed by skewed and imbalanced data, exploring various approaches and shedding light on the limitations of conventional methods. This blog is originally written for CSCE 689:601 and is the second blog of the series: "Machine Learning-Based CyberDefenses".
Static vs. Dynamic Approaches
Static analysis involves rule-based approaches, while dynamic analysis explores heuristics based on experience. Traditional rule-based methods often struggle with imbalanced datasets, requiring a shift towards more advanced techniques.
Using LSTM and BERT
The seminar introduced machine learning models, including Long Short-Term Memory (LSTM) networks and Bidirectional Encoder Representations from Transformers (BERT). These models, trained on preprocessed sequences of descriptive API calls, exhibited promising results, outperforming conventional rule-based approaches.
Paper highlights
Input sequences are the activities from Android OS logs using Wildfire. Examples of suspicious activities in activity logs:
An app trying to delete its own icon (trying to hide)
An app sending a message and deleting it
Preprocessed activities (input data) to integers and fed to model
Padded input sequences to a fixed length.
Evaluation metrics used: accuracy, recall, precision, f1 score.
Comparison done with Bag of Words, TFIDF (disadvantage: do not care about sequence, so only offline is possible)
Baseline methods - clustering, auto-encoder, DAGMM, DeepLog
Pre-trained BERT had highest performance
Increasing sequence lengths helps in deep learning models which work better for imbalanced dataset as they have more correlation power.
Takeaways
Ideal size of n-grams = take all sequences!!! But it is useless because malware would have done its part till the antivirus knows. Despite experiments showcasing the effectiveness of smaller n-grams (1 or 2), the real-world scenario often demands the consideration of many dozen-grams. So, I feel authors should have experimented more with "n" value or should have considered more datasets.
There are many versions of BERT available in the market and it was not quite clear which one did the authors use. If they only used one, I would have liked to see their reasoning of doing the same.
There is a need to detect rare events within a vast pool of benign files. e.g. we have ~3 million files in our system but only 3-4 files are malicious in them.
There are no different levels of harm (low, medium, high) like most people claim.
Not only detection rate matters, the model which detects earlier is better even if two models have same accuracy.
As discussed earlier, there are at least two layers in antivirus: device and cloud. So, on AV company's cloud it might be a balanced dataset, but on devices it is imbalanced. Experiments don't run on cloud!!!
Deep Learning cannot be used every time because of high compute requirements.
Possible representations of input data for malware detection: text, graph, images. Visual representation using heatmaps, textures and direct mapping (get binary in bytes, convert to pixels, groups of 3 pixels = RGB, and single pixels = grayscale)
Activity sequence in static analysis will contain all functions but activity sequence in dynamic analysis will only contain functions that actually executed.
We can craft a binary to fool a classifier using dead code https://profsandhu.com/cs5323_s18/yk_2010.pdf. This was a new concept for me and I would like explore more on the "nop" instruction for the course project.
In the end, checking something is a malware or not is just a statistical approach, there is no true solution to the problem. The course objective is malware recognition, not malware detection. Challenge is not in building an ML model with 99% accuracy or using a specific input representation. Challenge is how do we do this in practice, how do we handle multiple devices and how do we justify the actions taken by our model.