Machine Learning (In) Security: A Stream of Problems - Part 2
Based on sections 6,7,8 of paper "Machine Learning (In) Security: A Stream of Problems" by Ceschin et al.; CSCE 689 601 ML-Based Cyber Defenses
The paper describes the variety of problems faced by ML models today. The seminar delved into challenges and pitfalls faced in ML modelling and evaluation and the future of ML in cybersecurity. This blog is originally written for CSCE 689:601 and is the fourth blog of the series: "Machine Learning-Based CyberDefenses".
ML Modelling Issues and Solutions
In our previous discussions, we learned that a machine learning model is like a math tool that predicts things by spotting patterns in data. To find the best fit for a job, we try different models and tweak them. ML models for cybersecurity are a bit different because they can face challenges that normal methods can't handle in real-life situations.
Issue 1: Concept [Drift + Evolution]
Concept Drift is the situation in which the relation between input data and target variable changes over time. Recall that in the arms race, attackers are constantly changing their attack vectors when trying to bypass defenders' solutions.
A simple example
Concept Evolution is the process of defining and refining concepts, resulting in new labels according to the underlying concepts.
A simple example
Both problems concept [drift and evolution] might be correlated in cybersecurity, given that new concepts may result in new labels, such as new types of attacks produced by attackers.
Type of Drifts | When? | Example | Simple example |
Sudden | A concept is suddenly replaced by a new one | Create a totally new attack | e.g. Covid-19 |
Recurring | A previous active concept reappears after some time | An old type of attack starts to appear again after a given time | e.g. sales during Christmas or Black Friday |
Gradual | The probability of finding the previous concept decreases and the new one increases until it is completely replaced | New types of attacks are created and replace previous ones | e.g. effects of climate change are minimal and undetectable for short periods. |
Incremental | Difference between old and new concept is very small which is only noticed over a longer period | Attackers make few modifications in their attacks in a way that their concepts change over a large period | e.g. a stock that never goes down |
Solutions
Ensemble of classifiers
Similarity scores (cosine similarity or Jaccard index) between time-ordered pairs
Relative temporal similarity
Meta features
Online learning (e.g. DroidOL)
Framework using Venn-Abers predictors to identify when models tend to become obsolete.
Identify aging classification models during deployment (e.g. Transcend).
Use reinforcement learning to generate adversarial samples and retrain
Automatically update feature set and models without human involvement (e.g. DroidEvolver)
Concept drift detectors: Data stream learning solutions not yet totally explored by cybersecurity researchers.
DDM: Drift Detection Method
EDDM: Early Drift Detection Method
ADWIN: ADaptive WINdowing
Delayed labeling (unsupervised)
Learn and adapt by paying attention to what went wrong in the past (e.g. SyncStream)
Neighborhood graphs and fuzzy agglomerative clustering method
SPASC (Semi-supervised Pool and Accuracy-based Stream Classification)
Monitor the distribution of error
GraphPool
Adaptive Random Forest (ARF)
Multiple kernel learning
Issue 2: Adversarial Attacks
Adversarial attack is an malicious attempt to perturb a data point to another point such that it belongs to certain target adversarial class. Models are prone to suffer adversarial attacks where attackers modify their malicious vectors/features to somehow make them not being detected.
Consequences of adversarial attacks:
Allowing the execution of malicious software
Poisoning an ML model or drift detector if they use new unknown samples to update their definitions (without a ground truth from other sources)
Producing, as a consequence of 2, concept drift and evolution
When developing cybersecurity solutions using ML, both features and models must be robust against adversaries.
Direct attacks: Instead of using adversarial features, attackers may also directly attack ML models.
White-box attacks: adversary has full access to the model
Gradient-based attacks involve using the neural network weights to tweak data, creating deceptive versions that can be misclassified.
Analyzing models like decision trees or SVMs allows crafting deceptive data by manually changing the characteristics to influence the model's output.
Black-box attacks: They tend to be more challenging and real for adversaries, given that they usually do not have access to implementations of cybersecurity solutions or ML models, i.e., they have no knowledge about which features and classifiers a given solution is using and usually only known which is the raw input and the output.
Create random perturbations and test them in the input data
Change characteristics from samples looking at instances from all classes
Try to mimic the original model by creating a local model
Solutions
Generative Adversarial Networks (GANs)
Data augmentation or oversampling techniques
Use probability labels instead of using the hard class labels
Have only non-negative weights (e.g. Non-Negative MalConv)
Issue 3: Class imbalance
We have been talking about class imbalance in datasets from day 1 so we are aware of the effects of class imbalance in the ML model.
Solutions
Cost-sensitive learning (not easy to implement, faster)
Ensemble learning:
Bagging (e.g. Random forest)
Boosting (e.g. AdaBoost)
Anomaly detection (e.g. Isolation forest, One-class SVM)
Issue 4: Transfer Learning Problem
Transfer learning may also be a problem because usually, these base models are publicly available, which means that any potential attackers might have access to them and produce an adversarial vector that might affect both models: the base and the new one.
Solution
Use robust base models
Consider the robustness of the base model when using it to transfer learning to produce a solution without security weaknesses.
Issue 5: Implementation problem
Problem | Solution |
Popular frameworks like scikit-learn and Weka use batch learning algorithms | Use ML libraries for streaming data (e.g. Scikit-Multiflow, Massive Online Analysis (MOA), River, Spark Streaming), use adversarial ML frameworks like CleverHans and SecML |
Multi-language codebases may become incompatible with new releases or may be too slow. Optimizations are not always performed. | Optimize ML implementations using C and C++ under the hood (faster than Python and Java) |
If any component fails in a data stream pipeline, whole system might fail. | Fault tolerance, modular architecture, monitoring, alerting, automated testing, rollback mechanisms |
Performance: slow models will slowdown the whole system | Outsource the processing of ML algorithms to third-party components like hardware devices or outsource scanning procedures to the cloud. |
EVALUATION
Some evaluations may result in wrong conclusions that may backfire in security contexts.
Metrics
To correctly evaluate a solution the right metrics need to be selected to provide significant insights that can present different perspectives of the problem according to its real needs.
Metrics | Definition/Formula | Comments |
Accuracy | (TP+TN)/(TP+FP+TN+FN) | may provide wrong conclusions, don't use if imbalanced dataset |
Confusion matrix | A matrix where each row represents the real label and each column represents a predicted class (or vice versa) | Which class is more difficult to classify or which ones are being confused the most |
Recall | TP/(TP+FN) | High recall when we block all malign actions and sacrifice some benign |
Precision | TP/(TP+FP) | High precision when we think it is better to not detect malware than blocking benign software |
F1 score | Harmonic mean of recall and precision | Balanced |
Evaluation frameworks | Metric | Comments |
Conformal Evaluator (CE) | Two metrics (algorithm confidence and credibility) | Evaluate the robustness of the predictions made by the algorithm and their qualities |
Tesseract | Area Under Time (AUT) | Capture the impact of time decay on a classifier |
Comparing Apples to Orange
Use common datasets to compare results accurately.
Avoid claiming superior performance in literature without real-world evidence.
Don't compare different types of approaches as they have distinct challenges.
Share source codes to enable compatibility and fair comparisons across different studies.
Delayed Labels Evaluation
Another unintentional mistake that we generally make is the assumption that data and labels are simultaneously available. This is not true in real world because antivirus takes time to identify new threats so there is always a time gap.
Online vs Offline Experiments
Real-world considerations are important in both development and evaluation.
Type | Features | Example |
Offline (used by security companies in their internal analysis) | more flexible, use complex models that are run on large clusters, can collect huge amounts of data during a sandboxed execution, will detect more samples (as they have more data to decide), but the detection is likely to happen later than in an online detector | AV company investigating whether a given payload is malicious or not to develop a signature for it that will be distributed to the customers in the future |
Online (used by the endpoints installed in the user machines) | must be fast (real time, ~0 performance overhead), operate in memory-limited environments and aim to detect the threat as soon as possible, expected to present more false negatives (because fast decisions) | a real-time AV monitoring a process behavior, a packet filter inspecting the network traffic |
So, online and offline solutions must be evaluated using distinct criteria. For a more complete security treatment, modern security solutions should consider hybrid models that use multiple classifiers (CHAMELEON).
DISCUSSION: UNDERSTANDING ML
ML model is a type of signature
An ML model is just a weighted boolean signature. Though we can tweak weights for some flexibility, the model itself can't automatically expand beyond the initial boolean values without extra processing (re-training or feature re-extraction), a drawback shared with signature schemes.
There is no such thing as a 0-day detector
What is 0-day?
A model's ability to identify new threats depends on how "new" is defined. If it includes any sample an attacker can create, machine learning models may detect many 0-days as many of these new payloads will be variations of previously known threats. However, if "new" means entirely unfamiliar to the model, like a payload with unique features, ML models struggle to identify it as malicious. This highlights the challenge of concept drift and evolution. Thus, ML models are not more resistant detectors than typical signature schemes; they represent a type of signature that can be more generalized.
Security & Explainable Learning
Why required?
To allow security incident response teams (CSIRTs) to fill the identified security breaches.
Security around the monitored object can be improved in future versions.
To apply countermeasures.
Issue: Most deep learning models have no clear explanation for their operations.
The arms race will always exist
Defense solutions should try to reduce the gap between the development of new attacks and the generation of solutions for them.
The future of ML for cybersecurity
It is hard to imagine a future in cybersecurity without ML. Researchers are looking to combine ML and cryptography these days. ML classifiers based on homomorphic cryptography are gaining popularity, enabling data classification without decryption to preserve user privacy.
Final recommendations to improve ML in security:
Stop looking only at metrics, and start looking at effects
Commit yourself to the real world.
Check your work: https://secret.inf.ufpr.br/machine-learning-in-security-checklist/
Personal Opinion
Handling the privacy concerns using cryptography is a really cool idea. I am pretty excited about its future.
Like cryptography, there are several other fields from which cybersecurity can learn from. Researchers are already doing this in Quantum computing and cybersecurity, Cyberpyschology, CyberGenomics, Blockchain and the results are interesting.
How long should we wait to update our model after a concept drift is detected? This is a question to think about and I don't think if there is a solid answer to it right now.
The authors don't mention any strong ideas for handling the problems of transfer learning. This is quite a challenge because training models from scratch can take lot of resources and time so we cannot just stop using transfer learning.
References
arXiv:2010.16045v2
dl.acm.org/doi/pdf/10.1145/3375894.3375898
gist.github.com/StevenACoffman/a5f6f682d94e..
usenix.org/conference/usenixsecurity17/tech..
ink.library.smu.edu.sg/sis_research/4525
link.springer.com/chapter/10.1007/978-3-540..
riverml.xyz/0.11.1/api/drift/EDDM
epubs.siam.org/doi/epdf/10.1137/1.978161197..
ieeexplore.ieee.org/abstract/document/5693384
dl.acm.org/doi/10.1145/2623330.2623609
researchgate.net/publication/272374055_Nove..
link.springer.com/article/10.1007/s10115-01..
link.springer.com/article/10.1007/s10115-01..
sharif.edu/~beigy/docs/2016/dbz16.pdf
link.springer.com/article/10.1007/s10115-01..
link.springer.com/article/10.1007/s10994-01..
dl.acm.org/doi/10.1016/j.eswa.2017.08.033
ieeexplore.ieee.org/document/9099383
dl.acm.org/doi/abs/10.1145/2133360.2133363
analyticsvidhya.com/blog/2022/06/one-class-..
spark.apache.org/docs/latest/streaming-prog..
github.com/cleverhans-lab/cleverhans
usenix.org/conference/usenixsecurity17/tech..
usenix.org/conference/usenixsecurity19/pres..
research.aimultiple.com/homomorphic-encrypt..
secret.inf.ufpr.br/machine-learning-in-secu..
ibm.com/thought-leadership/institute-busine..
ncbi.nlm.nih.gov/pmc/articles/PMC8614761