The paper describes the variety of problems faced by ML models today. The seminar delved into challenges and pitfalls faced in ML modelling and evaluation and the future of ML in cybersecurity. This blog is originally written for CSCE 689:601 and is the fourth blog of the series: "Machine Learning-Based CyberDefenses".

ML Modelling Issues and Solutions

In our previous discussions, we learned that a machine learning model is like a math tool that predicts things by spotting patterns in data. To find the best fit for a job, we try different models and tweak them. ML models for cybersecurity are a bit different because they can face challenges that normal methods can't handle in real-life situations.

Issue 1: Concept [Drift + Evolution]

Concept Drift is the situation in which the relation between input data and target variable changes over time. Recall that in the arms race, attackers are constantly changing their attack vectors when trying to bypass defenders' solutions.

A simple example

Imagine an ML model identifying if some image is a cat or a dog. Over time, cats and dogs start wearing hats. This confuses the model because it was not trained to recognize animals with hats.

Concept Evolution is the process of defining and refining concepts, resulting in new labels according to the underlying concepts.

A simple example

Imagine we add new animals, like lions and tigers. To keep our ML model sharp, we need teach it about these new animals. This process of adding new things and updating the model's knowledge is like concept evolution.

Both problems concept [drift and evolution] might be correlated in cybersecurity, given that new concepts may result in new labels, such as new types of attacks produced by attackers.

Type of Drifts	When?	Example	Simple example
Sudden	A concept is suddenly replaced by a new one	Create a totally new attack	e.g. Covid-19
Recurring	A previous active concept reappears after some time	An old type of attack starts to appear again after a given time	e.g. sales during Christmas or Black Friday
Gradual	The probability of finding the previous concept decreases and the new one increases until it is completely replaced	New types of attacks are created and replace previous ones	e.g. effects of climate change are minimal and undetectable for short periods.
Incremental	Difference between old and new concept is very small which is only noticed over a longer period	Attackers make few modifications in their attacks in a way that their concepts change over a large period	e.g. a stock that never goes down

Four types of concept drifts

Solutions

Ensemble of classifiers
Similarity scores (cosine similarity or Jaccard index) between time-ordered pairs
Relative temporal similarity
Meta features
Online learning (e.g. DroidOL)
Framework using Venn-Abers predictors to identify when models tend to become obsolete.
Identify aging classification models during deployment (e.g. Transcend).
Use reinforcement learning to generate adversarial samples and retrain
Automatically update feature set and models without human involvement (e.g. DroidEvolver)
Concept drift detectors: Data stream learning solutions not yet totally explored by cybersecurity researchers.
- DDM: Drift Detection Method
- EDDM: Early Drift Detection Method
- ADWIN: ADaptive WINdowing
- Delayed labeling (unsupervised)
- Learn and adapt by paying attention to what went wrong in the past (e.g. SyncStream)
- Neighborhood graphs and fuzzy agglomerative clustering method
- SPASC (Semi-supervised Pool and Accuracy-based Stream Classification)
- Monitor the distribution of error
- GraphPool
- Adaptive Random Forest (ARF)
- Multiple kernel learning

Issue 2: Adversarial Attacks

Adversarial attack is an malicious attempt to perturb a data point to another point such that it belongs to certain target adversarial class. Models are prone to suffer adversarial attacks where attackers modify their malicious vectors/features to somehow make them not being detected.

Consequences of adversarial attacks:

Allowing the execution of malicious software
Poisoning an ML model or drift detector if they use new unknown samples to update their definitions (without a ground truth from other sources)
Producing, as a consequence of 2, concept drift and evolution

When developing cybersecurity solutions using ML, both features and models must be robust against adversaries.

Direct attacks: Instead of using adversarial features, attackers may also directly attack ML models.

White-box attacks: adversary has full access to the model
- Gradient-based attacks involve using the neural network weights to tweak data, creating deceptive versions that can be misclassified.
- Analyzing models like decision trees or SVMs allows crafting deceptive data by manually changing the characteristics to influence the model's output.
Black-box attacks: They tend to be more challenging and real for adversaries, given that they usually do not have access to implementations of cybersecurity solutions or ML models, i.e., they have no knowledge about which features and classifiers a given solution is using and usually only known which is the raw input and the output.
- Create random perturbations and test them in the input data
- Change characteristics from samples looking at instances from all classes
- Try to mimic the original model by creating a local model

Solutions

Generative Adversarial Networks (GANs)
Data augmentation or oversampling techniques
Use probability labels instead of using the hard class labels
Have only non-negative weights (e.g. Non-Negative MalConv)

Issue 3: Class imbalance

We have been talking about class imbalance in datasets from day 1 so we are aware of the effects of class imbalance in the ML model.

Solutions

Cost-sensitive learning (not easy to implement, faster)
Ensemble learning:
- Bagging (e.g. Random forest)
- Boosting (e.g. AdaBoost)
Anomaly detection (e.g. Isolation forest, One-class SVM)

Issue 4: Transfer Learning Problem

Transfer learning may also be a problem because usually, these base models are publicly available, which means that any potential attackers might have access to them and produce an adversarial vector that might affect both models: the base and the new one.

Solution

Use robust base models
Consider the robustness of the base model when using it to transfer learning to produce a solution without security weaknesses.

Issue 5: Implementation problem

Problem	Solution
Popular frameworks like scikit-learn and Weka use batch learning algorithms	Use ML libraries for streaming data (e.g. Scikit-Multiflow, Massive Online Analysis (MOA), River, Spark Streaming), use adversarial ML frameworks like CleverHans and SecML
Multi-language codebases may become incompatible with new releases or may be too slow. Optimizations are not always performed.	Optimize ML implementations using C and C++ under the hood (faster than Python and Java)
If any component fails in a data stream pipeline, whole system might fail.	Fault tolerance, modular architecture, monitoring, alerting, automated testing, rollback mechanisms
Performance: slow models will slowdown the whole system	Outsource the processing of ML algorithms to third-party components like hardware devices or outsource scanning procedures to the cloud.

EVALUATION

Some evaluations may result in wrong conclusions that may backfire in security contexts.

Metrics

To correctly evaluate a solution the right metrics need to be selected to provide significant insights that can present different perspectives of the problem according to its real needs.

Metrics	Definition/Formula	Comments
Accuracy	(TP+TN)/(TP+FP+TN+FN)	may provide wrong conclusions, don't use if imbalanced dataset
Confusion matrix	A matrix where each row represents the real label and each column represents a predicted class (or vice versa)	Which class is more difficult to classify or which ones are being confused the most
Recall	TP/(TP+FN)	High recall when we block all malign actions and sacrifice some benign
Precision	TP/(TP+FP)	High precision when we think it is better to not detect malware than blocking benign software
F1 score	Harmonic mean of recall and precision	Balanced

Evaluation frameworks	Metric	Comments
Conformal Evaluator (CE)	Two metrics (algorithm confidence and credibility)	Evaluate the robustness of the predictions made by the algorithm and their qualities
Tesseract	Area Under Time (AUT)	Capture the impact of time decay on a classifier

Comparing Apples to Orange

Use common datasets to compare results accurately.
Avoid claiming superior performance in literature without real-world evidence.
Don't compare different types of approaches as they have distinct challenges.
Share source codes to enable compatibility and fair comparisons across different studies.

Delayed Labels Evaluation

Another unintentional mistake that we generally make is the assumption that data and labels are simultaneously available. This is not true in real world because antivirus takes time to identify new threats so there is always a time gap.

Online vs Offline Experiments

Real-world considerations are important in both development and evaluation.

Type	Features	Example
Offline (used by security companies in their internal analysis)	more flexible, use complex models that are run on large clusters, can collect huge amounts of data during a sandboxed execution, will detect more samples (as they have more data to decide), but the detection is likely to happen later than in an online detector	AV company investigating whether a given payload is malicious or not to develop a signature for it that will be distributed to the customers in the future
Online (used by the endpoints installed in the user machines)	must be fast (real time, ~0 performance overhead), operate in memory-limited environments and aim to detect the threat as soon as possible, expected to present more false negatives (because fast decisions)	a real-time AV monitoring a process behavior, a packet filter inspecting the network traffic

So, online and offline solutions must be evaluated using distinct criteria. For a more complete security treatment, modern security solutions should consider hybrid models that use multiple classifiers (CHAMELEON).

DISCUSSION: UNDERSTANDING ML

ML model is a type of signature

An ML model is just a weighted boolean signature. Though we can tweak weights for some flexibility, the model itself can't automatically expand beyond the initial boolean values without extra processing (re-training or feature re-extraction), a drawback shared with signature schemes.

There is no such thing as a 0-day detector

What is 0-day?

A zero-day is a vulnerability or security hole in a computer system unknown to its owners, developers or anyone capable of mitigating it.

A model's ability to identify new threats depends on how "new" is defined. If it includes any sample an attacker can create, machine learning models may detect many 0-days as many of these new payloads will be variations of previously known threats. However, if "new" means entirely unfamiliar to the model, like a payload with unique features, ML models struggle to identify it as malicious. This highlights the challenge of concept drift and evolution. Thus, ML models are not more resistant detectors than typical signature schemes; they represent a type of signature that can be more generalized.

Security & Explainable Learning

Why required?

To allow security incident response teams (CSIRTs) to fill the identified security breaches.
Security around the monitored object can be improved in future versions.
To apply countermeasures.

Issue: Most deep learning models have no clear explanation for their operations.

The arms race will always exist

Defense solutions should try to reduce the gap between the development of new attacks and the generation of solutions for them.

The future of ML for cybersecurity

It is hard to imagine a future in cybersecurity without ML. Researchers are looking to combine ML and cryptography these days. ML classifiers based on homomorphic cryptography are gaining popularity, enabling data classification without decryption to preserve user privacy.
Final recommendations to improve ML in security:
- Stop looking only at metrics, and start looking at effects
- Commit yourself to the real world.
- Check your work: https://secret.inf.ufpr.br/machine-learning-in-security-checklist/

Personal Opinion

Handling the privacy concerns using cryptography is a really cool idea. I am pretty excited about its future.
Like cryptography, there are several other fields from which cybersecurity can learn from. Researchers are already doing this in Quantum computing and cybersecurity, Cyberpyschology, CyberGenomics, Blockchain and the results are interesting.
How long should we wait to update our model after a concept drift is detected? This is a question to think about and I don't think if there is a solid answer to it right now.
The authors don't mention any strong ideas for handling the problems of transfer learning. This is quite a challenge because training models from scratch can take lot of resources and time so we cannot just stop using transfer learning.