Automatic Yara Rule Generation Using Biclustering
Summary of seminar based on Raff et al. paper; CSCE 689 601 ML-Based Cyber Defenses
This paper describes AutoYara algorithm using biclustering and large n-grams to automate Yara rule generation for specific malware families, aimed at reducing analyst workload and enhancing rule efficiency. This blog is originally written for CSCE 689:601 and is the 23rd blog of the series: "Machine Learning-Based CyberDefenses".
Paper Highlights
YARA Rules (customized patterns for malware detection), use text-based signatures to pinpoint specific attributes of malicious software. They help in threat detection and incident response by enabling proactive scanning across files, memory, and network traffic. They have flexible syntax, so we can create precise detection rules tailored to varied malware scenarios.
Developing high-quality Yara rules for malware detection is labor intensive and time consuming. Existing solutions suffer from drawbacks; YarGen relies on a Naive Bayes model, VXsig utilizes the least-common-subsequence (LCS) algorithm, both requiring manual adjustment and human intervention, while the Greedy Approach is ineffective with limited samples, potentially lacking in rule generation.
The proposed solution, AutoYara, initially attempted with "Kilogram," encountered impracticalities leading to adoption of biclustering for feature selection and rule formulation. Overcoming challenges through seminal spectral co-clustering, AutoYara integrates large n-grams and the biclustering algorithm, ensuring comprehensive rule generation. Implemented in Java for cross-platform compatibility and swift execution, AutoYara's algorithm comprises two main components: "Filter Sample," identifying top-k most frequent n-grams, and "AutoYara," which encompasses building Bloom Filters, creating Yara rules, performing biclustering, and evaluating and selecting rules. Leveraging datasets from the Ember2017 corpus and VirusShare, AutoYara constructs Yara rules from raw bytes, enhancing its efficacy in malware detection and analysis.
Performance evaluation involved comparing AutoYara with YarGen and the Greedy approach using the Fβ metric with β set to 0.001, aiming for a false positive rate (FP) ≤ 0.1%, and large-scale testing demonstrated AutoYara's superiority.
Takeaways
Antivirus companies use both machine learning and signatures depending on the specific case. Deciding when to use a signature or a model depends on factors like time consumption and concept drift. For vulnerabilities that last only a day, a signature is created and then deployed in the machine learning model to teach it about the new malware. When AV claims using rules to detect malware, it does not necessarily mean they didn't use machine learning. In fact, machine learning might have been used to design the rule for detecting that specific malware.
For short-lived malware, signatures are advantageous due to their speed in deployment. However, signatures have limitations: they may not generalize well, are not storage-efficient, but are very fast. It is important to note that rules can be generated by models too, but this doesn't necessarily mean that humans develop the rules.
This paper aims to develop the AutoYara algorithm, which automates the generation of Yara rules for specific malware families by leveraging biclustering and large n-grams. The idea is to improve the efficiency and effectiveness of Yara rule construction while reducing analyst workload and enabling deployment on low-resource equipment. This approach is superior to previous methods because biclustering identifies relevant features and their combinations for Yara rules, allowing for the creation of precise rules with high true-positive rates and low false-positive rates. Biclustering helps because it selects features jointly and avoids the limitations of simple clustering, such as the need to specify the number of clusters in advance and the inability to handle overlapping clusters.
In contrast to traditional feature extraction methods used in machine learning, Yara rules involve human analysis as the classifier instead of throwing all features at the classifier. This approach makes it easier to explain the rules and ensures human interpretability. The authors suggest combining the strengths of both approaches to leverage the benefits of machine learning while retaining the human readability of rules. Ultimately, the output is not a model but a rule, emphasizing the importance of human-readable rules in cybersecurity analysis and decision-making processes.
Simple clustering won't work effectively because it overlooks the trade-off between the size of the n-gram sequence and false positives, similar to the trade-off encountered in adversarial generation where adding more bytes has its limitations. If the n-gram sequence is too generic, it can lead to an increase in false positives. Therefore, a more sophisticated approach like biclustering is necessary to identify relevant features and their combinations, striking a balance between specificity and false positives in Yara rule generation.
The rules generated by automatic methods are significantly more complex and less straightforward compared to those generated by humans, making them much harder to understand. Human-generated rules are typically more concise and easier to interpret. The model used to generate YARA rules must be trained on both goodware and malware files, as the rules should not match with goodware to avoid causing issues for users of the antivirus software. YARA rules are designed for specific targeting rather than generalization, focusing on particular files of interest. In the event of a malware attack targeting a company, analysts utilize YARA rules to search for other potential malware within the network. Automatic YARA rule generation can help analyze more files in short time.
These rules can be stored in hardware, which would result in significantly lower runtime compared to running them in software. However, there is a limitation in memory capacity in hardware, especially when storing in cache. Storing signatures in hardware allows for running them in a single cycle, enhancing speed. Nonetheless, there is a tradeoff between speed and coverage, suggesting the inclusion of the largest threats in the hardware to prioritize effectively.
In previous papers, genetic algorithms were proposed as a solution, but now biclustering is used because it addresses the two-dimensional nature of the problem, similar to genetic algorithms. The approach is notably similar in both contexts, demonstrating the adaptability and effectiveness of biclustering in addressing complex problems across different domains.
An efficient method to use storage is through the use of Bloom Filters. Bloom Filters are probabilistic data structures that reduce storage size but come with the trade-off of potential false positives. While not always accurate due to the possibility of collisions in hashing functions, Bloom Filters guarantee no false negatives but may produce false positives. Despite this drawback, Bloom Filters offer the advantage of being able to store numerous signatures. In hardware development, Bloom Filters are commonly used for their storage efficiency, although some data may be lost due to false positives.