Events 0
En
Ua
Events 0
Search result:
EMBERSim: a large database for accelerated similarity search in malware analysis- image 1

EMBERSim: a large database for accelerated similarity search in malware analysis

Binary code similarity (BCS) is an important component of training machine learning (ML) models to effectively analyse large volumes of cybersecurity telemetry. However, historically, BCS has focused on finding similarities among malware samples rather than secure data, which limits its effectiveness. The CrowdStrike research team released EMBERSim, a BCS dataset that extends the existing EMBER dataset with enhanced data tags and a new leaf similarity co-occurrence algorithm that takes into account both safe and malicious binaries. This innovative approach to cybersecurity similarity qualification improves BCS results in ML models, demonstrating that EMBERSim has the potential to improve malware detection and enable further research in this key area.

Purpose of the study

The main goal of the Crowdstrike study is to overcome the limitations of the BCS to improve malware detection and facilitate further research in this area. The study is based on the existing EMBER dataset, which includes Portable Executable (PE) files containing features and tags for malware classification.

New EMBERSim dataset

EMBERSim is a BCS research dataset that extends the malware family (FAM) metadata in the original EMBER dataset with similarity, class (CLASS) and behaviour (BEH) information, as well as additional family (FAM) tags. The extended tag list is determined using a co-occurrence algorithm.

Classification and assessment

Crowdstrike is the first to repurpose the XGBoost malware classifier to quantify pairwise similarity at the leaf level. The company proposes a new scheme for evaluating the effectiveness of the proposed leaf similarity technique using Top-K Selection and Relevance@K. This method was compared to the cybersecurity similarity computation method (ssdeep) and it was confirmed that leaf similarity is a better alternative.
Translated with www.DeepL.com/Translator (free version)

Description of metadata and tags

For each sample in EMBER, its SHA256 was used to query VirusTotal (VT) and run AVClass v2 to retrieve tags with corresponding confidence ratios. AVClass provides co-occurrence statistics for tag pairs, which allows enrichment of the malware sample tag set in EMBER by adding matching tags above a certain frequency threshold. The purpose of this enrichment is to be able to find samples with common characteristics, even if they belong to different families.

Leaf Prediction Similarity method

Using a trained ensemble decision tree model (in our case, XGBoost), we define the similarity of two samples as the similarity of their leaves in the context of this model. This method can be applied to any type of ensemble tree, but XGBoost was used in the experiments. Similarity is calculated as the proportion of trees in which both samples fall into the same leaf node.

Relevance @ K score

Crowdstrike conducted another evaluation, which includes checking the relevance of the results in a tag enrichment scenario. The Relevance@K evaluation used tag ranking to determine the relevance of the obtained samples. The evaluation is based on the relevance of the samples and different evaluation mechanisms such as EM, IoU, and NES.

The analysis showed that the leaf similarity method outperforms ssdeep and achieves better results for both types of queries – malicious and safe. The evaluation results confirmed the effectiveness of this approach in detecting and distinguishing between malicious and safe samples.

Summary

CrowdStrike’s research team continues to pursue research activities to stay ahead of the curve in threat detection and response. The EMBERSim project is a prime example of this effort to improve malware analysis and detection methods and to foster further research in this key area.

NEWS

Current news on your topic

All news
All news