Detecting Malicious .NET Files Using CLR Header Features and Machine Learning

The .Net Framework has made writing windows applications easier than ever. Several programming languages can be used to write software using the .Net Framework, the most common one being C#. Due to the abundance of modules and pre-built functionalities that allow programmers to easily manipulate the windows operating system with high abstraction and no need for low-level coding, the .Net framework has also become a desirable environment for malicious actors to write their malware. To best of our knowledge, researchers have been treating .NET malware and other malware the same way by utilizing features from the PE header to classify the files. This is not possible for.Net files because their PE headers are nearly identical. In this paper, we tackle the problem of detecting malicious .Net files by extracting features from the CLR header. As far as we know, we are the first ones to explore this approach. Furthermore, we create a new dataset comprised of.Net malware and benign files, which we freely distribute to the research community. Finally, we assess the performance of several machine learning algorithms to detect malicious .NET files. The random forest model was the best solution among the set of algorithms tested, exhibiting a performance of 92% for this predictive task.


Introduction
On February 13, 2002 [1], Microsoft introduced the .Net Framework, a software development framework, and now a built-in component of Windows, with the goal of making programming easier. With .NET, users no longer need to create pointers to manipulate memory operations, nor do they need to clear any objects that they create, as these functionalities are provided through the .Net Framework garbage collection. Various programming languages can be used to target the .Net Framework, with C# being the most common. C# makes developing applications very easy. However, the ease of writing programs that manipulate the windows operating system in .Net also made it a desirable language for malicious actors to write their Malware in. Between 2009 and 2015, we witnessed a 1600% growth in the number of .NET unique malware [2]. According to the Cybersecurity and Infrastructure Security Agency (CSIA), 3 out of the top 10 malware attacks in 2021 were written in C# [3], those being Agent Tesla, LokiBot and NanoCore. When a .Net file is compiled, the resulting file can be a DLL or an EXE file. However, these files are not similar to ordinary PE Files, i.e., they are not native executables. Unlike an executable that was compiled from C/C++ code, a .NET file does not have an x86 assembly code inside of it. Instead, the C# code is compiled into an intermediary language called the Common Intermediate Language (CIL) [4]. Later, when the file is executed, the Common Language Runtime (CLR) will convert the CIL code into native assembly code using Just in Time compilation (JIT). This provides applications written in .Net languages with various benefits such as memory management, garbage collection, type safety and exception handling [5]. The CLR acts as a virtual machine that handles the execution of the .Net files, and the translation process  from managed code, i.e., the CIL code, into native code, i.e., x86 Assembly code. Figure 1 illustrates the .Net compilation process.
In order for the windows loader to execute a file, this file must have a PE header [6]. Even though .Net files have a PE header, this only exists for legacy purposes to allow the files to be executed by the windows loader as portable executables [7]. However, this header is redundant, as all .Net executables have nearly identical PE headers, with few minor differences. They import only one library (mscoree.dll) and only one function (_CorExeMain). Figure 2 displays a comparison between two .Net files, a benign one, being the tool IL Spy [8] and a malicious one being Async RAT malware [9], using the tool PeNet [10]. We observe that they have nearly identical PE headers with the only difference being the virtual size of the file sections, that is, the size of this section in memory. In Figure 3 we display a comparison on the binary level of the two files using the tool PE Bear [11]. We confirm in this figure that the two files (a benign and a malicious) have nearly identical PE headers with only minor differences being the names of the files, their sizes, and the name of their sections. Thus, using only the PE header to classify .Net files as benign or malicious is not feasible as it does not contain enough information.
Below the PE header, exists the most important data structure in a .Net file, that is, the CLR header, which contains the necessary information for the CLR virtual machine to load the file. The CLR header is not documented by Microsoft, and that is one of the reasons for the lack of research in this area. Still, several non-official resources exist online explaining it in detail [12]. When a .Net file is executed, the Windows loader transforms control to the Common language runtime, which uses the metadata in the CLR header to correctly execute the file. In this paper, we focus on the information present in the CLR header to detect if a file is malicious or not. We present a static analysis method using information extracted from the CLR header to evaluate the effectiveness of different machine learning classifiers in the detection of .Net-based malware.
The key contributions of this study are as follows.
• We present a new dataset for the detection of malicious .Net files. We collected, pre-processed and extracted relevant features from a large set of 216006 malicious files and 14194 benign files, which we provide to the research community at the following link https://github.com/MagicianMido32/Detecting-Malicious-. NET-Files-Using-CLR-Header-Features-and-Machine-Learning; • We created various scrappers to scrap fresh benign and malicious .Net executables from the internet when building our dataset. We also provide this code to the research community; • we carry out an analysis of the performance of seven different learning algorithms using the proposed dataset.
This paper is organized as follows. In Section 2 we present the most relevant related works. Section 3 provides an overview of the new dataset that we collected and preprocessed. We present our experimental evaluation in Section 4, and the results are displayed and discussed in Section 5. Finally, Section 6 concludes this paper and discusses future research avenues.

Related Work
While much research has studied the Portable Executable Header, by extracting features from it, and through the development of Machine learning models for malware detection, such as using Assembly OP Code sequences [13], Imports table [14] or PE Header characteristics [15], to the best of our knowledge, not much work has been done to exploit the .NET CLR header for static malware classification. The closest research work we have found that extracts features from the CLR header is the work of Tom Leemreize who tries to analyze fileless malware for the .NET Framework through CLR profiling [16]. The author collected a dataset of five post-exploitation frameworks that maliciously utilized the .Net framework functionalities (both PowerShell and .Net PE). A .Net profiler was then used to build a .Net API call tree for each malware. YARA signatures were then created to detect malicious tactics used by the malware. To the best of the authors' knowledge, this is the only paper that is able to analyze all .Net framework application formats, including .Net compiled files, which is the focus of our work. This author also highlighted the research focus on Power-Shell, confirming that the area of malware detection in the .Net framework, in general, is significantly under-researched [16], which resonates with our findings.
Current solutions, such as Antimalware Scan Interface (AMSI) [17] based detection, have important drawbacks as they can be bypassed by malware [16]. One such example is the technique named Bring Your Own Interpreter (BYOI) where the malware itself can be written in another language such as python and be run through the .Net environment using dynamic language runtime with the Iron python project [18]. Leemreize [16] collected a dataset of stagers 1 of 5 post-exploitation frameworks that maliciously utilized the .Net framework functionalities. A customized .Net profiler was implemented using the tool GroboTrace. The profiler builds a call tree for each .Net API call, regardless of its source, whether it is a .Net PE file or a PowerShell script. A set of YARA rules to detect different malicious tactics used by these malicious stagers was proposed.
Still, no machine learning or deep learning techniques are used in [16] to address the malware detection problem. Moreover, manually created signatures can be easily bypassed, and the population sample used is small and not all of them are .Net PE files (some are PowerShell). In our paper, we explore the use of machine learning algorithms to tackle this predictive task and build a large dataset that we make available to the research community. In the next sections we provide more details about the dataset and the performance of the learning algorithms we evaluated.

The .Net Files Dataset
We propose a new dataset for tackling the issue of static malicious .Net PE files detection. We are the first -to the best of our knowledge -to create such a dataset. To achieve this, we built custom web scrappers to collect fresh malicious .Net files from different sources including: Any Run 2 , Malware Bazar 3 , Virus share 4 , Zoo 5 , VX Underground 6 , Twitter, Github and other sources. We created a tool that filters large malware samples and removes any files that are not .Net files. We collected a total of 216006 malicious .Net files. For the collection of benign examples we created several scrappers to download executable files from different source: CNet, Softonic, Source Forge, Net EXE, and Net Windows. We collected a total of 14194 benign .Net files. More precisely, we obtained the following files: 7865 from Sourceforge.net, 3603 from CNET, 151 from softonic, 185 from Windows 10 executables, 170 from Windows server 2019 executables, 400 from PortableApps.com, 1820 from various 1 A stager is a small malware that downloads and executes a larger stage payload representing the rest of the malware. This is done for making it lightweight allowing it to bypass antiviruses [19].  programs installed on our machines. All benign files collected were scanned with windows defender (up to date on 18 February 2023) and none of these files was detected as malicious or suspicious. Our goal was to extract the relevant information from the CLR header into a data frame file that we can be used to train machine learning algorithms.
With regards to the parser, we used Palo Alto open-source python library dotnetfile [20] that was developed and used internally by the company and was published in 2022. The library parses a .Net file and extracts not only the .Net CLR header information but also various metadata and strings. We used this library to build our parsing code in python and used it to parse our dataset of malware to generate a CSV file that can be consumed by machine learning models. Figure 4 shows the overall process of the .Net parser.
Initially, the dataset included 66 features, the target variable and a total of 222204 records. These 66 features extracted were configured in the parser. Out of these 66 features we dropped 14 because they were constant for all the records, and kept the remaining 52 features. We observed that multiple features extracted for each file were lists such as, a list of imports, a list of functions, or a list of namespaces. We had 18 of those types of features that exhibit a very high number of categories. Still, these features are important as they hold valuable information about the behavior of the program. For example, the list of unmanaged functions can be a strong indicator of potential process hollowing/RunPE [21] in the file. Table 1 displays the main characteristics of the features collected, including the number of features for each type. We distinguish here between categorical features with low and high cardinality, as they require further preprocessing for being used by standard machine learning algorithms. We assigned the type low cardinality to categorical features with less than 100 distinct values. The high cardinality features were clustered into text or list text. The first type corresponds to features represented by one single text per sample, while the second represents features that are lists of text, i.e., the feature is a list of text of variable size per sample. We made this dataset available to the research community to allow further research in this area. The dataset can be accessed at https://github.com/MagicianMido32/Detecting-Malicious-. NET-Files-Using-CLR-Header-Features-and-Machine-Learning.

Feature set encoding and analysis
After having extracted and pre-processed the dataset, we tried to convert the features with a large number of categories into a form suitable for the machine learning algorithms. Namely, we experimented with one hot encoding, and with word embeddings followed by a TFIDF vectorizer. However, in both cases, the resulting dataset is unfeasible to use due to the high dimensionality obtained. We selected to use Hash Encoder, from the library Category Encoders [22]. This is not the ideal solution, because of collisions, and potential information loss. However, we opted for this solution as it was the fastest and the most convenient solution, given the size of the dataset, and that would avoid increasing the dimensionality of the dataset to an unmanageable size. We applied the hash encoder to encode the categorical features with high cardinality (both the text and the list text types as listed in Table 1). This process was applied on the train data and after extended to the test data to avoid data leakage. Overall, we followed the following encoding procedure. For the low cardinality categorical features we used label encoder to encode them into their numerical representation without adding additional features. For the high cardinality categorical features as text and lists of text we used hash encoder, we encoded these 18 features into 7 new features, thus the number of features after encoding is 41 (3 low cardinality categorical features encoded using label encoder, 9 numerical features, 22 Boolean features, and 7 numerical features that are the result of encoding the high cardinality features. To carry out feature selection, we applied a forward selection method over a logistic regression model to measure the accuracy obtained over different features combinations. We used a stratified shuffled split to divide the dataset into two sets: a train and a test set. We used stratification to ensure that the distribution of the data is kept, i.e., the number of positive (malicious) and negative (benign) class cases are maintained. We also applied random oversampling on the training set to obtain a selection of features that would not be biased towards one of the classes. We found out that using the 40 features gave the highest accuracy. The results obtained on the logistic regression model trained using this method are: Accuracy: 0.75906, Precision: 0.72982, Recall: 0.78929, F1-score: 0.75839. Figure 5 shows the relationship between the accuracy and the number of features for the logistic regression model.
We also applied an alternative embedding method using Lasso regression for feature selection that provided a total of eleven features. The metrics for the logistic regression trained on this subset of features are Accuracy: 0.72839, Precision: 0.68750, Recall 0.79395, and F1-score: 0.73690. These results are not better than the previous method in terms of F1 score, which is our method of choice. The results are worse in terms of precision, which means that the model does worse when returning relevant results in comparison to returning irrelevant ones.

Experimental setup
For training the models we decided to not apply oversampling and we decided to use all the features in the original dataset. The main motivation for this concerns the fact that the feature selection phase proved to be of no significant value for the accuracy of the models.
We selected as performance assessment metrics for the models the F1 score, which is the harmonic mean of the precision (measurement for the rate of false positives) and recall (measurement for the rate of false negatives). We also decided to observe the recall in our comparison to be sure that the model does not overly misclassify benign samples. Besides these two metrics we computed the precision and the accuracy of the models. We used a stratified 5-Fold cross-validation procedure to preserve the distribution of the samples as the original dataset and to obtain a better estimate of the results. We report the average of the 5-fold CV process for each metric.
We selected seven well-known learning algorithms for testing. The learners selected are: decision tree, Logistic Regression, MLP, Naïve Bayes, Random Forest, SVM, and XGBoost. We trained these learners with their default parameters.

Results
In this section we present and analyze the results obtained with the learning algorithms and discuss this work limitations and strengths.

Results and discussion
The average results obtained from the selected learners over the 5 folds are presented in Table 2. We observe that random forest achieves the highest performance with an F1-score of 0.911, followed by the XGBoost model with 0.90. All the remaining learners have an much lower F1-score. These two algorithms are also the top performers (random forest being the best) for the accuracy and precision metrics. We note that for recall the SVM is the best, followed by Naive Bayes. However, these two models, achieve a high recall at the cost of an extremely low precision. Overall, the select as the best performing model the random forest and the second nest the XGBoost.
Overall, the ensemble models tested have a clear advantage in this task when compared with the other models. This may be happening because ensemble models: (i) are more robust by reducing the variance; and (ii) improve the performance when compared to a single contributing model. We also find it interesting that the MLP model is not performing as well, exhibiting more difficulties in this task. This can be due to a difficulty in dealing with the number of features extracted.

Limitations and strengths
This work has multiple strengths. Namely, we propose a novel feature extraction method from CLR headers that is an alternative to using the PE header. We also provide to the research community the first dataset with benign and malicious .NET files. We explore seven machine learning models for this predictive task, and build a scrapper to obtain benign and malicious .Net executables, which we also provide to the research community.
In terms of limitations, our work has limitations in two key aspects: the data and the machine learning algorithms. Regarding the dataset, we consider that the use of the hash encoder is a limitation of this work that should be addressed in the future. The heavy data imbalance between malicious and benign files is another aspect that should be taken into account. Finally, this task could be explored with a larger volume of data, opening the possibility to the usage of deep learning models. Although we tested seven machine learning algorithms, more could be added and deep learning models could also be explored. Moreover, extensive hyperparameter tuning could be carried out.

Conclusion
This paper presents a novel dataset for detecting malicious .Net files. We proposed a novel method for feature extraction using the CLR header of .Net files showing that it is possible to use the information extracted from the CLR header to classify malicious and benign .Net files. We provide this dataset to the research community as well as the code of the scrappers we used to obtain benign and malicious .Net executables. We tested several feature selection methods and evaluated seven machine learning models on the newly developed dataset. An accuracy of 92% was obtained when using the Random Forest model with the full set of features. The overall results are promising and show that the information collected from the CLR headers is useful and should be employed to address this detection problem.
An interesting avenue for future research concerns dealing with the high dimensionality of the dataset. We applied hash encoding, which comes at the expense of some information loss. We hypothesize that this impacted the performance of the machine learning models. A possible solution for this problem would be to use word embedding and stacking. Another important aspect concerns the imbalance in the dataset. The new dataset we present is largely imbalanced, with malicious records being significantly higher than benign records by a ratio of nearly 15 to 1. This is because collecting malicious files was easier and faster due to the availability of various malware-collecting sites such as malware Bazar and virus share. However, in order to collect benign files, it was necessary to manually scrap, download, and check each file. This is also an opportunity for future improvements.