Our Process

Get Paper Done In 3 Simple Steps

Place an order

Visit the URL and place your order with us. Fill basic details of your research paper, set the deadlines and submit the form.

Make payments

Chat with our experts to get the best quote. Make the payment via online banking, debit/credit cards or through paypal. Recieve an order confirmation number.

Receive your paper

Sit back and relax. Your well written, properly referenced research paper will be mailed to your inbox, before deadline. Download the paper. Revise and Submit.

Shape Thumb
Shape Thumb
Shape Thumb
  • Evan John Evan John
  • 15 min read

Malware detection techniques, focusing on API call sequences and comparing various machine learning models

Malware detection techniques, focusing on API call sequences and comparing various machine learning models

Abstract

This research aims to compare the effectiveness of API call sequence analysis on the use of machine learning to detect and categorize malware. The work employs an extensive set of API calls and deploys feature engineering to look for patterns in the information given. In the present study, the performances of two machine learning models namely Logistic Regression and Random Forest: in detecting malware were evaluated. Analyses show that the chosen Random Forest model has a higher accuracy of 94. 7% accuracy as compared to Logistic Regression has 63.2%. The paper focuses on the importance of API call correlations for capturing malware behavior and recognizes the ensemble method as a promising way of increasing detection rates. These results are beneficial in the development of malware detection paradigms and expose potential avenues for future research in cybersecurity.

Introduction

As technology advances and the threats become more diverse, there is a necessity to comprehend safe and secure malware detection. Malware is an abbreviated term for malicious software and comprises threat programs that are intended to take advantage of existing loopholes, spy and extract critical data, and hinder the proper functioning of the system. The ever-evolving and widespread nature of malware is a major threat to individuals, organizations, businesses, and countries since it is nearly impossible to avoid it entirely (Aboaoja et al, 2022). Given the dynamic and growing nature of cyber threats where new forms of malware are increasingly sophisticated and complex, strategies of detection such as signature-based mechanisms are no longer efficient, hence the need to employ newer approaches to this problem.

The detection of effective malware is a critical issue in the modern world where computer systems and networks play a crucial role in business processes. As witnessed with the rise of IoT devices together with cloud computing and Mobile Technologies the attack surface has greatly increased, thus providing more opportunities for malware to spread. Furthermore, the financial motivation toward cybercrime has created a new generation of malware that is capable of avoiding traditional methods of detection (Aslan & Samet, 2020). To address these impediments, the malware detection field has evolved greatly employing machine learning, artificial intelligence, and data mining tools, improving the detection rate and minimizing false alarms.

The purpose of this research is to contribute to the enhancement of the understanding of the detection approaches in the context of malware detection focusing on API call sequences and the machine learning models. This research aims at comparatively analyzing the application of typical supervised and unsupervised machine learning algorithms and two new methods of few-shot learning, namely Siamese and ProtoNet in malware classification.

Related Work

Over the years, much progress has been made in the detection of malware and researchers have come up with many techniques due to new tactics used by the malware. In their study, Souri and Hosseini (2018) pointed out and elaborated on the data mining techniques for the detection of malware where the researchers found that machine learning algorithms are efficient in the detection of malicious programs. In their work, there was an emphasis on feature selection and feature extraction methods to improve the detection accuracy something that concurs with the sequence of API employed in the current work. Aslan and Samet (2020) differentiated the modern approaches to represent the concept of malicious program detection, including behavioral as well as heuristic ones. They pointed out that polymorphic and metamorphic are difficult to detect through static analysis; and demanded dynamic analysis which reflects the behavior of a suspicious program at its run time. This perspective supports the current study’s approach of analyzing API call sequences to look for the existence of malignant acts.

Amro (2018) devoted his work to the presentation of various techniques for mobile malware detection more pertinent to mobile gadgets. The study pointed out the issues that arise due to the mobile environment for instance limited processing power and the real-time nature of the system. Although the current studies are more generalized toward the detection of malware, Amro’s work emphasizes the need to design optimal and flexible approaches in the field of detection concerning the various types of platforms. Altogether, these studies depict the myriad directions that are under investigation in malware detection research. Although they present information about a number of techniques, the current study aims to fill the gap left by comparing the traditional machine learning models with few-shot learning in the analysis of API call sequences, which can provide fresh insight into the efficiency of the approaches in practical settings.

Dataset and Preprocessing

Dataset Description

The data set used in this study is suitable for malware analysis through monitoring of API calls. The dataset is structured around three main categories of information: API call counts (prefixed with ‘count_’), API call return values (prefixed with ‘return_’), and API call averages (prefixed with ‘avg_’). This multiple-dimensional data acquisition is particularly useful in malware analysis as it not only records how often APIs are called but also the outcomes and often recurring patterns. Such comprehensive feature sets are important when building reliable malware detection models as discussed by Ye et al (2017) in their survey on the data mining approaches to malware detection.

Preprocessing Steps

Data cleaning and preprocessing play a crucial role in ensuring the reliability and effectiveness of malware detection models (Souri & Hosseini, 2018). The preprocessing steps undertaken in this study include:

  • Step 1: Missing Value Handling: The dataset was examined for missing values using the colSums(is.na(data)) function. Rows containing missing data were subsequently removed using na.omit() to ensure data integrity.
  • Step 2: Data Type Conversion: The ‘status’ column, likely indicating whether a sample is benign or malicious, was converted to a factor type. This step is crucial for proper categorization in subsequent analysis stages.
  • Step 3: Numeric Conversion: All columns, except ‘status’, were converted to numeric format using mutate_if(is.character, as.numeric). This ensures consistency in data types across features, facilitating more accurate model training and analysis.
  • Step 4: Final Cleaning: A final pass of na.omit() was performed to remove any remaining NA values, ensuring a clean and consistent dataset for analysis.

Feature Engineering

API Success Ratio: A Novel Approach

In the domain of malware detection feature engineering is a big part in enhancing the learning algorithm of a machine learning model. Based on the research done by Ye et al (2017), filtering and choosing of the features from the API calls can boost the accuracy of the Malware Detection System. In this study, a novel feature was engineered: the API success ratio as it comprises secondly the success stories of the API business, entrepreneurship, and utilization.

Significance of the API Success Ratio

This ratio is defined by the total return values (in order to represent the successful API calls) divided by the total number of API calls. The success ratio of API is very important in depicting the behavior of the software and can definitely depict the issue with the security of a program or containing virus attacks. Malware cannot be expected to adhere to standard models for the utilization of APIs, incidences of failure, or even the rate of success (Aslan & Samet, 2020). Due to the inclusion of this ratio, a big overall picture of the level of efficiency of the API calls is realized rather than flipping through the frequencies or the values of the calls.

Implementation and Implications

The feature was implemented by using the data manipulation available in R language and using ‘dplyr’ to traverse the data frames. This approach is in concordance with the suggestion by Souri and Hosseini (2018) that called for incorporating data mining techniques in the detection of malware. Thus, including this engineered feature makes the dataset more descriptive, and consequently, the development of powerful and precise malware detection models.

Results and Discussion

Malware Detection Techniques: Analysis of API Call Frequencies

Frequency Analysis of API Calls

The frequency of the API calls that occur allows the recognition and analysis of perhaps the conduct of malware within a device. Furthermore, as pointed out by Ye et al (2017), the frequency and intensity of API calls are rather strong indicators of threats. In this study, the most frequently occurring API calls were identified in order to discover possible patterns related to the behaviors of the malware in Figure 1. The results reveal that the most used API is the ‘GlobalMemoryStatusEx’ which was called ‘1,469,764’ times. It is used much more frequently than other APIs to get information about the current memory utilization of the system. Therefore, it might mean intensive memory operations, which could be linked to malware operations, for instance, memory scanning (Aslan et al., 2019). The next executing API mostly called after “GlobalMemoryStatusEx” is the “NtClose” which has been used to be called 349,252 times followed by “NtOpenProcess “ which has been used 198,553 times. A high number of “NtOpenProcess” means that the program tries to open others, which can be characteristic of various types of malware that inject code or gather information on the system.

Figure 1: Frequency Analysis of Top API Calls (Source, Own Work)

API Call Sequence Analysis

Figure 2: Correlation Matrix Interpretation (Source, Own Work)

The heatmap of the correlation matrix, in Figure 2, gives an appreciation of the level of dependency between different API calls which is central as noted by Souri and Hosseini (2018) when analyzing malware activity. The heatmap shows how two API calls are related; the color intensity of a rectangle means the intensity of the correlation between two calls. It must be noted, that on the heatmap, there are some rather large groups of closely interconnected API calls. It is worthy of note that ‘CreateServiceA’, ‘FindResourceExA’, and ‘Process32NextW’ are within the list. dll functions most representative of a high coefficient of correlation. This grouping raises the question of the evolution of the malware-creating services, the modification of the resources, and the list of processes. Such parallel activities can be, for example, when malware desires to continue its operation and gather information about the system, as noted by Aslan and Samet (2020). Another key relationship is noted between ‘NtCreateSection’, ‘Thread32Next’, and ‘Process32NextW’. This relationship suggests a sequence of operations of memory manipulation as well as the enumeration of processes and threads which is consistent with the behavior of malware code injection and system spying (Ye et al, 2017).

Comparative Analysis of Machine Learning Models

Figure 3: Model Accuracy Comparison (Source, Own Work)

The bar plot of the model accuracies gives a very brief but adequate impression of the competency of two machine learning models, namely Logistic Regression and Random Forest. fig 3 shows that the Random Forests model is more accurate in detecting the presence of malware than the Logistic Regression one, in the detection of malware. As can be seen, the Random Forest CV model was the best for the test set with 0.947 ( 94.7%) accuracy and the worst if compared with the Logistic Regression model which was 0.632 ( 63.2%) accuracy. The difference in the accuracy can be attributed to the works done by Ye et al (2017) who indicated that the use of the Random Forest ensemble was useful in malaria detection.

Figure 4: Confusion Matrix Analysis (Source, Own Work)

The confusion matrices, figure 4, of both models present the classification of the models in a better way in addition to the accuracy. When it comes to the misclassification rate of the Logistic Regression model from the confusion matrix, there is a relatively high misclassification rate, and high False Negatives, which is not good in terms of identifying malware. However, by referring to the confusion matrix of the random forest model, one can say that misclassifications in general are considerably lower. Hence the high accuracy of the Random Forest classifier aligns well with the conclusion made by Aslan and Samet, (2020) giving the rationale that ensemble methods are efficient in handling complicated patterns intrinsic to the malware.

Implications for Malware Detection

Specifically, the listed APIs, including NtWriteFile, GetFileAttributesW, and NtCreateFile, correlate with the study of Aboaoja et al (2022), which noted the significance of monitoring methods interacting with the file system when detecting viruses. The high frequency of this API usage potentially indicates that file manipulation operations may include data leakage and other system changes. Hence, the above studies show the need to incorporate behavioral analysis of the program in malware detection with an emphasis on the API calls. This means that the API calls regarding the memory and the processes should be given more focus as the high frequency of these API calls shows that they contain possible dangerous operations to be identified by the anomaly detection using the API call frequency as a reference.

The patterns identified and analyzed in this study have a profound impact and influence on the approaches to the detection of malware. Due to the frequent interaction of these API calls towards each other, they could be potential candidate predictor variables to signify an attack. For instance, the use of the file manipulation API (for example, the SetFileTime) and the process manipulation APIs (for example NtOpenProcess) may be suspected to be from some malware. Therefore, the findings highlighted the need to consider the sequential data regarding API calls, not the single calls, to identify malware. It is for these two related patterns that highly advanced forms of malware can be detected by the detection systems that are usually hard to detect by the usual signature detection systems.

The difference in performance between these two models has a ripple effect on the techniques that can be applied in the fight against the malware. Conventional recovery of features from API call data has less accuracy than the proposed method of using Random Forest; therefore it is better placed to identify features that aid in the identification of malware. The significance of this information lies in the fact that, according to Aboaoja et al (2022), it should be possible to subdivide accurately by behavioral characteristics in modern anti-malware applications. As backed up by the results obtained here, there is one way of generalizing that, indeed, it would seem that models with higher complexity, such as Random Forest, are more efficient in practicing compared to low-complexity models such as the Logistic Regression when comes to malware detection.

Challenges and Limitations

Dataset Relevance and Evolution

One of the primary challenges encountered in this study pertains to the relevance and timeliness of the data set used in malware detection. Allix et al (2015) have noted that due to, for instance, the emergence of new malware types, training data may easily become obsolete in a short time. Owing to the constantly renewed mode of malware development it also complicates the process of acquiring the latest and relevant data set for training of the machine learning algorithms.

Model Generalization and Overfitting

Another limitation relates to the possibility of generalizing the developed models. In the Random Forest model training, the method used exhibited a high accuracy in the built model; however, the level of overfitting is highly probable due to the range and sizes of data that were featured in the current study. As pointed out by Aslan and Samet (2020), this is a typical issue in malware detection based on machine learning techniques particularly when it comes to the identification of new samples of malware.

Scalability and Real-Time Detection

Even though the paper is informative in identifying API call sequences, several issues arise concerning scale and real-time detection. Souri and Hosseini (2018) also noted that while working in environments in which data are processed in real time, it can be rather challenging to support high IOP. This in some way reduces the chances of employing the suggested methods in real-time malware detection that is well known to have high traffic.

Conclusion

This study has demonstrated that analyzing API call sequences is an effective approach for detecting various types of malware. The frequently cited API calls especially the memory and process interface API calls are usually invoked due to low coding standards. Furthermore, more variability in products as well as more reliance on heuristic analysis and correlation computation expose more of one API sequence to reveal more of a characteristic of malware. The results prove the efficiency of the chosen Random Forest algorithm and the relevance of the chosen approach in the identification of API call sequences and the investigation of their relations for the enhancement of malware detection performance. Future work should extend these results to real-time alerting and the use value of API call analysis as well as other behavioral measurements. Moreover, research for other advanced techniques – the so-called second-generation models or deep machine learning models – might lead to finding improved techniques for virus detection.

 

 

 

References

Aboaoja, F. A., Zainal, A., Ghaleb, F. A., Al-Rimy, B. A. S., Eisa, T. A. E., & Elnour, A. A. H. (2022). Malware detection issues, challenges, and future directions: A survey. Applied Sciences, 12(17), 8482.

Amro, B. (2018). Malware detection techniques for mobile devices. arXiv preprint arXiv:1801.02837.

Aslan, Ö. A., & Samet, R. (2020). A comprehensive review of malware detection approaches. IEEE Access, 8, 6249-6271.

Mao, W., Cai, Z., Towsley, D., Feng, Q., & Guan, X. (2017). Security importance assessment for system objects and malware detection. Computers & Security, 68, 47-68.

Souri, A., & Hosseini, R. (2018). A state-of-the-art survey of malware detection approaches using data mining techniques. Human-centric Computing and Information Sciences, 8(1), 1-22.

Tahir, R. (2018). A study on malware and malware detection techniques. International Journal of Education and Management Engineering, 8(2), 20.

Ye, Y., Li, T., Adjeroh, D., & Iyengar, S. S. (2017). A survey on malware detection using data mining techniques. ACM Computing Surveys (CSUR), 50(3), 1-40.

Allix, K., Bissyandé, T. F., Klein, J., & Le Traon, Y. (2015, March). Are your training datasets yet relevant? an investigation into the importance of timeline in machine learning-based malware detection. In International Symposium on Engineering Secure Software and Systems (pp. 51-67). Cham: Springer International Publishing.

He, K., & Kim, D. S. (2019, August). Malware detection with malware images using deep learning techniques. In 2019 18th IEEE International Conference on Trust, security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE) (pp. 95-102). IEEE.

 

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00