An Ensemble-Based Malware Detection Model Using Minimum Feature Set
Abstract
Current commercial antivirus detection engines still rely on signature-based methods. However, with the huge increase in the number of new malware, current detection methods become not suitable. In this paper, we introduce a malware detection model based on ensemble learning. The model is trained using the minimum number of signification features that are extracted from the file header. Evaluations show that the ensemble models slightly outperform individual classification models. Experimental evaluations show that our model can predict unseen malware with an accuracy rate of 0.998 and with a false positive rate of 0.002. The paper also includes a comparison between the performance of the proposed model and with different machine learning techniques. We are emphasizing the use of machine learning based approaches to replace conventional signature-based methods.
References
Kumar, A., Kuppusamy, K. S., and Aghila, G. 2017. A learning model to detect maliciousness of portable executable using integrated feature set. Journal of King Saud University-Computer and Information Sciences 31, 2, pp. 252–265.
Bahador, M. B., Abadi, M., and Tajoddin, A. 2019. HLMD: a signature-based approach to hardware-level behavioral malware detection and classification. The Journal of Supercomputing 75, 5551–5582.
Ndibanje, B. et al. 2019. Cross-Method-Based Analysis and Classification of Malicious Behavior by API Calls Extraction. Applied Sciences 9, 2, 239. DOI: 10.3390/app9020239
Alazab, M., Venkatraman, S., and Watters, P. 2009. Effective digital forensic analysis of the NTFS disk image. Ubiquitous Computing and Communication Journal 4, 1, pp. 551–558.
Smith, M. et al. 2018. Dynamic Analysis of Executables to Detect and Characterize Malware. 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE. DOI: 10.1109/ICMLA.2018.00011
Yousefi-Azar, M. et al. 2018. Malytics: a malware detection scheme. IEEE Access 6, pp. 49418–49431.
Rhode, M., Burnap, P., and Jones, K. 2018. Early-stage malware prediction using recurrent neural networks. Computers & Security 77, pp. 578–594.
Sayadi, H., Patel, N., Sasan, A., Rafatirad, S., and Homayoun, H. 2018. Ensemble learning for effective run-time hardware-based malware detection: A comprehensive analysis and classification. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). ACM. DOI: 10.1145/3195970.3196047
Ucci, D., Aniello, L., and Baldoni, R. 2019. Survey of machine learning techniques for malware analysis. Computers & Security 81, pp. 123–147.
Song, J. et al. 2017. Practical in-depth analysis of ids alerts for tracing and identifying potential attackers on darknet. Sustainability 9, 2, pp. 1–18.
Kolosnjaji, B. et al. 2016. Deep learning for classification of malware system call sequences. Australasian Joint Conference on Artificial Intelligence. Springer, Cham, pp. 137–149.
Gandotra, E., Bansal, D., and Sofat, S. 2014. Malware analysis and classification: A survey. Journal of Information Security 5, pp. 56–64. DOI: 10.4236/jis.2014.52006
Burnap, P. et al. 2018. Malware classification using self organising feature maps and machine activity data. Computers & Security 73, pp. 399–410.
Qiao, Y. et al. 2014. CBM: free, automatic malware analysis framework using API call sequences. Knowledge engineering and management. Springer, Berlin, Heidelberg, pp. 225–236.
Luo, X. et al. 2016. An incremental-and-static-combined scheme for matrix-factorization-based collaborative filtering. IEEE transactions on automation science and engineering 13, 1, pp. 333–343.
Zeng, N. et al. 2014. Image-based quantitative analysis of gold immunochromatographic strip via cellular neural network approach. IEEE transactions on medical imaging 33, 5, pp. 1129–1136.
Ranveer, S. and Hiray, S. 2015. Comparative analysis of feature extraction methods of malware detection. International Journal of Computer Applications 120, 5, pp. 1–7.
Saxe, J., and Berlin, K. 2015. Deep neural network based malware detection using two dimensional binary program features. 10th International Conference on Malicious and Unwanted Software (MALWARE). IEEE, DOI: 10.1109/MALWARE.2015.7413680
Grosse, K. et al. 2017. Adversarial examples for malware detection. European Symposium on Research in Computer Security. Springer, Cham, pp. 62–79.
Tian, R. et al. 2010. Differentiating malware from cleanware using behavioural analysis. 5th international conference on malicious and unwanted software. IEEE. DOI: 10.1109/MALWARE.2010.5665796
Damodaran, A. et al. 2017. A comparison of static, dynamic, and hybrid analysis for malware detection. Journal of Computer Virology and Hacking Techniques 13, 1, pp. 1–12.
Fang, Y., Yu, B., Tang, Y., Liu, L., Lu, Z., Wang, Y., and Yang, Q. 2017. A new malware classification approach based on malware dynamic analysis. In Australasian Conference on Information Security and Privacy. Springer, Cham, pp. 173–189.
Zhang, Y., Huang, Q., Ma, X., Yang, Z., and Jiang, J. 2016. Using multi-features and ensemble learning method for imbalanced malware classification. In 2016 IEEE Trustcom/BigDataSE/ISPA. IEEE, pp. 965–973.
Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of machine learning research 3, pp. 1157–1182.
Nguyen, T.-T., Huang, J. Z., and Nguyen, T. T. 2015. Unbiased feature selection in learning random forests for high-dimensional data. The Scientific World Journal 2015. DOI: 10.1155/2015/471371
Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. 2010. Variable selection using random forests. Pattern Recognition Letters 31, 14, pp. 2225–2236.
Al-Azani, S. and El-Alfy, E.-S. 2017. Using word embedding and ensemble learning for highly imbalanced data sentiment analysis in short arabic text. Procedia Computer Science 109, pp. 359–366.
McCarthy, R. V., McCarthy, M. M., Ceccucci, W., and Halawi, L. 2019. Predictive Models Using Decision Trees. In Applying Predictive Analytics. Springer, Cham, pp. 123–144.
Copyright (c) 2019 MENDEL
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
MENDEL open access articles are normally published under a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA 4.0) https://creativecommons.org/licenses/by-nc-sa/4.0/ . Under the CC BY-NC-SA 4.0 license permitted 3rd party reuse is only applicable for non-commercial purposes. Articles posted under the CC BY-NC-SA 4.0 license allow users to share, copy, and redistribute the material in any medium of format, and adapt, remix, transform, and build upon the material for any purpose. Reusing under the CC BY-NC-SA 4.0 license requires that appropriate attribution to the source of the material must be included along with a link to the license, with any changes made to the original material indicated.