Predicting the Spread of Malware Outbreaks Using Autoencoder Based Neutral Networks

Bhardwaj Gopika; Yadav Rashi

doi:10.13164/mendel.2019.1.157

Bhardwaj Gopika Department of Information Technology, Indira Gandhi Delhi Technical University for Women, India
Yadav Rashi Department of Information Technology, Indira Gandhi Delhi Technical University for Women, India

DOI: https://doi.org/10.13164/mendel.2019.1.157

Keywords: malware outbreaks, topic modeling, similarity analysis, auto encoders, prediction

Abstract

Malware Outbreaks are pervasive in today's digital world. However, there is a lack of awareness on part of general public on how to safeguard against such attacks and a need for increased cooperation between various national and international research as well as governmental organizations to combat the threat. On the positive side, cyber security websites, blogs and newsletters post articles outlining the working and spread of a malware outbreak and steps to recover from the same as well. In this project, an effective approach to predicting the spread of malware outbreaks is presented. The scope of the project is 15 Malware Outbreaks and the approach involves collecting these cyber aware articles from the web, assigning them to the 15 Malware Outbreaks using Topic Modeling and Similarity Analysis and along with Spread information of the Malware Outbreaks, this is input to auto encoder neural network for learning latent space representations which are further used to predict the spread of malware outbreak as either high or low spread outbreak, achieving a prediction accuracy of 75.56. This work can be used to process large amount of cyber aware content for effective and accurate prediction in the era of much-needed cyber security.

References

Kang, C., Park, N., Prakash, B. A., Serra, E., and Subrahmanian, V. S. 2016. Ensemble models for data-driven prediction of malware infections, In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, pp. 583-592.

Litvak, M. and Last, M. 2008. Graph-based keyword extraction for single-document summarization. In Proceedings of the workshop on Multi-source Multilingual Information Extraction and Summarization. Association for Computational Linguistics, pp. 17-24.

Ercan, G. and Cicekli, I. 2017. Using lexical chains for keyword extraction. Information Processing & Management 43, 6, pp. 1705-1714.

Liu, F., Pennell, D., Liu, F., and Liu, Y. 2009. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics. Association for Computational Linguistics, pp. 620-628.

Brants, T., Chen, F., and Tsochantaridis, I. 2002. Topic-based document segmentation with probabilistic latent semantic analysis. In Proceedings of the eleventh international conference on Information and knowledge management. ACM, pp. 211-218.

Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. 2005. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Advances in neural information processing systems (NIPS 2004, December 13-18). Vancouver, British Columbia, Canada, pp. 1385-1392.

Xu, W., Liu, X., and Gong, Y. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM pp. 267-273.

Wang, F., Vemuri, B., and Syeda-Mahmood, T. 2009. Generalized L2-divergence and its application to shape alignment. In International Conference on Information Processing in Medical Imaging. Springer, Berlin, Heidelberg, pp. 227-238.

Hinton, G. E. and Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786, pp. 504-507.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. A. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11, Dec, pp. 3371-3408.

Latif, S., Rana, R., Qadir, J., and Epps, J. 2017. Variational autoencoders for learning latent representations of speech emotion: A preliminary study. arXiv:1712.08708. Retrieved from https://arxiv.org/abs/1712.08708

Wang, W., Huang, Y., Wang, Y., and Wang, L. 2014. Generalized autoencoder: A neural network framework for dimensionality reduction. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. IEEE, pp. 490-497.

De Paola, A., Favaloro, S., Gaglio, S., Re, G. L., and Morana, M. 2018. Malware Detection through Low-level Features and Stacked Denoising Autoencoders. In Proceedings of the Second Italian Conference on Cyber Security, Milan, Italy, February 6-9, 2018. CEUR Workshop Proceedings.

Buyukkokten, O., Garcia-Molina, H., and Paepcke, A. 2001. Seeing the whole in parts: text summarization for web browsing on handheld devices. In WWW '01 Proceedings of the 10th international conference on World Wide Web. ACM New York, NY, USA, pp. 652-662.

Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic keyword extraction from individual documents. In Text mining: applications and theory. John Wiley & Sons, pp. 1-20.

Biro, I., Szabo, J., & Benczur, A. A. 2008. Latent dirichlet allocation in web spam filtering. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web. ACM, pp. 29-32.

Tian, K., Revelle, M., and Poshyvanyk, D. 2009. Using latent dirichlet allocation for automatic categorization of software. In 2009 6th IEEE International Working Conference on Mining Software Repositories. IEEE pp. 163-166.

Schneider, K. M. 2004. A new feature selection score for multinomial naive Bayes text classification based on KL-divergence. In Proceedings of the ACL Interactive Poster and Demonstration Sessions. Barcelona, Spain, Article No. 24.