Detecting Outliers Using Modified Recursive PCA Algorithm For Dynamic Streaming Data

  • Yasi Dani Industrial and Financial Mathematics Research Group, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung, Jl. Ganesha 10 Bandung, Indonesia
  • Agus Yodi Gunawan Industrial and Financial Mathematics Research Group, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung, Jl. Ganesha 10 Bandung, Indonesia
  • Masayu Leylia Khodra School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Jl. Ganesha 10 Bandung, Indonesia
  • Sapto Wahyu Indratno Department Statistics Research Division, Faculty of Mathematics and Natural Sciences, Institut Teknologi Bandung, Jl. Ganesha 10 Bandung, Indonesia
Keywords: Outlier, Online learning, Recursive PCA, Eigendecomposition, Perturbation method

Abstract

Outlier analysis has been widely studied and has produced many methods. However, there is still rare a method to detect outliers for dynamically streaming batch data (online learning). In the present research, a novel online algorithm to detect outliers in such dataset is proposed. Data points are proceeded by applying a modified recursive PCA to predict sequentially parameters of the model; eigenvalues and eigenvectors of the statistical detection model are recursively updated using approximate values by perturbation methods. More specifically, the recursive eigenstructure is obtained from the derivation of the covariance matrix using the first-order perturbation technique. The Mahalanobis distance is then used as an outlier score. Our algorithm performances are evaluated using some metrics, namely accuration, precision, recall, F1-score, AUC-PR, and the execution time. Results show that the proposed online outlier detection is computationally efficient in time and the algorithm's performance effectiveness is comparable to that of the offline outlier detection algorithm via classical PCA.

References

Aggarwal, C. C. An introduction to outlier analysis. In Outlier analysis. Springer, 2017, pp. 1–34.

Ahmadi, M., Sharifi, A., Jafarian Fard, M., and Soleimani, N. Detection of brain lesion location in mri images using convolutional neural network and robust pca. International journal of neuroscience (2021), 1–12.

Al-Fawa’reh, M., Al-Fayoumi, M., Nashwan, S., and Fraihat, S. Cyber threat intelligence using pca-dnn model to detect abnormal network behavior. Egyptian Informatics Journal 23, 2 (2022), 173–185.

Alimohammadi, H., and Chen, S. N. Performance evaluation of outlier detection techniques in production timeseries: A systematic review and meta-analysis. Expert Systems with Applications 191 (2022), 116371.

Bosman, H. H., Liotta, A., Iacca, G., and W¨ortche, H. J. Anomaly detection in sensor systems using lightweight machine learning. In 2013 IEEE International Conference on Systems, Man, and Cybernetics (2013), IEEE, pp. 7–13.

Brownlee, J. Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery, 2020.

Caelen, O. A bayesian interpretation of the confusion matrix. Annals of Mathematics and Artificial Intelligence 81, 3 (2017), 429–450.

Cesa-Bianchi, N., and Orabona, F. Online learning algorithms. Annual review of statistics and its application (2021).

Chicco, D., Starovoitov, V., and Jurman, G. The benefits of the matthews correlation coefficient (mcc) over the diagnostic odds ratio (dor) in binary classification assessment. Ieee Access 9 (2021), 47112–47124.

Chicco, D., T¨otsch, N., and Jurman, G. The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData mining 14, 1 (2021), 1–22.

Emerson, J. W., and Kane, M. J. Don’t drown in the data. Significance 9, 4 (2012), 38–39.

Fieri, B., and Suhartono, D. Offensive language detection using soft voting ensemble model. MENDEL Journal 29, 1 (2023), 1–6.

Fischer, M. E., Cruickshanks, K. J., Dillard, L. K., Nondahl, D. M., Klein, B. E., Klein, R., Pankow, J. S., Tweed, T. S., Schubert, C. R., Dalton, D. S., et al. An epidemiologic study of the association between free recall dichotic digits test performance and vascular health. Journal of the American Academy of Audiology 30, 04 (2019), 282–292.

Gunawan, A. Y., Kresnowati, M. T. A. P., et al. Artificial neural network approach for the identification of clove buds origin based on metabolites composition. arXiv preprint arXiv:2007.05125 (2020).

Hawkins, D. M. Identification of outliers, vol. 11. Springer, 1980.

Hinch, E. Perturbation methods. Cambridge University Press, 1992.

Hoeltgebaum, H., Adams, N., and Fernandes, C. Estimation, forecasting, and anomaly detection for nonstationary streams using adaptive estimation. IEEE Transactions on Cybernetics (2021).

Ifzarne, S., Tabbaa, H., Hafidi, I., and Lamghari, N. Anomaly detection using machine learning techniques in wireless sensor networks. In Journal of Physics: Conference Series (2021), vol. 1743, IOP Publishing, p. 012021.

Ippel, L., Kaptein, M., and Vermunt, J. Dealing with data streams: An online, row-byrow, estimation tutorial. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 12, 4 (2016), 124.

Jolliffe, I. T. Principal component analysis for special types of data. Springer, 2002.

Mahalanobis, P. C. On test and measures of group divergence: theoretical formulae. Journal and Proceedings of Asiatic Society of Bengal New series 26 (1930), 541–588

Majdoubi, R., Masmoudi, L., Bakhti, M., Elharif, A., and Jabri, B. Parameters estimation of bldc motor based on physical approach and weighted recursive least square algorithm. International Journal of Electrical & Computer Engineering (2088-8708) 11, 1 (2021).

Pokrajac, D., Lazarevic, A., and Latecki, L. J. Incremental local outlier detection for data streams. In 2007 IEEE symposium on computational intelligence and data mining (2007), IEEE, pp. 504–515.

Saberioon, M., C´ısaˇr, P., Labb´e, L., Souˇcek, P., Pelissier, P., and Kerneis, T. Comparative performance analysis of support vector machine, random forest, logistic regression and k-nearest neighbours in rainbow trout (oncorhynchus mykiss) classification using imagebased features. Sensors 18, 4 (2018), 1027.

Schifano, E. D., Wu, J., Wang, C., Yan, J., and Chen, M.-H. Online updating of statistical inference in the big data setting. Technometrics 58, 3 (2016), 393–403.

Sippola, V., and Mercer, R. E. An experimental comparison of the geometry of models trained on natural language and synthetic data. In Canadian Conference on AI (2021).

Snijders, C., Matzat, U., and Reips, U.-D. ” big data”: big gaps of knowledge in the field of internet science. International journal of internet science 7, 1 (2012), 1–5.

Thuy, T. T. T., Thuan, L. D., Duc, N. H., and Minh, H. T. A study on heuristic algorithms combined with lr on a dnn-based ids model to detect iot attacks. MENDEL Journal 29, 1 (2023) 62–70.

Wang, C., Chen, M.-H., Wu, J., Yan, J., Zhang, Y., and Schifano, E. Online updating method with new variables for big data streams. Canadian Journal of Statistics 46, 1 (2018), 123–146.

Wissel, B. D., Greiner, H. M., Glauser, T. A., Pestian, J. P., Kemme, A. J., Santel, D., Ficker, D. M., Mangano, F. T., Szczesniak, R. D., and Dexheimer, J. W. Early identification of epilepsy surgery candidates: A multicenter, machine learning study. Acta Neurologica Scandinavica 144, 1 (2021), 41–50.

Zangeneh-Nejad, F., Amiri-Simkooei, A., Sharifi, M., and Asgari, J. Recursive least squares with additive parameters: Application to precise point positioning. Journal of Surveying Engineering 144, 4 (2018), 04018006.

Zea-Vera, R., Ryan, C. T., Havelka, J., Corr, S. J., Nguyen, T. C., Chatterjee, S., Wall Jr, M. J., Coselli, J. S., Rosengart, T. K., and Ghanta, R. K. Machine learning to predict outcomes and cost by phase of care after coronary artery bypass grafting. The Annals of Thoracic Surgery 114, 3 (2022), 711–719.

Published
2023-12-20
How to Cite
[1]
Dani, Y., Gunawan, A., Khodra, M. and Indratno, S. 2023. Detecting Outliers Using Modified Recursive PCA Algorithm For Dynamic Streaming Data. MENDEL. 29, 2 (Dec. 2023), 237-244. DOI:https://doi.org/10.13164/mendel.2023.2.237.
Section
Research articles