The Classification of Documents in Malay and Indonesian Using the Naive Bayesian Method Uses Words and Phrases as a Training Set
Abstract
Malay Language and Indonesian Language are two closely related languages, sharing a lot in common in the meanings of words and grammar. Classifying the two languages automatically using a tool is a challenge because the two languages are very similar. The classification method that is widely used today is the Naive Bayesian method. This method needs to be implemented in a particular way to increase the level of classification accuracy. In this study, a new method was used, by using a training set in the form of words and phrases instead of just using a training set in the form of words only. With this method, the level of classification accuracy of the two languages is increased.
References
Calders, T., and Verwer, S. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery 21 (2010), 277-292.
Chen, J., Huang, H., Tian, S., and Qu, Y. Feature selection for text classification with naive bayes. Expert Systems with Applications 36 (2009), 5432-5435.
Hrebik, R., and Kukal, J. Context out classifier. MENDEL 24 (2018), 101-106.
Jiang, L., Wang, S., Li, C., and Zhang, L. Structure extended multinomial naive bayes. Information Sciences 329 (2016), 346-356.
Krawiec, K. Opening the black box: Alternative search drivers for genetic programming and testbased problems. MENDEL 23 (2017), 1-6.
Nababan, P. Language in education: The case of indonesia. International Review of Education 37 (1991), 115-131.
Namatevs, I., and Aleksejeva, L. Decision algorithm for heuristic donor-recipient matching. MENDEL 23 (2017), 33-40.
Ortmann, A. Connecting the typology and semantics of nominal possession: alienability splits and the morphology-semantics interface. Morphology 28 (2018), 99-144.
Saritas, M., and Yasar, A. Performance analysis of ann and naive bayes classification algorithm for data classification. International Journal of Intelligent Systems and Applications in Engineering 73 (2019), 88-91.
Skrabanek, P., and Yayilgan, S. WECIA Graph: Visualization of classification performance dependency on grayscale conversion setting. MENDEL 24 (2018), 41-48.
Soh, H., and Nomoto, H. The malay verbal prex men- and the unergative/unaccusative distinction. Journal of East Asian Linguistics 20 (2011), 77-106.
Sosial, J., and Vol, B. Perbedaan semantik antara bahasa indonesia dan bahasa malaysia: Satu kajian awal upaya mengelak kesalahpahaman dan perbedaan budaya antara bangsa serumpun di asia tenggara fakultas tarbiyah dan keguruan , uin sultan syarif kasim riau. Jurnal Sosial Budaya 9 (2012), 261-282.
Wan, C., Lee, L., Rajkumar, R., and Isa, D. A hybrid text classification approach with low dependency on parameter by integrating k-nearest neighbor and support vector machine. Expert Systems with Applications 39 (2012), 11880-11888.
Yap, M., Liow, S. R., Jalil, S., and Faizal, S. The malay lexicon project: A database of lexical statistics for 9,592 words. Behavior Research Methods 42 (2010), 992-1003.
Zelinka, I., and Amer, E. An ensemble-based malware detection model using minimum feature set. MENDEL 25 (2019), 1-10.
Zhang, D., Koda, K., and Leong, C. Morphological awareness and bilingual word learning: a longitudinal structural equation modeling study. Reading and Writing 29 (2016), 383-407.
Copyright (c) 2020 MENDEL
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
MENDEL open access articles are normally published under a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA 4.0) https://creativecommons.org/licenses/by-nc-sa/4.0/ . Under the CC BY-NC-SA 4.0 license permitted 3rd party reuse is only applicable for non-commercial purposes. Articles posted under the CC BY-NC-SA 4.0 license allow users to share, copy, and redistribute the material in any medium of format, and adapt, remix, transform, and build upon the material for any purpose. Reusing under the CC BY-NC-SA 4.0 license requires that appropriate attribution to the source of the material must be included along with a link to the license, with any changes made to the original material indicated.