Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT

Otman Moussaoui; Yacine El Younnoussi

doi:10.13164/mendel.2023.1.055

Otman Moussaoui SIGL Laboratory, ENSA Tetuan, UAE
Yacine El Younnoussi Information System and Software Engineering, National School of Applied Sciences, Abdelmalek Essaadi University, Morocco

DOI: https://doi.org/10.13164/mendel.2023.1.055

Keywords: Moroccan Dialect, BERT, RoBERTa, Natural Language Processing, Pre-trained, Machine Learning

Abstract

This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.

References

High performance computing (hpc). https: //www.marwan.ma/index.php/en/services/hpc [Retrieved May 17, 2022].

La constitution, edition 2011. http://www.sgg.gov.ma/Portals/0/constitution/constitution_2011_Fr.pdf [Retrieved April 18,2023].

Top most-commented youtube channels in morocco — hypeauditor. https://hypeauditor.com/top-youtube-all-morocco/most-commented/ [Retrieved April 15, 2023].

Abdelali, A., Hassan, S., Mubarak, H., Darwish, K., and Samih, Y. Pre-training bert on arabic tweets: Practical considerations. arXiv preprint arXiv:2102.10684 (2021).

Abdul-Mageed, M., Elmadany, A., and Nagoudi, E. M. B. Arbert & marbert: deep bidirectional transformers for arabic. arXiv preprint arXiv:2101.01785 (2020).

Abdul-Mageed, M., Zhang, C., Elmadany, A., Bouamor, H., and Habash, N. Nadi 2021: The second nuanced arabic dialect identification shared task. arXiv preprint arXiv:2103.08466 (2021).

Antoun, W., Baly, F., and Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104 (2020).

Bhatia, S., Sharma, M., and Bhatia, K. K. Sentiment Analysis and Mining of Opinions. Springer International Publishing, Cham, 2018, pp. 503–523.

Boujou, E., Chataoui, H., Mekki, A. E., Benjelloun, S., Chairi, I., and Berrada, I. An open access nlp dataset for arabic dialects: Data collection, labeling, and model construction. arXiv preprint arXiv:2102.11000 (2021).

Cho, K., Van Merri¨enboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).

Conneau, A., and Lample, G. Cross-lingual language model pretraining. Advances in neural information processing systems 32 (2019).

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

Gaanoun, K., Naira, A. M., Allak, A., and Benelallam, I. Darijabert: a step forward in nlp for the written moroccan dialect. Research- Square (2023). https://doi.org/10.21203/rs.3.rs-2560653/v1.

Gani, M. O., Ayyasamy, R. K., Sangodiah, A., and Fui, Y. T. Ustw vs. stw: A comparative analysis for exam question classification based on bloom’s taxonomy. MENDEL 28, 2 (2022), 25–40.

Garouani, M., and Kharroubi, J. Mac: an open and free moroccan arabic corpus for sentiment analysis. In The Proceedings of the International Conference on Smart City Applications (2021), Springer, pp. 849–858.

Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Inoue, G., Alhafni, B., Baimukan, N., Bouamor, H., and Habash, N. The interplay of variant, size, and task type in arabic pre-trained language models. arXiv preprint arXiv:2103.06678 (2021).

Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

Ruder, S. Why you should do nlp beyond english. https://ruder.io/nlp-beyond-english [Retrieved April 18, 2023].

Samih, Y., and Maier, W. An arabicmoroccan darija code-switched corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (2016), pp. 4170–4175.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).

Sennrich, R., Haddow, B., and Birch, A. Das hunderttage-stadion: Entstehungsgeschichte des bad nauheimer kunsteisstadions unter colonel paul r. knight. Acl (2016), 1715–1725.

Tachicart, R., Bouzoubaa, K., and Jaafar, H. Lexical differences and similarities between moroccan dialect and arabic. In 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt) (2016), IEEE, pp. 331–337.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems 30 (2017).

Wijaya, M. C. The classification of documents in malay and indonesian using the naive bayesian method uses words and phrases as a training set. MENDEL 26, 2 (2020), 23–28.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations (2020), pp. 38–45.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining forlanguage understanding. Advances in neural information processing systems 32 (2019).

Zahir, J. Iadd: An integrated arabic dialect identification dataset. Data in Brief 40 (2022), 107777.