A transformer-based approach for abuse detection in code-mixed indic languages.

Bansal, Vibhuti, Tyagi, Mirnal, Sharma, Rajesh, Gupta, Vedika and Xin, Qin (2022) A transformer-based approach for abuse detection in code-mixed indic languages. ACM Transactions on Asian and Low-Resource Language Information Processing. ISSN 2375-4702 (In Press)

[thumbnail of A Transformer Based Approach for Abuse Detection in Code Mixed.pdf] Text
A Transformer Based Approach for Abuse Detection in Code Mixed.pdf - Published Version

Download (295kB)

Abstract

The advancement in the number of online social media platforms has entailed active participation from the web users globally. This has also lead to subsequent increase in the cyberbullying cases online. Such incidents diminish an individual’s reputation or defame a community, also posing a threat to the privacy of users in cyberspace. Traditionally, manual checks and handling mechanisms have been used to deal with such textual content. However, an automatic computer-based approach would provide far better solutions to this problem. Existing approaches to automate this task majorly involves classical machine learning models which tend to perform poorly on low resource languages. Owing to the varied background and language of web users, the cyberspace witnesses the presence of multilingual text. An integrated approach to accommodate multilingual text could be the appropriate solution. This paper explores various methods to detect abusive content in 13 Indic code-mixed languages. Firstly, baseline classical machine learning models are compared with Transformer based architecture. Secondly, the paper presents the experimental analysis of four state-of-the-art transformer-based models vis à vis XLM-RoBERTa, indic-BERT, MurilBert and mBERT, out of which XLM Roberta with BiGRU outperforms. Thirdly, the experimental setup of the best performing model XLM-RoBERTa is fed with emoji embeddings that leads to further enhancement of overall performance of the employed model. Finally, the model is trained with the combined dataset of 13 Indic languages, to compare its performance with those of individual language models. The performance of combined model surpassed those of the individual models in terms of F1 score and accuracy, supporting the fact that combined model fits the data better possibly due to its code-mixed nature. This model reports a F1 score of 0.88 on test data while rendering a training loss of 0.28, validation loss of 0.31 and an AUC score of 0.94 for both training and validation.

Item Type: Article
Keywords: Abuse Detection | Transformer Based Model | Online Social Media | Machine Learning
Subjects: Physical, Life and Health Sciences > Computer Science
JGU School/Centre: Jindal Global Business School
Depositing User: Amees Mohammad
Date Deposited: 11 Jan 2023 09:32
Last Modified: 11 Jan 2023 09:32
Official URL: https://doi.org/10.1145/3571818
URI: https://pure.jgu.edu.in/id/eprint/5423

Downloads

Downloads per month over past year

Actions (login required)

View Item
View Item