TABHATE: A Target-based Hate Speech Detection Dataset in Hindi

Sharma, Deepawali, Singh, Vivek Kumar and Gupta, Vedika (2023) TABHATE: A Target-based Hate Speech Detection Dataset in Hindi. [Working papers (or Preprints)] (Submitted)

[thumbnail of TABHATE_ A Target-based Hate Speech Detection Dataset in Hindi.pdf] Text
TABHATE_ A Target-based Hate Speech Detection Dataset in Hindi.pdf - Submitted Version

Download (554kB)

Abstract

Social media has over the years provided a medium for creation and dissemination of opinions and thoughts through online platforms. While it allows users to express their views, sentiments and emotions, some people try to use it to generate and share unpleasant and hateful content. Such content is now referred to as hate speech and it may target an individual, a group, a community, or a country. During the last few years, several techniques have been developed to automatically detect and identify hate speech, offensive and abusive content from social media platforms. However, majority of the studies focused on hate speech detection in English language texts. With social media getting higher penetration across different geographies, there is now a significant amount of content generated in various languages. Though there have been significant advancements in algorithmic approaches for the task, the non-availability of suitable dataset in other languages poses a problem in research advancement in them. Hindi is one such widely spoken language where such datasets are not available. This work attempts to bridge this research gap by presenting a curated and annotated dataset for targetbased hate speech (TABHATE) in the Hindi language. The dataset comprises of 2,020 tweets and is annotated by three independent annotators. A multiclass labelling is used where each tweet is labelled as: (i) individual targeting, (ii) community targeting, and (iii) none. Inter annotator agreement is computed. The suitability of dataset is then further explored by applying some standard deep learning and transformer-based models for the task of hate speech detection. The experimental results obtained show that the dataset can be used for experimental work on hate speech detection of Hindi language texts.

Item Type: Working papers (or Preprints)
Keywords: Hate Speech | Hate Speech Corpus | Hate Speech Dataset | Hindi language | Deep Learning
Subjects: Social Sciences and humanities > Arts and Humanities > Language and Linguistics
Physical, Life and Health Sciences > Computer Science
Social Sciences and humanities > Social Sciences > Social Sciences (General)
JGU School/Centre: Jindal Global Business School
Depositing User: Arjun Dinesh
Date Deposited: 25 Apr 2023 10:29
Last Modified: 25 Apr 2023 10:29
Official URL: https://doi.org/10.21203/rs.3.rs-2800717/v1
URI: https://pure.jgu.edu.in/id/eprint/5868

Downloads

Downloads per month over past year

Actions (login required)

View Item
View Item