arxiv:2208.10684

K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment

Published on Aug 23, 2022

Authors:

Abstract

K-MHaS, a multi-label dataset for Korean hate speech detection, is evaluated using Korean-BERT-based models, with KR-BERT and sub-character tokenization showing superior performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baseline experiments on K-MHaS using Korean-BERT-based language models with six different metrics. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.

View arXiv page View PDF GitHub 51 auto Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2208.10684

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2208.10684 in a model README.md to link it from this page.

Datasets citing this paper 4

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.