Detection of Body Shaming in Social Media: A Comparative Study of Traditional Machine Learning and Transformer-based Models
Abstract
With social media becoming a significant part of the daily life among people of all ages, people are also faced with harmful contents which could have a negative impact on their wellness. Body shaming is one such area where people especially younger generation are affected.
This study investigates and compares the performance of traditional machine learning models and state-of-the-art transformer-based pre-trained language models in identifying and classifying body shaming content from textual social media data. Two datasets were used: a 4k dataset consolidated from existing sources, and a larger 6k dataset extended with novel data collected from TikTok and X.The traditional models evaluated include Support Vector Machines (SVM), Logistic Regression, Naive Bayes, Random Forests, XGBoost, and AdaBoost, while the pre-trained language models explored were BERT, RoBERTa, and XLNet. Models were trained and evaluated on both datasets, with performance assessed using metrics such as precision, recall, F1-score, and Matthews Correlation Coefficient (MCC).
Results demonstrated the better performance of pre-trained language models over traditional models, particularly on the larger 6k dataset. BERT exhibited the highest F1-score (0.8062) and MCC (0.7631) on the 6k dataset, while RoBERTa performed best on the 4k dataset with an F1-score of 0.8494 and MCC of 0.8048. Among traditional models, SVM and Random Forests performed well, with Random Forests achieving the highest F1-score (0.7095) and MCC (0.6644) on the 6k dataset. This study highlights the effectiveness of pre-trained language models in capturing complex linguistic patterns and semantics, enabling better generalization to larger and more imbalanced datasets. However, traditional models like SVM and Random Forests remain viable alternatives, particularly in resource-constrained environments or when dealing with smaller datasets.