ML / NLPLive

Hate Speech Classifier

A deep learning hate speech detector built on RoBERTa — and a first-hand account of the biggest trap in machine learning: why a model with 95% accuracy can be completely, dangerously broken.

View Project
ML Engineer
March 2026
ML / NLP
Hate Speech Classifier overview
95%
Why Accuracy Lies
F1
Metric That Actually Works
RoBERTa
Social-Media-Native Model
Live
Published Write-up

The Challenge

Regex rules and banned word lists are trivially bypassed — trolls misspell deliberately, use sarcasm, replace letters with numbers. You need semantic understanding to catch this. But hate speech datasets are wildly imbalanced: typically 95% safe text, 5% actual hate speech. Feed that into a standard training loop and the model gets lazy — it learns that blindly predicting 'Safe' every single time scores 95% accuracy. The model looks great on paper while letting 100% of hate speech slip through.

The Solution

Replaced standard accuracy with F1-score as the north star metric, and fixed the training loop with class weighting inside the PyTorch loss function — mathematically rigging the penalties so a missed hate speech comment costs the model far more than a misclassified safe one. Combined with RoBERTa (pre-trained on raw social media data, not formal text), a custom preprocessing pipeline that translates emojis into text tags, and a head-swapped classification layer attached to the [CLS] token, the result is a classifier that actually does its job.

What I Built

01

Phase 1 — Selected RoBERTa over SciBERT after identifying domain mismatch: biomedical pre-training is useless for social media slang and emoji context

02

Phase 2 — Built a preprocessing pipeline for noisy social media text: normalizing misspelled slang and converting emojis into text tags so embeddings could read emotional signal

03

Phase 3 — Performed a PyTorch head swap: removed RoBERTa's default pre-training head and attached a custom Linear Classification Layer connected to the [CLS] token, outputting a Safe/Hate probability via Sigmoid

04

Phase 4 — Identified and fixed the imbalance trap: implemented class weighting inside the PyTorch loss function to apply massive penalties for missed hate speech vs. minor penalties for false positives

05

Phase 5 — Replaced accuracy with Precision, Recall, and F1-score as the primary evaluation metrics, demonstrating why a 95% accurate model can be completely broken in practice

06

Documented the full build and the accuracy trap in a technical write-up on Medium

The Story

The project started right after the clinical summarizer — and the first decision was which model to use. SciBERT was still fresh in my head, but I immediately ruled it out: a model pre-trained on PubMed papers has no idea what internet slang or emojis mean. You can't teach a medical student to moderate a Twitch chat. RoBERTa was the right call — pre-trained on massive corpora of raw, unfiltered social media. It already understood the cadence of online arguments. Preprocessing was messier than expected. Medical reports have perfect grammar; social media is chaos. I had to normalize deliberately misspelled slang and, crucially, translate emojis. Emojis carry massive toxic context — a tokenizer script converted symbols like 🤬 into text tags (e.g., [angry_face]) so the model's embeddings could actually read the emotional signal. The architecture required a head swap. Pre-trained transformers are designed to predict missing words — I wanted a bouncer, not a word guesser. I chopped off RoBERTa's default pre-training head and attached a new Linear Classification Layer connected specifically to the [CLS] token — the compressed mathematical summary of the entire input sequence. A Sigmoid function on the output gives a probability: Safe or Hate Speech. Then came the accuracy trap. Training on the raw imbalanced dataset produced a model sitting at 95% accuracy that predicted 'Safe' for literally everything. The fix was class weighting in the PyTorch loss function: minor penalty for misclassifying a safe comment, massive penalty for letting hate speech through. I forced the model to mathematically care about the minority class. Evaluation switched entirely to Recall, Precision, and F1-score — the only metrics that actually measure whether a content moderation model works.

What I Learned

Accuracy is a vanity metric on imbalanced datasets — it will lie to your face and look convincing. The real engineering is in the loss function: class weighting is how you encode your priorities into the math. Emojis are not decoration — they are semantic signal and need to be treated as first-class input. And model selection matters before anything else: domain mismatch between pre-training data and your task (medical vs. social media) will cap your ceiling no matter how well you fine-tune.

Technologies Used

PythonPyTorchRoBERTaBERTweetHuggingFaceTransformersDavidson Hate Speech DatasetSigmoidF1-Score

Back to

All Case Studies