Complaint Classifier

Priority triage

Classify support complaints into 4 priority levels using AI. Trained across 8 domains with sentiment-aware feature engineering.

Complaint 0/500
Quick examples
Processing Pipeline

How it works

The v2 pipeline fuses 10,000-dim TF-IDF features with 20 hand-engineered signals — capturing slang, sentiment, and domain signals that bag-of-words alone misses.

01
Raw text input
Accept any plain-text complaint string. No encoding requirements, no preprocessing on the caller side.
predict_complaint_v2("Server is down, urgent!")
02
Slang normalization
50+ slang tokens are mapped to canonical phrases before any other processing, preserving semantic signal that TF-IDF would silently drop.
omg → "extremely urgent" · wtf → "very serious issue" · lol → "this is funny but"
03
Text cleaning
Lowercase, strip URLs and emails, remove punctuation and digits, drop tokens shorter than 2 characters. Stopwords removed using a built-in list — no NLTK required.
04
TF-IDF vectorization
Cleaned text is transformed into a 10,000-dim sparse vector using a pre-fitted vectorizer. Unigrams and bigrams capture phrases like "system down".
TfidfVectorizer(max_features=10000, ngram_range=(1,2))
05
Engineered feature extraction
20 hand-crafted features are extracted in parallel and scaled with StandardScaler — urgency signals, sentiment intensity, domain flags, caps ratio, negation density, and more.
06
Feature fusion
Sparse TF-IDF matrix and dense engineered features are stacked into a 15,020-dim fused vector.
scipy.sparse.hstack([tfidf_matrix, dense_features])
07
Classification + probability calibration
Fused features go into the chosen model (LR or SVM). Both are probability-calibrated with CalibratedClassifierCV so confidence scores are meaningful, not raw logits.
model.predict_proba(X) → {Critical, High, Medium, Low}

Engineered features

The 20 engineered features capture signals that TF-IDF alone cannot see.

urgency
Count of urgent/emergency/asap/now keywords
sentiment
Emotional intensity from 50+ anger/frustration words
caps_ratio
Ratio of uppercase letters to total characters
allcaps_words
Count of fully capitalized words (WHY, NOW)
negation_density
Density of: not, never, can't, won't, didn't
positive_framing
Complaints hidden after a compliment ("love it but…")
scope_words
"all users", "entire system", "everyone affected"
domain flags ×8
One-hot: technical, payment, legal, healthcare, retail, HR, finance, auth

Priority labels

🔴
Critical — immediate action
System-wide outages, data breaches, GDPR violations, payroll crashes, security incidents.
🟠
High — resolve within hours
Payment failures, access blocked, appointment cancellations, crashes at checkout.
🔵
Medium — schedule for resolution
Service delays, rude staff, slow performance, delayed notifications, minor data issues.
🟢
Low — handle when convenient
Price feedback, cosmetic UI issues, feature requests, subjective preferences.
Model Reference

All models

Two production classifiers share a TF-IDF vectorizer and are swappable with zero pipeline changes. DistilBERT is available for advanced deployments.

Logistic Regression
v2 · recommended
Probability-calibrated linear classifier. Best for fast inference, interpretable confidence scores, and production APIs.
92.5%
Technical accuracy
79.2%
Stress-test accuracy
Specifications
Algorithmsklearn LogisticRegression
Solverlbfgs
Max iterations1000
Class weightingbalanced
CalibrationCalibratedClassifierCV (cv=5)
Feature dimensions15,020
Training samples736 (balanced, 8 domains)
Model file size206 KB
Inference latency~2 ms per sample
Per-class F1 score
Critical
F1 0.92
High
F1 0.90
Medium
F1 0.95
Low
F1 0.93
Best for
Production APIsFast inferenceProbability scoresInterpretabilityRecommended default
Usage
from predictor import predict_complaint
result = predict_complaint("Server is down", model_type="lr")
# → {"label": "Critical", "confidence": 0.87, ...}
Linear SVM
v2
Calibrated linear SVM. More robust on short or noisy texts. Good alternative when LR confidence scores seem overconfident.
92.5%
Technical accuracy
79.2%
Stress-test accuracy
Specifications
AlgorithmLinearSVC (CalibratedClassifierCV)
KernelLinear (implicit)
Regularization C1.0
Class weightingbalanced
CalibrationPlatt scaling via CalibratedClassifierCV
Feature dimensions15,020
Model file size1.1 MB
Inference latency~3 ms per sample
Per-class F1 score
Critical
F1 0.93
High
F1 0.89
Medium
F1 0.94
Low
F1 0.92
Best for
Short textsNoisy inputOutlier robustnessSlightly slower than LR
Usage
result = predict_complaint("Payment failed", model_type="svm")
# → {"label": "High", "confidence": 0.81, ...}
DistilBERT
advanced · optional
Transformer fine-tuned on the complaint dataset. Higher accuracy ceiling — requires GPU and heavy dependencies.
95%+
Expected accuracy
requires training
Specifications
Base modeldistilbert-base-uncased
Parameters66M
Fine-tuneFull fine-tune (classification head)
Max token length128
FrameworkHuggingFace Transformers + PyTorch
GPU requiredRecommended (CUDA)
CPU latency80–200 ms per sample
GPU latency~12 ms per sample
Setup
pip install transformers torch datasets
# Uncomment DistilBERT block in train.py
python train.py
Trade-offs
Highest accuracy ceilingContext-awareNeeds GPU for productionHeavy dependencies80–200ms CPU latency

v1 vs v2 comparison

v2 was rebuilt from scratch after stress-testing revealed v1 hit only 33% accuracy on diverse non-technical complaints.

Metricv1v2
Stress-test accuracy33.3%79.2%
Technical test accuracy92.5%~80% *
Dataset size400736
Complaint domains38+
Emotional vocabulary0 words50+ words
Feature dimensions10,00015,020
Slang normalization
Sentiment features
Domain-aware features
Positive-framing detection
Legal / GDPR domain
Healthcare domain
Cross-validation score77.3% ±1.5%

* v2 technical accuracy is lower because the test set now includes harder boundary cases absent from v1.