Processing Pipeline

How it works

The v2 pipeline fuses 10,000-dim TF-IDF features with 20 hand-engineered signals — capturing slang, sentiment, and domain signals that bag-of-words alone misses.

Raw text input

Accept any plain-text complaint string. No encoding requirements, no preprocessing on the caller side.

predict_complaint_v2("Server is down, urgent!")

Slang normalization

50+ slang tokens are mapped to canonical phrases before any other processing, preserving semantic signal that TF-IDF would silently drop.

omg → "extremely urgent" · wtf → "very serious issue" · lol → "this is funny but"

Text cleaning

Lowercase, strip URLs and emails, remove punctuation and digits, drop tokens shorter than 2 characters. Stopwords removed using a built-in list — no NLTK required.

TF-IDF vectorization

Cleaned text is transformed into a 10,000-dim sparse vector using a pre-fitted vectorizer. Unigrams and bigrams capture phrases like "system down".

TfidfVectorizer(max_features=10000, ngram_range=(1,2))

Engineered feature extraction

20 hand-crafted features are extracted in parallel and scaled with StandardScaler — urgency signals, sentiment intensity, domain flags, caps ratio, negation density, and more.

Feature fusion

Sparse TF-IDF matrix and dense engineered features are stacked into a 15,020-dim fused vector.

scipy.sparse.hstack([tfidf_matrix, dense_features])

Classification + probability calibration

Fused features go into the chosen model (LR or SVM). Both are probability-calibrated with CalibratedClassifierCV so confidence scores are meaningful, not raw logits.

model.predict_proba(X) → {Critical, High, Medium, Low}

Engineered features

The 20 engineered features capture signals that TF-IDF alone cannot see.

urgency

Count of urgent/emergency/asap/now keywords

sentiment

Emotional intensity from 50+ anger/frustration words

caps_ratio

Ratio of uppercase letters to total characters

allcaps_words

Count of fully capitalized words (WHY, NOW)

negation_density

Density of: not, never, can't, won't, didn't

positive_framing

Complaints hidden after a compliment ("love it but…")

scope_words

"all users", "entire system", "everyone affected"

domain flags ×8

One-hot: technical, payment, legal, healthcare, retail, HR, finance, auth

Priority labels

🔴

Critical — immediate action

System-wide outages, data breaches, GDPR violations, payroll crashes, security incidents.

🟠

High — resolve within hours

Payment failures, access blocked, appointment cancellations, crashes at checkout.

🔵

Medium — schedule for resolution

Service delays, rude staff, slow performance, delayed notifications, minor data issues.

🟢

Low — handle when convenient

Price feedback, cosmetic UI issues, feature requests, subjective preferences.

Model Reference

All models

Two production classifiers share a TF-IDF vectorizer and are swappable with zero pipeline changes. DistilBERT is available for advanced deployments.

Logistic Regression

v2 · recommended

Probability-calibrated linear classifier. Best for fast inference, interpretable confidence scores, and production APIs.

92.5%

Technical accuracy

79.2%

Stress-test accuracy

Specifications

Algorithmsklearn LogisticRegression

Solverlbfgs

Max iterations1000

Class weightingbalanced

CalibrationCalibratedClassifierCV (cv=5)

Feature dimensions15,020

Training samples736 (balanced, 8 domains)

Model file size206 KB

Inference latency~2 ms per sample

Per-class F1 score

Critical

F1 0.92

High

F1 0.90

Medium

F1 0.95

Low

F1 0.93

Best for

Production APIsFast inferenceProbability scoresInterpretabilityRecommended default

Usage

from predictor import predict_complaint
result = predict_complaint("Server is down", model_type="lr")
# → {"label": "Critical", "confidence": 0.87, ...}

Linear SVM

Calibrated linear SVM. More robust on short or noisy texts. Good alternative when LR confidence scores seem overconfident.

92.5%

Technical accuracy

79.2%

Stress-test accuracy

Specifications

AlgorithmLinearSVC (CalibratedClassifierCV)

KernelLinear (implicit)

Regularization C1.0

Class weightingbalanced

CalibrationPlatt scaling via CalibratedClassifierCV

Feature dimensions15,020

Model file size1.1 MB

Inference latency~3 ms per sample

Per-class F1 score

Critical

F1 0.93

High

F1 0.89

Medium

F1 0.94

Low

F1 0.92

Best for

Short textsNoisy inputOutlier robustnessSlightly slower than LR

Usage

result = predict_complaint("Payment failed", model_type="svm")
# → {"label": "High", "confidence": 0.81, ...}

DistilBERT

advanced · optional

Transformer fine-tuned on the complaint dataset. Higher accuracy ceiling — requires GPU and heavy dependencies.

95%+

Expected accuracy

requires training

Specifications

Base modeldistilbert-base-uncased

Parameters66M

Fine-tuneFull fine-tune (classification head)

Max token length128

FrameworkHuggingFace Transformers + PyTorch

GPU requiredRecommended (CUDA)

CPU latency80–200 ms per sample

GPU latency~12 ms per sample

Setup

pip install transformers torch datasets
# Uncomment DistilBERT block in train.py
python train.py

Trade-offs

Highest accuracy ceilingContext-awareNeeds GPU for productionHeavy dependencies80–200ms CPU latency

v1 vs v2 comparison

v2 was rebuilt from scratch after stress-testing revealed v1 hit only 33% accuracy on diverse non-technical complaints.

Metric	v1	v2
Stress-test accuracy	33.3%	79.2%
Technical test accuracy	92.5%	~80% *
Dataset size	400	736
Complaint domains	3	8+
Emotional vocabulary	0 words	50+ words
Feature dimensions	10,000	15,020
Slang normalization	✗	✓
Sentiment features	✗	✓
Domain-aware features	✗	✓
Positive-framing detection	✗	✓
Legal / GDPR domain	✗	✓
Healthcare domain	✗	✓
Cross-validation score	—	77.3% ±1.5%

* v2 technical accuracy is lower because the test set now includes harder boundary cases absent from v1.

Priority triage

How it works

Engineered features

Priority labels

All models

v1 vs v2 comparison