From 10,000 False Alerts to 200: Rebuilding Transaction Monitoring with ML
When 95% of your AML alerts are false positives, your compliance team is not investigating. They are drowning. This is the story of how we rebuilt a transaction monitoring system from scratch, cutting monthly alerts from 10,000 to 200 while actually catching more real suspicious activity.
Before vs. After: Toggle to Compare
The Starting Point
The institution we worked with is a mid-sized bank operating across three West African countries. Their transaction monitoring system was a vendor solution configured with 187 rules, accumulated over eight years. Nobody fully understood what all 187 rules did. Some had been written by compliance officers who had long since left. Others were added in response to specific regulatory findings and never revisited.
The system generated approximately 10,000 alerts per month. A team of twelve analysts reviewed them. Their disposition data told the story: 95.3% of alerts were closed as false positives. Of the remaining 4.7%, about half were “true but trivial” (technically reportable but clearly not criminal). Genuine suspicious activity accounted for roughly 2% of total alerts.
Analyst turnover was 40% annually. Exit interviews consistently cited the same reason: the work felt meaningless. Reviewing and closing false positives for eight hours a day is demoralizing for skilled compliance professionals. The bank was spending $2.1 million per year on a process that was failing at its core purpose.
Where Analyst Time Actually Went (Monthly)
79% of analyst capacity consumed by alerts that turned out to be nothing
The ML Approach
We did not start by building models. We started by studying two years of investigation outcomes. Every closed case had a disposition (false positive, true positive, SAR filed) and a narrative written by the reviewing analyst. This dataset was the foundation for everything that followed.
Feature Engineering: The 400+ Feature Set
The old rule-based system used approximately 15 variables: transaction amount, frequency, country, customer type, and a handful of derived metrics. Our ML pipeline expanded this to over 400 features organized into five categories. The breadth of the feature set is what gives the model its discriminative power. No small set of variables can distinguish genuine suspicion from legitimate activity. The pattern emerges from the interaction of hundreds of signals.
400+ Features in 5 Categories (Click to Explore)
Raw transaction statistics aggregated over multiple time windows. The foundation of the feature set.
Handling Class Imbalance
With only 2% true positives in the labeled data, naive model training produces a classifier that predicts “not suspicious” for everything and achieves 98% accuracy. Useless, but technically accurate. We addressed this with a combination of techniques:
Model Selection: Why XGBoost Won
We evaluated five model architectures. XGBoost won on every metric that mattered for production deployment.
| Model | Precision | Recall | F2 Score | Latency |
|---|---|---|---|---|
| XGBoost | 72% | 89% | 0.84 | 4ms |
| LightGBM | 70% | 87% | 0.82 | 3ms |
| Random Forest | 68% | 82% | 0.78 | 12ms |
| Neural Network | 71% | 85% | 0.81 | 28ms |
| Logistic Regression | 54% | 76% | 0.69 | 1ms |
The Deployment Strategy
We did not flip a switch from rules to ML. The transition took four months and followed a deliberate sequence designed to build confidence with both the compliance team and the regulator.
Four-Phase Rollout
Results
After three months of full ML operation, the numbers told a clear story. Monthly alerts dropped from 10,000 to approximately 200. The false positive rate fell from 95% to 28%. But the metric that mattered most to the regulator was the true positive rate: it rose from 62% to 89%. The ML system was not just generating fewer alerts. It was generating better ones.
The compliance team went from twelve analysts to five (the others were redeployed, not laid off, to investigation and policy roles). Those five analysts now spend their time on genuinely suspicious cases rather than mechanical false positive disposition. Job satisfaction scores improved. Analyst turnover dropped to near zero.
Regulatory Acceptance
Convincing the regulator was not trivial. We prepared extensive documentation: model validation reports, backtesting results showing superior detection rates, explainability samples using SHAP values, and a formal model risk management framework covering governance, monitoring, and fallback procedures. The regulator approved the transition after a three-month parallel running period during which they could independently verify that the ML system matched or exceeded the rule-based system on every metric.
The key insight: regulators do not object to ML per se. They object to opacity. An ML system with comprehensive explainability, rigorous validation, and clear governance is easier to defend than a tangled web of 187 rules that nobody fully understands.