Transaction Monitoring · Case Study

From 10,000 False Alerts to 200: Rebuilding Transaction Monitoring with ML

November 2024·11 min read

When 95% of your AML alerts are false positives, your compliance team is not investigating. They are drowning. This is the story of how we rebuilt a transaction monitoring system from scratch, cutting monthly alerts from 10,000 to 200 while actually catching more real suspicious activity.

Before vs. After: Toggle to Compare

10,000

Monthly Alerts

95%

False Positive Rate

62%

True Positive Rate

3.2 hrs

Avg Investigation Time

The Starting Point

The institution we worked with is a mid-sized bank operating across three West African countries. Their transaction monitoring system was a vendor solution configured with 187 rules, accumulated over eight years. Nobody fully understood what all 187 rules did. Some had been written by compliance officers who had long since left. Others were added in response to specific regulatory findings and never revisited.

The system generated approximately 10,000 alerts per month. A team of twelve analysts reviewed them. Their disposition data told the story: 95.3% of alerts were closed as false positives. Of the remaining 4.7%, about half were “true but trivial” (technically reportable but clearly not criminal). Genuine suspicious activity accounted for roughly 2% of total alerts.

Analyst turnover was 40% annually. Exit interviews consistently cited the same reason: the work felt meaningless. Reviewing and closing false positives for eight hours a day is demoralizing for skilled compliance professionals. The bank was spending $2.1 million per year on a process that was failing at its core purpose.

Where Analyst Time Actually Went (Monthly)

Reviewing false positive alerts

1,520 hrs

Reviewing true-but-trivial alerts

230 hrs

Investigating genuinely suspicious cases

115 hrs

SAR drafting and filing

58 hrs

79% of analyst capacity consumed by alerts that turned out to be nothing

The ML Approach

We did not start by building models. We started by studying two years of investigation outcomes. Every closed case had a disposition (false positive, true positive, SAR filed) and a narrative written by the reviewing analyst. This dataset was the foundation for everything that followed.

Feature Engineering: The 400+ Feature Set

The old rule-based system used approximately 15 variables: transaction amount, frequency, country, customer type, and a handful of derived metrics. Our ML pipeline expanded this to over 400 features organized into five categories. The breadth of the feature set is what gives the model its discriminative power. No small set of variables can distinguish genuine suspicion from legitimate activity. The pattern emerges from the interaction of hundreds of signals.

400+ Features in 5 Categories (Click to Explore)

Raw transaction statistics aggregated over multiple time windows. The foundation of the feature set.

•Sum/count/avg by period (7d, 30d, 90d)

•Max single transaction amount

•Cash vs. electronic ratio

•Round amount frequency

•Cross-border transaction percentage

85 features in this category

Handling Class Imbalance

With only 2% true positives in the labeled data, naive model training produces a classifier that predicts “not suspicious” for everything and achieves 98% accuracy. Useless, but technically accurate. We addressed this with a combination of techniques:

SMOTE oversampling

Synthetic minority oversampling to generate artificial true positive examples in feature space. Applied conservatively (2x, not 10x) to avoid overfitting to synthetic patterns.

Cost-sensitive learning

Custom loss function that penalizes false negatives (missed laundering) 50x more heavily than false positives. Calibrated to the actual business cost of each error type.

Stratified sampling

Training and validation splits preserve the original class distribution. Critical for getting reliable performance estimates.

Threshold tuning

Model outputs a probability. The alert threshold was tuned on a validation set to optimize the F2 score (recall-weighted), reflecting the higher cost of missing suspicious activity.

Model Selection: Why XGBoost Won

We evaluated five model architectures. XGBoost won on every metric that mattered for production deployment.

Model	Precision	Recall	F2 Score	Latency
XGBoost	72%	89%	0.84	4ms
LightGBM	70%	87%	0.82	3ms
Random Forest	68%	82%	0.78	12ms
Neural Network	71%	85%	0.81	28ms
Logistic Regression	54%	76%	0.69	1ms

The Deployment Strategy

We did not flip a switch from rules to ML. The transition took four months and followed a deliberate sequence designed to build confidence with both the compliance team and the regulator.

Four-Phase Rollout

Shadow Mode (Month 1)

ML model scored every transaction in parallel with the rule-based system. No alerts changed. We compared ML predictions against actual investigation outcomes.

Hybrid Scoring (Month 2)

ML scores attached to rule-generated alerts. Analysts saw the ML confidence score alongside the traditional alert. High-ML-score alerts were prioritized in the queue.

ML-First with Rule Backstop (Month 3)

ML model became the primary alert generator. Rules kept running as a safety net. Any rule-triggered alert not also flagged by ML was reviewed to validate the ML was not missing things.

Full ML with Mandatory Rules (Month 4)

ML model generates all alerts. Only legally mandated rules (CTR thresholds) remain active. Continuous monitoring with automated drift detection.

Results

After three months of full ML operation, the numbers told a clear story. Monthly alerts dropped from 10,000 to approximately 200. The false positive rate fell from 95% to 28%. But the metric that mattered most to the regulator was the true positive rate: it rose from 62% to 89%. The ML system was not just generating fewer alerts. It was generating better ones.

The compliance team went from twelve analysts to five (the others were redeployed, not laid off, to investigation and policy roles). Those five analysts now spend their time on genuinely suspicious cases rather than mechanical false positive disposition. Job satisfaction scores improved. Analyst turnover dropped to near zero.

Regulatory Acceptance

Convincing the regulator was not trivial. We prepared extensive documentation: model validation reports, backtesting results showing superior detection rates, explainability samples using SHAP values, and a formal model risk management framework covering governance, monitoring, and fallback procedures. The regulator approved the transition after a three-month parallel running period during which they could independently verify that the ML system matched or exceeded the rule-based system on every metric.

The key insight: regulators do not object to ML per se. They object to opacity. An ML system with comprehensive explainability, rigorous validation, and clear governance is easier to defend than a tangled web of 187 rules that nobody fully understands.

Patrick Attankurugu

Senior AI/ML Engineer specializing in transaction monitoring and AML compliance. Senior AI/ML Engineer at Agregar Technologies.