AI Agents · LangGraph

Building Autonomous Compliance Agents with LangGraph

September 2024·10 min read

What happens when you give an LLM access to PEP databases, sanctions lists, transaction histories, and the ability to make compliance decisions? SENTINEL is a multi-agent system where five specialized AI agents collaborate through LangGraph to conduct full KYC/AML investigations autonomously.

Specialized Agents

20 min

Avg Investigation Time

4 hrs

Previous Manual Time

87%

Accuracy vs Human Analysts

Why a Multi-Agent Architecture?

A compliance investigation is not a single task. It is a workflow with distinct phases, each requiring different expertise and different tools. Asking a single LLM to handle everything, from parsing transaction data to querying PEP databases to drafting SAR narratives, produces mediocre results across the board.

The insight behind SENTINEL is specialization. Five agents, each with a focused role, constrained tool set, and tailored system prompt, outperform a single generalist agent by a wide margin. This mirrors how human compliance teams work: the analyst who triages alerts is not the same person who drafts SARs.

Five Specialized Agents (Click to Explore)

Alert Triage Agent

~12 sec

First responder. Receives raw alerts and performs initial risk assessment.

Tools

Transaction history APICustomer profile DBRule metadata lookupHistorical disposition query

Input

Raw alert with transaction IDs, rule ID, and customer ID

Output

Risk score (1-10), initial assessment narrative, recommended next agent

LangGraph: State Machines for Agent Coordination

The coordination layer is built on LangGraph, which models the investigation workflow as a state machine. Each state represents a phase of the investigation. Transitions between states are determined by the output of the previous agent, combined with confidence thresholds and business rules.

This is fundamentally different from chain-based approaches (LangChain's sequential chains) or simple prompt chaining. A state machine can branch, loop back, and route to human review when confidence is low. The investigation path adapts to what the agents find, rather than following a fixed sequence.

LangGraph State Machine (Hover for Details)

ALERT_RECEIVED→TRIAGING

TRIAGING→LOW_RISK_CLOSED

TRIAGING→INVESTIGATING

INVESTIGATING→HUMAN_REVIEW

INVESTIGATING→SCORING

SCORING→CLOSED_NO_SAR

SCORING→SAR_DRAFTING

SAR_DRAFTING→HUMAN_APPROVAL

HUMAN_APPROVAL→SAR_FILED

The Tool Ecosystem

Each agent has access to a curated set of tools. This is not a free-for-all where every agent can call every API. Tool access is scoped by role, both for security and for quality. An agent performs better when its tool set is focused and relevant to its specific task.

Tool	Triage	CDD	TxnAnalysis	Scoring	SAR
Transaction history API	✓	•	✓	•	✓
Customer profile DB	✓	✓	•	•	✓
AfricaPEP API	•	✓	•	•	•
Sanctions screening	•	✓	•	•	•
Adverse media search	•	✓	•	•	•
Neo4j graph query	•	✓	✓	•	•
ML risk model	•	•	•	✓	•
SAR template engine	•	•	•	•	✓

Guardrails and Human-in-the-Loop

Autonomous does not mean unsupervised. We built three layers of guardrails:

Confidence thresholds

Every agent outputs a confidence score alongside its findings. When confidence drops below 70% at any stage, the investigation routes to human review. The agent provides its partial findings and specific questions for the human analyst, so the handoff is efficient rather than starting from scratch.

Action boundaries

Agents can read data and generate recommendations, but they cannot take irreversible actions. Closing a case as non-suspicious requires human confirmation. Filing a SAR requires human approval and signature. The agents accelerate the investigation; humans make the final call.

Hallucination detection

Every factual claim in an agent's output is traced back to a specific tool call and response. If an agent asserts that a customer appeared on a sanctions list, the system verifies that the sanctions screening tool actually returned that result. Ungrounded claims are flagged and the finding is marked as unverified.

The SAR Generation Pipeline

The SAR Drafting Agent is the most complex of the five. Writing a Suspicious Activity Report requires synthesizing findings from all previous agents into a coherent narrative that meets regulatory formatting requirements. The narrative must be factual, precise, and reference specific transactions and evidence.

SAR Draft Quality Metrics

Factual accuracy (verified claims)

96.4%

Regulatory format compliance

99.1%

Human edit rate (changes needed)

23%

Avg edits per SAR

2.1

Results

After three months of parallel running (agents alongside human analysts working the same cases independently), the results were clear. The multi-agent system completed investigations in an average of 20 minutes versus 4 hours for human analysts. Agreement with human decisions was 87%, and in the 13% of cases where they disagreed, a review panel found the agents were correct 41% of the time, meaning the effective accuracy differential is smaller than the headline number suggests.

The more interesting metric is investigation quality. Human analysts, under pressure to clear alert queues, often took shortcuts: skipping peer group analysis, not checking beneficial ownership, or writing minimal SAR narratives. The agents never skip steps. They run the full investigation workflow every time, producing more thorough and consistent work product.

What We Learned About Agent Reliability

Specialization beats generalization, decisively

A single agent with all tools performed at 61% accuracy. Five specialized agents hit 87%. The improvement comes from focused context, constrained tool sets, and tailored prompts.

State machines are essential for compliance workflows

Linear chains cannot handle the branching logic of real investigations. State machines let us model conditional routing, human checkpoints, and retry logic cleanly.

Tool reliability matters more than model capability

The agents are only as good as the tools they call. When our PEP database had a 15-minute outage, the CDD Agent produced confidently wrong results. We added health checks and fallback behaviors for every tool.

Explainability is not optional

Every agent decision must be traceable to specific data. This is not just a regulatory requirement; it is how you debug agent behavior when something goes wrong.

Patrick Attankurugu

Senior AI/ML and Backend Engineer at Agregar Technologies, building autonomous compliance AI agents for African financial institutions.