Building Autonomous Compliance Agents with LangGraph
What happens when you give an LLM access to PEP databases, sanctions lists, transaction histories, and the ability to make compliance decisions? SENTINEL is a multi-agent system where five specialized AI agents collaborate through LangGraph to conduct full KYC/AML investigations autonomously.
Why a Multi-Agent Architecture?
A compliance investigation is not a single task. It is a workflow with distinct phases, each requiring different expertise and different tools. Asking a single LLM to handle everything, from parsing transaction data to querying PEP databases to drafting SAR narratives, produces mediocre results across the board.
The insight behind SENTINEL is specialization. Five agents, each with a focused role, constrained tool set, and tailored system prompt, outperform a single generalist agent by a wide margin. This mirrors how human compliance teams work: the analyst who triages alerts is not the same person who drafts SARs.
Five Specialized Agents (Click to Explore)
Alert Triage Agent
~12 secFirst responder. Receives raw alerts and performs initial risk assessment.
LangGraph: State Machines for Agent Coordination
The coordination layer is built on LangGraph, which models the investigation workflow as a state machine. Each state represents a phase of the investigation. Transitions between states are determined by the output of the previous agent, combined with confidence thresholds and business rules.
This is fundamentally different from chain-based approaches (LangChain's sequential chains) or simple prompt chaining. A state machine can branch, loop back, and route to human review when confidence is low. The investigation path adapts to what the agents find, rather than following a fixed sequence.
LangGraph State Machine (Hover for Details)
The Tool Ecosystem
Each agent has access to a curated set of tools. This is not a free-for-all where every agent can call every API. Tool access is scoped by role, both for security and for quality. An agent performs better when its tool set is focused and relevant to its specific task.
| Tool | Triage | CDD | TxnAnalysis | Scoring | SAR |
|---|---|---|---|---|---|
| Transaction history API | ✓ | • | ✓ | • | ✓ |
| Customer profile DB | ✓ | ✓ | • | • | ✓ |
| AfricaPEP API | • | ✓ | • | • | • |
| Sanctions screening | • | ✓ | • | • | • |
| Adverse media search | • | ✓ | • | • | • |
| Neo4j graph query | • | ✓ | ✓ | • | • |
| ML risk model | • | • | • | ✓ | • |
| SAR template engine | • | • | • | • | ✓ |
Guardrails and Human-in-the-Loop
Autonomous does not mean unsupervised. We built three layers of guardrails:
Every agent outputs a confidence score alongside its findings. When confidence drops below 70% at any stage, the investigation routes to human review. The agent provides its partial findings and specific questions for the human analyst, so the handoff is efficient rather than starting from scratch.
Agents can read data and generate recommendations, but they cannot take irreversible actions. Closing a case as non-suspicious requires human confirmation. Filing a SAR requires human approval and signature. The agents accelerate the investigation; humans make the final call.
Every factual claim in an agent's output is traced back to a specific tool call and response. If an agent asserts that a customer appeared on a sanctions list, the system verifies that the sanctions screening tool actually returned that result. Ungrounded claims are flagged and the finding is marked as unverified.
The SAR Generation Pipeline
The SAR Drafting Agent is the most complex of the five. Writing a Suspicious Activity Report requires synthesizing findings from all previous agents into a coherent narrative that meets regulatory formatting requirements. The narrative must be factual, precise, and reference specific transactions and evidence.
SAR Draft Quality Metrics
Results
After three months of parallel running (agents alongside human analysts working the same cases independently), the results were clear. The multi-agent system completed investigations in an average of 20 minutes versus 4 hours for human analysts. Agreement with human decisions was 87%, and in the 13% of cases where they disagreed, a review panel found the agents were correct 41% of the time, meaning the effective accuracy differential is smaller than the headline number suggests.
The more interesting metric is investigation quality. Human analysts, under pressure to clear alert queues, often took shortcuts: skipping peer group analysis, not checking beneficial ownership, or writing minimal SAR narratives. The agents never skip steps. They run the full investigation workflow every time, producing more thorough and consistent work product.
What We Learned About Agent Reliability
A single agent with all tools performed at 61% accuracy. Five specialized agents hit 87%. The improvement comes from focused context, constrained tool sets, and tailored prompts.
Linear chains cannot handle the branching logic of real investigations. State machines let us model conditional routing, human checkpoints, and retry logic cleanly.
The agents are only as good as the tools they call. When our PEP database had a 15-minute outage, the CDD Agent produced confidently wrong results. We added health checks and fallback behaviors for every tool.
Every agent decision must be traceable to specific data. This is not just a regulatory requirement; it is how you debug agent behavior when something goes wrong.