Multi-Agent AI Is a Distributed Systems Problem

Give two AI agents access to the same git repo and you will rediscover every problem from distributed systems textbooks. Duplicate work, conflicting changes, tasks claimed but never completed, agents that crash silently while their locks persist. We solved these problems decades ago. The AI hype cycle just convinced everyone that throwing more compute at coordination would work. It does not.

Christian Bourlier at rezzed.ai puts it plainly: "The problems aren't AI problems. They're distributed systems problems. CAP theorem still applies. Agent coordination systems are distributed systems whether we acknowledge it or not." After building and operating a production multi-agent system across hundreds of agent sessions, he found that teams debugging agent failures by tweaking prompts are almost always solving the wrong problem.

The Failure Modes

The mapping from distributed systems to multi-agent AI is uncomfortably precise.

Stale locks. An agent claims a task, then crashes or exhausts its context window. The task remains marked as "in progress" forever. No other agent will touch it. In distributed databases, this is the exact same problem: a process acquires a lock, dies, and the lock never releases. The solution is the same too: time-limited leases rather than permanent locks. If an agent does not heartbeat within the lease window, the task returns to the queue.

Split brain. Two agents modify the same file simultaneously. One writes a refactor, the other writes a bugfix. Git merges them with conflicts. The codebase enters an inconsistent state that neither agent fully understands. This is the network partition scenario where two nodes both believe they are the leader. Without a single coordination point or transactional claiming, it is inevitable.

Cascade failure. One agent produces subtly incorrect output, like a function with the wrong signature. Downstream agents build on top of it and propagate the error. By the time a human spots the problem, five agents have built code on a broken foundation.

Byzantine fault. An agent hallucinates a solution that looks plausible but is wrong. It reports the task as complete with confidence. The orchestrator accepts the output because it has no way to verify it independently. This is the Byzantine Generals Problem exactly: a component that sends contradictory or false information while appearing to function correctly. Trust but verify is not optional here.

Thundering herd. A shared dependency fails, say an API rate limit is hit. All agents detect the failure and retry simultaneously, making the problem worse. This is the classic retry storm. Exponential backoff and jitter solve it, but only if you remember to implement them.

The Solutions That Already Exist

None of these require novel AI research. The distributed systems community built the answers decades ago.

Lease-based task claiming. Tasks are claimed with a time limit, not locked permanently. Bourlier recommends a fifteen-minute claim timeout. If the agent does not complete or renew the lease, a reclamation process returns the task to the available pool. This prevents the stale lock problem entirely.

Heartbeat monitoring. Agents periodically signal that they are alive and working. When the heartbeat stops, the system knows the agent is dead and can release its resources. This turns silent agent crashes from invisible failures into detectable, recoverable events.

Message queues. Agents should not call each other directly. Bourlier found that direct agent-to-agent calls create a mesh topology with O(N-squared) connections. Replacing those with asynchronous message passing through a durable relay reduces complexity to O(N). It also provides natural backpressure, delivery guarantees, and dead letter queues for messages that cannot be processed.

Circuit breakers. When an agent or external dependency fails repeatedly, stop sending it work temporarily. This prevents cascade failures from propagating and gives the failing component time to recover. Without circuit breakers, a single misbehaving agent can exhaust the entire system's capacity.

Centralised problem and solution log. Naren Yellavula, who built a compiler using a multi-agent pipeline, found that the most expensive failure mode is agents independently discovering and solving the same problem. His solution: a shared PROBLEMS.md file that every agent scans before starting work and writes to after resolving a novel issue. Over time this becomes institutional knowledge that survives across agent sessions and model upgrades.

Real Production Patterns

The teams running multi-agent systems in production have converged on a handful of patterns that work.

The supervisor pattern. A single coordinator agent decomposes work and assigns tasks. Worker agents pull tasks from a queue, execute them, and report results. The coordinator never writes production code directly. Yellavula enforces this constraint explicitly: an orchestrator that can also write code will always take the shortcut of writing code. The constraint is the feature.

Claim-based ownership with transactions. When an agent claims a task, the operation is atomic. Two agents cannot claim the same task because the transaction serialises access. Bourlier's implementation uses database transactions so that a race condition results in one agent getting the task and the other receiving null. The loser moves on to the next available task.

Confidence scoring on outputs. Yellavula uses Greptile for AI-powered code review with a confidence score attached. A score of four or five out of five means the review found no significant issues and a human can merge quickly. Lower scores flag PRs that need genuine human attention. This turns code review from an all-or-nothing bottleneck into a triage system: most agent PRs get fast-tracked, the risky ones get eyes on them.

Contract-first handoffs. Yellavula defines interfaces between components before any agent starts implementing. Breaking interface changes mid-sprint create cascade failures that are hard to recover from. The same principle applies in human teams, but with AI agents the consequences are worse because agents will not raise their hand and say they are confused.

Framework Landscape

The current generation of multi-agent frameworks each takes a different approach. LangGraph models agent workflows as directed graphs, which suits complex branching logic and conditional routing. CrewAI uses a role-based metaphor where agents have defined personas and responsibilities, making it the lowest barrier to entry. OpenAI's Agents SDK, Google's ADK, and Anthropic's Agent SDK each provide lightweight orchestration layers with varying degrees of built-in state management.

The key insight from production teams is that the framework matters less than the coordination primitives. You can build a reliable multi-agent system on any of these frameworks, or on none of them. What determines success is whether you have implemented lease-based claiming, heartbeats, message queues, and idempotency. Bourlier estimates retrofitting a proper coordination layer onto an existing agent system takes about a month of focused engineering time. The payoff is scaling from five agents to fifty without rewriting the coordination layer.

The Mental Model Shift

Yellavula frames it as: "I run an engineering organisation where some of the engineers are AI agents." The practices that make human teams work apply directly. Sprint planning, code review, commit discipline, shared knowledge repositories, integration testing, these are not overhead to be eliminated. They are the coordination layer that prevents chaos.

The difference is that AI agents are cheaper to run in parallel, never get distracted, and have no ego about being corrected by a shared problem repository. But they also have no conscience about breaking rules, no intuition for when something feels wrong, and no ability to escalate ambiguity to a human without being explicitly programmed to do so.

Practical Starting Point

Do not start with twenty agents. Start with a task queue that is pull-based, transactional, and idempotent. This alone eliminates most coordination bugs. Then add heartbeat monitoring so crashed agents release their work. Then add the shared problem log so agents stop rediscovering the same solutions. Only then should you think about adding more agents.

Bourlier's recommended order is: task queue first, message relay second, persistent agent identity third, instrumentation last. You cannot optimise what you cannot measure, but you also cannot measure a system that falls apart under basic coordination stress.

The bottleneck is not AI capability. It is discipline in the process. The patterns are known, the libraries exist, and the failure modes are predictable. The hard part is getting them right in your stack, with your constraints, at your scale. That has always been the hard part of distributed systems, and it is the hard part of multi-agent AI now.

Further reading: Christian Bourlier's full architecture review at rezzed.ai and Naren Yellavula's multi-agent project breakdown at yella.dev.