Building a multi-agent harness that doesn't fall apart at 3am

I’ve now built three different multi-agent systems in production. Each time, I’ve made roughly the same mistakes in a different order. This post is an attempt to write down the things I wish I’d known before starting the third one.

The loop that eats itself

The core of most agentic systems is a loop: give the model a context, let it call tools, observe the results, feed them back in. Simple enough on a whiteboard. In production, this loop has two failure modes that sound obvious in retrospect but cost me days each.

The first is unbounded recursion through tool calls. If your agent can call a tool that enqueues another agent task, and that task can call the same tool, you don’t have a DAG — you have a potential cycle. You need to enforce a max depth at the harness level, not trust the model to stop.

The model doesn’t know it’s in a loop. It only sees its current context window. Loop detection is your job.

The second is context bloat. Tool results get appended to the context on every iteration. By turn 8, you’re passing 40k tokens of intermediate state to a model that only needed the final value from step 3. Summarize aggressively, or better yet, structure your tools to return only what the next step needs.

Retries are not free

When a tool call fails, the instinct is to retry. And often that’s right. But there’s a class of tool — anything with side effects — where retrying on failure is dangerous. If your agent called send_email() and got a timeout, you don’t know if the email sent. Retrying sends it twice.

The solution I’ve landed on: separate your tools into idempotent and effectful at registration time. The harness retries idempotent tools automatically. Effectful tools surface the error to the model and let it decide.

The orchestration layer is not the agent

The temptation is to put business logic in the orchestrator. But that logic belongs in the system prompt or in a tool, not in the harness code. A harness that’s easy to reason about does exactly three things: manages context, routes tool calls, and enforces limits.

# Too much logic in the harness
if "DONE" in last_message:
    return result
elif "NEEDS_HUMAN" in last_message:
    escalate()

# Better: let the model use a structured finish tool
# finish(status="done", result=...) | finish(status="escalate", reason=...)

None of this is revolutionary. It’s the kind of thing that seems obvious once you’ve debugged a 3am incident where your agent sent 200 emails because a retry loop hit a flaky SMTP server.