Tags:#ai_and_agents #software_engineering

Stop Giving AI Agents Job Titles

What a 25,000-run study reveals about coordination, runtime specialisation, and why protocol design matters more than org charts.

Most multi-agent systems are designed the way bad enterprises are organised: with a neat org chart, a lot of assigned roles, and a deep belief that more coordination must mean better outcomes.

One agent is the planner. Another is the researcher. Another is the coder. Another is the reviewer. It feels rigorous. It feels scalable. It feels like engineering.

But according to a large new study of more than 25,000 multi-agent runs, that instinct is often wrong.

The paper’s core finding is simple and surprisingly sharp: the best multi-agent systems are not the ones with the strongest central boss, and they are not the ones with the most unconstrained autonomy either. They are the ones built around the right coordination protocol, with just enough structure to help agents work together and just enough freedom to let useful roles emerge on their own.

For software engineers, that is the headline. If you are building agentic systems, the most important design choice may not be which model you use. It may be how your agents are allowed to interact.

The paper in one sentence

The paper compares different ways of organising groups of AI agents across thousands of tasks, and finds that a constrained sequential coordination pattern consistently outperforms both top-down centralised control and fully shared autonomous collaboration.

That is a big deal, because most of the industry discussion around agents still focuses on model capability, prompt cleverness, and tool access. This paper argues that the execution model itself is a first-order variable.

Not an implementation detail. A first-order design choice.

What the researchers actually tested

This was not a toy comparison between two cute prompting styles.

The study ran more than 25,000 experiments across:

8 different language models
teams ranging from 4 to 256 agents
8 coordination protocols
multiple task domains

The goal was to test how performance changes when you vary not just the model, but the organisational structure governing the agents.

That distinction matters. In most engineering teams, we already accept that architecture shapes behaviour. A service mesh, an event bus, a leader-election algorithm, or a consensus protocol does not just move data around. It changes what the system can reliably do.

The paper makes the same point for agent systems. Coordination is not a wrapper around intelligence. Coordination is part of the intelligence.

The most important finding: neither hierarchy nor chaos wins

The standout result is what the paper describes as a kind of middle path.

A sequential coordination protocol beat centralised coordination by 14%.

It beat fully autonomous shared coordination by 44%.

That should make a lot of people uncomfortable, because it cuts against two popular intuitions at once.

The first intuition is managerial: if many agents are involved, surely a strong central planner should help. Give one agent the big picture, let it assign work, and keep the rest aligned.

The second intuition is romantic: if agents are smart enough, surely the best thing is to let them all collaborate freely in a shared workspace and let intelligence emerge.

The paper finds that both extremes underperform.

Why? Because each creates a different failure mode.

Centralized systems bottleneck. One agent becomes the coordination chokepoint. That can reduce parallel confusion, but it also limits adaptation, runtime specialization, and local initiative.

Fully shared autonomous systems, on the other hand, often create noise. Everyone sees everything. Everyone can act. Coordination overhead rises. Signal quality drops. You get something that looks collaborative but behaves like a distributed system without backpressure, ownership, or clean handoffs.

The sequential protocol appears to work because it imposes enough order to keep the system legible while still letting useful behaviour emerge inside that structure.

Software engineers should recognise this pattern immediately. Many real systems do not work best at either extreme. Pure centralisation creates bottlenecks. Pure decentralisation creates coordination costs. The best designs often introduce constrained interfaces, bounded handoffs, and disciplined state transitions.

This paper suggests agent systems are no different.

Stop designing agent teams like company org charts

One of the most interesting findings is that useful roles often emerged endogenously rather than needing to be preassigned.

In plain English: the best systems did not always need you to tell one agent to be “the strategist” and another to be “the critic.” Given the right protocol, agents often specialised on their own.

This is a direct challenge to one of the most common patterns in agent demos: the fixed-role team.

You have seen this setup before:

planner agent
researcher agent
coder agent
reviewer agent
judge agent

It is tidy, easy to explain, and often good for presentations. But the paper suggests it may be the wrong default for serious systems.

Why? Because fixed roles assume you already know the optimal decomposition of the task. In practice, that is often false. Real tasks are messy. The useful partitioning of work may depend on the input, the intermediate outputs, the model quality, and what the system learns mid-run.

For engineers, the implication is clear: prefer protocols that allow runtime specialization over architectures that hard-code a static division of labor too early.

In distributed systems terms, this is the difference between assigning responsibilities at deploy time and allowing them to emerge from workload and state.

Completed work is a better coordination primitive than intentions

Another powerful idea in the paper is that coordination worked best when agents interacted through completed outputs rather than through unrestricted mutual awareness.

That may sound abstract, but it maps cleanly onto software practice.

A lot of failed multi-agent designs resemble a group chat. Every agent can see every other agent’s thoughts, plans, partial drafts, and intentions. That feels collaborative. But it is also a recipe for context pollution.

By contrast, a sequential or bounded protocol forces agents to coordinate via artefacts that are already shaped enough to be useful: drafts, decisions, summaries, transformations, critiques.

This is closer to how robust engineering systems operate. Services do not usually coordinate by sharing raw internal thought. They coordinate via stable interfaces, committed state, and explicit outputs.

The paper’s results suggest that agent systems also benefit from that discipline.

If you are building an agent workflow today, one practical takeaway is this: reduce free-form cross-agent chatter. Increase structured handoffs.

More agents do not automatically mean better results

Another finding that engineers will appreciate: scale had limits.

Increasing the number of agents from 64 to 256 produced no significant quality gain.

This matters because a lot of agent design still treats agent count as a proxy for sophistication. If one agent is useful and eight agents are more impressive, then surely 128 agents must be even better.

Not necessarily.

Past a certain point, you are not adding intelligence. You are adding overhead.

Again, this should feel familiar. We already know this from distributed systems. More nodes do not always mean more throughput. More services do not always mean cleaner architecture. More concurrency does not always mean better latency. Once coordination costs dominate, extra components can make the system worse, not better.

The same appears to be true here. Agent count is not the metric that matters. Productive coordination is.

Model quality still matters, but not in the way people think

The paper does not say models are irrelevant. Far from it.

It finds that stronger models are better able to take advantage of looser coordination and emergent structure, while weaker models often benefit from more scaffolding.

This is a subtle but important point.

A powerful model can navigate ambiguity, infer latent roles, and decide when not to contribute. A weaker model may just produce more noise when given that same freedom.

So the right protocol is not universal in a simplistic sense. It interacts with model capability.

For engineering teams, that means architecture should be chosen jointly with model selection. You should not just ask, “What is the best model?” You should ask, “What coordination pattern does this model support reliably?”

That is a much more mature systems question.

The cost story is also changing

One of the paper’s most pragmatic findings is economic: open-source models reached roughly 95% of top closed-model performance at much lower cost.

That number will get attention, and rightly so.

If true for your use case, it changes the optimisation problem. You may not need the absolute best frontier model if protocol design can recover much of the remaining gap. For many production systems, a slightly weaker but dramatically cheaper model inside a better coordination structure may be the better engineering choice.

This is especially relevant for agent systems, where cost multiplies quickly. Once you move from one model invocation to dozens or hundreds, the economics of orchestration start to matter as much as the raw quality of any single response.

The strangest and maybe most human result

Perhaps the most interesting behavioural finding is that agents sometimes chose to abstain.

In some setups, agents effectively declined to contribute when they had little value to add.

That is fascinating, because it suggests a more mature form of coordination than simple participation. The system is not just dividing work. It is learning when extra input is redundant.

In software terms, this is the opposite of many badly designed systems, where every component insists on doing something just because it exists.

An agent that knows when not to act may be more valuable than one that is always active.

That should influence how we think about agent evaluation. The goal is not maximal participation. The goal is maximal useful contribution.

What software engineers should do with this

If you are building agentic systems, the lesson from this paper is not “use sequential coordination everywhere.” It is broader than that.

The real lesson is this: coordination is an architectural concern, not a prompting afterthought.

That means:

Design protocols, not just personas
Prefer bounded handoffs over shared cognitive soup
Let useful roles emerge when possible
Do not confuse more agents with more capability
Match protocol complexity to model capability
Optimise for useful contribution, not visible activity

In other words, treat agent systems less like org charts and more like distributed systems.

That framing is, I think, the paper’s biggest contribution. It pulls the conversation away from anthropomorphic fluff and back toward engineering reality.

The hard problem is not just making a model smart.

The hard problem is making many model instances work together without turning intelligence into interference.

Final thought

The most common mistake in multi-agent design is assuming that intelligence scales by multiplication.

It usually does not.

This paper suggests it scales by coordination.

And that is good news for engineers, because coordination is something we actually know how to reason about. We have spent decades learning that the structure of interaction matters as much as the power of the components. The same lesson now seems to apply to AI agents.

So if your current architecture starts with assigning job titles to a swarm of agents, stop.

Start with the protocol instead.