Tags:#ai_and_agents #software_engineering

Beyond Planner, Generator, Reviewer: Why We Are Building User-Selectable Sub-Agent Teams in HeadGym Pablo

Most agentic products today converge on a familiar workflow: plan, generate, review. It is a sensible default. A planner decomposes the task, a generator produces the artefact, and a reviewer checks or improves the output. For many tasks, that pipeline is already a meaningful improvement over a single prompt.

But once you start building serious desktop workspaces for knowledge work, that generic flow begins to show its limits.

At HeadGym Pablo, we are exploring a different direction: specialised sub-agents and teams of sub-agents that users can explicitly select for the task at hand. Instead of always routing work through the same generic planner/generator/reviewer pipeline, the workspace can expose a set of purpose-built actors: a skeptic, a fact-checker, a debate partner, a style enforcer, a refiner, a judge panel, a brand critic, or a synthesis agent. In some cases, the right abstraction is not a single agent at all, but a small team arranged in an iterative loop.

This article explains why we think this matters, why software engineers should care, and why this design may be more important for real-world agentic systems than simply adding another smarter base model.

The problem with the default pipeline

The planner/generator/reviewer model is attractive because it is easy to understand and easy to implement. It also maps neatly onto a common mental model of work:

first decide what to do,
then do it,
then check it.

The problem is that this model assumes most tasks share the same structure. In practice, they do not.

Writing a blog post, interrogating an argument, drafting sales copy, stress-testing a product pitch, checking legal tone, and refining a technical architecture memo are not the same class of work. They are not just different prompts. They require different kinds of pressure, different evaluation criteria, different tool access, and different conversational dynamics.

A generic reviewer often collapses into vague advice:

“make it clearer”
“improve the flow”
“add more detail”
“tighten the conclusion”

That may be acceptable for lightweight assistance, but it is not enough if the goal is to build a serious agentic workspace. In many knowledge tasks, the bottleneck is not generation. The bottleneck is pressure-testing, iteration, specialisation, and selection of the right cognitive mode.

That is the context for specialised sub-agents.

What a specialised sub-agent actually is

A specialised sub-agent is not just “a prompt with a different hat on.”

A prompt can ask a model to behave critically or creatively. But a sub-agent can do more than that because it has:

a stable role,
a narrower objective,
persistent state over a session,
potentially distinct tools,
a clearer interface contract,
and the ability to participate in a wider workflow.

That distinction matters.

If a user wants to be grilled on an article they just wrote, you can absolutely do that with a prompt:

Critically analyse this article. Point out weak arguments and ask me hard questions.

That works once. But it remains fundamentally single-turn and weakly structured.

A specialised Skeptic Agent can do something much more useful:

read the article,
identify the central claims,
check which claims depend on evidence,
inspect tone for overstatement,
ask one challenge question at a time,
adapt the next question based on the user’s answer,
keep track of unresolved weaknesses,
and optionally use tools such as search, fact-checking, or domain-specific references.

That turns a static prompt into an interactive process.

The advantage is not just better wording. The advantage is that the system now has an actor with a job.

Why sub-agents are better than role prompts

There are five major advantages.

1. Greater reasoning depth

Prompts tend to produce one-shot behaviours. Sub-agents can sustain multi-turn reasoning.

A skeptic agent can ask a question, inspect the answer, detect evasion, drill deeper, and force the user to defend the underlying claim. That is much closer to how a real editor, reviewer, or debate partner behaves.

2. Persistent state

A specialised agent can remember what has already been challenged, what the user has defended well, what remains weak, and which lines of critique are still open.

This matters because good critique is cumulative. It depends on memory.

3. Tool specialization

This is one of the most important architectural differences. A specialised agent can have access to tools that fit its function:

a fact-checking tool,
a logical fallacy detector,
a style guide validator,
a knowledge base lookup,
a search interface,
a citation verifier,
a competitor comparison tool.

A prompt cannot really “own” tools. An agent can.

4. Modular product design

Once the workspace treats these roles as composable agents, the product becomes easier to grow. You do not have to endlessly expand a monolithic system prompt to cover every possible mode of work. You can add a new specialised actor with a clear scope and known behaviour.

5. Better user intent capture

A dropdown that says “Skeptic”, “Fact Checker”, “Debate Partner”, or “Brand Voice Enforcer” is a much better expression of user intent than expecting the user to discover the perfect prompt every time.

This lowers cognitive load and raises precision at the same time.

The deeper idea: not just sub-agents, but teams of sub-agents

Once you accept that some tasks benefit from specialization, the next step becomes obvious: some tasks benefit from multiple specialised agents working together.

This matters most for subjective work.

AutoResearch-style loops work well when the system can optimize toward a measurable score. But many important forms of work do not reduce cleanly to metrics:

Is this argument convincing?
Is this article interesting?
Does this copy create trust?
Is this positioning sharp enough?
Does this paragraph sound authoritative or merely inflated?

These are not problems with a single scalar objective. They are problems of judgment.

That is where teams of sub-agents become compelling.

A useful example is an iterative loop like this:

one agent writes a draft,
another critiques it without proposing fixes,
a third rewrites it based on the critique,
a fourth merges the strongest parts,
a blind judge panel selects the winner,
the loop continues until no variant clearly beats the current best version.

This is powerful because it does not pretend subjective work can be solved with one metric. Instead, it creates a structured adversarial process that approximates how humans actually improve ideas.

Why teams work better for subjective tasks

There are several reasons this architecture is promising.

They mimic real collaborative cognition

Humans rarely produce strong writing, arguments, or positioning in one pass. We draft, critique, rewrite, compare, defend, merge, and revise. The best work often emerges from tension between roles rather than from a single linear process.

Sub-agent teams bring that tension into software.

They separate incompatible cognitive tasks

Writing, criticizing, synthesizing, and judging are not the same task.

A single model can do all of them in principle, but not always well at the same time. When one agent is asked to generate, attack, defend, and arbitrate in one pass, the result often becomes muddled. Role separation improves clarity.

For example:

a Drafting Agent should optimize for production,
a Critique Agent should optimize for diagnosis,
a Rewriter Agent should optimize for improvement,
a Judge Agent should optimize for evaluation,
a Merger Agent should optimize for synthesis.

Each role becomes sharper because it is narrower.

They offer multiple lenses on quality

In subjective work, there is no universal score, but there are many partial judgments:

clarity,
coherence,
novelty,
persuasiveness,
tone,
factual rigor,
brand fit,
structural flow.

A team of sub-agents allows the system to inspect the same artefact through several lenses without collapsing them into one vague “review.”

They create productive internal disagreement

This may be the most important property.

A good system should not always agree with itself. If every component is optimized to be helpful in the same generic way, then the system becomes bland. A well-designed multi-agent loop introduces purposeful disagreement:

one agent pushes for stronger claims,
another pushes for more evidence,
another penalizes verbosity,
another protects tone consistency,
another prefers caution and precision.

That disagreement can produce better outcomes than consensus-by-default.

What this could look like in HeadGym Pablo

In a desktop agentic workspace, the interface matters as much as the underlying models. We are interested in sub-agents not just as a backend architecture, but as a user-facing interaction model.

A user might start with a general drafting flow and then bring in specialised actors as needed.

For example:

After drafting an article

The user invokes:

Skeptic Agent to challenge core arguments,
Fact Checker Agent to inspect claims,
Narrative Flow Agent to identify structural weaknesses,
Audience Fit Agent to judge whether it lands with the intended reader.

During idea development

The user invokes a team:

Idea Generator
Challenger
Expander
Curator

This is useful for brainstorming product names, article angles, campaign themes, or thesis development.

For argumentation

The workspace could spin up:

Pro Agent
Con Agent
Synthesizer
Moderator/Judge

This is particularly useful for strategy memos, opinion pieces, policy analysis, or internal decision support.

For marketing or positioning

A team might include:

Drafting Agent
Objection Agent
Brand Voice Agent
Compression Agent
Judge Panel

This could be far more useful than a generic review pass, because the task is not merely to improve prose. It is to improve persuasion under constraints.

Why this is especially relevant for desktop workspaces

The desktop is an important setting for this concept because it is where high-context, multi-artefact work happens.

A browser chat window is fine for isolated prompt-response interactions. But a desktop agentic workspace can maintain richer context:

current document state,
supporting notes,
prior iterations,
source files,
user feedback,
selected tools,
active agent roster,
and session-level memory.

That environment makes sub-agents much more useful. They are no longer abstract personas floating in a chat. They become working participants inside a live workspace.

This opens up several interesting UX possibilities:

users can choose agents from a dropdown or palette,
agents can be activated at the start of a session or injected mid-flow,
different agents can annotate different parts of the same document,
teams can run asynchronously and present competing outputs,
the user can act as a final judge,
or the workspace can run blind evaluation and present only winning variants.

In other words, the UI becomes a coordination layer for specialised cognition.

Architecture implications for engineers

For software engineers, the important point is that this is not just a prompt design idea. It is a systems design problem.

Once you move from a single pipeline to sub-agents and agent teams, you need clearer answers to several questions.

Role definition

Each sub-agent needs a tightly defined objective. “Be helpful” is not enough. Roles should be narrow enough that behaviour is legible and testable.

Tool boundaries

Which tools does each agent have access to? A skeptic may need search and claim extraction. A style enforcer may only need the document and the style guide. A judge may need both variants but not the authorship history.

Communication protocol

How do agents exchange outputs?

Do they pass full drafts, structured critiques, ranked observations, deltas, confidence scores, or decision records? Once you have teams, message format matters.

Iteration control

When does the loop stop?

Possible stopping rules include:

fixed number of rounds,
no judged improvement across two iterations,
cost threshold exceeded,
user intervention,
confidence plateau,
or deadline reached.

Without explicit control, loops become expensive and noisy.

Determinism and reproducibility

If a user asks why the system selected one variant over another, can the workspace explain it? Can it replay the loop? Can it expose the critiques and judgments that led to the final version?

These are not academic concerns. They matter for trust.

Human-in-the-loop checkpoints

For highly subjective work, user judgment is often still the highest-quality evaluator. The system should not treat the human as an interrupt. It should treat the human as part of the loop.

A practical way to think about the design space

One useful framing is this:

use a generic pipeline when the task is broad and the user is still orienting,
use a specialised sub-agent when the user wants a specific cognitive function,
use a team of sub-agents when the task is subjective, iterative, and benefits from internal disagreement.

That gives the product a natural progression.

The generic pipeline gets the user started.
Specialised agents sharpen the work.
Agent teams stress-test and evolve it.

This is really about interface, not just intelligence

The most interesting part of this concept is that it shifts the conversation away from “how do we get one model to do everything?” toward “how do we give users the right cognitive instruments for the job?”

That is a fundamentally different product philosophy.

A desktop agentic workspace should not just be a single brilliant assistant sitting behind a text box. It should feel more like an environment in which different forms of reasoning can be invoked, coordinated, and compared.

In that sense, sub-agents are not just implementation details. They are part of the interface.

They make the system more explicit, more legible, and potentially more powerful because they allow the user to choose not just what they want done, but how they want the system to think.

Conclusion

We are exploring specialised sub-agents and teams of sub-agents in HeadGym Pablo because generic planner/generator/reviewer pipelines are useful but insufficient.

They are insufficient not because the pattern is wrong, but because real knowledge work is not uniform. Different tasks demand different roles, different tools, different evaluation criteria, and different interaction styles. In many cases, especially in writing, argumentation, and creative problem-solving, quality emerges from structured tension rather than from a single pass.

A specialised sub-agent gives the user a sharper instrument.
A team of sub-agents gives the system a way to reason through disagreement.

That is why this concept matters.

If agentic workspaces are going to become serious environments for software engineers, writers, researchers, and strategists, they will need more than generic pipelines. They will need modular actors, explicit roles, controllable loops, and interfaces that let users orchestrate them deliberately.

That is the direction we are working toward with HeadGym Pablo.