Tags:#ai_and_agents #software_engineering

Why Agent‑to‑Agent Communication Should Be Wrapped in CloudEvents

Or: how to stop your multi‑agent system from becoming distributed shell scripts.

Most agent‑to‑agent (A2A) communication today looks deceptively simple.

One agent emits some JSON.
Another agent reads it.
Maybe there’s an HTTP call. Maybe a message queue. Maybe a shared memory buffer.

It works in demos. It even works in early production.

And then one day you ask a very reasonable question:

“Why did this agent do that?”

At that moment, most agent systems collapse under inspection.

This article makes a straightforward claim:

If agents are autonomous actors, then their communication must be treated as events of record.
Wrapping A2A protocols in CloudEvents is how you get there.

This isn’t about adding ceremony.
It’s about making agentic systems operable.

A2A Is Not a Chat Problem. It’s a Distributed Systems Problem.

There’s a persistent category error in how we design agent systems.

We think of A2A communication as “chat between models.”
In reality, it has far more in common with microservices talking to each other under partial failure.

Agents are:

autonomous
stateful
non‑deterministic internally
long‑lived in intent, even if their processes are ephemeral

Once you have more than one agent, you inherit all the classic distributed systems problems:

causality
retries
duplication
partial failure
ordering
observability
auditability

Most A2A protocols today ignore this.

They optimize for immediacy, not history.

And history is where systems break.

The Failure Mode of “Raw” A2A Protocols

A typical A2A message might look like this:


{
  "from": "planner-agent",
  "to": "executor-agent",
  "task": "deploy_service",
  "params": {
    "service": "billing-api",
    "env": "prod"
  }
}

Seems fine. Until you ask:

Was this retried?
Was it acted on twice?
What caused it?
What version of the planner sent it?
What happened after it was sent?
Can I replay this interaction in a test environment?

There is no answer, because there is no event of record.

This is the same mistake we made in early microservices: RPC everywhere, state nowhere, history lost.

Agent systems are now repeating that mistake—faster.

Agents Aren’t Features. They’re Actors.

Once you accept that agents are actors, a few things become obvious:

Their actions have consequences
Those consequences must be observable
Their decisions must be inspectable after the fact
Their interactions must survive time

An agent that emits an instruction is not “sending a message.”
It is causing an event in the world.

And events deserve structure.

CloudEvents: The Boring Thing That Fixes This

CloudEvents is not a message bus.
It’s not Kafka. It’s not NATS. It’s not SQS.

It’s a minimal, transport‑agnostic event envelope with a deliberately small contract:

id — unique identity
source — where the event came from
type — what kind of event this is
subject — what the event is about
time — when it happened
data — the payload
datacontenttype — how to interpret it

That’s it.

Which is exactly why it works.

CloudEvents doesn’t tell you how to move events.
It tells you what an event is.

And that distinction matters enormously for agents.

Why CloudEvents Fits A2A Communication Almost Too Well

Let’s map real A2A requirements to CloudEvents primitives:

| A2A Requirement | CloudEvents Provides | |-----------------|----------------------| | Agent identity | source, subject | | Intent | type | | Causality | id, correlation extensions | | Replayability | Stable, immutable envelopes | | Transport flexibility | HTTP, Kafka, NATS, SQS, files | | Governance & audit | Metadata + schema |

A key insight is simple but powerful:

Agents reason over time. CloudEvents preserves time.

Without time, causality collapses.
Without causality, debugging becomes mythology.

Raw A2A vs CloudEvents‑Wrapped A2A

Naive A2A Message


{
  "action": "summarize_document",
  "document_id": "doc-417",
  "requested_by": "research-agent"
}

CloudEvents‑Wrapped A2A Event


{
  "specversion": "1.0",
  "type": "agent.task.requested",
  "source": "agent://research-agent",
  "id": "9c2f1c4e-2c5d-4c4d-9c4a-1d92cbd3b8fa",
  "time": "2026-03-26T09:12:41Z",
  "subject": "document/doc-417",
  "datacontenttype": "application/json",
  "data": {
    "task": "summarize_document",
    "constraints": {
      "length": "short",
      "audience": "exec"
    }
  }
}

Nothing magical happened.

But suddenly you can:

replay this event
trace what caused it
audit who asked for it
correlate it with downstream actions
simulate the system offline
enforce policy at the boundary

That’s the difference between a demo and a system.

“Isn’t This Overkill?”

This objection always appears. It appeared with:

structured logging
distributed tracing
event sourcing
schema registries

Let’s address it directly.

“It adds latency”

Your LLM call takes hundreds of milliseconds.
A CloudEvents envelope adds microseconds.

Latency is not your bottleneck. Opacity is.

“Agents are ephemeral”

Exactly.
Which is why their effects must not be.

Processes die. Events persist.

“This is enterprise architecture creeping in”

No.
This is the minimum architecture required for systems that evolve.

The cost of not having an event of record is always paid later—
during incidents, audits, rewrites, and blame.

With interest.

A2A + CloudEvents Is How Agents Scale Beyond a Single Team

Once A2A communication is event‑native, new doors open:

Multi‑team agent ecosystems
Cross‑org agent collaboration
Safety and policy enforcement
Deterministic replay and simulation
Compliance without rewriting everything
Tooling reuse across agent frameworks

This mirrors exactly what happened with microservices:

The systems that survived were the ones that treated events as first‑class.

Agents are not exempt from this law.
If anything, they need it more.

Don’t Let Agents Become the New Shell Scripts

Shell scripts worked—until they didn’t.
They were powerful, flexible, and completely ungovernable.

Agent systems built without event discipline are headed the same way.

If you expect your agents to:

operate autonomously
coordinate reliably
survive production
be understood six months later

Then agent‑to‑agent communication must be event‑native.

Wrapping A2A protocols in CloudEvents isn’t abstraction for its own sake.

It’s how agent systems grow up.

If you’re building agent infrastructure today, treat this as a design constraint, not an optimization. Your future self will thank you.