Claude Academy
expert17 min

Designing Agentic Systems

Learning Objectives

  • Architect multi-agent systems with proper boundaries
  • Choose between agents, skills, and commands for different needs
  • Design error handling and fallback strategies
  • Implement reliable autonomous workflows

Thinking in Systems

Building one agent is straightforward. Building a system of agents that work together reliably is an engineering challenge. This lesson covers the architecture patterns, communication protocols, and error handling strategies that make multi-agent systems work.

The Decision Framework

Before reaching for agents, ask:

Do I need an agent, a skill, or a command?

Is it a recurring specialized ROLE with its own context needs?

→ Agent (security reviewer, test writer)

Is it specialized KNOWLEDGE that enhances the current session?

→ Skill (migration patterns, API conventions)

Is it a repeatable TASK sequence?

→ Command (deploy checklist, code review)

Is it a one-off request?

→ Just ask Claude directly

How many agents do I need?

Task touches 1-3 files → Single agent (or no agent — just ask Claude)

Task touches 5-15 files independently → /batch with parallel agents

Task has 2-3 distinct concerns → 2-3 specialist agents

Task needs planning + implementation → Supervisor + worker agents

Task is a full project → Agent team

The rule of thumb: use the minimum number of agents that achieves the goal. Each agent adds complexity and token cost.

Architecture Patterns

Pattern 1: Specialist Pool

Multiple independent specialists, each handling their domain:

┌──────────────┐  ┌──────────────┐  ┌──────────────┐

│ Security │ │ Performance │ │ Testing │

│ Reviewer │ │ Analyst │ │ Writer │

│ (Opus) │ │ (Sonnet) │ │ (Sonnet) │

└──────────────┘ └──────────────┘ └──────────────┘

│ │ │

└─────────────────┴──────────────────┘

Your Session

Each specialist works independently on its concern. Results are aggregated in your session. No coordination between specialists needed.

Best for: Multi-dimensional code review, where each dimension is independent.

Pattern 2: Supervisor + Workers

A lead agent coordinates worker agents:

┌─────────────────────┐

│ Supervisor (Opus) │

│ Plans, assigns, │

│ reviews, integrates│

└──────────┬──────────┘

┌──────┼──────┐

▼ ▼ ▼

┌──────┐┌──────┐┌──────┐

│Work 1││Work 2││Work 3│ (Sonnet)

└──────┘└──────┘└──────┘

The supervisor decomposes the task, assigns subtasks, reviews results, and integrates. Workers implement their assigned subtask.

Best for: Large tasks with interdependencies that need coordinated planning.

Pattern 3: Pipeline

Agents process work sequentially, each adding value:

Code → [Implementer] → [Reviewer] → [Tester] → [Documenter] → Done

(Write code) (Review it) (Test it) (Document it)

Each agent's output is the next agent's input. The pipeline ensures every change goes through a consistent quality process.

Best for: Workflows with sequential quality gates.

Pattern 4: Mapper + Reducer

Distribute work across many agents, then aggregate:

[Mapper] → Agent 1 → Result 1 ─┐

→ Agent 2 → Result 2 ──┤→ [Reducer] → Final Report

→ Agent 3 → Result 3 ──┤

→ Agent N → Result N ──┘

The mapper distributes (each agent analyzes one module). The reducer aggregates (combines all analyses into a single report).

Best for: Codebase-wide analysis where each module is independent.

Communication Protocols

Direct Messages

Agent-to-agent communication for specific coordination:

Security Agent → Lead: "Found a critical auth bypass in payment.ts. 

This blocks the performance optimization

because the fix changes the API contract."

Broadcasts

Messages to all agents for shared context:

Lead → All: "The database schema has changed. All agents working 

on repository files need to regenerate their Prisma client."

Status Reports

Periodic updates from workers to supervisor:

Worker 1 → Lead: "Status: 3/5 files converted. On track."

Worker 2 → Lead: "Status: Blocked. Type error in shared utility."

Worker 3 → Lead: "Status: Complete. All tests passing."

Error Handling Strategies

Retry

For transient failures:

# Agent configuration

maxTurns: 25


# In the agent's instructions:

If a tool use fails, retry once. If it fails again, report the error

to the lead agent instead of retrying indefinitely.

Escalation

When an agent can't solve the problem:

Worker: "I can't fix the type error in shared-utils.ts because 

it requires changes to the interface that other workers

are using."

Lead: "Understood. I'll coordinate the interface change with

all affected workers and provide the updated interface."

Graceful Degradation

When part of the system fails, the rest continues:

Batch of 10 agents:

Agent 1: ✓ Success

Agent 2: ✓ Success

Agent 3: ✗ Failed (test timeout)

Agent 4: ✓ Success

...

Result: 9 successful, 1 failed. Failed unit queued for retry.

Circuit Breaker

If too many failures occur, stop and report:

If more than 3 out of 10 agents fail:

→ Stop remaining agents

→ Report to human: "30% failure rate. Likely a systemic issue

(shared dependency, infrastructure problem). Human review needed."

Isolation Boundaries

Every agent needs clear boundaries:

File Boundaries

# Agent A: only touches auth files

permissionMode: acceptEdits


You may ONLY read and write files in src/services/auth/ and tests/auth/.

Do not touch any other directory.

Tool Boundaries

{

"permissions": {

"allow": ["Read(src/auth/)", "Write(src/auth/)", "Bash(pnpm test auth)"],

"deny": ["Write(src/payments/*)", "Bash(pnpm deploy )"]

}

}

Context Boundaries

Each agent gets only the context it needs:

  • Security agent: gets security-related files and instructions
  • Performance agent: gets performance-related files and metrics
  • Test agent: gets source files and test conventions

Don't give every agent the full codebase context.

Design Principles

1. Start simple: One agent first. Add more only when complexity demands it.

2. Clear boundaries: Each agent knows exactly what it can and can't do.

3. Minimal communication: Agents that need to communicate constantly should probably be one agent.

4. Fail gracefully: Plan for agent failures from the start.

5. Human in the loop: Keep a human checkpoint for critical decisions.

6. Match model to role: Opus for reasoning, Sonnet for implementation, Haiku for exploration.

7. Isolate experiments: Use worktrees for agents making changes.

Key Takeaway

Agentic systems succeed through clear isolation boundaries, appropriate pattern selection (specialist pool, supervisor+workers, pipeline, or mapper+reducer), and robust error handling (retry, escalate, degrade gracefully). Before adding agents, ask: do I need an agent, a skill, or a command? Use the minimum number of agents that achieves the goal. Each agent should have well-defined scope, restricted tools, and clear communication interfaces. Start simple, add complexity only when justified.