ADR-0014: Coordination-First Architecture

Status

Proposed (Supersedes ADR-0013)

Context

ADR-0013 proposed using PostgreSQL for state management with event sourcing and snapshots. However, after careful analysis, this approach:

Violates the minimal core philosophy by adding heavyweight dependencies
Makes Caxton responsible for business domain concerns (agent state)
Creates operational complexity (backups, migrations, replication)
Introduces a shared state bottleneck that limits scalability

Further research revealed that Caxton’s actual needs are coordination rather than shared state:

Agent discovery and registry
Health monitoring and failure detection
Message routing information
Cluster membership

Decision

Caxton adopts a coordination-first architecture that eliminates shared state in favor of lightweight coordination protocols. Agent state management becomes a business domain responsibility through MCP tools.

Protocol Layering:

SWIM Protocol: Infrastructure layer for cluster coordination and membership
FIPA Protocol: Application layer for semantic agent-to-agent messaging
Clear Separation: These protocols complement rather than compete with each other

Core Principles

1. No Shared State

Each Caxton instance maintains only local state. No external database dependencies.

2. Coordination Through Gossip

Use SWIM protocol for cluster coordination:

Scalable membership protocol
Built-in failure detection
Eventually consistent
No single point of failure

3. Agent State via MCP Tools

Agents requiring persistence use business-provided MCP tools:

// Standard interface for state persistence
#[async_trait]
pub trait McpStateTool: Send + Sync {
    async fn store(&self, key: String, value: Value) -> Result<()>;
    async fn retrieve(&self, key: String) -> Result<Option<Value>>;
    async fn delete(&self, key: String) -> Result<()>;
    async fn list(&self, prefix: String) -> Result<Vec<String>>;
}

Architecture Components

Local State Storage

Each instance uses embedded SQLite for local state:

Agent registry cache
Routing tables
Message queues during partitions
Conversation state tracking ```rust pub struct LocalState { db: rusqlite::Connection, }

impl LocalState { pub fn new(path: &str) -> Result { let db = Connection::open(path)?; db.execute_batch( "CREATE TABLE IF NOT EXISTS agents ( id TEXT PRIMARY KEY, capabilities TEXT NOT NULL, metadata TEXT, last_seen INTEGER NOT NULL ); CREATE TABLE IF NOT EXISTS routes ( agent_id TEXT PRIMARY KEY, node_id TEXT NOT NULL, updated_at INTEGER NOT NULL );" )?; Ok(Self { db }) } }

#### Cluster Coordination
SWIM protocol for distributed coordination:
```rust
use memberlist::{Config, Memberlist, Node};

pub struct ClusterCoordinator {
    memberlist: Memberlist,
    local_agents: Arc<RwLock<HashMap<AgentId, AgentInfo>>>,
}

impl ClusterCoordinator {
    pub async fn start(&mut self, bind_addr: &str, seeds: Vec<String>) -> Result<()> {
        let config = Config::default()
            .with_bind_addr(bind_addr)
            .with_gossip_interval(Duration::from_millis(200));

        self.memberlist = Memberlist::new(config)?;

        if !seeds.is_empty() {
            self.memberlist.join(seeds).await?;
        }

        self.start_gossip_loop().await;
        Ok(())
    }

    async fn start_gossip_loop(&self) {
        // Periodically gossip local agent registry
        tokio::spawn(async move {
            loop {
                self.broadcast_agents().await;
                tokio::time::sleep(Duration::from_secs(5)).await;
            }
        });
    }
}

Message Routing

Routing without shared state:

pub struct MessageRouter {
    local_routes: HashMap<AgentId, NodeId>,
    gossip: Arc<ClusterCoordinator>,
}

impl MessageRouter {
    pub async fn route(&self, msg: Message) -> Result<()> {
        // Try local agents first
        if let Some(agent) = self.local_agents.get(&msg.receiver) {
            return agent.handle(msg).await;
        }

        // Check gossip-learned routes
        if let Some(node_id) = self.local_routes.get(&msg.receiver) {
            return self.forward_to_node(node_id, msg).await;
        }

        // Broadcast query if unknown
        self.gossip.query_agent_location(&msg.receiver).await
    }
}

State Categories

Caxton-Managed (Coordination)

Agent Registry: Which agents exist and their capabilities
Cluster Membership: Which Caxton instances are alive
Routing Table: Which node hosts which agents
Health Status: Liveness and readiness information

Business-Managed (State)

Agent Checkpoints: Persistent agent state
Conversation History: Message logs and context
Task State: Long-running operation status
Audit Logs: Compliance and debugging
Business Data: Domain-specific information

Implementation Example

Multi-Instance Deployment

// Instance 1 (Primary datacenter)
let instance1 = Caxton::builder()
    .with_local_db("instance1.db")
    .with_bind_addr("10.0.1.10:7946")
    .with_seeds(vec!["10.0.2.10:7946"])
    .build()?;

// Instance 2 (Secondary datacenter)
let instance2 = Caxton::builder()
    .with_local_db("instance2.db")
    .with_bind_addr("10.0.2.10:7946")
    .with_seeds(vec!["10.0.1.10:7946"])
    .build()?;

// They automatically:
// - Discover each other via SWIM
// - Share agent registries via gossip
// - Route messages without shared state

Agent with Business State

pub struct StatefulAgent {
    id: AgentId,
    state_tool: Box<dyn McpStateTool>,
}

impl StatefulAgent {
    pub async fn checkpoint(&self) -> Result<()> {
        let state = self.serialize_state()?;
        self.state_tool.store(
            format!("checkpoints/{}", self.id),
            state
        ).await
    }

    pub async fn restore(&mut self) -> Result<()> {
        if let Some(state) = self.state_tool.retrieve(
            format!("checkpoints/{}", self.id)
        ).await? {
            self.deserialize_state(state)?;
        }
        Ok(())
    }
}

Consequences

Positive

No external dependencies: SQLite is embedded, SWIM is a library
Linear scalability: No shared state bottleneck
Operational simplicity: No database administration
Fault isolation: Node failures don’t affect others
Geographic distribution: Works naturally across regions
Business flexibility: Choose any state backend via MCP
Minimal core maintained: Caxton remains a message router
Partition tolerance: Graceful degradation during network splits
Cross-cluster communication: Agents can communicate across instance boundaries

Negative

Eventual consistency: Agent registry may be temporarily inconsistent
No strong consistency: Cannot guarantee global ordering
Learning curve: SWIM protocol less familiar than databases
Network partitions: Require careful handling and degraded modes
Gossip overhead: Background network traffic for coordination

Neutral

Different mental model: Think coordination, not shared state
MCP tool requirement: Businesses must provide state tools if needed
Migration complexity: Existing systems expecting shared state need updates

Migration Path

Phase 1: Local State (Week 1)

Implement SQLite for local storage
No breaking changes to external API

Phase 2: SWIM Protocol (Week 2-3)

Add memberlist dependency
Implement gossip for agent registry
Maintain backward compatibility

Phase 3: Remove Shared State (Week 4)

Deprecate PostgreSQL backend
Provide migration tools
Document MCP state tool interface

Phase 4: MCP Tools (Week 5-6)

Publish MCP StateTool trait
Provide reference implementations
Create migration guides

Alternatives Considered

Keep PostgreSQL (ADR-0013)

Pros: Strong consistency, familiar tooling
Cons: Heavy dependency, operational complexity, scalability limits
Decision: Rejected due to minimal core violation

Embedded etcd

Pros: Strong consistency, proven in Kubernetes
Cons: Still requires consensus, complex for our needs
Decision: Overkill for coordination-only needs

Redis with Clustering

Pros: Fast, supports pub/sub
Cons: External dependency, complex cluster setup
Decision: Still violates zero-dependency goal

Comparison with Industry Systems

HashiCorp Consul

Uses SWIM for membership (like our proposal)
Raft only for critical config (we avoid entirely)
Proves gossip scales to thousands of nodes

Apache Cassandra

Gossip protocol for cluster state
No central coordinator
Validates our approach at scale

Kubernetes

etcd for config, local state in kubelet
Similar hybrid model
Shows pattern works in production

Guidelines

Think coordination, not consistency: Design for eventual consistency
Local first: Prefer local state over distributed state
Gossip sparingly: Only share essential information
Business owns state: Let MCP tools handle persistence
Fail independently: Design for partition tolerance

ADR-0015: Distributed Protocol Architecture - Details FIPA/SWIM integration and partition handling
ADR-0012: Pragmatic FIPA Subset - Agent communication protocol
ADR-0013: State Management Architecture (Superseded) - Previous approach
ADR-0004: Minimal Core Philosophy - Core design principle

References

Notes

This architecture makes Caxton truly lightweight and cloud-native. By eliminating shared state, we remove the primary scaling bottleneck and operational burden. The coordination-first approach aligns perfectly with the minimal core philosophy while providing all necessary functionality through intelligent architectural choices.