Clustering

Clustering and Distributed Operations

This guide covers running Caxton in a distributed cluster configuration for high availability and scalability.

Overview

Caxton uses a coordination-first architecture that requires no external dependencies like databases or message queues. Each Caxton instance:

  • Maintains its own local state using embedded SQLite
  • Coordinates with other instances via the SWIM gossip protocol
  • Automatically discovers and routes messages to agents across the cluster
  • Handles network partitions gracefully with degraded mode operation

For architectural details, see:

Starting a Cluster

Bootstrap First Node

The first node acts as the seed for cluster formation:

# Start the seed node
caxton server start \
  --node-id node-1 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --bootstrap

Join Additional Nodes

Other nodes join by connecting to the seed:

# On node 2
caxton server start \
  --node-id node-2 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --join node-1.example.com:7946

# On node 3
caxton server start \
  --node-id node-3 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --join node-1.example.com:7946,node-2.example.com:7946

Verify Cluster Status

# Check cluster membership
caxton cluster members

# Example output:
NODE-ID    STATUS    ADDRESS           AGENTS    CPU    MEMORY
node-1     alive     10.0.1.10:7946    42        15%    2.1GB
node-2     alive     10.0.1.11:7946    38        12%    1.8GB
node-3     alive     10.0.1.12:7946    40        18%    2.3GB

Configuration

Cluster Configuration File

Create /etc/caxton/cluster.yaml:

coordination:
  cluster:
    # SWIM protocol settings
    bind_addr: 0.0.0.0:7946
    advertise_addr: ${HOSTNAME}:7946

    # Seed nodes for joining
    seeds:
      - caxton-1.example.com:7946
      - caxton-2.example.com:7946
      - caxton-3.example.com:7946

    # Gossip parameters
    gossip_interval: 200ms
    gossip_fanout: 3
    probe_interval: 1s
    probe_timeout: 500ms

  # Partition handling
  partition:
    detection_timeout: 5s
    quorum_size: 2
    degraded_mode: true
    queue_writes: true

Security Configuration

Enable mTLS for secure inter-node communication:

security:
  cluster:
    mtls:
      enabled: true
      ca_cert: /etc/caxton/ca.crt
      node_cert: /etc/caxton/certs/node.crt
      node_key: /etc/caxton/certs/node.key
      verify_peer: true

See ADR-0016: Security Architecture for details.

Agent Distribution

Agents are automatically distributed across the cluster:

# Deploy an agent (automatically placed on optimal node)
caxton deploy agent.wasm --name my-agent

# Deploy with placement preferences
caxton deploy agent.wasm \
  --name my-agent \
  --placement-strategy least-loaded \
  --prefer-nodes node-1,node-2

# Force deployment to specific node
caxton deploy agent.wasm \
  --name my-agent \
  --target-node node-3

Agent Discovery

Agents can communicate regardless of which node they’re on:

# Send message to agent (routing handled automatically)
caxton message send \
  --to remote-agent \
  --content "Hello from anywhere in the cluster!"

# The cluster automatically:
# 1. Discovers which node hosts 'remote-agent'
# 2. Routes the message through the cluster
# 3. Delivers to the target agent

High Availability

Automatic Failover

When a node fails, its agents are automatically redistributed:

# Monitor failover behavior
caxton cluster watch

# Example during node failure:
[INFO] Node node-2 detected as failed
[INFO] Redistributing 38 agents from node-2
[INFO] Agent 'processor-1' migrated to node-1
[INFO] Agent 'worker-5' migrated to node-3
[INFO] All agents successfully redistributed (2.3s)

Network Partition Handling

Caxton handles network partitions gracefully:

Majority Partition

Nodes in the majority partition continue normal operations:

# On majority side (2 of 3 nodes)
caxton cluster status
# Status: HEALTHY (majority partition)
# Operations: READ-WRITE
# Nodes: 2/3 active

Minority Partition

Nodes in the minority enter degraded mode:

# On minority side (1 of 3 nodes)
caxton cluster status
# Status: DEGRADED (minority partition)
# Operations: READ-ONLY
# Nodes: 1/3 active
# Queued writes: 42

When the partition heals, queued operations are replayed automatically.

Monitoring

Cluster Metrics

Key metrics to monitor:

# Cluster health metrics
curl http://localhost:9090/metrics | grep caxton_cluster

# Key metrics:
caxton_cluster_nodes_total          3
caxton_cluster_nodes_alive          3
caxton_cluster_agents_total         120
caxton_cluster_gossip_latency_ms    0.8
caxton_cluster_convergence_time_ms  423

Performance Monitoring

Monitor cluster performance against targets:

# Check performance against requirements
caxton cluster performance

# Output:
METRIC                    TARGET      ACTUAL    STATUS
Message routing P50       100μs       87μs      ✓
Message routing P99       1ms         0.9ms     ✓
Agent startup P50         10ms        8.2ms     ✓
Gossip convergence        <5s         2.1s      ✓

See ADR-0017: Performance Requirements for targets.

Operations

Rolling Upgrades

Perform zero-downtime upgrades:

# Start upgrade process
caxton cluster upgrade --version v1.2.0

# The cluster will:
# 1. Select a canary node
# 2. Drain traffic from canary
# 3. Upgrade canary node
# 4. Monitor for 24 hours
# 5. Roll out to remaining nodes

See ADR-0018: Operational Procedures for details.

Backup and Recovery

Each node maintains its own state, but cluster-wide backups are coordinated:

# Create cluster-wide backup
caxton cluster backup --dest s3://backups/caxton/

# Restore from backup
caxton cluster restore --from s3://backups/caxton/2024-01-15/

Scaling

Adding Nodes

# Add new node to running cluster
caxton server start \
  --node-id node-4 \
  --join <any-existing-node>:7946

# Agents automatically rebalance
caxton cluster rebalance --strategy even-distribution

Removing Nodes

# Gracefully remove a node
caxton cluster leave --node node-2 --drain-timeout 60s

# Force remove failed node
caxton cluster remove --node node-2 --force

Troubleshooting

Common Issues

Nodes Not Joining

# Check network connectivity
caxton cluster ping node-2

# Verify gossip encryption keys match
caxton cluster verify-auth

# Check firewall rules (port 7946 must be open)

Split Brain Detection

# Check for split brain
caxton cluster detect-partition

# If split brain detected:
WARNING: Potential split brain detected
Partition 1: [node-1, node-2] (majority)
Partition 2: [node-3] (minority)
Action: Node-3 entering degraded mode

Performance Issues

# Analyze cluster performance
caxton cluster analyze

# Suggestions:
- High gossip latency: Reduce gossip_fanout
- Slow convergence: Decrease gossip_interval
- Message delays: Check network latency between nodes

Best Practices

  1. Odd Number of Nodes: Deploy 3, 5, or 7 nodes to avoid split-brain
  2. Geographic Distribution: Spread nodes across availability zones
  3. Resource Monitoring: Monitor CPU, memory, and network usage
  4. Regular Backups: Schedule automated backups
  5. Security: Always enable mTLS in production
  6. Capacity Planning: Plan for 2x peak load for headroom

Advanced Topics

Multi-Region Deployment

For global deployments:

coordination:
  cluster:
    regions:
      - name: us-east
        nodes: [node-1, node-2, node-3]
      - name: eu-west
        nodes: [node-4, node-5, node-6]

    # Cross-region settings
    cross_region:
      latency_aware_routing: true
      prefer_local_region: true
      max_cross_region_latency: 100ms

Custom Partition Strategies

Implement custom partition handling:

partition:
  strategy: custom
  custom_handler: /usr/local/bin/partition-handler
  decisions:
    - condition: "nodes < quorum"
      action: "read-only"
    - condition: "nodes == 1"
      action: "local-only"
    - condition: "critical_agents_present"
      action: "continue-critical"

Performance Tuning

SWIM Protocol Tuning

# For small clusters (< 10 nodes)
gossip_interval: 100ms
gossip_fanout: 3

# For medium clusters (10-50 nodes)
gossip_interval: 200ms
gossip_fanout: 4

# For large clusters (> 50 nodes)
gossip_interval: 500ms
gossip_fanout: 5

Network Optimization

# Use QUIC for better performance
transport:
  type: quic
  congestion_control: bbr
  max_streams: 100

Next Steps