Clustering

Clustering and Distributed Operations

This guide covers running Caxton in a distributed cluster configuration for high availability and scalability.

Overview

Caxton uses a coordination-first architecture that requires no external dependencies like databases or message queues. Each Caxton instance:

Maintains its own local state using embedded SQLite
Coordinates with other instances via the SWIM gossip protocol
Automatically discovers and routes messages to agents across the cluster
Handles network partitions gracefully with degraded mode operation

For architectural details, see:

Starting a Cluster

Bootstrap First Node

The first node acts as the seed for cluster formation:

# Start the seed node
caxton server start \
  --node-id node-1 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --bootstrap

Join Additional Nodes

Other nodes join by connecting to the seed:

# On node 2
caxton server start \
  --node-id node-2 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --join node-1.example.com:7946

# On node 3
caxton server start \
  --node-id node-3 \
  --bind-addr 0.0.0.0:7946 \
  --api-addr 0.0.0.0:8080 \
  --join node-1.example.com:7946,node-2.example.com:7946

Verify Cluster Status

# Check cluster membership
caxton cluster members

# Example output:
NODE-ID    STATUS    ADDRESS           AGENTS    CPU    MEMORY
node-1     alive     10.0.1.10:7946    42        15%    2.1GB
node-2     alive     10.0.1.11:7946    38        12%    1.8GB
node-3     alive     10.0.1.12:7946    40        18%    2.3GB

Configuration

Cluster Configuration File

Create /etc/caxton/cluster.yaml:

coordination:
  cluster:
    # SWIM protocol settings
    bind_addr: 0.0.0.0:7946
    advertise_addr: ${HOSTNAME}:7946

    # Seed nodes for joining
    seeds:
      - caxton-1.example.com:7946
      - caxton-2.example.com:7946
      - caxton-3.example.com:7946

    # Gossip parameters
    gossip_interval: 200ms
    gossip_fanout: 3
    probe_interval: 1s
    probe_timeout: 500ms

  # Partition handling
  partition:
    detection_timeout: 5s
    quorum_size: 2
    degraded_mode: true
    queue_writes: true

Security Configuration

Enable mTLS for secure inter-node communication:

security:
  cluster:
    mtls:
      enabled: true
      ca_cert: /etc/caxton/ca.crt
      node_cert: /etc/caxton/certs/node.crt
      node_key: /etc/caxton/certs/node.key
      verify_peer: true

See ADR-0016: Security Architecture for details.

Agent Distribution

Agents are automatically distributed across the cluster:

# Deploy an agent (automatically placed on optimal node)
caxton deploy agent.wasm --name my-agent

# Deploy with placement preferences
caxton deploy agent.wasm \
  --name my-agent \
  --placement-strategy least-loaded \
  --prefer-nodes node-1,node-2

# Force deployment to specific node
caxton deploy agent.wasm \
  --name my-agent \
  --target-node node-3

Agent Discovery

Agents can communicate regardless of which node they’re on:

# Send message to agent (routing handled automatically)
caxton message send \
  --to remote-agent \
  --content "Hello from anywhere in the cluster!"

# The cluster automatically:
# 1. Discovers which node hosts 'remote-agent'
# 2. Routes the message through the cluster
# 3. Delivers to the target agent

High Availability

Automatic Failover

When a node fails, its agents are automatically redistributed:

# Monitor failover behavior
caxton cluster watch

# Example during node failure:
[INFO] Node node-2 detected as failed
[INFO] Redistributing 38 agents from node-2
[INFO] Agent 'processor-1' migrated to node-1
[INFO] Agent 'worker-5' migrated to node-3
[INFO] All agents successfully redistributed (2.3s)

Network Partition Handling

Caxton handles network partitions gracefully:

Majority Partition

Nodes in the majority partition continue normal operations:

# On majority side (2 of 3 nodes)
caxton cluster status
# Status: HEALTHY (majority partition)
# Operations: READ-WRITE
# Nodes: 2/3 active

Minority Partition

Nodes in the minority enter degraded mode:

# On minority side (1 of 3 nodes)
caxton cluster status
# Status: DEGRADED (minority partition)
# Operations: READ-ONLY
# Nodes: 1/3 active
# Queued writes: 42

When the partition heals, queued operations are replayed automatically.

Monitoring

Cluster Metrics

Key metrics to monitor:

# Cluster health metrics
curl http://localhost:9090/metrics | grep caxton_cluster

# Key metrics:
caxton_cluster_nodes_total          3
caxton_cluster_nodes_alive          3
caxton_cluster_agents_total         120
caxton_cluster_gossip_latency_ms    0.8
caxton_cluster_convergence_time_ms  423

Performance Monitoring

Monitor cluster performance against targets:

# Check performance against requirements
caxton cluster performance

# Output:
METRIC                    TARGET      ACTUAL    STATUS
Message routing P50       100μs       87μs      ✓
Message routing P99       1ms         0.9ms     ✓
Agent startup P50         10ms        8.2ms     ✓
Gossip convergence        <5s         2.1s      ✓

See ADR-0017: Performance Requirements for targets.

Operations

Rolling Upgrades

Perform zero-downtime upgrades:

# Start upgrade process
caxton cluster upgrade --version v1.2.0

# The cluster will:
# 1. Select a canary node
# 2. Drain traffic from canary
# 3. Upgrade canary node
# 4. Monitor for 24 hours
# 5. Roll out to remaining nodes

See ADR-0018: Operational Procedures for details.

Backup and Recovery

Each node maintains its own state, but cluster-wide backups are coordinated:

# Create cluster-wide backup
caxton cluster backup --dest s3://backups/caxton/

# Restore from backup
caxton cluster restore --from s3://backups/caxton/2024-01-15/

Scaling

Adding Nodes

# Add new node to running cluster
caxton server start \
  --node-id node-4 \
  --join <any-existing-node>:7946

# Agents automatically rebalance
caxton cluster rebalance --strategy even-distribution

Removing Nodes

# Gracefully remove a node
caxton cluster leave --node node-2 --drain-timeout 60s

# Force remove failed node
caxton cluster remove --node node-2 --force

Troubleshooting

Common Issues

Nodes Not Joining

# Check network connectivity
caxton cluster ping node-2

# Verify gossip encryption keys match
caxton cluster verify-auth

# Check firewall rules (port 7946 must be open)

Split Brain Detection

# Check for split brain
caxton cluster detect-partition

# If split brain detected:
WARNING: Potential split brain detected
Partition 1: [node-1, node-2] (majority)
Partition 2: [node-3] (minority)
Action: Node-3 entering degraded mode

Performance Issues

# Analyze cluster performance
caxton cluster analyze

# Suggestions:
- High gossip latency: Reduce gossip_fanout
- Slow convergence: Decrease gossip_interval
- Message delays: Check network latency between nodes

Best Practices

Odd Number of Nodes: Deploy 3, 5, or 7 nodes to avoid split-brain
Geographic Distribution: Spread nodes across availability zones
Resource Monitoring: Monitor CPU, memory, and network usage
Regular Backups: Schedule automated backups
Security: Always enable mTLS in production
Capacity Planning: Plan for 2x peak load for headroom

Advanced Topics

Multi-Region Deployment

For global deployments:

coordination:
  cluster:
    regions:
      - name: us-east
        nodes: [node-1, node-2, node-3]
      - name: eu-west
        nodes: [node-4, node-5, node-6]

    # Cross-region settings
    cross_region:
      latency_aware_routing: true
      prefer_local_region: true
      max_cross_region_latency: 100ms

Custom Partition Strategies

Implement custom partition handling:

partition:
  strategy: custom
  custom_handler: /usr/local/bin/partition-handler
  decisions:
    - condition: "nodes < quorum"
      action: "read-only"
    - condition: "nodes == 1"
      action: "local-only"
    - condition: "critical_agents_present"
      action: "continue-critical"

Performance Tuning

SWIM Protocol Tuning

# For small clusters (< 10 nodes)
gossip_interval: 100ms
gossip_fanout: 3

# For medium clusters (10-50 nodes)
gossip_interval: 200ms
gossip_fanout: 4

# For large clusters (> 50 nodes)
gossip_interval: 500ms
gossip_fanout: 5

Network Optimization

# Use QUIC for better performance
transport:
  type: quic
  congestion_control: bbr
  max_streams: 100