Production Deployment Guide

Comprehensive guide for deploying Caxton multi-agent systems in production environments with high availability, load balancing, and backup strategies.

Production Deployment Guide

This guide covers deploying Caxton multi-agent systems in production environments, including system requirements, installation methods, configuration best practices, and operational considerations.

System Requirements

Minimum Requirements

  • CPU: 4 cores (x86_64 or ARM64)
  • Memory: 8 GB RAM
  • Storage: 50 GB SSD
  • Network: 1 Gbps connection
  • OS: Linux (Ubuntu 22.04+, RHEL 8+, CentOS 8+)
  • CPU: 16+ cores with AVX2 support
  • Memory: 32+ GB RAM
  • Storage: 200+ GB NVMe SSD
  • Network: 10 Gbps connection with low latency
  • OS: Ubuntu 22.04 LTS or RHEL 9

WebAssembly Runtime Requirements

  • WASI Support: Full WASI preview 1 compatibility
  • Memory Management: Support for linear memory up to 4GB per instance
  • Multi-threading: WASM threads support for concurrent agent execution
  • Security: Sandboxing with capability-based security model

Installation Methods

Docker Deployment

Single Node Setup

# Pull the official Caxton image
docker pull caxton/caxton:latest

# Create data directory
mkdir -p /opt/caxton/data

# Run Caxton container
docker run -d \
  --name caxton-runtime \
  --restart unless-stopped \
  -p 8080:8080 \
  -p 9090:9090 \
  -v /opt/caxton/data:/data \
  -e CAXTON_CONFIG_PATH=/data/config.toml \
  caxton/caxton:latest

Docker Compose Configuration

# docker-compose.yml
version: '3.8'
services:
  caxton-runtime:
    image: caxton/caxton:latest
    restart: unless-stopped
    ports:
      - "8080:8080"  # HTTP API
      - "9090:9090"  # Metrics
      - "4317:4317"  # OTLP gRPC
    volumes:
      - ./config:/config:ro
      - caxton-data:/data
    environment:
      - CAXTON_CONFIG_PATH=/config/production.toml
      - CAXTON_LOG_LEVEL=info
      - CAXTON_METRICS_ENABLED=true
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  caxton-data:

Kubernetes Deployment

Namespace and ConfigMap

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: caxton-system

---
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: caxton-config
  namespace: caxton-system
data:
  production.toml: |
    [runtime]
    max_agents = 1000
    wasm_memory_limit = "512MB"
    execution_timeout = "30s"

    [networking]
    bind_address = "0.0.0.0:8080"
    metrics_address = "0.0.0.0:9090"

    [observability]
    enable_tracing = true
    otlp_endpoint = "http://jaeger-collector:14268/api/traces"

    [coordination]
    # Local state storage
    local_state_path = "/data/local.db"

    # Cluster coordination
    cluster_enabled = true
    bind_addr = "0.0.0.0:7946"
    gossip_interval = "200ms"

Deployment Configuration

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: caxton-runtime
  namespace: caxton-system
spec:
  replicas: 3
  selector:
    matchLabels:
      app: caxton-runtime
  template:
    metadata:
      labels:
        app: caxton-runtime
    spec:
      containers:
      - name: caxton
        image: caxton/caxton:latest
        ports:
        - containerPort: 8080
        - containerPort: 9090
        env:
        - name: CAXTON_CONFIG_PATH
          value: "/config/production.toml"
        volumeMounts:
        - name: config-volume
          mountPath: /config
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
      volumes:
      - name: config-volume
        configMap:
          name: caxton-config

---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: caxton-service
  namespace: caxton-system
spec:
  selector:
    app: caxton-runtime
  ports:
  - name: http
    port: 8080
    targetPort: 8080
  - name: metrics
    port: 9090
    targetPort: 9090
  type: ClusterIP

Bare Metal Installation

System Preparation

# Install dependencies
sudo apt update && sudo apt install -y \
  curl wget \
  build-essential \
  pkg-config \
  libssl-dev

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

# Install WebAssembly targets
rustup target add wasm32-wasi
rustup target add wasm32-unknown-unknown

Binary Installation

# Download latest release
CAXTON_VERSION="v0.2.0"
curl -L "https://github.com/caxton/caxton/releases/download/${CAXTON_VERSION}/caxton-linux-amd64.tar.gz" \
  | tar xz -C /usr/local/bin/

# Create system user
sudo useradd --system --shell /bin/false --home-dir /opt/caxton caxton

# Create directories
sudo mkdir -p /opt/caxton/{bin,config,data,logs}
sudo chown -R caxton:caxton /opt/caxton

# Create systemd service
sudo tee /etc/systemd/system/caxton.service > /dev/null << EOF
[Unit]
Description=Caxton Multi-Agent Runtime
After=network.target

[Service]
Type=exec
User=caxton
Group=caxton
ExecStart=/usr/local/bin/caxton --config /opt/caxton/config/production.toml
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
SyslogIdentifier=caxton

# Security settings
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/opt/caxton/data /opt/caxton/logs

[Install]
WantedBy=multi-user.target
EOF

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable caxton
sudo systemctl start caxton

Configuration Best Practices

Production Configuration

# /opt/caxton/config/production.toml

[runtime]
# Agent execution limits
max_agents = 1000
max_concurrent_executions = 100
wasm_memory_limit = "512MB"
wasm_stack_limit = "1MB"
execution_timeout = "30s"
agent_idle_timeout = "300s"

# Resource management
cpu_quota = "8.0"  # CPU cores
memory_limit = "16GB"
temp_storage_limit = "10GB"

[networking]
# Bind addresses
bind_address = "0.0.0.0:8080"
metrics_address = "0.0.0.0:9090"
admin_address = "127.0.0.1:8081"

# Connection limits
max_connections = 10000
connection_timeout = "30s"
request_timeout = "60s"
keepalive_timeout = "60s"

# TLS configuration
tls_enabled = true
tls_cert_path = "/opt/caxton/config/server.crt"
tls_key_path = "/opt/caxton/config/server.key"
tls_ca_path = "/opt/caxton/config/ca.crt"

[coordination]
# Local state storage (per instance)
local_state_path = "/opt/caxton/data/local.db"
journal_mode = "WAL"

# Cluster coordination
cluster_enabled = true
bind_addr = "0.0.0.0:7946"
advertise_addr = "auto"
seeds = [
  "caxton-node-1:7946",
  "caxton-node-2:7946",
  "caxton-node-3:7946"
]
gossip_interval = "200ms"
probe_interval = "1s"

[observability]
# Logging
log_level = "info"
log_format = "json"
log_file = "/opt/caxton/logs/caxton.log"
log_rotation = "daily"
log_retention_days = 30

# Metrics
enable_metrics = true
metrics_prefix = "caxton"
histogram_buckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]

# Tracing
enable_tracing = true
trace_sample_rate = 0.1
otlp_endpoint = "http://jaeger-collector:14268/api/traces"
otlp_timeout = "10s"

[security]
# Authentication
auth_enabled = true
jwt_secret = "${JWT_SECRET}"
jwt_expiry = "1h"
api_key_header = "X-API-Key"

# Rate limiting
rate_limit_enabled = true
rate_limit_per_second = 100
rate_limit_burst = 200

# CORS
cors_enabled = true
cors_origins = ["https://dashboard.example.com"]
cors_methods = ["GET", "POST", "PUT", "DELETE"]

Environment Variables

# /opt/caxton/config/caxton.env
CAXTON_CONFIG_PATH=/opt/caxton/config/production.toml
CAXTON_LOG_LEVEL=info
JWT_SECRET=your-jwt-secret
OTLP_ENDPOINT=http://jaeger:14268/api/traces

High Availability Setup

Multi-Node Cluster

Load Balancer Configuration (HAProxy)

# /etc/haproxy/haproxy.cfg
global
    daemon
    log stdout local0 info

defaults
    log global
    option httplog
    option dontlognull
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

frontend caxton_frontend
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/caxton.pem
    redirect scheme https if !{ ssl_fc }
    default_backend caxton_backend

backend caxton_backend
    balance roundrobin
    option httpchk GET /health
    server caxton1 10.0.1.10:8080 check
    server caxton2 10.0.1.11:8080 check
    server caxton3 10.0.1.12:8080 check

frontend caxton_metrics
    bind *:9090
    default_backend caxton_metrics_backend

backend caxton_metrics_backend
    balance roundrobin
    server caxton1 10.0.1.10:9090 check
    server caxton2 10.0.1.11:9090 check
    server caxton3 10.0.1.12:9090 check

Cluster Coordination Setup

# Each Caxton instance automatically discovers others via SWIM protocol
# No external coordination service required
# Instances share agent registry through gossip
# Message routing works without shared state

Health Checks and Failover

# kubernetes/healthcheck-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: caxton-healthcheck
  namespace: caxton-system
spec:
  selector:
    app: caxton-runtime
  ports:
  - port: 8080
    targetPort: 8080
  type: LoadBalancer
  externalTrafficPolicy: Local  # Preserve source IP
  healthCheckNodePort: 32000   # Custom health check port

Load Balancing

Nginx Configuration

# /etc/nginx/sites-available/caxton
upstream caxton_backend {
    least_conn;
    server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    listen 443 ssl http2;
    server_name caxton.example.com;

    ssl_certificate /etc/ssl/certs/caxton.crt;
    ssl_certificate_key /etc/ssl/private/caxton.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-RSA-AES256-GCM-SHA512:DHE-RSA-AES256-GCM-SHA512;

    location / {
        proxy_pass http://caxton_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Timeouts
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }

    location /health {
        access_log off;
        proxy_pass http://caxton_backend/health;
    }
}

Session Affinity

For stateful agent sessions, configure session affinity:

# Add to upstream block
upstream caxton_backend {
    ip_hash;  # Route based on client IP
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;
}

# Or use consistent hashing
upstream caxton_backend {
    hash $request_uri consistent;
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;
}

Backup Strategies

Data Backup

Local State Backup Script

#!/bin/bash
# /opt/caxton/scripts/backup-state.sh

BACKUP_DIR="/opt/caxton/backups"
DATE=$(date +%Y%m%d_%H%M%S)
STATE_PATH="/opt/caxton/data/local.db"

mkdir -p "$BACKUP_DIR"

# Create SQLite backup
sqlite3 "$STATE_PATH" ".backup '$BACKUP_DIR/state_$DATE.db'"

# Compress backup
gzip "$BACKUP_DIR/state_$DATE.db"

# Clean old backups (keep last 7 days)
find "$BACKUP_DIR" -name "state_*.db.gz" -mtime +7 -delete

echo "State backup completed: $BACKUP_DIR/state_$DATE.db.gz"

Configuration Backup

#!/bin/bash
# /opt/caxton/scripts/backup-config.sh

BACKUP_DIR="/opt/caxton/backups"
DATE=$(date +%Y%m%d_%H%M%S)
CONFIG_DIR="/opt/caxton/config"

mkdir -p "$BACKUP_DIR"

# Backup configuration files
tar -czf "$BACKUP_DIR/config_$DATE.tar.gz" -C "$(dirname "$CONFIG_DIR")" "$(basename "$CONFIG_DIR")"

# Clean old backups
find "$BACKUP_DIR" -name "config_*.tar.gz" -mtime +30 -delete

echo "Configuration backup completed: $BACKUP_DIR/config_$DATE.tar.gz"

Automated Backup with Systemd

# /etc/systemd/system/caxton-backup.service
[Unit]
Description=Caxton Backup Service
After=caxton.service

[Service]
Type=oneshot
User=caxton
ExecStart=/opt/caxton/scripts/backup-state.sh
ExecStartPost=/opt/caxton/scripts/backup-config.sh

# /etc/systemd/system/caxton-backup.timer
[Unit]
Description=Run Caxton backup daily
Requires=caxton-backup.service

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

Disaster Recovery

Recovery Procedures

  1. Local State Recovery:
    # Stop Caxton service
    sudo systemctl stop caxton
    
    # Restore SQLite database
    gunzip -c state_backup.db.gz > /opt/caxton/data/local.db
    
    # Set permissions
    sudo chown caxton:caxton /opt/caxton/data/local.db
    
    # Start service
    sudo systemctl start caxton
    
  2. Configuration Recovery:
    # Extract configuration backup
    tar -xzf config_backup.tar.gz -C /opt/caxton/
    
    # Set permissions
    sudo chown -R caxton:caxton /opt/caxton/config
    
    # Restart service
    sudo systemctl restart caxton
    
  3. Full System Recovery:
    # Deploy infrastructure
    kubectl apply -f kubernetes/
    
    # Wait for pods to be ready
    kubectl wait --for=condition=ready pod -l app=caxton-runtime
    
    # Restore data
    kubectl exec -it caxton-runtime-0 -- /scripts/restore-data.sh
    

Performance Optimization

Resource Limits

# Fine-tuned resource limits
[runtime]
max_agents = 2000
max_concurrent_executions = 200
wasm_memory_limit = "1GB"
agent_startup_timeout = "10s"
agent_shutdown_timeout = "5s"

# Memory management
gc_interval = "60s"
memory_pressure_threshold = 0.8
agent_memory_reclaim = true

Monitoring and Alerts

Set up monitoring for:

  • CPU and memory usage
  • Agent execution metrics
  • Network I/O
  • Storage I/O
  • Error rates
  • Response times

Example Prometheus alert rules are provided in the Monitoring Guide.

Troubleshooting

Common Issues

  1. High Memory Usage:
    • Check agent memory limits
    • Monitor for memory leaks
    • Adjust garbage collection settings
  2. Agent Startup Failures:
    • Verify WASM module validity
    • Check resource limits
    • Review error logs
  3. Network Connectivity Issues:
    • Verify firewall rules
    • Check DNS resolution
    • Test load balancer health

Debugging Tools

# Check service status
systemctl status caxton

# View logs
journalctl -u caxton -f

# Monitor metrics
curl http://localhost:9090/metrics

# Agent debugging
caxton debug --agent-id <agent-id>

Security Considerations

  • Enable TLS encryption for all communications
  • Use strong authentication mechanisms
  • Implement proper network segmentation
  • Regular security updates and patches
  • Monitor for suspicious activity

For detailed security guidelines, see the Security Guide.