Monitoring and Observability Guide

Comprehensive monitoring and observability setup for Caxton multi-agent systems using OpenTelemetry, Prometheus, Jaeger, and custom dashboards.

Monitoring and Observability Guide

This guide covers setting up comprehensive monitoring and observability for Caxton multi-agent systems, including metrics collection, distributed tracing, log aggregation, alerting, and performance monitoring.

Observability Architecture

Caxton’s observability stack is built on industry-standard tools:

OpenTelemetry: Unified telemetry collection and export
Prometheus: Metrics storage and alerting
Jaeger: Distributed tracing
Grafana: Visualization and dashboards
Fluentd/Loki: Log aggregation
AlertManager: Alert routing and management

OpenTelemetry Integration

Configuration

Caxton has built-in OpenTelemetry support that can be configured through the main configuration file:

# caxton.toml
[observability]
# Enable OpenTelemetry instrumentation
enable_tracing = true
enable_metrics = true
enable_logging = true

# OTLP export configuration
otlp_endpoint = "http://otel-collector:4317"
otlp_timeout = "10s"
otlp_headers = { "api-key" = "your-api-key" }

# Sampling configuration
trace_sample_rate = 0.1  # Sample 10% of traces
metrics_export_interval = "10s"

# Resource attributes
service_name = "caxton-runtime"
service_version = "0.2.0"
deployment_environment = "production"

OpenTelemetry Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  resource:
    attributes:
      - key: environment
        value: production
        action: upsert

  tail_sampling:
    decision_wait: 10s
    num_traces: 100
    expected_new_traces_per_sec: 10
    policies:
      - name: error_sampling
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow_requests
        type: latency
        latency: {threshold_ms: 1000}
      - name: random_sampling
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: caxton
    const_labels:
      environment: production

  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    tenant_id: caxton

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, tail_sampling, batch]
      exporters: [jaeger]

    metrics:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [loki]

Metrics Collection

Core Metrics

Caxton automatically exposes the following metrics categories:

Runtime Metrics

caxton_agents_total: Total number of agents
caxton_agents_active: Currently active agents
caxton_agent_executions_total: Total agent executions
caxton_agent_execution_duration_seconds: Agent execution time
caxton_wasm_memory_usage_bytes: WASM memory usage per agent
caxton_runtime_memory_usage_bytes: Runtime memory usage
caxton_runtime_cpu_usage_ratio: CPU utilization

Message Protocol Metrics

caxton_messages_sent_total: FIPA messages sent
caxton_messages_received_total: FIPA messages received
caxton_message_processing_duration_seconds: Message processing time
caxton_message_queue_size: Current message queue size
caxton_message_errors_total: Message processing errors

HTTP API Metrics

caxton_http_requests_total: HTTP requests by method/status
caxton_http_request_duration_seconds: HTTP request duration
caxton_http_active_connections: Active HTTP connections
caxton_websocket_connections: Active WebSocket connections

System Resource Metrics

caxton_disk_usage_bytes: Disk space usage
caxton_network_bytes_total: Network I/O
caxton_file_descriptors: Open file descriptors

Custom Metrics Configuration

[observability.metrics]
# Custom histogram buckets for latency metrics
http_duration_buckets = [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
agent_execution_buckets = [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0]

# Metric labels to include
include_labels = ["agent_type", "agent_id", "message_type"]
exclude_labels = ["sensitive_data"]

# Export configuration
export_interval = "15s"
export_timeout = "5s"

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "caxton_rules.yml"

scrape_configs:
  - job_name: 'caxton-runtime'
    static_configs:
      - targets: ['caxton-runtime:9090']
    scrape_interval: 10s
    metrics_path: /metrics

  - job_name: 'caxton-kubernetes'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

Distributed Tracing

Trace Context Propagation

Caxton automatically propagates trace context through:

HTTP headers (W3C Trace Context)
FIPA message metadata
Internal agent communications
Database operations

Custom Spans

// Example: Adding custom spans in agent code
use opentelemetry::trace::Tracer;

#[tracing::instrument(name = "agent.execute_task")]
async fn execute_task(&self, task: Task) -> Result<TaskResult> {
    let span = tracer.start("task.processing");
    span.set_attribute("task.type", task.task_type.clone());
    span.set_attribute("task.priority", task.priority as i64);

    // Process task
    let result = self.process_task_internal(task).await;

    match &result {
        Ok(_) => span.set_status(Status::Ok),
        Err(e) => {
            span.set_status(Status::Error {
                description: e.to_string().into()
            });
        }
    }

    result
}

Jaeger Configuration

# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.45
        ports:
        - containerPort: 16686  # UI
        - containerPort: 14268  # HTTP collector
        - containerPort: 14250  # gRPC collector
        - containerPort: 6831   # UDP agent
        env:
        - name: COLLECTOR_OTLP_ENABLED
          value: "true"
        - name: SPAN_STORAGE_TYPE
          value: elasticsearch
        - name: ES_SERVER_URLS
          value: http://elasticsearch:9200
        resources:
          requests:
            memory: 512Mi
            cpu: 250m
          limits:
            memory: 1Gi
            cpu: 500m

Trace Sampling Strategies

{
  "service_strategies": [
    {
      "service": "caxton-runtime",
      "type": "probabilistic",
      "param": 0.1,
      "max_traces_per_second": 100,
      "operation_strategies": [
        {
          "operation": "agent.execute",
          "type": "probabilistic",
          "param": 0.2
        },
        {
          "operation": "message.process",
          "type": "ratelimiting",
          "param": 50
        }
      ]
    }
  ],
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.05
  }
}

Log Aggregation

Structured Logging Configuration

[observability.logging]
level = "info"
format = "json"
output = "stdout"

# Log correlation
include_trace_id = true
include_span_id = true

# Field configuration
timestamp_format = "rfc3339"
level_key = "level"
message_key = "message"
trace_id_key = "trace_id"
span_id_key = "span_id"

# Sampling for high-volume logs
debug_sample_rate = 0.01
trace_sample_rate = 0.1

Fluentd Configuration

# fluentd-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      @id caxton_logs
      path /var/log/containers/caxton-*.log
      pos_file /var/log/fluentd-caxton.log.pos
      tag caxton.*
      format json
      read_from_head true
    </source>

    <filter caxton.**>
      @type parser
      key_name message
      reserve_data true
      <parse>
        @type json
        json_parser_error_class JSONParserError
      </parse>
    </filter>

    <filter caxton.**>
      @type record_transformer
      <record>
        service "caxton-runtime"
        environment "production"
        cluster "#{ENV['CLUSTER_NAME']}"
      </record>
    </filter>

    <match caxton.**>
      @type copy
      <store>
        @type loki
        url http://loki:3100
        tenant ""
        extra_labels {"service":"caxton"}
        line_format json
        <label>
          level
          service
          environment
          agent_id
        </label>
      </store>
      <store>
        @type elasticsearch
        host elasticsearch
        port 9200
        index_name caxton-logs
        type_name _doc
        include_timestamp true
      </store>
    </match>

Loki Configuration

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://alertmanager:9093

limits_config:
  ingestion_rate_mb: 16
  ingestion_burst_size_mb: 24
  max_streams_per_user: 10000
  max_line_size: 256KB

Alerting Rules

Prometheus Alert Rules

# caxton_rules.yml
groups:
- name: caxton.rules
  rules:
  # Agent health alerts
  - alert: CaxtonAgentHighFailureRate
    expr: rate(caxton_agent_executions_total{status="error"}[5m]) / rate(caxton_agent_executions_total[5m]) > 0.05
    for: 2m
    labels:
      severity: warning
      service: caxton
    annotations:
      summary: "High agent execution failure rate"
      description: "Agent execution failure rate is  for "

  - alert: CaxtonAgentMemoryLeak
    expr: increase(caxton_wasm_memory_usage_bytes[30m]) > 100*1024*1024
    for: 5m
    labels:
      severity: critical
      service: caxton
    annotations:
      summary: "Potential memory leak in agent"
      description: "Agent  memory usage increased by  in 30 minutes"

  # System resource alerts
  - alert: CaxtonHighCPUUsage
    expr: caxton_runtime_cpu_usage_ratio > 0.8
    for: 5m
    labels:
      severity: warning
      service: caxton
    annotations:
      summary: "High CPU usage"
      description: "CPU usage is "

  - alert: CaxtonHighMemoryUsage
    expr: caxton_runtime_memory_usage_bytes / 1024/1024/1024 > 16
    for: 5m
    labels:
      severity: critical
      service: caxton
    annotations:
      summary: "High memory usage"
      description: "Memory usage is "

  # API performance alerts
  - alert: CaxtonHighResponseTime
    expr: histogram_quantile(0.95, rate(caxton_http_request_duration_seconds_bucket[5m])) > 2.0
    for: 2m
    labels:
      severity: warning
      service: caxton
    annotations:
      summary: "High HTTP response time"
      description: "95th percentile response time is s"

  - alert: CaxtonServiceDown
    expr: up{job="caxton-runtime"} == 0
    for: 1m
    labels:
      severity: critical
      service: caxton
    annotations:
      summary: "Caxton service is down"
      description: "Caxton runtime service is not responding"

  # Message processing alerts
  - alert: CaxtonMessageQueueBacklog
    expr: caxton_message_queue_size > 1000
    for: 5m
    labels:
      severity: warning
      service: caxton
    annotations:
      summary: "Large message queue backlog"
      description: "Message queue size is  messages"

  - alert: CaxtonMessageProcessingErrors
    expr: rate(caxton_message_errors_total[5m]) > 10
    for: 2m
    labels:
      severity: critical
      service: caxton
    annotations:
      summary: "High message processing error rate"
      description: "Message processing error rate is /second"

AlertManager Configuration

# alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@example.com'
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
  - match:
      service: caxton
    receiver: 'caxton-team'

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'

- name: 'critical-alerts'
  email_configs:
  - to: 'oncall@example.com'
    subject: '[CRITICAL] Caxton Alert'
    body: |
      Alert: 
      Summary: 
      Description: 
  slack_configs:
  - channel: '#alerts-critical'
    title: 'Critical Alert: '
    text: ''

- name: 'caxton-team'
  slack_configs:
  - channel: '#caxton-alerts'
    title: 'Caxton Alert: '
    text: ''

Dashboard Setup

Grafana Dashboards

Runtime Overview Dashboard

{
  "dashboard": {
    "title": "Caxton Runtime Overview",
    "panels": [
      {
        "title": "Active Agents",
        "type": "stat",
        "targets": [
          {
            "expr": "caxton_agents_active",
            "legendFormat": "Active Agents"
          }
        ]
      },
      {
        "title": "Agent Execution Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(caxton_agent_executions_total[5m])",
            "legendFormat": "Executions/sec"
          }
        ]
      },
      {
        "title": "Memory Usage by Agent Type",
        "type": "graph",
        "targets": [
          {
            "expr": "sum by (agent_type) (caxton_wasm_memory_usage_bytes)",
            "legendFormat": ""
          }
        ]
      },
      {
        "title": "Response Time Percentiles",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(caxton_http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          },
          {
            "expr": "histogram_quantile(0.50, rate(caxton_http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "50th percentile"
          }
        ]
      }
    ]
  }
}

Performance Monitoring Dashboard

Key performance indicators to monitor:

Throughput Metrics:
- Requests per second
- Agent executions per second
- Message processing rate
Latency Metrics:
- Request response time
- Agent execution time
- Message processing delay
Resource Utilization:
- CPU usage
- Memory consumption
- Disk I/O
- Network I/O
Error Rates:
- HTTP error responses
- Agent execution failures
- Message processing errors

Custom Grafana Plugins

Install useful plugins for enhanced monitoring:

# Install Grafana plugins
grafana-cli plugins install grafana-polystat-panel
grafana-cli plugins install grafana-worldmap-panel
grafana-cli plugins install grafana-piechart-panel

Performance Monitoring

Benchmarking

Set up automated performance benchmarks:

# benchmark-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: caxton-benchmark
spec:
  schedule: "0 2 * * *"  # Run daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: benchmark
            image: caxton/benchmark:latest
            command:
            - /bin/sh
            - -c
            - |
              /benchmark --target http://caxton-service:8080 \
                        --duration 300s \
                        --concurrent-users 100 \
                        --report-to-prometheus http://prometheus:9090
          restartPolicy: OnFailure

Load Testing

Example load test configuration:

// k6-load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 100 },  // Ramp up to 100 users
    { duration: '5m', target: 100 },  // Stay at 100 users
    { duration: '2m', target: 200 },  // Ramp up to 200 users
    { duration: '5m', target: 200 },  // Stay at 200 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
    http_req_failed: ['rate<0.1'],     // Error rate under 10%
  },
};

export default function() {
  const response = http.post('http://caxton.local:8080/api/agents', {
    agent_type: 'test-agent',
    config: { test: true }
  });

  check(response, {
    'status is 200': (r) => r.status === 200,
    'response time < 1000ms': (r) => r.timings.duration < 1000,
  });

  sleep(1);
}

Continuous Profiling

Enable continuous profiling in production:

[profiling]
enabled = true
endpoint = "http://pyroscope:4040"
application_name = "caxton-runtime"
server_address = "0.0.0.0:6060"
sample_rate = 100  # Hz
profile_types = ["cpu", "alloc_objects", "alloc_space", "inuse_objects", "inuse_space"]

Troubleshooting Monitoring

Common Issues

Missing Metrics:
- Check OpenTelemetry collector configuration
- Verify network connectivity
- Review Prometheus scrape configuration
High Cardinality:
- Limit label values
- Use recording rules for pre-aggregation
- Implement metric sampling
Trace Sampling Issues:
- Adjust sampling rates
- Check trace context propagation
- Verify Jaeger storage capacity

Debug Commands

# Check metrics endpoint
curl http://caxton:9090/metrics | grep caxton_

# Validate OpenTelemetry export
curl -X POST http://otel-collector:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"test": "data"}'

# Query Prometheus
curl 'http://prometheus:9090/api/v1/query?query=up'

# Check Jaeger traces
curl http://jaeger:16686/api/traces?service=caxton-runtime

For more operational guidance, see the Deployment Guide and Security Guide.