Monitoring and Observability Guide
Comprehensive monitoring and observability setup for Caxton multi-agent systems using OpenTelemetry, Prometheus, Jaeger, and custom dashboards.
Monitoring and Observability Guide
This guide covers setting up comprehensive monitoring and observability for Caxton multi-agent systems, including metrics collection, distributed tracing, log aggregation, alerting, and performance monitoring.
Observability Architecture
Caxton’s observability stack is built on industry-standard tools:
- OpenTelemetry: Unified telemetry collection and export
- Prometheus: Metrics storage and alerting
- Jaeger: Distributed tracing
- Grafana: Visualization and dashboards
- Fluentd/Loki: Log aggregation
- AlertManager: Alert routing and management
OpenTelemetry Integration
Configuration
Caxton has built-in OpenTelemetry support that can be configured through the main configuration file:
# caxton.toml
[observability]
# Enable OpenTelemetry instrumentation
enable_tracing = true
enable_metrics = true
enable_logging = true
# OTLP export configuration
otlp_endpoint = "http://otel-collector:4317"
otlp_timeout = "10s"
otlp_headers = { "api-key" = "your-api-key" }
# Sampling configuration
trace_sample_rate = 0.1 # Sample 10% of traces
metrics_export_interval = "10s"
# Resource attributes
service_name = "caxton-runtime"
service_version = "0.2.0"
deployment_environment = "production"
OpenTelemetry Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: environment
value: production
action: upsert
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
- name: error_sampling
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow_requests
type: latency
latency: {threshold_ms: 1000}
- name: random_sampling
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: caxton
const_labels:
environment: production
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
tenant_id: caxton
service:
pipelines:
traces:
receivers: [otlp]
processors: [resource, tail_sampling, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [resource, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [resource, batch]
exporters: [loki]
Metrics Collection
Core Metrics
Caxton automatically exposes the following metrics categories:
Runtime Metrics
caxton_agents_total
: Total number of agentscaxton_agents_active
: Currently active agentscaxton_agent_executions_total
: Total agent executionscaxton_agent_execution_duration_seconds
: Agent execution timecaxton_wasm_memory_usage_bytes
: WASM memory usage per agentcaxton_runtime_memory_usage_bytes
: Runtime memory usagecaxton_runtime_cpu_usage_ratio
: CPU utilization
Message Protocol Metrics
caxton_messages_sent_total
: FIPA messages sentcaxton_messages_received_total
: FIPA messages receivedcaxton_message_processing_duration_seconds
: Message processing timecaxton_message_queue_size
: Current message queue sizecaxton_message_errors_total
: Message processing errors
HTTP API Metrics
caxton_http_requests_total
: HTTP requests by method/statuscaxton_http_request_duration_seconds
: HTTP request durationcaxton_http_active_connections
: Active HTTP connectionscaxton_websocket_connections
: Active WebSocket connections
System Resource Metrics
caxton_disk_usage_bytes
: Disk space usagecaxton_network_bytes_total
: Network I/Ocaxton_file_descriptors
: Open file descriptors
Custom Metrics Configuration
[observability.metrics]
# Custom histogram buckets for latency metrics
http_duration_buckets = [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
agent_execution_buckets = [0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0]
# Metric labels to include
include_labels = ["agent_type", "agent_id", "message_type"]
exclude_labels = ["sensitive_data"]
# Export configuration
export_interval = "15s"
export_timeout = "5s"
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "caxton_rules.yml"
scrape_configs:
- job_name: 'caxton-runtime'
static_configs:
- targets: ['caxton-runtime:9090']
scrape_interval: 10s
metrics_path: /metrics
- job_name: 'caxton-kubernetes'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
Distributed Tracing
Trace Context Propagation
Caxton automatically propagates trace context through:
- HTTP headers (W3C Trace Context)
- FIPA message metadata
- Internal agent communications
- Database operations
Custom Spans
// Example: Adding custom spans in agent code
use opentelemetry::trace::Tracer;
#[tracing::instrument(name = "agent.execute_task")]
async fn execute_task(&self, task: Task) -> Result<TaskResult> {
let span = tracer.start("task.processing");
span.set_attribute("task.type", task.task_type.clone());
span.set_attribute("task.priority", task.priority as i64);
// Process task
let result = self.process_task_internal(task).await;
match &result {
Ok(_) => span.set_status(Status::Ok),
Err(e) => {
span.set_status(Status::Error {
description: e.to_string().into()
});
}
}
result
}
Jaeger Configuration
# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.45
ports:
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP collector
- containerPort: 14250 # gRPC collector
- containerPort: 6831 # UDP agent
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
resources:
requests:
memory: 512Mi
cpu: 250m
limits:
memory: 1Gi
cpu: 500m
Trace Sampling Strategies
{
"service_strategies": [
{
"service": "caxton-runtime",
"type": "probabilistic",
"param": 0.1,
"max_traces_per_second": 100,
"operation_strategies": [
{
"operation": "agent.execute",
"type": "probabilistic",
"param": 0.2
},
{
"operation": "message.process",
"type": "ratelimiting",
"param": 50
}
]
}
],
"default_strategy": {
"type": "probabilistic",
"param": 0.05
}
}
Log Aggregation
Structured Logging Configuration
[observability.logging]
level = "info"
format = "json"
output = "stdout"
# Log correlation
include_trace_id = true
include_span_id = true
# Field configuration
timestamp_format = "rfc3339"
level_key = "level"
message_key = "message"
trace_id_key = "trace_id"
span_id_key = "span_id"
# Sampling for high-volume logs
debug_sample_rate = 0.01
trace_sample_rate = 0.1
Fluentd Configuration
# fluentd-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
@id caxton_logs
path /var/log/containers/caxton-*.log
pos_file /var/log/fluentd-caxton.log.pos
tag caxton.*
format json
read_from_head true
</source>
<filter caxton.**>
@type parser
key_name message
reserve_data true
<parse>
@type json
json_parser_error_class JSONParserError
</parse>
</filter>
<filter caxton.**>
@type record_transformer
<record>
service "caxton-runtime"
environment "production"
cluster "#{ENV['CLUSTER_NAME']}"
</record>
</filter>
<match caxton.**>
@type copy
<store>
@type loki
url http://loki:3100
tenant ""
extra_labels {"service":"caxton"}
line_format json
<label>
level
service
environment
agent_id
</label>
</store>
<store>
@type elasticsearch
host elasticsearch
port 9200
index_name caxton-logs
type_name _doc
include_timestamp true
</store>
</match>
Loki Configuration
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://alertmanager:9093
limits_config:
ingestion_rate_mb: 16
ingestion_burst_size_mb: 24
max_streams_per_user: 10000
max_line_size: 256KB
Alerting Rules
Prometheus Alert Rules
# caxton_rules.yml
groups:
- name: caxton.rules
rules:
# Agent health alerts
- alert: CaxtonAgentHighFailureRate
expr: rate(caxton_agent_executions_total{status="error"}[5m]) / rate(caxton_agent_executions_total[5m]) > 0.05
for: 2m
labels:
severity: warning
service: caxton
annotations:
summary: "High agent execution failure rate"
description: "Agent execution failure rate is for "
- alert: CaxtonAgentMemoryLeak
expr: increase(caxton_wasm_memory_usage_bytes[30m]) > 100*1024*1024
for: 5m
labels:
severity: critical
service: caxton
annotations:
summary: "Potential memory leak in agent"
description: "Agent memory usage increased by in 30 minutes"
# System resource alerts
- alert: CaxtonHighCPUUsage
expr: caxton_runtime_cpu_usage_ratio > 0.8
for: 5m
labels:
severity: warning
service: caxton
annotations:
summary: "High CPU usage"
description: "CPU usage is "
- alert: CaxtonHighMemoryUsage
expr: caxton_runtime_memory_usage_bytes / 1024/1024/1024 > 16
for: 5m
labels:
severity: critical
service: caxton
annotations:
summary: "High memory usage"
description: "Memory usage is "
# API performance alerts
- alert: CaxtonHighResponseTime
expr: histogram_quantile(0.95, rate(caxton_http_request_duration_seconds_bucket[5m])) > 2.0
for: 2m
labels:
severity: warning
service: caxton
annotations:
summary: "High HTTP response time"
description: "95th percentile response time is s"
- alert: CaxtonServiceDown
expr: up{job="caxton-runtime"} == 0
for: 1m
labels:
severity: critical
service: caxton
annotations:
summary: "Caxton service is down"
description: "Caxton runtime service is not responding"
# Message processing alerts
- alert: CaxtonMessageQueueBacklog
expr: caxton_message_queue_size > 1000
for: 5m
labels:
severity: warning
service: caxton
annotations:
summary: "Large message queue backlog"
description: "Message queue size is messages"
- alert: CaxtonMessageProcessingErrors
expr: rate(caxton_message_errors_total[5m]) > 10
for: 2m
labels:
severity: critical
service: caxton
annotations:
summary: "High message processing error rate"
description: "Message processing error rate is /second"
AlertManager Configuration
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@example.com'
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
service: caxton
receiver: 'caxton-team'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'critical-alerts'
email_configs:
- to: 'oncall@example.com'
subject: '[CRITICAL] Caxton Alert'
body: |
Alert:
Summary:
Description:
slack_configs:
- channel: '#alerts-critical'
title: 'Critical Alert: '
text: ''
- name: 'caxton-team'
slack_configs:
- channel: '#caxton-alerts'
title: 'Caxton Alert: '
text: ''
Dashboard Setup
Grafana Dashboards
Runtime Overview Dashboard
{
"dashboard": {
"title": "Caxton Runtime Overview",
"panels": [
{
"title": "Active Agents",
"type": "stat",
"targets": [
{
"expr": "caxton_agents_active",
"legendFormat": "Active Agents"
}
]
},
{
"title": "Agent Execution Rate",
"type": "graph",
"targets": [
{
"expr": "rate(caxton_agent_executions_total[5m])",
"legendFormat": "Executions/sec"
}
]
},
{
"title": "Memory Usage by Agent Type",
"type": "graph",
"targets": [
{
"expr": "sum by (agent_type) (caxton_wasm_memory_usage_bytes)",
"legendFormat": ""
}
]
},
{
"title": "Response Time Percentiles",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(caxton_http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(caxton_http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
}
]
}
]
}
}
Performance Monitoring Dashboard
Key performance indicators to monitor:
- Throughput Metrics:
- Requests per second
- Agent executions per second
- Message processing rate
- Latency Metrics:
- Request response time
- Agent execution time
- Message processing delay
- Resource Utilization:
- CPU usage
- Memory consumption
- Disk I/O
- Network I/O
- Error Rates:
- HTTP error responses
- Agent execution failures
- Message processing errors
Custom Grafana Plugins
Install useful plugins for enhanced monitoring:
# Install Grafana plugins
grafana-cli plugins install grafana-polystat-panel
grafana-cli plugins install grafana-worldmap-panel
grafana-cli plugins install grafana-piechart-panel
Performance Monitoring
Benchmarking
Set up automated performance benchmarks:
# benchmark-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: caxton-benchmark
spec:
schedule: "0 2 * * *" # Run daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: benchmark
image: caxton/benchmark:latest
command:
- /bin/sh
- -c
- |
/benchmark --target http://caxton-service:8080 \
--duration 300s \
--concurrent-users 100 \
--report-to-prometheus http://prometheus:9090
restartPolicy: OnFailure
Load Testing
Example load test configuration:
// k6-load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up to 100 users
{ duration: '5m', target: 100 }, // Stay at 100 users
{ duration: '2m', target: 200 }, // Ramp up to 200 users
{ duration: '5m', target: 200 }, // Stay at 200 users
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<2000'], // 95% of requests under 2s
http_req_failed: ['rate<0.1'], // Error rate under 10%
},
};
export default function() {
const response = http.post('http://caxton.local:8080/api/agents', {
agent_type: 'test-agent',
config: { test: true }
});
check(response, {
'status is 200': (r) => r.status === 200,
'response time < 1000ms': (r) => r.timings.duration < 1000,
});
sleep(1);
}
Continuous Profiling
Enable continuous profiling in production:
[profiling]
enabled = true
endpoint = "http://pyroscope:4040"
application_name = "caxton-runtime"
server_address = "0.0.0.0:6060"
sample_rate = 100 # Hz
profile_types = ["cpu", "alloc_objects", "alloc_space", "inuse_objects", "inuse_space"]
Troubleshooting Monitoring
Common Issues
- Missing Metrics:
- Check OpenTelemetry collector configuration
- Verify network connectivity
- Review Prometheus scrape configuration
- High Cardinality:
- Limit label values
- Use recording rules for pre-aggregation
- Implement metric sampling
- Trace Sampling Issues:
- Adjust sampling rates
- Check trace context propagation
- Verify Jaeger storage capacity
Debug Commands
# Check metrics endpoint
curl http://caxton:9090/metrics | grep caxton_
# Validate OpenTelemetry export
curl -X POST http://otel-collector:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{"test": "data"}'
# Query Prometheus
curl 'http://prometheus:9090/api/v1/query?query=up'
# Check Jaeger traces
curl http://jaeger:16686/api/traces?service=caxton-runtime
For more operational guidance, see the Deployment Guide and Security Guide.