Metrics Integration Guide
Metrics Integration Guide
Overview
This guide documents Caxton’s metrics aggregation and monitoring strategy using Prometheus and OpenTelemetry, ensuring comprehensive observability across all components.
Architecture
Metrics Pipeline
Agents → OpenTelemetry Collector → Prometheus → Grafana
↓
Alternative Backends
(Datadog, New Relic, etc.)
Prometheus Integration
Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'caxton-orchestrator'
static_configs:
- targets: ['localhost:9090']
metrics_path: '/metrics'
- job_name: 'caxton-agents'
static_configs:
- targets: ['localhost:9091-9099']
metrics_path: '/metrics'
- job_name: 'opentelemetry-collector'
static_configs:
- targets: ['localhost:8888']
Key Metrics
Orchestrator Metrics
// Core orchestrator metrics
pub static MESSAGES_PROCESSED: Counter = Counter::new(
"caxton_messages_processed_total",
"Total number of messages processed"
);
pub static MESSAGE_LATENCY: Histogram = Histogram::new(
"caxton_message_latency_seconds",
"Message processing latency in seconds"
);
pub static ACTIVE_AGENTS: Gauge = Gauge::new(
"caxton_active_agents",
"Number of currently active agents"
);
pub static AGENT_MEMORY_USAGE: Gauge = Gauge::new(
"caxton_agent_memory_bytes",
"Memory usage per agent in bytes"
);
Agent Metrics
// Per-agent metrics
pub static TASK_DURATION: Histogram = Histogram::new(
"caxton_task_duration_seconds",
"Task execution duration in seconds"
);
pub static TASK_SUCCESS_RATE: Gauge = Gauge::new(
"caxton_task_success_rate",
"Task success rate (0-1)"
);
pub static AGENT_CPU_USAGE: Gauge = Gauge::new(
"caxton_agent_cpu_usage_percent",
"CPU usage percentage per agent"
);
Metric Labels and Cardinality
Best Practices
- Keep cardinality under control (< 10 label values per metric)
- Use consistent label names across metrics
- Avoid high-cardinality labels (user IDs, request IDs)
Standard Labels
pub struct StandardLabels {
pub agent_id: String, // Agent identifier
pub agent_type: String, // Agent type/capability
pub conversation_id: String, // Conversation correlation
pub environment: String, // dev/staging/prod
pub version: String, // Software version
}
OpenTelemetry Collector Configuration
Collector Setup
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'caxton-metrics'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9090']
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
resource:
attributes:
- key: service.name
value: "caxton"
- key: service.version
from_attribute: version
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
logging:
loglevel: debug
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resource]
exporters: [prometheus, logging]
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [jaeger, logging]
Grafana Dashboard Configuration
Core Dashboards
System Overview Dashboard
{
"dashboard": {
"title": "Caxton System Overview",
"panels": [
{
"title": "Message Throughput",
"targets": [
{
"expr": "rate(caxton_messages_processed_total[5m])"
}
]
},
{
"title": "Active Agents",
"targets": [
{
"expr": "caxton_active_agents"
}
]
},
{
"title": "Message Latency (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(caxton_message_latency_seconds_bucket[5m]))"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(caxton_errors_total[5m])"
}
]
}
]
}
}
Agent Performance Dashboard
{
"dashboard": {
"title": "Agent Performance",
"panels": [
{
"title": "Task Success Rate by Agent",
"targets": [
{
"expr": "caxton_task_success_rate{}"
}
]
},
{
"title": "Agent Memory Usage",
"targets": [
{
"expr": "caxton_agent_memory_bytes{}"
}
]
},
{
"title": "Task Duration Distribution",
"targets": [
{
"expr": "histogram_quantile(0.5, rate(caxton_task_duration_seconds_bucket[5m]))"
}
]
}
]
}
}
Alert Rules
Critical Alerts
groups:
- name: caxton_critical
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(caxton_errors_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is errors/sec"
- alert: OrchestratorDown
expr: up{job="caxton-orchestrator"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Orchestrator is down"
- alert: HighMemoryUsage
expr: caxton_agent_memory_bytes > 1073741824 # 1GB
for: 10m
labels:
severity: warning
annotations:
summary: "Agent high memory usage"
Performance Alerts
groups:
- name: caxton_performance
interval: 1m
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, rate(caxton_message_latency_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High message processing latency"
description: "95th percentile latency is s"
- alert: LowThroughput
expr: rate(caxton_messages_processed_total[5m]) < 10
for: 10m
labels:
severity: warning
annotations:
summary: "Low message throughput"
description: "Processing only messages/sec"
Custom Metrics Implementation
Adding New Metrics
use prometheus::{register_counter, register_histogram, register_gauge};
// Register custom metrics
lazy_static! {
static ref CUSTOM_METRIC: Counter = register_counter!(
"caxton_custom_metric_total",
"Description of custom metric"
).unwrap();
}
// Use in code
CUSTOM_METRIC.inc();
Metric Types Guide
- Counter: For monotonically increasing values (requests, errors)
- Gauge: For values that go up and down (memory, connections)
- Histogram: For distributions (latency, sizes)
- Summary: For pre-calculated quantiles (not recommended)
Backend Alternatives
Datadog Integration
# For Datadog backend
exporters:
datadog:
api:
key: ${DATADOG_API_KEY}
site: datadoghq.com
hostname: caxton-orchestrator
New Relic Integration
# For New Relic backend
exporters:
newrelic:
apikey: ${NEW_RELIC_API_KEY}
timeout: 30s
CloudWatch Integration
# For AWS CloudWatch
exporters:
awscloudwatchmetrics:
namespace: Caxton
region: us-west-2
Performance Considerations
Metric Collection Overhead
- Keep scrape intervals reasonable (15-30s for most metrics)
- Use histograms sparingly (higher storage cost)
- Batch metric updates where possible
- Consider sampling for high-volume metrics
Storage and Retention
# Prometheus storage configuration
storage:
tsdb:
path: /var/lib/prometheus
retention.time: 30d
retention.size: 10GB
wal_compression: true
Query Optimization
- Use recording rules for expensive queries
- Implement query result caching
- Optimize label cardinality
- Use downsampling for long-term storage
Debugging Metrics Issues
Common Problems and Solutions
Missing Metrics
# Check if metrics endpoint is accessible
curl http://localhost:9090/metrics
# Verify Prometheus scrape config
curl http://localhost:9090/api/v1/targets
# Check collector logs
docker logs otel-collector
High Cardinality
# Find high cardinality metrics
count by (__name__)({__name__=~".+"})
# Identify problematic labels
count by (label_name) (metric_name)
Performance Issues
# Profile Prometheus
curl http://localhost:9090/debug/pprof/profile?seconds=30 > profile.pb.gz
# Check TSDB stats
curl http://localhost:9090/api/v1/tsdb_status
Best Practices Summary
- Use standard metrics libraries - OpenTelemetry SDK preferred
- Keep cardinality low - < 100k unique series
- Document all metrics - Include unit and meaning
- Version metric names - Include v1, v2 when breaking changes
- Test alerts locally - Use Prometheus unit tests
- Monitor the monitoring - Meta-metrics for observability stack
- Regular cleanup - Remove unused metrics and dashboards