Metrics Integration Guide

Overview

This guide documents Caxton’s metrics aggregation and monitoring strategy using Prometheus and OpenTelemetry, ensuring comprehensive observability across all components.

Architecture

Metrics Pipeline

Agents → OpenTelemetry Collector → Prometheus → Grafana
                ↓
          Alternative Backends
         (Datadog, New Relic, etc.)

Prometheus Integration

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'caxton-orchestrator'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'

  - job_name: 'caxton-agents'
    static_configs:
      - targets: ['localhost:9091-9099']
    metrics_path: '/metrics'

  - job_name: 'opentelemetry-collector'
    static_configs:
      - targets: ['localhost:8888']

Key Metrics

Orchestrator Metrics

// Core orchestrator metrics
pub static MESSAGES_PROCESSED: Counter = Counter::new(
    "caxton_messages_processed_total",
    "Total number of messages processed"
);

pub static MESSAGE_LATENCY: Histogram = Histogram::new(
    "caxton_message_latency_seconds",
    "Message processing latency in seconds"
);

pub static ACTIVE_AGENTS: Gauge = Gauge::new(
    "caxton_active_agents",
    "Number of currently active agents"
);

pub static AGENT_MEMORY_USAGE: Gauge = Gauge::new(
    "caxton_agent_memory_bytes",
    "Memory usage per agent in bytes"
);

Agent Metrics

// Per-agent metrics
pub static TASK_DURATION: Histogram = Histogram::new(
    "caxton_task_duration_seconds",
    "Task execution duration in seconds"
);

pub static TASK_SUCCESS_RATE: Gauge = Gauge::new(
    "caxton_task_success_rate",
    "Task success rate (0-1)"
);

pub static AGENT_CPU_USAGE: Gauge = Gauge::new(
    "caxton_agent_cpu_usage_percent",
    "CPU usage percentage per agent"
);

Metric Labels and Cardinality

Best Practices

Keep cardinality under control (< 10 label values per metric)
Use consistent label names across metrics
Avoid high-cardinality labels (user IDs, request IDs)

Standard Labels

pub struct StandardLabels {
    pub agent_id: String,      // Agent identifier
    pub agent_type: String,     // Agent type/capability
    pub conversation_id: String, // Conversation correlation
    pub environment: String,    // dev/staging/prod
    pub version: String,        // Software version
}

OpenTelemetry Collector Configuration

Collector Setup

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'caxton-metrics'
          scrape_interval: 10s
          static_configs:
            - targets: ['localhost:9090']

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  resource:
    attributes:
      - key: service.name
        value: "caxton"
      - key: service.version
        from_attribute: version

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

  logging:
    loglevel: debug

  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheus, logging]

    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [jaeger, logging]

Grafana Dashboard Configuration

Core Dashboards

System Overview Dashboard

{
  "dashboard": {
    "title": "Caxton System Overview",
    "panels": [
      {
        "title": "Message Throughput",
        "targets": [
          {
            "expr": "rate(caxton_messages_processed_total[5m])"
          }
        ]
      },
      {
        "title": "Active Agents",
        "targets": [
          {
            "expr": "caxton_active_agents"
          }
        ]
      },
      {
        "title": "Message Latency (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(caxton_message_latency_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(caxton_errors_total[5m])"
          }
        ]
      }
    ]
  }
}

Agent Performance Dashboard

{
  "dashboard": {
    "title": "Agent Performance",
    "panels": [
      {
        "title": "Task Success Rate by Agent",
        "targets": [
          {
            "expr": "caxton_task_success_rate{}"
          }
        ]
      },
      {
        "title": "Agent Memory Usage",
        "targets": [
          {
            "expr": "caxton_agent_memory_bytes{}"
          }
        ]
      },
      {
        "title": "Task Duration Distribution",
        "targets": [
          {
            "expr": "histogram_quantile(0.5, rate(caxton_task_duration_seconds_bucket[5m]))"
          }
        ]
      }
    ]
  }
}

Alert Rules

Critical Alerts

groups:
  - name: caxton_critical
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(caxton_errors_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is  errors/sec"

      - alert: OrchestratorDown
        expr: up{job="caxton-orchestrator"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Orchestrator is down"

      - alert: HighMemoryUsage
        expr: caxton_agent_memory_bytes > 1073741824  # 1GB
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Agent  high memory usage"

Performance Alerts

groups:
  - name: caxton_performance
    interval: 1m
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(caxton_message_latency_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High message processing latency"
          description: "95th percentile latency is s"

      - alert: LowThroughput
        expr: rate(caxton_messages_processed_total[5m]) < 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low message throughput"
          description: "Processing only  messages/sec"

Custom Metrics Implementation

Adding New Metrics

use prometheus::{register_counter, register_histogram, register_gauge};

// Register custom metrics
lazy_static! {
    static ref CUSTOM_METRIC: Counter = register_counter!(
        "caxton_custom_metric_total",
        "Description of custom metric"
    ).unwrap();
}

// Use in code
CUSTOM_METRIC.inc();

Metric Types Guide

Counter: For monotonically increasing values (requests, errors)
Gauge: For values that go up and down (memory, connections)
Histogram: For distributions (latency, sizes)
Summary: For pre-calculated quantiles (not recommended)

Backend Alternatives

Datadog Integration

# For Datadog backend
exporters:
  datadog:
    api:
      key: ${DATADOG_API_KEY}
      site: datadoghq.com
    hostname: caxton-orchestrator

New Relic Integration

# For New Relic backend
exporters:
  newrelic:
    apikey: ${NEW_RELIC_API_KEY}
    timeout: 30s

CloudWatch Integration

# For AWS CloudWatch
exporters:
  awscloudwatchmetrics:
    namespace: Caxton
    region: us-west-2

Performance Considerations

Metric Collection Overhead

Keep scrape intervals reasonable (15-30s for most metrics)
Use histograms sparingly (higher storage cost)
Batch metric updates where possible
Consider sampling for high-volume metrics

Storage and Retention

# Prometheus storage configuration
storage:
  tsdb:
    path: /var/lib/prometheus
    retention.time: 30d
    retention.size: 10GB
    wal_compression: true

Query Optimization

Use recording rules for expensive queries
Implement query result caching
Optimize label cardinality
Use downsampling for long-term storage

Debugging Metrics Issues

Common Problems and Solutions

Missing Metrics

# Check if metrics endpoint is accessible
curl http://localhost:9090/metrics

# Verify Prometheus scrape config
curl http://localhost:9090/api/v1/targets

# Check collector logs
docker logs otel-collector

High Cardinality

# Find high cardinality metrics
count by (__name__)({__name__=~".+"})

# Identify problematic labels
count by (label_name) (metric_name)

Performance Issues

# Profile Prometheus
curl http://localhost:9090/debug/pprof/profile?seconds=30 > profile.pb.gz

# Check TSDB stats
curl http://localhost:9090/api/v1/tsdb_status

Best Practices Summary

Use standard metrics libraries - OpenTelemetry SDK preferred
Keep cardinality low - < 100k unique series
Document all metrics - Include unit and meaning
Version metric names - Include v1, v2 when breaking changes
Test alerts locally - Use Prometheus unit tests
Monitor the monitoring - Meta-metrics for observability stack
Regular cleanup - Remove unused metrics and dashboards

Metrics Integration Guide

Overview

Architecture

Metrics Pipeline

Prometheus Integration

Configuration

Key Metrics

Orchestrator Metrics

Agent Metrics

Metric Labels and Cardinality

Best Practices

Standard Labels

OpenTelemetry Collector Configuration

Collector Setup

Grafana Dashboard Configuration

Core Dashboards

System Overview Dashboard

Agent Performance Dashboard

Alert Rules

Critical Alerts

Performance Alerts

Custom Metrics Implementation

Adding New Metrics

Metric Types Guide

Backend Alternatives

Datadog Integration

New Relic Integration

CloudWatch Integration

Performance Considerations

Metric Collection Overhead

Storage and Retention

Query Optimization

Debugging Metrics Issues

Common Problems and Solutions

Missing Metrics

High Cardinality

Performance Issues

Best Practices Summary

References