0007. Management API Design

Status: Proposed

Date: August 03, 2025

Categories:

Architecture Technology

0007. Management API Design

Date: 2025-08-03

Status

Proposed

Context

With Caxton’s pivot to an application server architecture (ADR-0006), we need a well-designed management API that enables programmatic control of the multi-agent system. This API must be accessible to developers regardless of their language choice while maintaining the performance and type safety that Rust provides internally.

The API design must balance several concerns:

Language agnostic: Accessible from any programming language
Performance: Minimal overhead for high-frequency operations
Type safety: Preserve Rust’s guarantees across the API boundary
Observability: Built-in instrumentation for debugging distributed systems
Evolution: Ability to extend without breaking existing clients

Decision Drivers

Industry standards: gRPC is the de facto standard for high-performance APIs
REST familiarity: Many developers expect REST APIs for tooling integration
Type safety: Need to preserve Rust’s type guarantees across API boundaries
Performance requirements: < 1ms overhead for local API calls
Debugging needs: Must support distributed tracing and structured logging

Decision

We will implement a dual-protocol API architecture:

1. gRPC as Primary Protocol

service CaxtonManagement {
  // Agent lifecycle management
  rpc DeployAgent(DeployAgentRequest) returns (DeployAgentResponse);
  rpc UndeployAgent(UndeployAgentRequest) returns (UndeployAgentResponse);
  rpc ListAgents(ListAgentsRequest) returns (ListAgentsResponse);

  // Message operations
  rpc SendMessage(SendMessageRequest) returns (SendMessageResponse);
  rpc SubscribeMessages(SubscribeRequest) returns (stream Message);

  // Health and monitoring
  rpc Health(HealthRequest) returns (HealthResponse);
  rpc Metrics(MetricsRequest) returns (MetricsResponse);
}

2. REST Gateway via gRPC-Gateway

Auto-generated from gRPC definitions
OpenAPI/Swagger documentation
JSON request/response format
WebSocket support for streaming operations

3. API Design Principles

Resource-Oriented Design:

/api/v1/agents                    # Agent collection
/api/v1/agents/{id}              # Individual agent
/api/v1/agents/{id}/messages     # Agent's messages
/api/v1/messages                 # System-wide message stream

Structured Error Handling:

message Error {
  string code = 1;        // Machine-readable error code
  string message = 2;     // Human-readable description
  string trace_id = 3;    // Correlation ID for debugging
  map<string, string> metadata = 4;  // Additional context
}

OpenTelemetry Integration:

Every API call creates a trace span
Propagate trace context via headers
Structured logging with trace correlation
Prometheus metrics for all operations

Consequences

Positive

Language agnostic: Any language with gRPC support can use Caxton
Type safety: Protocol buffers provide schema validation
Performance: Binary protocol with streaming support
REST compatibility: Gateway provides familiar HTTP/JSON interface
Future proof: gRPC supports backward/forward compatibility
Generated SDKs: Automatic client generation for all languages
Built-in observability: Tracing and metrics from day one

Negative

Complexity: Two protocols to maintain
Learning curve: gRPC less familiar than REST
Tooling requirements: Need protoc compiler for development
Debugging: Binary protocol harder to inspect than JSON

Mitigation Strategies

Complexity:

Single source of truth (protobuf definitions)
Automated gateway generation
Comprehensive testing of both protocols

Learning Curve:

Excellent documentation with examples
Pre-built SDKs for popular languages
REST gateway for initial exploration

Debugging:

gRPC reflection for runtime introspection
Request/response logging in development
Trace-based debugging tools

API Examples

Deploy Agent (gRPC)

let request = DeployAgentRequest {
    name: "processor".to_string(),
    wasm_module: module_bytes,
    capabilities: vec!["messaging", "mcp-tools"],
    resources: Some(Resources {
        memory_limit: 100 * 1024 * 1024, // 100MB
        cpu_shares: 1024,
    }),
};

let response = client.deploy_agent(request).await?;
println!("Agent deployed: {}", response.agent_id);

Deploy Agent (REST)

curl -X POST https://localhost:8080/api/v1/agents \
  -H "Content-Type: application/json" \
  -d '{
    "name": "processor",
    "wasm_module": "base64...",
    "capabilities": ["messaging", "mcp-tools"],
    "resources": {
      "memory_limit": 104857600,
      "cpu_shares": 1024
    }
  }'

let request = SubscribeRequest {
    filter: Some(MessageFilter {
        agent_id: Some("processor".to_string()),
        message_types: vec!["task", "result"],
    }),
};

let mut stream = client.subscribe_messages(request).await?;
while let Some(message) = stream.message().await? {
    println!("Received: {:?}", message);
}

Health Check with Tracing

curl -X GET https://localhost:8080/api/v1/health \
  -H "X-Trace-Id: 550e8400-e29b-41d4-a716-446655440000"

{
  "status": "healthy",
  "version": "1.0.0",
  "agents": {
    "running": 42,
    "capacity": 100
  },
  "trace_id": "550e8400-e29b-41d4-a716-446655440000"
}

Observability Integration

Every API operation:

Creates a trace span with operation details
Logs structured data with trace correlation
Updates Prometheus metrics
Propagates context to downstream operations

Example trace structure:

caxton.api.deploy_agent (1.2ms)
├── caxton.wasm.validate (0.3ms)
├── caxton.runtime.create (0.5ms)
├── caxton.registry.register (0.2ms)
└── caxton.events.emit (0.1ms)

ADR-0001: Observability-First Architecture - Defines tracing/metrics strategy
ADR-0006: Application Server Architecture - Established need for management API
ADR-0008: Agent Deployment Model - Uses this API for deployment operations
ADR-0009: CLI Tool Design - Uses the gRPC API

References

gRPC Best Practices
Google API Design Guide
OpenTelemetry Specification
Bryan Cantrill’s talks on API observability

0007. Management API Design

Status

Context

Decision Drivers

Decision

1. gRPC as Primary Protocol

2. REST Gateway via gRPC-Gateway

3. API Design Principles

Consequences

Positive

Negative

Mitigation Strategies

API Examples

Deploy Agent (gRPC)

Deploy Agent (REST)

Subscribe to Messages (gRPC Streaming)

Health Check with Tracing

Observability Integration

Related Decisions

References