0013. State Management Architecture
ADR-0013: State Management Architecture
Status
Superseded by ADR-0014: Coordination-First Architecture
Supersession Notice
This ADR has been superseded because:
- The PostgreSQL dependency violates the minimal core philosophy
- Shared state creates unnecessary operational complexity
- Agent state management should be a business domain concern
- Coordination protocols provide sufficient functionality without shared state
See ADR-0014 for the current approach.
Context
Caxton needs a robust strategy for managing agent state, orchestrator state, and conversation history. While the architecture emphasizes stateless WebAssembly agents, practical production systems require state persistence for:
- Agent crash recovery
- Conversation history and context
- Long-running task coordination
- Audit trails and compliance
- Debugging and observability
The state management system must balance consistency, performance, and operational simplicity while maintaining the minimal core philosophy.
Decision
Caxton implements a hybrid state management architecture using event sourcing for critical state transitions and snapshot strategies for performance optimization.
Core State Management Principles
1. Event Sourcing for Audit and Recovery
All state changes are captured as immutable events:
pub enum StateEvent {
AgentRegistered { id: AgentId, capabilities: Vec<Capability>, timestamp: Instant },
MessageSent { from: AgentId, to: AgentId, message: Message, timestamp: Instant },
TaskAssigned { task_id: TaskId, agent_id: AgentId, timestamp: Instant },
TaskCompleted { task_id: TaskId, result: TaskResult, timestamp: Instant },
AgentFailed { id: AgentId, reason: String, timestamp: Instant },
}
2. Snapshot Strategies
To prevent unbounded event log growth:
- Time-based snapshots: Every 1 hour for active agents
- Event-count snapshots: Every 1000 events per conversation
- Size-based snapshots: When event log exceeds 10MB
- On-demand snapshots: Before maintenance operations
3. State Partitioning
State is partitioned by concern:
- Orchestrator State: Agent registry, routing tables, health metrics
- Conversation State: Message history, correlation contexts
- Agent State: Minimal checkpoint data for recovery
- Task State: Assignment, progress, results
Implementation Architecture
Storage Backend
Primary storage uses PostgreSQL with JSONB for flexibility:
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY,
aggregate_id UUID NOT NULL,
aggregate_type VARCHAR(50) NOT NULL,
event_type VARCHAR(100) NOT NULL,
event_data JSONB NOT NULL,
metadata JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
INDEX idx_aggregate (aggregate_id, created_at),
INDEX idx_type_time (aggregate_type, created_at)
);
CREATE TABLE snapshots (
aggregate_id UUID PRIMARY KEY,
aggregate_type VARCHAR(50) NOT NULL,
snapshot_data JSONB NOT NULL,
event_version BIGINT NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
Recovery Procedures
Agent Recovery
- Load latest snapshot for agent
- Replay events since snapshot
- Reconstruct agent state
- Resume operation from last checkpoint
pub async fn recover_agent(agent_id: AgentId) -> Result<AgentState> {
let snapshot = load_snapshot(agent_id).await?;
let events = load_events_since(agent_id, snapshot.version).await?;
let mut state = snapshot.state;
for event in events {
state = apply_event(state, event)?;
}
Ok(state)
}
Conversation Recovery
- Load conversation snapshot
- Replay message events
- Restore correlation contexts
- Resume message processing
Orchestrator Recovery
- Load orchestrator snapshot
- Replay registration and routing events
- Rebuild agent registry
- Restore health metrics
- Resume normal operation
State Consistency Guarantees
Eventually Consistent Views
- Agent registry updates propagate within 100ms
- Message history available within 500ms
- Task status updates within 1 second
Strong Consistency for Critical Operations
- Task assignment uses distributed locks
- Agent registration uses compare-and-swap
- Message ordering preserved per conversation
Performance Optimizations
Write-Through Cache
- Redis for hot state (active conversations)
- 5-minute TTL with refresh on access
- Write-through to PostgreSQL
Read Replicas
- Separate read replicas for queries
- Lag monitoring with alerts at >1 second
- Automatic failover to primary if lag exceeds threshold
Batch Processing
- Event writes batched per 100ms window
- Snapshot generation in background workers
- Vacuum operations during low-traffic periods
Consequences
Positive
- Complete audit trail - Every state change is recorded
- Point-in-time recovery - Can restore to any previous state
- Debugging capability - Can replay scenarios exactly
- Horizontal scalability - Event log can be partitioned
- Compliance ready - Immutable audit log for regulations
Negative
- Storage overhead - Events and snapshots require significant space
- Complexity - Event sourcing adds conceptual overhead
- Eventual consistency - Some operations see stale data
- Operational burden - Requires snapshot management and cleanup
Neutral
- Standard PostgreSQL operations knowledge required
- Event sourcing patterns well-understood in industry
- Existing tooling (Kafka, EventStore) could replace if needed
Migration Path
Phase 1: Basic Event Logging (Week 1-2)
- Implement event schema
- Add event logging to critical paths
- Deploy PostgreSQL infrastructure
Phase 2: Snapshot Implementation (Week 3-4)
- Implement snapshot generation
- Add snapshot-based recovery
- Test recovery procedures
Phase 3: Performance Optimization (Week 5-6)
- Add Redis caching layer
- Implement read replicas
- Optimize query patterns
Phase 4: Production Hardening (Week 7-8)
- Add monitoring and alerts
- Implement backup strategies
- Document operational procedures
Alternatives Considered
Pure Event Streaming (Kafka)
- Pros: Proven scale, existing ecosystem
- Cons: Operational complexity, requires Kafka expertise
- Decision: PostgreSQL simpler for initial implementation
Document Store (MongoDB)
- Pros: Flexible schema, good developer experience
- Cons: Weaker consistency guarantees, less operational maturity
- Decision: PostgreSQL JSONB provides similar flexibility
Key-Value Store (DynamoDB/Cassandra)
- Pros: Massive scale, predictable performance
- Cons: Complex data modeling, expensive at small scale
- Decision: Overkill for initial requirements
Guidelines for State Management
- Minimize State: Agents should be as stateless as possible
- Immutable Events: Never modify past events
- Idempotent Operations: Handle duplicate events gracefully
- Bounded Contexts: Don’t share state across boundaries
- Explicit Schemas: Version all event and snapshot formats
Monitoring and Alerts
Key metrics to track:
- Event write latency (target: <10ms p99)
- Snapshot generation time (target: <1 second)
- Recovery time objective (target: <30 seconds)
- Storage growth rate (alert at >1GB/day)
- Replication lag (alert at >1 second)
Security Considerations
- Encrypt events at rest using PostgreSQL TDE
- Audit log access with row-level security
- Separate encryption keys for PII data
- Regular backup encryption and testing
- GDPR compliance via event anonymization
References
- Event Sourcing Pattern
- CQRS and Event Sourcing
- PostgreSQL JSONB Performance
- Observability-First Architecture (ADR-0001)
- Minimal Core Philosophy (ADR-0004)
Notes
This state management architecture provides the foundation for reliable agent coordination while maintaining operational simplicity. The event sourcing approach ensures we never lose critical data, while snapshots keep performance acceptable. As the system grows, we can migrate to specialized event stores without changing the conceptual model.