Lightweight State Alternatives
Lightweight State Management Alternatives for Caxton
Executive Summary
Based on extensive research, Caxton should adopt a coordination-only approach rather than shared state, with agent state management delegated to business domains via MCP tools. This aligns with the minimal core philosophy and significantly reduces operational complexity.
Key Recommendation: Coordination Over Shared State
Why Caxton Doesn’t Need Shared State
After analyzing Caxton’s actual requirements, most “state” needs are actually coordination concerns:
- Agent Registry: Which agents are available and their capabilities
- Routing Information: How to reach specific agents
- Health Status: Liveness and readiness of agents
- Message Correlation: Tracking conversation contexts
These can be managed through gossip protocols and eventual consistency rather than strongly consistent shared state.
Proposed Architecture: Hybrid Coordination Model
1. Embedded SQLite for Local State
Each Caxton instance maintains its own local state using embedded SQLite:
- Zero external dependencies
- Excellent performance for local queries
- Mature, battle-tested technology
- Small footprint (~500KB)
// Local state storage per instance
pub struct LocalState {
db: rusqlite::Connection,
}
impl LocalState {
pub fn new() -> Result<Self> {
let db = Connection::open("caxton_local.db")?;
// Store instance-specific data
Ok(Self { db })
}
}
2. SWIM Protocol for Cluster Coordination
Use the SWIM protocol for lightweight cluster coordination:
- No shared state required
- Scales to thousands of nodes
- Failure detection built-in
- Eventually consistent membership
// Using memberlist (Rust SWIM implementation)
use memberlist::{Config, Memberlist};
pub struct ClusterCoordinator {
memberlist: Memberlist,
local_registry: HashMap<AgentId, AgentInfo>,
}
impl ClusterCoordinator {
pub async fn join_cluster(&mut self, seeds: Vec<String>) -> Result<()> {
self.memberlist.join(seeds).await?;
// Gossip local agent registry to peers
self.broadcast_local_agents().await
}
}
3. Agent State as Business Domain Concern
Critical Insight: Agent state should NOT be Caxton’s responsibility.
Current Problem with ADR-0013
The proposed PostgreSQL-based state management violates the minimal core philosophy by making Caxton responsible for:
- Agent checkpointing
- State recovery
- Event sourcing
- Snapshot management
Proposed Solution: MCP State Tools
Agents requiring state persistence should use MCP tools provided by the business domain:
// Example: Agent uses MCP tool for state
pub struct StatefulAgent {
state_tool: Box<dyn McpStateTool>,
}
impl StatefulAgent {
pub async fn save_state(&self, key: &str, value: Value) -> Result<()> {
// Delegate to business-provided MCP tool
self.state_tool.store(key, value).await
}
pub async fn load_state(&self, key: &str) -> Result<Value> {
// Business decides storage backend
self.state_tool.retrieve(key).await
}
}
This allows businesses to choose their own state backends:
- Redis for caching
- PostgreSQL for transactions
- S3 for blob storage
- DynamoDB for serverless
Lightweight Storage Options Comparison
For Caxton’s Internal Needs Only
Solution | Pros | Cons | Use Case |
---|---|---|---|
SQLite | Zero deps, mature, SQL support | Single-writer limitation | ✅ Local instance state |
sled | Pure Rust, lock-free | Unstable, space inefficient | ❌ Too immature |
RocksDB | High performance, LSM-tree | C++ dependency, complex | ⚠️ If performance critical |
LMDB | Memory-mapped, multi-process | Read-optimized | ❌ Wrong access pattern |
Recommendation: SQLite for Local State
- Each Caxton instance has its own SQLite database
- No coordination needed for local operations
- Gossip protocol shares necessary information
Implementation Strategy
Phase 1: Remove Shared State Requirements
// Before: Shared state in PostgreSQL
pub struct SharedOrchestrator {
postgres: PostgresPool,
// Complex event sourcing...
}
// After: Coordination-only
pub struct CoordinatedOrchestrator {
local_db: SQLite,
gossip: SwimProtocol,
// No shared state!
}
Phase 2: Implement SWIM Protocol
use async_std::sync::RwLock;
pub struct SwimCluster {
members: RwLock<HashMap<NodeId, NodeInfo>>,
failure_detector: FailureDetector,
}
impl SwimCluster {
pub async fn detect_failures(&self) {
// SWIM's scalable failure detection
let target = self.select_random_member().await;
if !self.ping(target).await {
self.request_ping_from_others(target).await;
}
}
}
Phase 3: MCP State Tool Specification
// Standard interface for state persistence
#[async_trait]
pub trait McpStateTool: Send + Sync {
async fn store(&self, key: String, value: Value) -> Result<()>;
async fn retrieve(&self, key: String) -> Result<Option<Value>>;
async fn delete(&self, key: String) -> Result<()>;
async fn list(&self, prefix: String) -> Result<Vec<String>>;
}
// Businesses implement their preferred backend
pub struct RedisStateTool { /* ... */ }
pub struct S3StateTool { /* ... */ }
pub struct PostgresStateTool { /* ... */ }
Benefits of This Approach
1. Operational Simplicity
- No PostgreSQL required: Eliminates heavy dependency
- No backup management: Each instance is disposable
- No migration complexity: Schema-less coordination
2. Better Scalability
- Linear scaling: Add nodes without shared state bottleneck
- Geographic distribution: Works across regions
- Fault isolation: Node failures don’t affect others
3. Alignment with Minimal Core
- Core remains simple: Just message routing
- Flexibility for users: Choose their own state backend
- Clear boundaries: Caxton handles coordination, not business state
4. Reduced Complexity
- No event sourcing: Eliminates complex replay logic
- No snapshots: No snapshot management overhead
- No consensus: SWIM provides eventual consistency
Migration Path from ADR-0013
Step 1: Redefine State Categories
# What Caxton manages (coordination)
coordination:
- agent_registry # Via gossip
- health_status # Via SWIM
- routing_info # Via gossip
# What businesses manage (state)
business_state:
- agent_checkpoints # Via MCP tools
- conversation_history # Via MCP tools
- task_state # Via MCP tools
- audit_logs # Via MCP tools
Step 2: Update ADR-0013
Create ADR-0014 that supersedes ADR-0013:
- Title: “Coordination-First Architecture”
- Explicitly reject shared state
- Define MCP state tool interface
- Document SWIM protocol usage
Step 3: Implement Gradually
- Start with SQLite for local state
- Add SWIM for cluster membership
- Define MCP state tool interface
- Migrate shared state to coordination
Example: Multi-Instance Deployment
// Instance 1 (Primary DC)
let instance1 = Caxton::new()
.with_local_db("instance1.db")
.with_swim_seeds(vec!["instance2:7946"]);
// Instance 2 (Secondary DC)
let instance2 = Caxton::new()
.with_local_db("instance2.db")
.with_swim_seeds(vec!["instance1:7946"]);
// They discover each other via SWIM
// Share agent registry via gossip
// No shared database needed!
Comparison with Other Systems
HashiCorp Consul
- Uses SWIM for membership
- Raft only for critical configuration
- Proves gossip scales to thousands of nodes
Apache Cassandra
- Uses gossip for cluster state
- No central coordinator
- Scales to hundreds of nodes
Kubernetes
- etcd only for critical config
- Kubelet has local state
- Proves hybrid model works
Risks and Mitigations
Risk: Eventual Consistency
Mitigation: Only use for non-critical data like agent discovery. Critical operations use local state.
Risk: Network Partitions
Mitigation: SWIM handles partitions gracefully. Each partition continues operating independently.
Risk: Missing Features
Mitigation: MCP tools provide flexibility. Businesses can add any state management they need.
Conclusion
Caxton should:
- Abandon shared state in favor of coordination protocols
- Use SQLite for local instance state
- Implement SWIM for cluster coordination
- Delegate agent state to MCP tools
This approach:
- Eliminates PostgreSQL dependency
- Reduces operational complexity
- Improves scalability
- Aligns with minimal core philosophy
- Provides maximum flexibility
The key insight: Caxton is a message router, not a database. Let it excel at routing while businesses handle their own state requirements through MCP tools.
Recommended Next Steps
- Revise ADR-0013 to remove PostgreSQL dependency
- Create new ADR for coordination-first architecture
- Define MCP StateTool interface specification
- Prototype SWIM integration using memberlist-rs
- Update architecture docs to reflect this approach
This lightweight approach will make Caxton easier to deploy, operate, and scale while maintaining all necessary functionality through intelligent architectural choices.