Lightweight State Alternatives

Lightweight State Management Alternatives for Caxton

Executive Summary

Based on extensive research, Caxton should adopt a coordination-only approach rather than shared state, with agent state management delegated to business domains via MCP tools. This aligns with the minimal core philosophy and significantly reduces operational complexity.

Key Recommendation: Coordination Over Shared State

Why Caxton Doesn’t Need Shared State

After analyzing Caxton’s actual requirements, most “state” needs are actually coordination concerns:

Agent Registry: Which agents are available and their capabilities
Routing Information: How to reach specific agents
Health Status: Liveness and readiness of agents
Message Correlation: Tracking conversation contexts

These can be managed through gossip protocols and eventual consistency rather than strongly consistent shared state.

Proposed Architecture: Hybrid Coordination Model

1. Embedded SQLite for Local State

Each Caxton instance maintains its own local state using embedded SQLite:

Zero external dependencies
Excellent performance for local queries
Mature, battle-tested technology
Small footprint (~500KB)

// Local state storage per instance
pub struct LocalState {
    db: rusqlite::Connection,
}

impl LocalState {
    pub fn new() -> Result<Self> {
        let db = Connection::open("caxton_local.db")?;
        // Store instance-specific data
        Ok(Self { db })
    }
}

2. SWIM Protocol for Cluster Coordination

Use the SWIM protocol for lightweight cluster coordination:

No shared state required
Scales to thousands of nodes
Failure detection built-in
Eventually consistent membership

// Using memberlist (Rust SWIM implementation)
use memberlist::{Config, Memberlist};

pub struct ClusterCoordinator {
    memberlist: Memberlist,
    local_registry: HashMap<AgentId, AgentInfo>,
}

impl ClusterCoordinator {
    pub async fn join_cluster(&mut self, seeds: Vec<String>) -> Result<()> {
        self.memberlist.join(seeds).await?;
        // Gossip local agent registry to peers
        self.broadcast_local_agents().await
    }
}

3. Agent State as Business Domain Concern

Critical Insight: Agent state should NOT be Caxton’s responsibility.

Current Problem with ADR-0013

The proposed PostgreSQL-based state management violates the minimal core philosophy by making Caxton responsible for:

Agent checkpointing
State recovery
Event sourcing
Snapshot management

Proposed Solution: MCP State Tools

Agents requiring state persistence should use MCP tools provided by the business domain:

// Example: Agent uses MCP tool for state
pub struct StatefulAgent {
    state_tool: Box<dyn McpStateTool>,
}

impl StatefulAgent {
    pub async fn save_state(&self, key: &str, value: Value) -> Result<()> {
        // Delegate to business-provided MCP tool
        self.state_tool.store(key, value).await
    }

    pub async fn load_state(&self, key: &str) -> Result<Value> {
        // Business decides storage backend
        self.state_tool.retrieve(key).await
    }
}

This allows businesses to choose their own state backends:

Redis for caching
PostgreSQL for transactions
S3 for blob storage
DynamoDB for serverless

Lightweight Storage Options Comparison

For Caxton’s Internal Needs Only

Solution	Pros	Cons	Use Case
SQLite	Zero deps, mature, SQL support	Single-writer limitation	✅ Local instance state
sled	Pure Rust, lock-free	Unstable, space inefficient	❌ Too immature
RocksDB	High performance, LSM-tree	C++ dependency, complex	⚠️ If performance critical
LMDB	Memory-mapped, multi-process	Read-optimized	❌ Wrong access pattern

Recommendation: SQLite for Local State

Each Caxton instance has its own SQLite database
No coordination needed for local operations
Gossip protocol shares necessary information

Implementation Strategy

Phase 1: Remove Shared State Requirements

// Before: Shared state in PostgreSQL
pub struct SharedOrchestrator {
    postgres: PostgresPool,
    // Complex event sourcing...
}

// After: Coordination-only
pub struct CoordinatedOrchestrator {
    local_db: SQLite,
    gossip: SwimProtocol,
    // No shared state!
}

Phase 2: Implement SWIM Protocol

use async_std::sync::RwLock;

pub struct SwimCluster {
    members: RwLock<HashMap<NodeId, NodeInfo>>,
    failure_detector: FailureDetector,
}

impl SwimCluster {
    pub async fn detect_failures(&self) {
        // SWIM's scalable failure detection
        let target = self.select_random_member().await;
        if !self.ping(target).await {
            self.request_ping_from_others(target).await;
        }
    }
}

Phase 3: MCP State Tool Specification

// Standard interface for state persistence
#[async_trait]
pub trait McpStateTool: Send + Sync {
    async fn store(&self, key: String, value: Value) -> Result<()>;
    async fn retrieve(&self, key: String) -> Result<Option<Value>>;
    async fn delete(&self, key: String) -> Result<()>;
    async fn list(&self, prefix: String) -> Result<Vec<String>>;
}

// Businesses implement their preferred backend
pub struct RedisStateTool { /* ... */ }
pub struct S3StateTool { /* ... */ }
pub struct PostgresStateTool { /* ... */ }

Benefits of This Approach

1. Operational Simplicity

No PostgreSQL required: Eliminates heavy dependency
No backup management: Each instance is disposable
No migration complexity: Schema-less coordination

2. Better Scalability

Linear scaling: Add nodes without shared state bottleneck
Geographic distribution: Works across regions
Fault isolation: Node failures don’t affect others

3. Alignment with Minimal Core

Core remains simple: Just message routing
Flexibility for users: Choose their own state backend
Clear boundaries: Caxton handles coordination, not business state

4. Reduced Complexity

No event sourcing: Eliminates complex replay logic
No snapshots: No snapshot management overhead
No consensus: SWIM provides eventual consistency

Migration Path from ADR-0013

Step 1: Redefine State Categories

# What Caxton manages (coordination)
coordination:
  - agent_registry    # Via gossip
  - health_status    # Via SWIM
  - routing_info     # Via gossip

# What businesses manage (state)
business_state:
  - agent_checkpoints  # Via MCP tools
  - conversation_history  # Via MCP tools
  - task_state  # Via MCP tools
  - audit_logs  # Via MCP tools

Step 2: Update ADR-0013

Create ADR-0014 that supersedes ADR-0013:

Title: “Coordination-First Architecture”
Explicitly reject shared state
Define MCP state tool interface
Document SWIM protocol usage

Step 3: Implement Gradually

Start with SQLite for local state
Add SWIM for cluster membership
Define MCP state tool interface
Migrate shared state to coordination

Example: Multi-Instance Deployment

// Instance 1 (Primary DC)
let instance1 = Caxton::new()
    .with_local_db("instance1.db")
    .with_swim_seeds(vec!["instance2:7946"]);

// Instance 2 (Secondary DC)
let instance2 = Caxton::new()
    .with_local_db("instance2.db")
    .with_swim_seeds(vec!["instance1:7946"]);

// They discover each other via SWIM
// Share agent registry via gossip
// No shared database needed!

Comparison with Other Systems

HashiCorp Consul

Uses SWIM for membership
Raft only for critical configuration
Proves gossip scales to thousands of nodes

Apache Cassandra

Uses gossip for cluster state
No central coordinator
Scales to hundreds of nodes

Kubernetes

etcd only for critical config
Kubelet has local state
Proves hybrid model works

Risks and Mitigations

Risk: Eventual Consistency

Mitigation: Only use for non-critical data like agent discovery. Critical operations use local state.

Risk: Network Partitions

Mitigation: SWIM handles partitions gracefully. Each partition continues operating independently.

Risk: Missing Features

Mitigation: MCP tools provide flexibility. Businesses can add any state management they need.

Conclusion

Caxton should:

Abandon shared state in favor of coordination protocols
Use SQLite for local instance state
Implement SWIM for cluster coordination
Delegate agent state to MCP tools

This approach:

Eliminates PostgreSQL dependency
Reduces operational complexity
Improves scalability
Aligns with minimal core philosophy
Provides maximum flexibility

The key insight: Caxton is a message router, not a database. Let it excel at routing while businesses handle their own state requirements through MCP tools.

Recommended Next Steps

Revise ADR-0013 to remove PostgreSQL dependency
Create new ADR for coordination-first architecture
Define MCP StateTool interface specification
Prototype SWIM integration using memberlist-rs
Update architecture docs to reflect this approach

This lightweight approach will make Caxton easier to deploy, operate, and scale while maintaining all necessary functionality through intelligent architectural choices.