Chapter 6.5: Production Checklist
This chapter provides a comprehensive checklist for deploying EventCore applications to production. Use this as a final validation before going live and as a periodic review for existing production systems.
Pre-Deployment Checklist
Security
Authentication and Authorization
- JWT secret key configured and secured
- Token expiration properly configured
- Role-based access control implemented and tested
- API rate limiting configured
- CORS origins restricted to known domains
- HTTPS enforced for all endpoints
- Security headers configured (HSTS, CSP, etc.)
#![allow(unused)] fn main() { // Security configuration validation #[derive(Debug)] pub struct SecurityAudit { pub findings: Vec<SecurityFinding>, } #[derive(Debug)] pub struct SecurityFinding { pub category: SecurityCategory, pub severity: SecuritySeverity, pub description: String, pub recommendation: String, } #[derive(Debug)] pub enum SecurityCategory { Authentication, Authorization, Encryption, NetworkSecurity, DataProtection, } #[derive(Debug)] pub enum SecuritySeverity { Critical, High, Medium, Low, } pub struct SecurityAuditor; impl SecurityAuditor { pub fn audit_configuration(config: &AppConfig) -> SecurityAudit { let mut findings = Vec::new(); // Check JWT configuration if config.jwt.secret_key.len() < 32 { findings.push(SecurityFinding { category: SecurityCategory::Authentication, severity: SecuritySeverity::Critical, description: "JWT secret key is too short".to_string(), recommendation: "Use a secret key of at least 256 bits (32 bytes)".to_string(), }); } // Check CORS configuration if config.cors.allowed_origins.contains(&"*".to_string()) { findings.push(SecurityFinding { category: SecurityCategory::NetworkSecurity, severity: SecuritySeverity::High, description: "CORS allows all origins".to_string(), recommendation: "Restrict CORS to specific trusted domains".to_string(), }); } // Check HTTPS enforcement if !config.server.force_https { findings.push(SecurityFinding { category: SecurityCategory::NetworkSecurity, severity: SecuritySeverity::High, description: "HTTPS not enforced".to_string(), recommendation: "Enable HTTPS enforcement for all endpoints".to_string(), }); } // Check rate limiting if config.rate_limiting.requests_per_minute == 0 { findings.push(SecurityFinding { category: SecurityCategory::NetworkSecurity, severity: SecuritySeverity::Medium, description: "Rate limiting not configured".to_string(), recommendation: "Configure appropriate rate limits for API endpoints".to_string(), }); } SecurityAudit { findings } } } }
Database Security
- Database credentials stored in secrets management
- Connection encryption (SSL/TLS) enabled
- Database user permissions follow principle of least privilege
- Database firewall rules restrict access
- Connection pooling properly configured
- Query parameterization used (prevent SQL injection)
-- PostgreSQL security checklist queries
-- Check SSL is enforced
SHOW ssl;
-- Check user permissions
\du
-- Check database-level permissions
SELECT datname, datacl FROM pg_database;
-- Check table-level permissions
SELECT schemaname, tablename, tableowner, tablespace, hasindexes, hasrules, hastriggers
FROM pg_tables
WHERE schemaname = 'public';
-- Verify no wildcard permissions
SELECT * FROM information_schema.table_privileges
WHERE grantee = 'PUBLIC';
Performance
Resource Limits
- CPU limits set appropriately
- Memory limits configured with buffer
- Database connection pool sized correctly
- Request timeouts configured
- Circuit breakers implemented
- Resource quotas set at namespace level
# Kubernetes resource configuration checklist
apiVersion: v1
kind: LimitRange
metadata:
name: eventcore-limits
namespace: eventcore
spec:
limits:
- type: Container
default:
memory: "512Mi"
cpu: "500m"
defaultRequest:
memory: "256Mi"
cpu: "250m"
max:
memory: "2Gi"
cpu: "2000m"
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: eventcore-quota
namespace: eventcore
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
persistentvolumeclaims: "4"
Performance Benchmarks
- Load testing completed with realistic scenarios
- Performance baselines established
- Scalability limits identified
- Database query performance optimized
- Index usage analyzed and optimized
#![allow(unused)] fn main() { // Performance validation pub struct PerformanceValidator { target_metrics: PerformanceTargets, } #[derive(Debug, Clone)] pub struct PerformanceTargets { pub max_p95_latency_ms: u64, pub min_throughput_rps: f64, pub max_error_rate: f64, pub max_memory_usage_mb: f64, } impl PerformanceValidator { pub async fn validate_performance(&self) -> Result<PerformanceValidationResult, ValidationError> { let mut results = PerformanceValidationResult::default(); // Test command latency let latency_test = self.test_command_latency().await?; results.latency_passed = latency_test.p95_latency_ms <= self.target_metrics.max_p95_latency_ms; // Test throughput let throughput_test = self.test_throughput().await?; results.throughput_passed = throughput_test.requests_per_second >= self.target_metrics.min_throughput_rps; // Test error rate let error_test = self.test_error_rate().await?; results.error_rate_passed = error_test.error_rate <= self.target_metrics.max_error_rate; // Test memory usage let memory_test = self.test_memory_usage().await?; results.memory_passed = memory_test.peak_memory_mb <= self.target_metrics.max_memory_usage_mb; results.overall_passed = results.latency_passed && results.throughput_passed && results.error_rate_passed && results.memory_passed; Ok(results) } async fn test_command_latency(&self) -> Result<LatencyTestResult, ValidationError> { // Implement latency testing // Execute sample commands and measure response times Ok(LatencyTestResult { p95_latency_ms: 50, // Example result avg_latency_ms: 25, }) } async fn test_throughput(&self) -> Result<ThroughputTestResult, ValidationError> { // Implement throughput testing // Execute concurrent commands and measure RPS Ok(ThroughputTestResult { requests_per_second: 150.0, // Example result peak_concurrent_requests: 50, }) } } #[derive(Debug, Default)] pub struct PerformanceValidationResult { pub latency_passed: bool, pub throughput_passed: bool, pub error_rate_passed: bool, pub memory_passed: bool, pub overall_passed: bool, } }
Reliability
High Availability
- Multiple replicas deployed
- Pod disruption budgets configured
- Health checks implemented and tested
- Readiness probes properly configured
- Liveness probes tuned appropriately
- Rolling update strategy configured
# High availability configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: eventcore-pdb
namespace: eventcore
spec:
minAvailable: 2
selector:
matchLabels:
app: eventcore
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: eventcore-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
spec:
containers:
- name: eventcore-app
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
Backup and Recovery
- Automated backups configured and tested
- Backup verification automated
- Recovery procedures documented and tested
- Point-in-time recovery capability verified
- Cross-region backup replication configured
- Backup retention policies implemented
#![allow(unused)] fn main() { // Backup validation pub struct BackupValidator; impl BackupValidator { pub async fn validate_backup_system(&self) -> Result<BackupValidationResult, ValidationError> { let mut result = BackupValidationResult::default(); // Test backup creation result.backup_creation = self.test_backup_creation().await?; // Test backup verification result.backup_verification = self.test_backup_verification().await?; // Test restore functionality result.restore_capability = self.test_restore_capability().await?; // Test backup schedule result.backup_schedule = self.verify_backup_schedule().await?; // Test retention policy result.retention_policy = self.verify_retention_policy().await?; result.overall_passed = result.backup_creation && result.backup_verification && result.restore_capability && result.backup_schedule && result.retention_policy; Ok(result) } } #[derive(Debug, Default)] pub struct BackupValidationResult { pub backup_creation: bool, pub backup_verification: bool, pub restore_capability: bool, pub backup_schedule: bool, pub retention_policy: bool, pub overall_passed: bool, } }
Monitoring and Observability
Metrics Collection
- Application metrics exported to Prometheus
- Business metrics tracked
- Infrastructure metrics monitored
- Custom dashboards created for key metrics
- SLI/SLO defined and monitored
#![allow(unused)] fn main() { // Metrics validation pub struct MetricsValidator { prometheus_client: PrometheusClient, } impl MetricsValidator { pub async fn validate_metrics(&self) -> Result<MetricsValidationResult, ValidationError> { let mut result = MetricsValidationResult::default(); // Check core application metrics result.core_metrics = self.check_core_metrics().await?; // Check business metrics result.business_metrics = self.check_business_metrics().await?; // Check infrastructure metrics result.infrastructure_metrics = self.check_infrastructure_metrics().await?; // Verify metric freshness result.metrics_current = self.check_metrics_freshness().await?; result.overall_passed = result.core_metrics && result.business_metrics && result.infrastructure_metrics && result.metrics_current; Ok(result) } async fn check_core_metrics(&self) -> Result<bool, ValidationError> { let required_metrics = vec![ "eventcore_commands_total", "eventcore_command_duration_seconds", "eventcore_events_written_total", "eventcore_active_streams", "eventcore_projection_lag_seconds", ]; for metric in required_metrics { if !self.prometheus_client.metric_exists(metric).await? { return Ok(false); } } Ok(true) } } }
Logging
- Structured logging implemented
- Log aggregation configured
- Log retention policies set
- Correlation IDs used throughout
- Log levels appropriately configured
- Sensitive data excluded from logs
Alerting
- Critical alerts configured
- Warning alerts tuned to reduce noise
- Alert routing configured for different severities
- Escalation policies defined
- Alert fatigue minimized through proper thresholds
# Alerting validation checklist
groups:
- name: eventcore-critical
rules:
- alert: EventCoreDown
expr: up{job="eventcore"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "EventCore service is down"
- alert: HighErrorRate
expr: rate(eventcore_command_errors_total[5m]) / rate(eventcore_commands_total[5m]) > 0.05
for: 3m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: DatabaseConnectionFailure
expr: eventcore_connection_pool_errors_total > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection issues"
Deployment Checklist
Environment Configuration
- Environment variables properly set
- Secrets configured and mounted
- Config maps updated
- Feature flags configured appropriately
- Resource limits applied
- Network policies configured
Database Setup
- Database migrations applied and verified
- Database indexes created and optimized
- Database monitoring configured
- Connection pooling tuned
- Backup strategy implemented
- Read replicas configured if needed
Infrastructure
- DNS records configured
- Load balancer configured
- SSL certificates installed and valid
- CDN configured if applicable
- Firewall rules applied
- Network segmentation implemented
Post-Deployment Verification
Functional Testing
- Smoke tests pass
- Critical user journeys work
- API endpoints respond correctly
- Authentication works
- Authorization enforced
- Error handling works properly
#![allow(unused)] fn main() { // Post-deployment validation suite pub struct PostDeploymentValidator { base_url: String, auth_token: String, } impl PostDeploymentValidator { pub async fn run_validation_suite(&self) -> Result<ValidationSuite, ValidationError> { let mut suite = ValidationSuite::default(); // Test 1: Health check suite.health_check = self.test_health_endpoint().await?; // Test 2: Authentication suite.authentication = self.test_authentication().await?; // Test 3: Core functionality suite.core_functionality = self.test_core_functionality().await?; // Test 4: Performance suite.performance = self.test_basic_performance().await?; // Test 5: Error handling suite.error_handling = self.test_error_handling().await?; suite.overall_passed = suite.health_check && suite.authentication && suite.core_functionality && suite.performance && suite.error_handling; Ok(suite) } async fn test_health_endpoint(&self) -> Result<bool, ValidationError> { let response = reqwest::get(&format!("{}/health", self.base_url)).await?; Ok(response.status().is_success()) } async fn test_authentication(&self) -> Result<bool, ValidationError> { // Test with valid token let client = reqwest::Client::new(); let response = client .get(&format!("{}/api/v1/test", self.base_url)) .header("Authorization", format!("Bearer {}", self.auth_token)) .send() .await?; if !response.status().is_success() { return Ok(false); } // Test without token (should fail) let response = client .get(&format!("{}/api/v1/test", self.base_url)) .send() .await?; Ok(response.status() == 401) } async fn test_core_functionality(&self) -> Result<bool, ValidationError> { // Test a simple command execution let client = reqwest::Client::new(); let create_user_payload = serde_json::json!({ "email": "test@example.com", "first_name": "Test", "last_name": "User" }); let response = client .post(&format!("{}/api/v1/users", self.base_url)) .header("Authorization", format!("Bearer {}", self.auth_token)) .json(&create_user_payload) .send() .await?; Ok(response.status().is_success()) } } #[derive(Debug, Default)] pub struct ValidationSuite { pub health_check: bool, pub authentication: bool, pub core_functionality: bool, pub performance: bool, pub error_handling: bool, pub overall_passed: bool, } }
Performance Validation
- Response times within acceptable limits
- Throughput meets requirements
- Resource usage within limits
- Memory leaks not detected
- CPU usage stable
- Database performance optimal
Monitoring Validation
- Metrics flowing to monitoring system
- Logs being collected and indexed
- Traces visible in tracing system
- Alerts triggering appropriately
- Dashboards showing correct data
- SLI/SLO monitoring active
Ongoing Operations Checklist
Daily Checks
- System health green across all services
- Error rates within acceptable thresholds
- Performance metrics meeting SLOs
- Resource utilization not approaching limits
- Log analysis for new error patterns
- Security alerts reviewed
Weekly Checks
- Backup verification completed successfully
- Performance trends analyzed
- Capacity planning reviewed
- Security patches evaluated and applied
- Dependency updates reviewed
- Documentation updated as needed
Monthly Checks
- Disaster recovery procedures tested
- Security audit completed
- Performance benchmarks updated
- Cost optimization opportunities identified
- Capacity forecasting updated
- Runbook accuracy verified
Automation Scripts
Deployment Validation Script
#!/bin/bash
# deployment-validation.sh
set -e
NAMESPACE="eventcore"
APP_NAME="eventcore-app"
BASE_URL="https://api.eventcore.example.com"
echo "🚀 Starting deployment validation..."
# Check deployment status
echo "📋 Checking deployment status..."
kubectl rollout status deployment/$APP_NAME -n $NAMESPACE --timeout=300s
# Check pod health
echo "🏥 Checking pod health..."
READY_PODS=$(kubectl get pods -l app=$APP_NAME -n $NAMESPACE -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | wc -w)
DESIRED_PODS=$(kubectl get deployment $APP_NAME -n $NAMESPACE -o jsonpath='{.spec.replicas}')
if [ "$READY_PODS" -ne "$DESIRED_PODS" ]; then
echo "❌ Not all pods are ready: $READY_PODS/$DESIRED_PODS"
exit 1
fi
echo "✅ All pods are ready: $READY_PODS/$DESIRED_PODS"
# Check health endpoint
echo "🔍 Testing health endpoint..."
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $BASE_URL/health)
if [ "$HTTP_STATUS" -ne 200 ]; then
echo "❌ Health check failed with status: $HTTP_STATUS"
exit 1
fi
echo "✅ Health check passed"
# Check metrics endpoint
echo "📊 Testing metrics endpoint..."
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $BASE_URL/metrics)
if [ "$HTTP_STATUS" -ne 200 ]; then
echo "❌ Metrics endpoint failed with status: $HTTP_STATUS"
exit 1
fi
echo "✅ Metrics endpoint responding"
# Check database connectivity
echo "🗄️ Testing database connectivity..."
kubectl exec -n $NAMESPACE deployment/$APP_NAME -- eventcore-cli health-check database
if [ $? -ne 0 ]; then
echo "❌ Database connectivity check failed"
exit 1
fi
echo "✅ Database connectivity verified"
# Run smoke tests
echo "💨 Running smoke tests..."
kubectl exec -n $NAMESPACE deployment/$APP_NAME -- eventcore-cli test smoke
if [ $? -ne 0 ]; then
echo "❌ Smoke tests failed"
exit 1
fi
echo "✅ Smoke tests passed"
echo "🎉 Deployment validation completed successfully!"
Health Check Script
#!/bin/bash
# health-check.sh
set -e
NAMESPACE="eventcore"
PROMETHEUS_URL="http://prometheus.monitoring.svc.cluster.local:9090"
echo "🔍 Running comprehensive health check..."
# Check application health
echo "📱 Checking application health..."
APP_UP=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=up{job=\"eventcore\"}" | jq '.data.result[0].value[1]' -r)
if [ "$APP_UP" != "1" ]; then
echo "❌ Application is down"
exit 1
fi
# Check error rate
echo "🚨 Checking error rate..."
ERROR_RATE=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=rate(eventcore_command_errors_total[5m])/rate(eventcore_commands_total[5m])" | jq '.data.result[0].value[1]' -r)
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
echo "❌ High error rate detected: $ERROR_RATE"
exit 1
fi
# Check response time
echo "⏱️ Checking response time..."
P95_LATENCY=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=histogram_quantile(0.95, rate(eventcore_command_duration_seconds_bucket[5m]))" | jq '.data.result[0].value[1]' -r)
if (( $(echo "$P95_LATENCY > 1.0" | bc -l) )); then
echo "❌ High latency detected: ${P95_LATENCY}s"
exit 1
fi
# Check database connectivity
echo "🗄️ Checking database health..."
DB_CONNECTIONS=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=eventcore_connection_pool_size" | jq '.data.result[0].value[1]' -r)
MAX_CONNECTIONS=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=eventcore_connection_pool_max_size" | jq '.data.result[0].value[1]' -r)
UTILIZATION=$(echo "scale=2; $DB_CONNECTIONS / $MAX_CONNECTIONS" | bc)
if (( $(echo "$UTILIZATION > 0.8" | bc -l) )); then
echo "⚠️ High database connection utilization: $UTILIZATION"
fi
echo "✅ All health checks passed!"
Emergency Procedures
Incident Response
- Assess severity using incident severity matrix
- Activate incident response team if critical
- Create incident tracking (ticket/channel)
- Implement immediate mitigation if possible
- Communicate status to stakeholders
- Investigate root cause after mitigation
- Document lessons learned and improvements
Rollback Procedures
#!/bin/bash
# emergency-rollback.sh
NAMESPACE="eventcore"
APP_NAME="eventcore-app"
echo "🚨 Emergency rollback initiated..."
# Get previous revision
CURRENT_REVISION=$(kubectl rollout history deployment/$APP_NAME -n $NAMESPACE --output=json | jq '.items[-1].revision')
PREVIOUS_REVISION=$((CURRENT_REVISION - 1))
echo "Rolling back from revision $CURRENT_REVISION to $PREVIOUS_REVISION"
# Perform rollback
kubectl rollout undo deployment/$APP_NAME -n $NAMESPACE --to-revision=$PREVIOUS_REVISION
# Wait for rollback to complete
kubectl rollout status deployment/$APP_NAME -n $NAMESPACE --timeout=300s
# Verify health
sleep 30
./health-check.sh
echo "✅ Emergency rollback completed"
Summary
Production readiness checklist for EventCore:
- ✅ Security - Authentication, authorization, encryption
- ✅ Performance - Resource limits, optimization, benchmarks
- ✅ Reliability - High availability, backup and recovery
- ✅ Monitoring - Metrics, logging, alerting, dashboards
- ✅ Operations - Deployment validation, health checks, incident response
Key principles:
- Validate everything - Don’t assume anything works in production
- Automate checks - Use scripts and tools for consistent validation
- Monitor continuously - Track all critical metrics and logs
- Plan for failure - Have rollback and recovery procedures ready
- Document procedures - Maintain up-to-date runbooks and checklists
This completes the EventCore Operations guide. You now have comprehensive documentation for deploying, monitoring, and maintaining EventCore applications in production environments.
Next, proceed to Part 7: Reference →