Critical Infrastructure

Disaster Recovery & Redundancy Plan

Infrastructure resilience strategy for mission-critical agentic AI systems. Transform single-point-of-failure into enterprise-grade reliability.

πŸ“… Version 1.0 β€” March 2026 ⏱️ RTO: < 15 min πŸ’° ~$35/month πŸ”’ Tailscale VPN

Executive Summary

Current infrastructure relies on a single Mac mini with residential internet β€” creating unacceptable risk for mission-critical operations. This plan implements active-passive redundancy with automatic failover, geographic distribution, and encrypted traffic routing.

Current Risk
HIGH

Single point of failure

Recovery Time
< 15 min

Automatic failover

Data Loss
< 5 min

Real-time sync

Monthly Cost
~$35

VPS + storage

1. Architecture Overview

Primary-secondary topology with automatic DNS failover. All traffic encrypted via VPN mesh.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ DNS / LOAD BALANCER β”‚ β”‚ (Cloudflare / Route 53) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PRIMARY β”‚ β”‚ SECONDARY β”‚ β”‚ (Mac mini) │◄───────►│ (VPS/Cloud) β”‚ β”‚ Home / Office β”‚ Sync β”‚ AWS/DigitalOcean β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ - OpenClaw β”‚ β”‚ - OpenClaw β”‚ β”‚ - Local files β”‚ β”‚ - Replicated data β”‚ β”‚ - Telegram/Slack β”‚ β”‚ - VPN endpoint β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ └────────►│ MONITOR β”‚β—„β”€β”€β”€β”€β”€β”˜ β”‚ (Uptime/ β”‚ β”‚ Heartbeat) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.1 Cloud Provider Comparison

Provider Specs Cost/mo Best For
DigitalOcean 4GB RAM / 2vCPU $24 Balance of cost/performance
AWS Lightsail 4GB RAM / 2vCPU $20 Enterprise integration
Linode 4GB Linode $24 Simple, reliable
Hetzner CPX21 €8.20 Cost-conscious, EU privacy

Recommendation: DigitalOcean (NYC3) β€” best balance for US-based operations.

2. VPN & Traffic Routing

VPN is critical for protecting API keys, preventing ISP snooping, and securing cloud communication.

2.1 VPN Options

Option A: Tailscale (Recommended)

# Install on both systems
curl -fsSL https://tailscale.com/install.sh | sh

# Authenticate
tailscale up --authkey tskey-auth-...

# Assign static IPs
# Primary: 100.64.1.1
# Secondary: 100.64.1.2

Pros: Zero-config, NAT traversal, free personal use
Cons: Dependency on Tailscale infrastructure

Option B: WireGuard (Self-hosted)

wg genkey | tee privatekey | wg pubkey > publickey

# /etc/wireguard/wg0.conf
[Interface]
PrivateKey = <primary-private-key>
Address = 10.200.200.1/24
ListenPort = 51820

Pros: Fully self-hosted
Cons: Requires static IP or DDNS

Option C: Headscale (Self-hosted Tailscale)

Run your own Tailscale coordination server. Best of both worlds.

3. Risk Profiles for Expansion

Integration Risk Level Mitigation Required Recommendation
Social Accounts HIGH Vault, MFA, IP whitelist Proceed with caution
Claude Code MEDIUM Containerized, limited scope Acceptable with controls
Banking/Finance CRITICAL Dedicated VM, hardware token Isolate completely
Email (Gmail) HIGH OAuth, app-specific passwords Use gog with restrictions
Git Repos MEDIUM Deploy keys, not personal tokens Acceptable

3.1 Banking/Finance: Complete Isolation

Never grant OpenClaw: Trading access, bank transfers, crypto wallet keys, payment processor APIs.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ FINANCE VM (Isolated) β”‚ β”‚ - No internet except bank APIs β”‚ β”‚ - No OpenClaw integration β”‚ β”‚ - Hardware token required β”‚ β”‚ - Read-only reporting only β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ (Monthly manual sync) β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ OpenClaw (Standard operations) β”‚ β”‚ - Can READ finance reports β”‚ β”‚ - Cannot initiate transactions β”‚ β”‚ - No access to finance VM credentials β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4. Implementation Roadmap

1
Foundation (Week 1)
Provision VPS, install OpenClaw, set up Tailscale VPN, configure basic file sync
2
Automation (Week 2)
Implement health checks, configure DNS failover, set up monitoring, test failover scenario
3
Security Hardening (Week 3)
Enable VPN-only API access, rotate all keys, configure social account vault, implement approval workflows
4
Documentation (Week 4)
Document failover procedures, create runbooks, train on manual recovery, schedule quarterly DR drills

5. Cost Analysis

Monthly Operating Costs

Component Cost
VPS (4GB) $24
Tailscale Free
Cloudflare Free
Storage (100GB) $6
Monitoring $5
Total ~$35/mo

ROI: Prevents 1 day of downtime every 14-57 months to break even. Pays for itself with first prevented outage.

6. Emergency Procedures

Scenario A: Primary Hardware Failure

  1. Detection: Health check fails 3x (3 minutes)
  2. Automatic: DNS switches to secondary
  3. Manual: Verify secondary handling traffic
  4. Recovery: Repair/replace primary
  5. Restore: Sync data back to new primary
  6. Failback: Update DNS, verify

Time to recovery: 15 minutes automatic + 2 hours full restoration

Scenario B: Internet Outage (Primary Location)

  1. Failover to secondary (already on cloud)
  2. Access via mobile hotspot for urgent tasks
  3. Wait for ISP restoration
  4. Reconcile divergent changes

Time to recovery: 5 minutes

Scenario C: Complete Data Loss

  1. Restore from backup (S3/Backblaze)
  2. Decrypt using offline recovery key
  3. Checksum validation
  4. Restart OpenClaw services
  5. Full functionality verification

Time to recovery: 2-4 hours

7. Security Checklist

Pre-Deployment Verification

All secrets in vault (no plaintext in config)
VPN mandatory for all inter-node traffic
MFA enabled on all cloud accounts
Regular key rotation (quarterly)
Automated security scanning
Incident response plan documented
Offsite backups (3-2-1 rule)
Encryption at rest and in transit

FAQ

Q: Should I use an EU or domestic VPS provider?

Short answer: Geographic diversity is the main advantage. Everything else is trade-offs.

When EU Hosting Wins

When EU Hosting Loses

Recommendation

Scenario Choice
You + Primary = US DigitalOcean NYC β†’ DigitalOcean SFO
US + EU customers DigitalOcean NYC β†’ Hetzner Frankfurt
Privacy paranoid Hetzner (German privacy laws)
Cost optimization Hetzner (65% cheaper)

Bottom line: For US-based operations, same-provider coast-to-coast redundancy beats cross-continent complexity unless you specifically need EU presence.

Q: What happens if both primary and secondary fail?

This is the "what if the internet dies" scenario. Recovery depends on backup storage:

Q: Do I really need a VPN if both nodes are mine?

Yes. Three reasons:

  1. API keys in transit: Without VPN, credentials travel over public internet
  2. ISP snooping: Traffic patterns reveal operational intel
  3. Automatic encryption: Tailscale/WireGuard encrypts everything by default

Cost: $0 (Tailscale free tier). Risk without: Unknown but non-zero.

Q: Can I use this for client work or is it personal only?

This scales to client work with modifications:

Next Steps

  1. Review this plan β€” Schedule 30-min discussion
  2. Select VPS provider β€” DigitalOcean recommended
  3. Provision secondary node β€” Week 1 goal
  4. Test failover β€” Verify RTO/RPO targets
  5. Document lessons learned β€” Update runbooks