🛡️ High Availability

High Availability & Cluster Deployment Guide

Production-grade multi-node deployments with automatic failover, load balancing, and zero-downtime maintenance. Covers bare-metal, Docker Compose, and Kubernetes topologies.

Version 26.0 · Last updated March 2026 · Estimated setup time: 2–4 hours

6. Application Cluster Setup
7. PostgreSQL Streaming Replication
8. Automatic Failover
9. Docker Compose HA
10. Kubernetes Deployment

11. Health Checks & Monitoring
12. Failover Testing Runbook
13. Backup Strategy
14. Troubleshooting
15. Quick Reference

🔍 1. Overview & Architecture Options

MinusNow supports three progressive deployment topologies. This guide covers Topology A and Topology B — multi-node architectures where any single server failure does not cause application downtime.

Topology	Nodes	DB Failover	App Failover	Target Uptime	Use Case
Single-Node	1	❌ None	❌ None	~99.5%	Dev / POC / Small teams
Two-Node (existing guide)	2	⚠️ Manual	❌ None	~99.9%	Small production
Topology A — This Guide	3 (2 App + 1 DB)	❌ Single DB	✅ Automatic	~99.95%	App-tier HA; DB handled by backups
Topology B — This Guide	4 (2 App + 2 DB)	✅ Automatic	✅ Automatic	~99.99%	Full HA; enterprise production

🏗️ 2. Topology A — 2 App Servers + 1 DB Server

The application tier is load-balanced across two servers. If one app server fails, the load balancer routes all traffic to the surviving node. The database is a single server (protected by regular backups and optional read replica).

✅ Advantages

Zero downtime for app failures
Rolling updates (drain → update → enable)
Horizontal scale by adding more app nodes
Simpler than full DB clustering

⚠️ Limitations

DB is single point of failure
DB failure requires restore from backup
RPO depends on backup frequency

🎯 Best For

Teams of 50–500 users
Environments with managed DB (RDS/Cloud SQL)
Budget-conscious HA requirements

🏗️ 3. Topology B — 2 App Servers + 2 DB Nodes

Full high availability across both tiers. The database uses PostgreSQL streaming replication with automatic failover via Patroni or manual promotion. No single point of failure.

┌─────────────────────┐ │ Load Balancer │ │ (Nginx / HAProxy) │ │ :80 / :443 (VIP) │ └─────────┬────────────┘ │ ┌────────────┴────────────┐ │ │ ┌──────▼──────┐ ┌───────▼─────┐ │ App Node 1 │ │ App Node 2 │ │ 10.0.1.11 │ │ 10.0.1.12 │ │ Node.js │ │ Node.js │ │ :5000 │ │ :5000 │ └──────┬──────┘ └───────┬─────┘ │ │ └────────────┬────────────┘ │ ┌──────────────┴──────────────┐ │ │ ┌───────▼────────┐ ┌──────────▼───────┐ │ DB Primary │ │ DB Replica │ │ 10.0.2.10 │ ──WAL──▶│ 10.0.2.11 │ │ PostgreSQL │ stream │ PostgreSQL │ │ :5432 (R/W) │ │ :5432 (Read-Only)│ └────────────────┘ └──────────────────┘ │ │ └──────── Patroni / ──────────┘ etcd (automatic failover)

✅ Advantages

No single point of failure
Automatic DB failover (RPO ≈ 0)
Read-replica offloads reporting queries
Zero-downtime for any single node failure

⚠️ Considerations

More complex to set up and maintain
Requires Patroni + etcd (or managed DB HA)
Network split-brain risk needs quorum
4 servers minimum

🎯 Best For

Enterprises with 500+ users
Regulated environments (SOC 2, ISO 27001)
Zero-downtime SLA requirements
Mission-critical ITSM workflows

💻 4. Hardware & Network Requirements

Server Specifications

Role	CPU	RAM	Storage	OS	Qty (Topo A)	Qty (Topo B)
Load Balancer	2 vCPU	4 GB	20 GB SSD	Ubuntu 24.04 / RHEL 9	1	1
App Server	8 vCPU	32 GB	100 GB SSD	Ubuntu 24.04 / RHEL 9	2	2
DB Primary	8 vCPU	64 GB	500 GB NVMe	Ubuntu 24.04 / RHEL 9	1	1
DB Replica	8 vCPU	64 GB	500 GB NVMe	Ubuntu 24.04 / RHEL 9	—	1

Tip: The load balancer can run on one of the app servers to reduce node count. For true HA, use a dedicated LB or a cloud-managed LB (ALB, Azure App Gateway).

Network Requirements

Source	Destination	Port	Protocol	Purpose
Internet / VIP	Load Balancer	80, 443	TCP	HTTP/HTTPS ingress
Load Balancer	App Node 1 & 2	5000	TCP	Upstream app traffic
App Node 1 & 2	DB Primary	5432	TCP	Database connections
DB Primary	DB Replica	5432	TCP	WAL streaming replication
All nodes	All nodes	2379–2380	TCP	etcd cluster (Patroni, Topo B only)
All nodes	All nodes	8008	TCP	Patroni REST API (Topo B only)

Software Prerequisites (all nodes)

Component	Version	App Nodes	DB Nodes	LB Node
Node.js	22.x LTS	✅	—	—
PostgreSQL	16+	—	✅	—
Nginx or HAProxy	1.24+ / 2.8+	—	—	✅
PM2	Latest	✅	—	—
Patroni	3.x	—	✅ (Topo B)	—
etcd	3.5+	—	✅ (Topo B)	—

⚖️ 5. Load Balancer Setup

The load balancer distributes traffic across app nodes and performs health checks to detect failures. Choose Nginx (simpler) or HAProxy (more HA features).

Option A: Nginx Load Balancer

# /etc/nginx/conf.d/minusnow-ha.conf upstream minusnow_app { # Least-connections balancing (best for long WebSocket sessions) least_conn; server 10.0.1.11:5000 max_fails=3 fail_timeout=30s; server 10.0.1.12:5000 max_fails=3 fail_timeout=30s; } server { listen 80; server_name minusnow.yourdomain.com; return 301 https://$server_name$request_uri; } server { listen 443 ssl http2; server_name minusnow.yourdomain.com; ssl_certificate /etc/ssl/certs/minusnow.crt; ssl_certificate_key /etc/ssl/private/minusnow.key; ssl_protocols TLSv1.2 TLSv1.3; ssl_ciphers HIGH:!aNULL:!MD5; # Health check endpoint location /api/health { proxy_pass http://minusnow_app; proxy_connect_timeout 5s; proxy_read_timeout 10s; } # WebSocket support (for live updates) location /ws { proxy_pass http://minusnow_app; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_set_header Host $host; proxy_read_timeout 86400; } # Main application location / { proxy_pass http://minusnow_app; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_connect_timeout 10s; proxy_read_timeout 60s; proxy_send_timeout 60s; } }

Option B: HAProxy Load Balancer

# /etc/haproxy/haproxy.cfg global log /dev/log local0 maxconn 4096 user haproxy group haproxy daemon defaults log global mode http option httplog option dontlognull timeout connect 5s timeout client 60s timeout server 60s timeout tunnel 86400s # WebSocket keepalive frontend https_front bind *:443 ssl crt /etc/ssl/certs/minusnow.pem default_backend minusnow_app # Stats page (restrict to internal) acl internal src 10.0.0.0/8 use_backend stats_backend if internal { path_beg /haproxy-stats } backend minusnow_app balance leastconn option httpchk GET /api/health http-check expect status 200 server app1 10.0.1.11:5000 check inter 10s fall 3 rise 2 server app2 10.0.1.12:5000 check inter 10s fall 3 rise 2 backend stats_backend stats enable stats uri /haproxy-stats stats refresh 10s

Enable & Test

# Nginx sudo nginx -t sudo systemctl enable --now nginx # HAProxy sudo haproxy -c -f /etc/haproxy/haproxy.cfg sudo systemctl enable --now haproxy # Verify from the LB server curl -k https://localhost/api/health # Expected: {"status":"ok","timestamp":"..."}

⚙️ 6. Application Cluster Setup

Repeat these steps on both app servers (10.0.1.11 and 10.0.1.12).

Step 6a: Install Node.js & Deploy Application

# Install Node.js 22.x LTS curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - sudo apt-get install -y nodejs # Create application user sudo useradd -r -m -s /bin/bash minusnow # Deploy application sudo mkdir -p /opt/minusnow sudo chown minusnow:minusnow /opt/minusnow cd /opt/minusnow # Copy application package (from your build server or artifact store) sudo -u minusnow tar xzf /tmp/minusnow-latest.tar.gz -C /opt/minusnow/ # Install dependencies & build sudo -u minusnow npm ci --production sudo -u minusnow npm run build

Step 6b: Configure Environment Variables

# /opt/minusnow/.env (same on both app nodes) NODE_ENV=production PORT=5000 # Point to DB Primary (Topology A) or Patroni VIP (Topology B) DATABASE_URL=postgresql://minusnow:<password>@10.0.2.10:5432/minusnow_prod # Session secret (MUST be identical on all app nodes for sticky sessions) SESSION_SECRET=<same-secret-on-all-app-nodes> # Application URL (the LB address) APP_BASE_URL=https://minusnow.yourdomain.com # Agent API key AGENT_API_KEY=<from-secrets-manager>

Critical: The SESSION_SECRET must be identical on all app nodes. If different secrets are used, users will be logged out when their request hits the other node.

Step 6c: PM2 Cluster Mode

# /opt/minusnow/ecosystem.config.cjs module.exports = { apps: [{ name: "minusnow", script: "dist/index.js", instances: "max", // Use all available CPU cores exec_mode: "cluster", // PM2 cluster mode for multi-core env: { NODE_ENV: "production", PORT: 5000, }, // Graceful restart kill_timeout: 5000, listen_timeout: 10000, // Auto-restart on crash max_restarts: 10, restart_delay: 4000, // Log rotation log_date_format: "YYYY-MM-DD HH:mm:ss", error_file: "/var/log/minusnow/error.log", out_file: "/var/log/minusnow/output.log", }] };

# Start application with PM2 sudo -u minusnow pm2 start ecosystem.config.cjs sudo -u minusnow pm2 save # Enable PM2 startup on boot sudo env PATH=$PATH:/usr/bin pm2 startup systemd -u minusnow --hp /home/minusnow

Step 6d: Verify Both App Nodes

# From the LB server, test each node directly: curl http://10.0.1.11:5000/api/health # {"status":"ok","node":"app1","uptime":...} curl http://10.0.1.12:5000/api/health # {"status":"ok","node":"app2","uptime":...} # Test through the load balancer: curl -k https://minusnow.yourdomain.com/api/health

🗄️ 7. PostgreSQL Streaming Replication (Topology B)

Set up WAL-based streaming replication between the Primary (10.0.2.10) and Replica (10.0.2.11). This enables near-zero data loss failover.

Step 7a: Configure Primary (10.0.2.10)

# /etc/postgresql/16/main/postgresql.conf (Ubuntu) # or /var/lib/pgsql/16/data/postgresql.conf (RHEL) # --- Replication Settings --- wal_level = replica max_wal_senders = 5 wal_keep_size = '1GB' max_replication_slots = 5 synchronous_commit = on # or 'remote_apply' for strong consistency synchronous_standby_names = 'replica1' # name of the standby # --- Connection Settings --- listen_addresses = '*'

Step 7b: Create Replication User on Primary

# On the Primary DB server sudo -u postgres psql CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD '<strong-password>'; # Add to pg_hba.conf: # host replication replicator 10.0.2.11/32 scram-sha-256 sudo systemctl restart postgresql

Step 7c: Initialize Replica (10.0.2.11)

# Stop PostgreSQL on Replica sudo systemctl stop postgresql # Remove existing data directory sudo rm -rf /var/lib/postgresql/16/main/* # Create base backup from Primary sudo -u postgres pg_basebackup \ -h 10.0.2.10 \ -U replicator \ -D /var/lib/postgresql/16/main \ -Fp -Xs -P -R # The -R flag auto-creates standby.signal and sets primary_conninfo

Step 7d: Configure Replica

# /var/lib/postgresql/16/main/postgresql.auto.conf (auto-created by -R) # Verify it contains: primary_conninfo = 'host=10.0.2.10 port=5432 user=replicator password=<password> application_name=replica1' # Start PostgreSQL on Replica sudo systemctl start postgresql

Step 7e: Verify Replication

# On Primary — check replication status: sudo -u postgres psql -c "SELECT client_addr, state, sync_state, sent_lsn, replay_lsn FROM pg_stat_replication;" # client_addr | state | sync_state | sent_lsn | replay_lsn # --------------+-----------+------------+--------------+------------ # 10.0.2.11 | streaming | sync | 0/3000060 | 0/3000060 # On Replica — confirm it's in recovery mode: sudo -u postgres psql -c "SELECT pg_is_in_recovery();" # pg_is_in_recovery # ------------------- # t

Replication Active: When state = streaming and sent_lsn ≈ replay_lsn, the replica is fully synchronized with the primary.

🔄 8. Automatic Failover with Patroni

Patroni automates PostgreSQL HA — it monitors the primary, promotes the replica, and updates connection routing. It uses etcd for consensus.

Step 8a: Install etcd (on both DB nodes)

sudo apt install -y etcd # /etc/default/etcd (Node 10.0.2.10) ETCD_NAME="etcd1" ETCD_DATA_DIR="/var/lib/etcd" ETCD_LISTEN_PEER_URLS="http://10.0.2.10:2380" ETCD_LISTEN_CLIENT_URLS="http://10.0.2.10:2379,http://127.0.0.1:2379" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.0.2.10:2380" ETCD_ADVERTISE_CLIENT_URLS="http://10.0.2.10:2379" ETCD_INITIAL_CLUSTER="etcd1=http://10.0.2.10:2380,etcd2=http://10.0.2.11:2380" ETCD_INITIAL_CLUSTER_STATE="new" # Similar on 10.0.2.11 with ETCD_NAME="etcd2" and appropriate IPs sudo systemctl enable --now etcd

Step 8b: Install Patroni (on both DB nodes)

sudo apt install -y python3-pip python3-psycopg2 sudo pip3 install patroni[etcd]

Step 8c: Patroni Configuration

# /etc/patroni/patroni.yml (Primary: 10.0.2.10) scope: minusnow-cluster name: pg-node1 restapi: listen: 0.0.0.0:8008 connect_address: 10.0.2.10:8008 etcd3: hosts: 10.0.2.10:2379,10.0.2.11:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 # 1 MB max lag for failover synchronous_mode: true postgresql: use_pg_rewind: true parameters: max_connections: 200 wal_level: replica max_wal_senders: 5 max_replication_slots: 5 synchronous_commit: "on" postgresql: listen: 0.0.0.0:5432 connect_address: 10.0.2.10:5432 data_dir: /var/lib/postgresql/16/main bin_dir: /usr/lib/postgresql/16/bin authentication: superuser: username: postgres password: <postgres-password> replication: username: replicator password: <replicator-password> parameters: unix_socket_directories: '/var/run/postgresql'

# /etc/patroni/patroni.yml (Replica: 10.0.2.11) — same but: name: pg-node2 restapi: connect_address: 10.0.2.11:8008 postgresql: connect_address: 10.0.2.11:5432

Step 8d: Start Patroni

# Create systemd service sudo tee /etc/systemd/system/patroni.service <<'EOF' [Unit] Description=Patroni PostgreSQL HA After=network.target etcd.service [Service] Type=simple User=postgres ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable --now patroni # Check cluster status patronictl -c /etc/patroni/patroni.yml list # +--------+----------+---------+---------+----+-----------+ # | Member | Host | Role | State | TL | Lag in MB | # +--------+----------+---------+---------+----+-----------+ # | pg-node1 | 10.0.2.10 | Leader | running | 1 | 0 | # | pg-node2 | 10.0.2.11 | Replica | running | 1 | 0 | # +--------+----------+---------+---------+----+-----------+

Step 8e: HAProxy for DB Failover (on LB or App nodes)

Route application DB connections through HAProxy so the app always hits the current primary, regardless of which node Patroni promoted.

# Append to /etc/haproxy/haproxy.cfg (or separate file) listen postgres_rw bind *:6432 mode tcp option httpchk GET /primary http-check expect status 200 default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions server pg-node1 10.0.2.10:5432 check port 8008 server pg-node2 10.0.2.11:5432 check port 8008 listen postgres_ro bind *:6433 mode tcp balance roundrobin option httpchk GET /replica http-check expect status 200 default-server inter 3s fall 3 rise 2 server pg-node1 10.0.2.10:5432 check port 8008 server pg-node2 10.0.2.11:5432 check port 8008

# Update app .env to use HAProxy for DB: DATABASE_URL=postgresql://minusnow:<password>@10.0.1.10:6432/minusnow_prod

How it works: Patroni exposes /primary and /replica health endpoints on port 8008. HAProxy checks these to route R/W traffic to the current leader and read traffic to replicas. On failover, Patroni promotes the replica — HAProxy detects the role change within seconds and reroutes traffic automatically.

🐳 9. Docker Compose HA Deployment

For teams preferring containerized deployments, here's a Docker Compose configuration that replicates the HA topology with multiple app containers and PostgreSQL.

Topology A: 2 App Containers + 1 DB

# docker-compose.ha.yml — Topology A (2 App + 1 DB) version: '3.8' services: # ── Load Balancer ────────────────────── nginx: image: nginx:1.25-alpine ports: - "80:80" - "443:443" volumes: - ./nginx/minusnow-ha.conf:/etc/nginx/conf.d/default.conf:ro - ./nginx/certs:/etc/ssl/certs:ro depends_on: app1: condition: service_healthy app2: condition: service_healthy restart: always # ── Application Node 1 ──────────────── app1: build: . hostname: app1 environment: - NODE_ENV=production - PORT=5000 - DATABASE_URL=postgresql://minusnow:${DB_PASSWORD:-changeme}@db:5432/minusnow_prod - SESSION_SECRET=${SESSION_SECRET} env_file: .env depends_on: db: condition: service_healthy healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:5000/api/health"] interval: 15s timeout: 5s retries: 3 start_period: 30s restart: always volumes: - app-data:/app/data - audit-logs:/app/audit-logs # ── Application Node 2 ──────────────── app2: build: . hostname: app2 environment: - NODE_ENV=production - PORT=5000 - DATABASE_URL=postgresql://minusnow:${DB_PASSWORD:-changeme}@db:5432/minusnow_prod - SESSION_SECRET=${SESSION_SECRET} env_file: .env depends_on: db: condition: service_healthy healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:5000/api/health"] interval: 15s timeout: 5s retries: 3 start_period: 30s restart: always volumes: - app-data:/app/data - audit-logs:/app/audit-logs # ── Database ─────────────────────────── db: image: postgres:16-alpine environment: POSTGRES_DB: minusnow_prod POSTGRES_USER: minusnow POSTGRES_PASSWORD: ${DB_PASSWORD:-changeme} volumes: - pgdata:/var/lib/postgresql/data ports: - "127.0.0.1:5432:5432" healthcheck: test: ["CMD-SHELL", "pg_isready -U minusnow -d minusnow_prod"] interval: 10s timeout: 5s retries: 5 restart: always shm_size: '256mb' volumes: pgdata: app-data: audit-logs:

Nginx Config for Docker

# nginx/minusnow-ha.conf upstream minusnow { least_conn; server app1:5000 max_fails=3 fail_timeout=30s; server app2:5000 max_fails=3 fail_timeout=30s; } server { listen 80; server_name _; location /ws { proxy_pass http://minusnow; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_set_header Host $host; proxy_read_timeout 86400; } location / { proxy_pass http://minusnow; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } }

Topology B: 2 App + 2 DB (with Replication)

# docker-compose.ha-full.yml — Topology B (2 App + Primary DB + Replica DB) version: '3.8' services: nginx: image: nginx:1.25-alpine ports: - "80:80" - "443:443" volumes: - ./nginx/minusnow-ha.conf:/etc/nginx/conf.d/default.conf:ro depends_on: app1: condition: service_healthy app2: condition: service_healthy restart: always app1: build: . hostname: app1 environment: - NODE_ENV=production - PORT=5000 - DATABASE_URL=postgresql://minusnow:${DB_PASSWORD:-changeme}@db-primary:5432/minusnow_prod - SESSION_SECRET=${SESSION_SECRET} env_file: .env depends_on: db-primary: condition: service_healthy healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:5000/api/health"] interval: 15s timeout: 5s retries: 3 start_period: 30s restart: always volumes: - app-data:/app/data app2: build: . hostname: app2 environment: - NODE_ENV=production - PORT=5000 - DATABASE_URL=postgresql://minusnow:${DB_PASSWORD:-changeme}@db-primary:5432/minusnow_prod - SESSION_SECRET=${SESSION_SECRET} env_file: .env depends_on: db-primary: condition: service_healthy healthcheck: test: ["CMD", "wget", "-q", "--spider", "http://localhost:5000/api/health"] interval: 15s timeout: 5s retries: 3 start_period: 30s restart: always volumes: - app-data:/app/data # ── Primary Database ─────────────────── db-primary: image: postgres:16-alpine hostname: db-primary environment: POSTGRES_DB: minusnow_prod POSTGRES_USER: minusnow POSTGRES_PASSWORD: ${DB_PASSWORD:-changeme} volumes: - pgdata-primary:/var/lib/postgresql/data - ./db/init-primary.sh:/docker-entrypoint-initdb.d/init-replication.sh:ro healthcheck: test: ["CMD-SHELL", "pg_isready -U minusnow -d minusnow_prod"] interval: 10s timeout: 5s retries: 5 restart: always shm_size: '256mb' command: > postgres -c wal_level=replica -c max_wal_senders=5 -c max_replication_slots=5 -c synchronous_commit=on -c synchronous_standby_names='replica1' -c wal_keep_size=1GB # ── Replica Database ─────────────────── db-replica: image: postgres:16-alpine hostname: db-replica environment: PGUSER: replicator PGPASSWORD: ${REPL_PASSWORD:-replpass} depends_on: db-primary: condition: service_healthy volumes: - pgdata-replica:/var/lib/postgresql/data - ./db/init-replica.sh:/docker-entrypoint-initdb.d/init-replica.sh:ro restart: always shm_size: '256mb' volumes: pgdata-primary: pgdata-replica: app-data:

Launch Commands

# Topology A (2 App + 1 DB): docker compose -f docker-compose.ha.yml up -d --build # Topology B (2 App + 2 DB): docker compose -f docker-compose.ha-full.yml up -d --build # Check status: docker compose ps docker compose logs -f app1 app2 # Scale app tier (add a 3rd app node dynamically): docker compose -f docker-compose.ha.yml up -d --scale app1=1 --scale app2=1 # For truly scaling beyond 2, use docker compose --scale or Kubernetes.

☸️ 10. Kubernetes Deployment

For production Kubernetes clusters. Provides auto-healing, rolling updates, and horizontal pod autoscaling.

Deployment Manifest

# k8s/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: minusnow-app labels: app: minusnow spec: replicas: 2 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 # Zero-downtime deploys selector: matchLabels: app: minusnow template: metadata: labels: app: minusnow spec: containers: - name: minusnow image: minusnow/itsm:26.0 ports: - containerPort: 5000 env: - name: NODE_ENV value: "production" - name: PORT value: "5000" - name: DATABASE_URL valueFrom: secretKeyRef: name: minusnow-secrets key: database-url - name: SESSION_SECRET valueFrom: secretKeyRef: name: minusnow-secrets key: session-secret livenessProbe: httpGet: path: /api/health port: 5000 initialDelaySeconds: 30 periodSeconds: 15 failureThreshold: 3 readinessProbe: httpGet: path: /api/health port: 5000 initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "2000m" memory: "2Gi"

Service & Ingress

# k8s/service.yaml apiVersion: v1 kind: Service metadata: name: minusnow-svc spec: selector: app: minusnow ports: - port: 80 targetPort: 5000 type: ClusterIP --- # k8s/ingress.yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: minusnow-ingress annotations: cert-manager.io/cluster-issuer: letsencrypt-prod nginx.ingress.kubernetes.io/proxy-read-timeout: "86400" nginx.ingress.kubernetes.io/proxy-send-timeout: "86400" nginx.ingress.kubernetes.io/websocket-services: minusnow-svc spec: ingressClassName: nginx tls: - hosts: - minusnow.yourdomain.com secretName: minusnow-tls rules: - host: minusnow.yourdomain.com http: paths: - path: / pathType: Prefix backend: service: name: minusnow-svc port: number: 80

Horizontal Pod Autoscaler

# k8s/hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: minusnow-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: minusnow-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70

Database: Use CloudNativePG or Managed Service

# For PostgreSQL on Kubernetes, recommended options: # # Option 1: Managed Database (recommended for production) # - AWS RDS Multi-AZ # - Azure Database for PostgreSQL Flexible (HA enabled) # - Google Cloud SQL with HA # # Option 2: CloudNativePG Operator # kubectl apply -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.22/releases/cnpg-1.22.0.yaml # # Then create a PostgreSQL cluster: # k8s/cnpg-cluster.yaml apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: minusnow-db spec: instances: 2 # 1 primary + 1 replica storage: size: 50Gi storageClass: gp3 postgresql: parameters: max_connections: "200" synchronous_commit: "on" bootstrap: initdb: database: minusnow_prod owner: minusnow

📊 11. Health Checks & Monitoring

Application Health Endpoint

# MinusNow exposes /api/health by default. Example response: { "status": "ok", "uptime": 86400, "database": "connected", "version": "26.0.1", "node": "app1" } # The LB should poll this endpoint every 10-15 seconds. # If it returns non-200 or times out, mark the node as down.

Key Metrics to Monitor

Metric	Threshold	Alert Level	Where to Check
App response time (p95)	> 2s	Warning	LB logs / APM
App node count (healthy)	< 2	Critical	LB health check
DB replication lag	> 10 MB	Warning	`pg_stat_replication`
DB replication lag	> 100 MB	Critical	`pg_stat_replication`
DB connections used	> 80%	Warning	`pg_stat_activity`
Disk usage (DB)	> 85%	Critical	`df -h`
Patroni leader status	No leader	Critical	`patronictl list`
CPU usage (any node)	> 90% sustained	Warning	Node exporter / htop

Replication Lag Monitor Script

#!/bin/bash # /opt/minusnow/scripts/check-replication.sh # Run via cron every 1 minute LAG=$(sudo -u postgres psql -t -c \ "SELECT COALESCE(pg_wal_lsn_diff(sent_lsn, replay_lsn), 0) FROM pg_stat_replication LIMIT 1;" \ 2>/dev/null | tr -d ' ') if [ -z "$LAG" ]; then echo "CRITICAL: No replication connection" exit 2 elif [ "$LAG" -gt 104857600 ]; then # 100 MB echo "CRITICAL: Replication lag ${LAG} bytes" exit 2 elif [ "$LAG" -gt 10485760 ]; then # 10 MB echo "WARNING: Replication lag ${LAG} bytes" exit 1 else echo "OK: Replication lag ${LAG} bytes" exit 0 fi

🧪 12. Failover Testing Runbook

Run these tests quarterly to verify your HA setup works. Always test during a maintenance window.

Test 1: App Node Failure

# Simulate App Node 1 crash ssh app1 "sudo pm2 stop all" # or: docker stop app1 # Verify: # 1. LB detects failure within 30s # 2. All traffic routes to App Node 2 # 3. Users experience no errors (maybe brief WebSocket reconnect) curl -k https://minusnow.yourdomain.com/api/health # Should still return 200 # Recover: ssh app1 "sudo pm2 start all"

Test 2: DB Primary Failure (Topology B)

# Simulate DB Primary crash ssh db-primary "sudo systemctl stop patroni" # Verify: # 1. Patroni promotes Replica within 10-30s # 2. HAProxy reroutes DB connections to new primary # 3. App nodes reconnect automatically # 4. Check data integrity patronictl -c /etc/patroni/patroni.yml list # pg-node2 should now show as "Leader" # Test application: curl -k https://minusnow.yourdomain.com/api/health # Recover original primary: ssh db-primary "sudo systemctl start patroni" # Patroni will reinitialize it as a replica

Test 3: Complete LB Failure

# If using a single LB, this is your remaining SPOF. # Options to eliminate: # 1. Use keepalived with a floating VIP between 2 LB nodes # 2. Use a cloud-managed LB (ALB, Azure App Gateway) # 3. Use DNS-based failover (Route 53, Cloudflare) # Test: Stop nginx/haproxy on the LB ssh lb "sudo systemctl stop nginx" # Result: Application becomes unreachable # Mitigation: keepalived floats VIP to backup LB within 3s

Expected Failover Times

Failure Scenario	Detection	Failover	Total Downtime	Data Loss
App node crash	~10–30s	Instant (LB reroutes)	< 30s	None
DB primary crash (Patroni)	~10s	~10–20s (promotion)	< 30s	Near-zero (sync replication)
DB primary crash (manual)	Monitoring alert	5–15 min (manual)	5–15 min	Near-zero
LB crash (with keepalived)	~3s	~3s (VIP float)	< 5s	None
Full data center loss	Immediate	DNS failover	5–30 min	Depends on replication

💾 13. Backup Strategy

Backup Matrix

What	How	Frequency	Retention	Where
PostgreSQL full dump	`pg_dump` / `pg_basebackup`	Daily (2 AM)	30 days	Remote NFS / S3
WAL archiving	Continuous (streaming)	Continuous	7 days	Remote storage
Application config	`.env`, `ecosystem.config.cjs`	On change	Versioned	Git / Vault
Application data (uploads)	rsync / S3 sync	Hourly	90 days	Remote NFS / S3
Audit logs	rsync / S3 sync	Daily	7 years	Compliance storage

Automated Backup Script

#!/bin/bash # /opt/minusnow/scripts/backup.sh # Run daily via cron: 0 2 * * * /opt/minusnow/scripts/backup.sh TIMESTAMP=$(date +%Y%m%d_%H%M%S) BACKUP_DIR="/backups/minusnow" REMOTE_DIR="s3://minusnow-backups/daily" # Database backup pg_dump -h 10.0.2.10 -U mnow_backup -Fc minusnow_prod \ > "${BACKUP_DIR}/db_${TIMESTAMP}.dump" # Application data tar czf "${BACKUP_DIR}/data_${TIMESTAMP}.tar.gz" \ /opt/minusnow/data /opt/minusnow/audit-logs # Upload to remote aws s3 cp "${BACKUP_DIR}/db_${TIMESTAMP}.dump" "${REMOTE_DIR}/" aws s3 cp "${BACKUP_DIR}/data_${TIMESTAMP}.tar.gz" "${REMOTE_DIR}/" # Cleanup local (keep 7 days) find "${BACKUP_DIR}" -name "*.dump" -mtime +7 -delete find "${BACKUP_DIR}" -name "*.tar.gz" -mtime +7 -delete echo "[$(date)] Backup complete: db_${TIMESTAMP}.dump"

🔧 14. Troubleshooting

14.1 App Nodes Not Joining LB Pool

# Check if the health endpoint responds: curl http://10.0.1.11:5000/api/health # Check Nginx upstream status: sudo nginx -T | grep -A5 "upstream minusnow" # Check HAProxy stats: echo "show stat" | socat stdio /var/run/haproxy/admin.sock # Common fix: ensure PORT=5000, firewall allows :5000

14.2 Session Lost When Switching Nodes

# Cause: Different SESSION_SECRET on each app node # Fix: Ensure SESSION_SECRET is identical across all app nodes # Verify: ssh app1 "grep SESSION_SECRET /opt/minusnow/.env" ssh app2 "grep SESSION_SECRET /opt/minusnow/.env" # Both must output the same value

14.3 Replication Lag Increasing

# Check current lag: sudo -u postgres psql -c "SELECT client_addr, state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;" # Common causes: # 1. Slow disk I/O on replica → upgrade to NVMe # 2. Long-running queries on replica → cancel them # 3. Network saturation → check bandwidth between DB nodes # 4. max_wal_senders too low → increase in postgresql.conf

14.4 Patroni Won't Start

# Check logs: sudo journalctl -u patroni -n 50 # Common issues: # 1. etcd not reachable → verify etcd cluster health: etcdctl endpoint health # 2. Data directory permissions: ls -la /var/lib/postgresql/16/main/ sudo chown -R postgres:postgres /var/lib/postgresql/16/main/ # 3. Port conflict with standalone PostgreSQL: sudo systemctl stop postgresql sudo systemctl start patroni

14.5 Split-Brain Scenario

Split-brain occurs when both DB nodes think they are primary. This can cause data corruption.

# Prevention: # 1. Use synchronous_commit = on (data safety) # 2. maximum_lag_on_failover in Patroni (prevents stale promotion) # 3. etcd quorum (prevents isolated node from promoting) # Detection: patronictl -c /etc/patroni/patroni.yml list # If 2 "Leader" entries appear → immediate action required # Resolution: # 1. Stop one of the two primaries immediately # 2. Identify which has the latest data (check pg_current_wal_lsn()) # 3. Demote the stale one: patronictl reinit minusnow-cluster <stale-node>

📋 15. Quick Reference

Architecture Decision Matrix

Factor	Topology A (2+1)	Topology B (2+2)	Kubernetes
Setup complexity	Low	Medium	High
App-tier HA	✅	✅	✅
DB failover	❌ Manual/Backup	✅ Automatic	✅ (managed DB or CNP)
Zero-downtime deploy	✅ Rolling	✅ Rolling	✅ Rolling
Auto-scaling	❌	❌	✅ HPA
Min servers	3	4	3+ node K8s cluster
Best for	50–500 users	500+ users	Cloud-native teams

Essential Commands

Action	Command
Check app health	`curl http://<app-ip>:5000/api/health`
Check cluster through LB	`curl -k https://minusnow.yourdomain.com/api/health`
View replication status	`sudo -u postgres psql -c "SELECT * FROM pg_stat_replication;"`
Patroni cluster status	`patronictl -c /etc/patroni/patroni.yml list`
Manual DB failover	`patronictl -c /etc/patroni/patroni.yml switchover`
Rolling restart app	`pm2 reload minusnow` (on each node sequentially)
Docker HA start	`docker compose -f docker-compose.ha.yml up -d`
K8s scale app	`kubectl scale deployment minusnow-app --replicas=4`

High Availability & Cluster Deployment Guide

📑 Table of Contents

🔍 1. Overview & Architecture Options

🏗️ 2. Topology A — 2 App Servers + 1 DB Server

✅ Advantages

⚠️ Limitations

🎯 Best For

🏗️ 3. Topology B — 2 App Servers + 2 DB Nodes

✅ Advantages

⚠️ Considerations

🎯 Best For

💻 4. Hardware & Network Requirements

Server Specifications

Network Requirements

Software Prerequisites (all nodes)

⚖️ 5. Load Balancer Setup

Option A: Nginx Load Balancer

Option B: HAProxy Load Balancer

Enable & Test

⚙️ 6. Application Cluster Setup

Step 6a: Install Node.js & Deploy Application

Step 6b: Configure Environment Variables

Step 6c: PM2 Cluster Mode

Step 6d: Verify Both App Nodes

🗄️ 7. PostgreSQL Streaming Replication (Topology B)

Step 7a: Configure Primary (10.0.2.10)

Step 7b: Create Replication User on Primary

Step 7c: Initialize Replica (10.0.2.11)

Step 7d: Configure Replica

Step 7e: Verify Replication

🔄 8. Automatic Failover with Patroni

Step 8a: Install etcd (on both DB nodes)

Step 8b: Install Patroni (on both DB nodes)

Step 8c: Patroni Configuration

Step 8d: Start Patroni

Step 8e: HAProxy for DB Failover (on LB or App nodes)

🐳 9. Docker Compose HA Deployment

Topology A: 2 App Containers + 1 DB

Nginx Config for Docker

Topology B: 2 App + 2 DB (with Replication)

Launch Commands

☸️ 10. Kubernetes Deployment

Deployment Manifest

Service & Ingress

Horizontal Pod Autoscaler

Database: Use CloudNativePG or Managed Service

📊 11. Health Checks & Monitoring

Application Health Endpoint

Key Metrics to Monitor

Replication Lag Monitor Script

🧪 12. Failover Testing Runbook

Test 1: App Node Failure

Test 2: DB Primary Failure (Topology B)

Test 3: Complete LB Failure

Expected Failover Times

💾 13. Backup Strategy

Backup Matrix

Automated Backup Script

🔧 14. Troubleshooting

14.1 App Nodes Not Joining LB Pool

14.2 Session Lost When Switching Nodes

14.3 Replication Lag Increasing

14.4 Patroni Won't Start

14.5 Split-Brain Scenario

📋 15. Quick Reference

Architecture Decision Matrix

Essential Commands

Related Documentation

📚 Deployment Guides

🔗 Reference