🛡️ High Availability
High Availability & Cluster Deployment Guide
Production-grade multi-node deployments with automatic failover, load balancing, and zero-downtime maintenance. Covers bare-metal, Docker Compose, and Kubernetes topologies.
Version 26.0 · Last updated March 2026 · Estimated setup time: 2–4 hours
🔍 1. Overview & Architecture Options
MinusNow supports three progressive deployment topologies. This guide covers Topology A and Topology B — multi-node architectures where any single server failure does not cause application downtime.
| Topology | Nodes | DB Failover | App Failover | Target Uptime | Use Case |
| Single-Node |
1 |
❌ None |
❌ None |
~99.5% |
Dev / POC / Small teams |
| Two-Node (existing guide) |
2 |
⚠️ Manual |
❌ None |
~99.9% |
Small production |
| Topology A — This Guide |
3 (2 App + 1 DB) |
❌ Single DB |
✅ Automatic |
~99.95% |
App-tier HA; DB handled by backups |
| Topology B — This Guide |
4 (2 App + 2 DB) |
✅ Automatic |
✅ Automatic |
~99.99% |
Full HA; enterprise production |
🏗️ 2. Topology A — 2 App Servers + 1 DB Server
The application tier is load-balanced across two servers. If one app server fails, the load balancer routes all traffic to the surviving node. The database is a single server (protected by regular backups and optional read replica).
┌─────────────────────┐
│ Load Balancer │
│ (Nginx / HAProxy) │
│ :80 / :443 (VIP) │
└─────────┬────────────┘
│
┌────────────┴────────────┐
│ │
┌──────▼──────┐ ┌───────▼─────┐
│ App Node 1 │ │ App Node 2 │
│ 10.0.1.11 │ │ 10.0.1.12 │
│ Node.js │ │ Node.js │
│ :5000 │ │ :5000 │
└──────┬──────┘ └───────┬─────┘
│ │
└────────────┬────────────┘
│
┌─────────▼──────────┐
│ DB Server │
│ 10.0.2.10 │
│ PostgreSQL :5432 │
│ (Primary) │
└────────────────────┘
✅ Advantages
- Zero downtime for app failures
- Rolling updates (drain → update → enable)
- Horizontal scale by adding more app nodes
- Simpler than full DB clustering
⚠️ Limitations
- DB is single point of failure
- DB failure requires restore from backup
- RPO depends on backup frequency
🎯 Best For
- Teams of 50–500 users
- Environments with managed DB (RDS/Cloud SQL)
- Budget-conscious HA requirements
🏗️ 3. Topology B — 2 App Servers + 2 DB Nodes
Full high availability across both tiers. The database uses PostgreSQL streaming replication with automatic failover via Patroni or manual promotion. No single point of failure.
┌─────────────────────┐
│ Load Balancer │
│ (Nginx / HAProxy) │
│ :80 / :443 (VIP) │
└─────────┬────────────┘
│
┌────────────┴────────────┐
│ │
┌──────▼──────┐ ┌───────▼─────┐
│ App Node 1 │ │ App Node 2 │
│ 10.0.1.11 │ │ 10.0.1.12 │
│ Node.js │ │ Node.js │
│ :5000 │ │ :5000 │
└──────┬──────┘ └───────┬─────┘
│ │
└────────────┬────────────┘
│
┌──────────────┴──────────────┐
│ │
┌───────▼────────┐ ┌──────────▼───────┐
│ DB Primary │ │ DB Replica │
│ 10.0.2.10 │ ──WAL──▶│ 10.0.2.11 │
│ PostgreSQL │ stream │ PostgreSQL │
│ :5432 (R/W) │ │ :5432 (Read-Only)│
└────────────────┘ └──────────────────┘
│ │
└──────── Patroni / ──────────┘
etcd (automatic
failover)
✅ Advantages
- No single point of failure
- Automatic DB failover (RPO ≈ 0)
- Read-replica offloads reporting queries
- Zero-downtime for any single node failure
⚠️ Considerations
- More complex to set up and maintain
- Requires Patroni + etcd (or managed DB HA)
- Network split-brain risk needs quorum
- 4 servers minimum
🎯 Best For
- Enterprises with 500+ users
- Regulated environments (SOC 2, ISO 27001)
- Zero-downtime SLA requirements
- Mission-critical ITSM workflows
💻 4. Hardware & Network Requirements
Server Specifications
| Role | CPU | RAM | Storage | OS | Qty (Topo A) | Qty (Topo B) |
| Load Balancer | 2 vCPU | 4 GB | 20 GB SSD | Ubuntu 24.04 / RHEL 9 | 1 | 1 |
| App Server | 8 vCPU | 32 GB | 100 GB SSD | Ubuntu 24.04 / RHEL 9 | 2 | 2 |
| DB Primary | 8 vCPU | 64 GB | 500 GB NVMe | Ubuntu 24.04 / RHEL 9 | 1 | 1 |
| DB Replica | 8 vCPU | 64 GB | 500 GB NVMe | Ubuntu 24.04 / RHEL 9 | — | 1 |
Tip: The load balancer can run on one of the app servers to reduce node count. For true HA, use a dedicated LB or a cloud-managed LB (ALB, Azure App Gateway).
Network Requirements
| Source | Destination | Port | Protocol | Purpose |
| Internet / VIP | Load Balancer | 80, 443 | TCP | HTTP/HTTPS ingress |
| Load Balancer | App Node 1 & 2 | 5000 | TCP | Upstream app traffic |
| App Node 1 & 2 | DB Primary | 5432 | TCP | Database connections |
| DB Primary | DB Replica | 5432 | TCP | WAL streaming replication |
| All nodes | All nodes | 2379–2380 | TCP | etcd cluster (Patroni, Topo B only) |
| All nodes | All nodes | 8008 | TCP | Patroni REST API (Topo B only) |
Software Prerequisites (all nodes)
| Component | Version | App Nodes | DB Nodes | LB Node |
| Node.js | 22.x LTS | ✅ | — | — |
| PostgreSQL | 16+ | — | ✅ | — |
| Nginx or HAProxy | 1.24+ / 2.8+ | — | — | ✅ |
| PM2 | Latest | ✅ | — | — |
| Patroni | 3.x | — | ✅ (Topo B) | — |
| etcd | 3.5+ | — | ✅ (Topo B) | — |
⚖️ 5. Load Balancer Setup
The load balancer distributes traffic across app nodes and performs health checks to detect failures. Choose Nginx (simpler) or HAProxy (more HA features).
Option A: Nginx Load Balancer
# /etc/nginx/conf.d/minusnow-ha.conf
upstream minusnow_app {
# Least-connections balancing (best for long WebSocket sessions)
least_conn;
server 10.0.1.11:5000 max_fails=3 fail_timeout=30s;
server 10.0.1.12:5000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name minusnow.yourdomain.com;
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl http2;
server_name minusnow.yourdomain.com;
ssl_certificate /etc/ssl/certs/minusnow.crt;
ssl_certificate_key /etc/ssl/private/minusnow.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# Health check endpoint
location /api/health {
proxy_pass http://minusnow_app;
proxy_connect_timeout 5s;
proxy_read_timeout 10s;
}
# WebSocket support (for live updates)
location /ws {
proxy_pass http://minusnow_app;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 86400;
}
# Main application
location / {
proxy_pass http://minusnow_app;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 10s;
proxy_read_timeout 60s;
proxy_send_timeout 60s;
}
}
Option B: HAProxy Load Balancer
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
maxconn 4096
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5s
timeout client 60s
timeout server 60s
timeout tunnel 86400s # WebSocket keepalive
frontend https_front
bind *:443 ssl crt /etc/ssl/certs/minusnow.pem
default_backend minusnow_app
# Stats page (restrict to internal)
acl internal src 10.0.0.0/8
use_backend stats_backend if internal { path_beg /haproxy-stats }
backend minusnow_app
balance leastconn
option httpchk GET /api/health
http-check expect status 200
server app1 10.0.1.11:5000 check inter 10s fall 3 rise 2
server app2 10.0.1.12:5000 check inter 10s fall 3 rise 2
backend stats_backend
stats enable
stats uri /haproxy-stats
stats refresh 10s
Enable & Test
# Nginx
sudo nginx -t
sudo systemctl enable --now nginx
# HAProxy
sudo haproxy -c -f /etc/haproxy/haproxy.cfg
sudo systemctl enable --now haproxy
# Verify from the LB server
curl -k https://localhost/api/health
# Expected: {"status":"ok","timestamp":"..."}
⚙️ 6. Application Cluster Setup
Repeat these steps on both app servers (10.0.1.11 and 10.0.1.12).
Step 6a: Install Node.js & Deploy Application
# Install Node.js 22.x LTS
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
sudo apt-get install -y nodejs
# Create application user
sudo useradd -r -m -s /bin/bash minusnow
# Deploy application
sudo mkdir -p /opt/minusnow
sudo chown minusnow:minusnow /opt/minusnow
cd /opt/minusnow
# Copy application package (from your build server or artifact store)
sudo -u minusnow tar xzf /tmp/minusnow-latest.tar.gz -C /opt/minusnow/
# Install dependencies & build
sudo -u minusnow npm ci --production
sudo -u minusnow npm run build
Step 6b: Configure Environment Variables
# /opt/minusnow/.env (same on both app nodes)
NODE_ENV=production
PORT=5000
# Point to DB Primary (Topology A) or Patroni VIP (Topology B)
DATABASE_URL=postgresql://minusnow:<password>@10.0.2.10:5432/minusnow_prod
# Session secret (MUST be identical on all app nodes for sticky sessions)
SESSION_SECRET=<same-secret-on-all-app-nodes>
# Application URL (the LB address)
APP_BASE_URL=https://minusnow.yourdomain.com
# Agent API key
AGENT_API_KEY=<from-secrets-manager>
Critical: The SESSION_SECRET must be identical on all app nodes. If different secrets are used, users will be logged out when their request hits the other node.
Step 6c: PM2 Cluster Mode
# /opt/minusnow/ecosystem.config.cjs
module.exports = {
apps: [{
name: "minusnow",
script: "dist/index.js",
instances: "max", // Use all available CPU cores
exec_mode: "cluster", // PM2 cluster mode for multi-core
env: {
NODE_ENV: "production",
PORT: 5000,
},
// Graceful restart
kill_timeout: 5000,
listen_timeout: 10000,
// Auto-restart on crash
max_restarts: 10,
restart_delay: 4000,
// Log rotation
log_date_format: "YYYY-MM-DD HH:mm:ss",
error_file: "/var/log/minusnow/error.log",
out_file: "/var/log/minusnow/output.log",
}]
};
# Start application with PM2
sudo -u minusnow pm2 start ecosystem.config.cjs
sudo -u minusnow pm2 save
# Enable PM2 startup on boot
sudo env PATH=$PATH:/usr/bin pm2 startup systemd -u minusnow --hp /home/minusnow
Step 6d: Verify Both App Nodes
# From the LB server, test each node directly:
curl http://10.0.1.11:5000/api/health
# {"status":"ok","node":"app1","uptime":...}
curl http://10.0.1.12:5000/api/health
# {"status":"ok","node":"app2","uptime":...}
# Test through the load balancer:
curl -k https://minusnow.yourdomain.com/api/health
🗄️ 7. PostgreSQL Streaming Replication (Topology B)
Set up WAL-based streaming replication between the Primary (10.0.2.10) and Replica (10.0.2.11). This enables near-zero data loss failover.
Step 7a: Configure Primary (10.0.2.10)
# /etc/postgresql/16/main/postgresql.conf (Ubuntu)
# or /var/lib/pgsql/16/data/postgresql.conf (RHEL)
# --- Replication Settings ---
wal_level = replica
max_wal_senders = 5
wal_keep_size = '1GB'
max_replication_slots = 5
synchronous_commit = on # or 'remote_apply' for strong consistency
synchronous_standby_names = 'replica1' # name of the standby
# --- Connection Settings ---
listen_addresses = '*'
Step 7b: Create Replication User on Primary
# On the Primary DB server
sudo -u postgres psql
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD '<strong-password>';
# Add to pg_hba.conf:
# host replication replicator 10.0.2.11/32 scram-sha-256
sudo systemctl restart postgresql
Step 7c: Initialize Replica (10.0.2.11)
# Stop PostgreSQL on Replica
sudo systemctl stop postgresql
# Remove existing data directory
sudo rm -rf /var/lib/postgresql/16/main/*
# Create base backup from Primary
sudo -u postgres pg_basebackup \
-h 10.0.2.10 \
-U replicator \
-D /var/lib/postgresql/16/main \
-Fp -Xs -P -R
# The -R flag auto-creates standby.signal and sets primary_conninfo
Step 7d: Configure Replica
# /var/lib/postgresql/16/main/postgresql.auto.conf (auto-created by -R)
# Verify it contains:
primary_conninfo = 'host=10.0.2.10 port=5432 user=replicator password=<password> application_name=replica1'
# Start PostgreSQL on Replica
sudo systemctl start postgresql
Step 7e: Verify Replication
# On Primary — check replication status:
sudo -u postgres psql -c "SELECT client_addr, state, sync_state, sent_lsn, replay_lsn FROM pg_stat_replication;"
# client_addr | state | sync_state | sent_lsn | replay_lsn
# --------------+-----------+------------+--------------+------------
# 10.0.2.11 | streaming | sync | 0/3000060 | 0/3000060
# On Replica — confirm it's in recovery mode:
sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
# pg_is_in_recovery
# -------------------
# t
Replication Active: When state = streaming and sent_lsn ≈ replay_lsn, the replica is fully synchronized with the primary.
🔄 8. Automatic Failover with Patroni
Patroni automates PostgreSQL HA — it monitors the primary, promotes the replica, and updates connection routing. It uses etcd for consensus.
Step 8a: Install etcd (on both DB nodes)
sudo apt install -y etcd
# /etc/default/etcd (Node 10.0.2.10)
ETCD_NAME="etcd1"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_LISTEN_PEER_URLS="http://10.0.2.10:2380"
ETCD_LISTEN_CLIENT_URLS="http://10.0.2.10:2379,http://127.0.0.1:2379"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.0.2.10:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://10.0.2.10:2379"
ETCD_INITIAL_CLUSTER="etcd1=http://10.0.2.10:2380,etcd2=http://10.0.2.11:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
# Similar on 10.0.2.11 with ETCD_NAME="etcd2" and appropriate IPs
sudo systemctl enable --now etcd
Step 8b: Install Patroni (on both DB nodes)
sudo apt install -y python3-pip python3-psycopg2
sudo pip3 install patroni[etcd]
Step 8c: Patroni Configuration
# /etc/patroni/patroni.yml (Primary: 10.0.2.10)
scope: minusnow-cluster
name: pg-node1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.2.10:8008
etcd3:
hosts: 10.0.2.10:2379,10.0.2.11:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1 MB max lag for failover
synchronous_mode: true
postgresql:
use_pg_rewind: true
parameters:
max_connections: 200
wal_level: replica
max_wal_senders: 5
max_replication_slots: 5
synchronous_commit: "on"
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.2.10:5432
data_dir: /var/lib/postgresql/16/main
bin_dir: /usr/lib/postgresql/16/bin
authentication:
superuser:
username: postgres
password: <postgres-password>
replication:
username: replicator
password: <replicator-password>
parameters:
unix_socket_directories: '/var/run/postgresql'
# /etc/patroni/patroni.yml (Replica: 10.0.2.11) — same but:
name: pg-node2
restapi:
connect_address: 10.0.2.11:8008
postgresql:
connect_address: 10.0.2.11:5432
Step 8d: Start Patroni
# Create systemd service
sudo tee /etc/systemd/system/patroni.service <<'EOF'
[Unit]
Description=Patroni PostgreSQL HA
After=network.target etcd.service
[Service]
Type=simple
User=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni.yml
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now patroni
# Check cluster status
patronictl -c /etc/patroni/patroni.yml list
# +--------+----------+---------+---------+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +--------+----------+---------+---------+----+-----------+
# | pg-node1 | 10.0.2.10 | Leader | running | 1 | 0 |
# | pg-node2 | 10.0.2.11 | Replica | running | 1 | 0 |
# +--------+----------+---------+---------+----+-----------+
Step 8e: HAProxy for DB Failover (on LB or App nodes)
Route application DB connections through HAProxy so the app always hits the current primary, regardless of which node Patroni promoted.
# Append to /etc/haproxy/haproxy.cfg (or separate file)
listen postgres_rw
bind *:6432
mode tcp
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server pg-node1 10.0.2.10:5432 check port 8008
server pg-node2 10.0.2.11:5432 check port 8008
listen postgres_ro
bind *:6433
mode tcp
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2
server pg-node1 10.0.2.10:5432 check port 8008
server pg-node2 10.0.2.11:5432 check port 8008
# Update app .env to use HAProxy for DB:
DATABASE_URL=postgresql://minusnow:<password>@10.0.1.10:6432/minusnow_prod
How it works: Patroni exposes /primary and /replica health endpoints on port 8008. HAProxy checks these to route R/W traffic to the current leader and read traffic to replicas. On failover, Patroni promotes the replica — HAProxy detects the role change within seconds and reroutes traffic automatically.
🐳 9. Docker Compose HA Deployment
For teams preferring containerized deployments, here's a Docker Compose configuration that replicates the HA topology with multiple app containers and PostgreSQL.
Topology A: 2 App Containers + 1 DB
# docker-compose.ha.yml — Topology A (2 App + 1 DB)
version: '3.8'
services:
# ── Load Balancer ──────────────────────
nginx:
image: nginx:1.25-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/minusnow-ha.conf:/etc/nginx/conf.d/default.conf:ro
- ./nginx/certs:/etc/ssl/certs:ro
depends_on:
app1:
condition: service_healthy
app2:
condition: service_healthy
restart: always
# ── Application Node 1 ────────────────
app1:
build: .
hostname: app1
environment:
- NODE_ENV=production
- PORT=5000
- DATABASE_URL=postgresql://minusnow:${DB_PASSWORD:-changeme}@db:5432/minusnow_prod
- SESSION_SECRET=${SESSION_SECRET}
env_file: .env
depends_on:
db:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:5000/api/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
restart: always
volumes:
- app-data:/app/data
- audit-logs:/app/audit-logs
# ── Application Node 2 ────────────────
app2:
build: .
hostname: app2
environment:
- NODE_ENV=production
- PORT=5000
- DATABASE_URL=postgresql://minusnow:${DB_PASSWORD:-changeme}@db:5432/minusnow_prod
- SESSION_SECRET=${SESSION_SECRET}
env_file: .env
depends_on:
db:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:5000/api/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
restart: always
volumes:
- app-data:/app/data
- audit-logs:/app/audit-logs
# ── Database ───────────────────────────
db:
image: postgres:16-alpine
environment:
POSTGRES_DB: minusnow_prod
POSTGRES_USER: minusnow
POSTGRES_PASSWORD: ${DB_PASSWORD:-changeme}
volumes:
- pgdata:/var/lib/postgresql/data
ports:
- "127.0.0.1:5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U minusnow -d minusnow_prod"]
interval: 10s
timeout: 5s
retries: 5
restart: always
shm_size: '256mb'
volumes:
pgdata:
app-data:
audit-logs:
Nginx Config for Docker
# nginx/minusnow-ha.conf
upstream minusnow {
least_conn;
server app1:5000 max_fails=3 fail_timeout=30s;
server app2:5000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name _;
location /ws {
proxy_pass http://minusnow;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 86400;
}
location / {
proxy_pass http://minusnow;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Topology B: 2 App + 2 DB (with Replication)
# docker-compose.ha-full.yml — Topology B (2 App + Primary DB + Replica DB)
version: '3.8'
services:
nginx:
image: nginx:1.25-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx/minusnow-ha.conf:/etc/nginx/conf.d/default.conf:ro
depends_on:
app1:
condition: service_healthy
app2:
condition: service_healthy
restart: always
app1:
build: .
hostname: app1
environment:
- NODE_ENV=production
- PORT=5000
- DATABASE_URL=postgresql://minusnow:${DB_PASSWORD:-changeme}@db-primary:5432/minusnow_prod
- SESSION_SECRET=${SESSION_SECRET}
env_file: .env
depends_on:
db-primary:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:5000/api/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
restart: always
volumes:
- app-data:/app/data
app2:
build: .
hostname: app2
environment:
- NODE_ENV=production
- PORT=5000
- DATABASE_URL=postgresql://minusnow:${DB_PASSWORD:-changeme}@db-primary:5432/minusnow_prod
- SESSION_SECRET=${SESSION_SECRET}
env_file: .env
depends_on:
db-primary:
condition: service_healthy
healthcheck:
test: ["CMD", "wget", "-q", "--spider", "http://localhost:5000/api/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
restart: always
volumes:
- app-data:/app/data
# ── Primary Database ───────────────────
db-primary:
image: postgres:16-alpine
hostname: db-primary
environment:
POSTGRES_DB: minusnow_prod
POSTGRES_USER: minusnow
POSTGRES_PASSWORD: ${DB_PASSWORD:-changeme}
volumes:
- pgdata-primary:/var/lib/postgresql/data
- ./db/init-primary.sh:/docker-entrypoint-initdb.d/init-replication.sh:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U minusnow -d minusnow_prod"]
interval: 10s
timeout: 5s
retries: 5
restart: always
shm_size: '256mb'
command: >
postgres
-c wal_level=replica
-c max_wal_senders=5
-c max_replication_slots=5
-c synchronous_commit=on
-c synchronous_standby_names='replica1'
-c wal_keep_size=1GB
# ── Replica Database ───────────────────
db-replica:
image: postgres:16-alpine
hostname: db-replica
environment:
PGUSER: replicator
PGPASSWORD: ${REPL_PASSWORD:-replpass}
depends_on:
db-primary:
condition: service_healthy
volumes:
- pgdata-replica:/var/lib/postgresql/data
- ./db/init-replica.sh:/docker-entrypoint-initdb.d/init-replica.sh:ro
restart: always
shm_size: '256mb'
volumes:
pgdata-primary:
pgdata-replica:
app-data:
Launch Commands
# Topology A (2 App + 1 DB):
docker compose -f docker-compose.ha.yml up -d --build
# Topology B (2 App + 2 DB):
docker compose -f docker-compose.ha-full.yml up -d --build
# Check status:
docker compose ps
docker compose logs -f app1 app2
# Scale app tier (add a 3rd app node dynamically):
docker compose -f docker-compose.ha.yml up -d --scale app1=1 --scale app2=1
# For truly scaling beyond 2, use docker compose --scale or Kubernetes.
☸️ 10. Kubernetes Deployment
For production Kubernetes clusters. Provides auto-healing, rolling updates, and horizontal pod autoscaling.
Deployment Manifest
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: minusnow-app
labels:
app: minusnow
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero-downtime deploys
selector:
matchLabels:
app: minusnow
template:
metadata:
labels:
app: minusnow
spec:
containers:
- name: minusnow
image: minusnow/itsm:26.0
ports:
- containerPort: 5000
env:
- name: NODE_ENV
value: "production"
- name: PORT
value: "5000"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: minusnow-secrets
key: database-url
- name: SESSION_SECRET
valueFrom:
secretKeyRef:
name: minusnow-secrets
key: session-secret
livenessProbe:
httpGet:
path: /api/health
port: 5000
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/health
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
Service & Ingress
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
name: minusnow-svc
spec:
selector:
app: minusnow
ports:
- port: 80
targetPort: 5000
type: ClusterIP
---
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minusnow-ingress
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-read-timeout: "86400"
nginx.ingress.kubernetes.io/proxy-send-timeout: "86400"
nginx.ingress.kubernetes.io/websocket-services: minusnow-svc
spec:
ingressClassName: nginx
tls:
- hosts:
- minusnow.yourdomain.com
secretName: minusnow-tls
rules:
- host: minusnow.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: minusnow-svc
port:
number: 80
Horizontal Pod Autoscaler
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: minusnow-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: minusnow-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Database: Use CloudNativePG or Managed Service
# For PostgreSQL on Kubernetes, recommended options:
#
# Option 1: Managed Database (recommended for production)
# - AWS RDS Multi-AZ
# - Azure Database for PostgreSQL Flexible (HA enabled)
# - Google Cloud SQL with HA
#
# Option 2: CloudNativePG Operator
# kubectl apply -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.22/releases/cnpg-1.22.0.yaml
#
# Then create a PostgreSQL cluster:
# k8s/cnpg-cluster.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: minusnow-db
spec:
instances: 2 # 1 primary + 1 replica
storage:
size: 50Gi
storageClass: gp3
postgresql:
parameters:
max_connections: "200"
synchronous_commit: "on"
bootstrap:
initdb:
database: minusnow_prod
owner: minusnow
📊 11. Health Checks & Monitoring
Application Health Endpoint
# MinusNow exposes /api/health by default. Example response:
{
"status": "ok",
"uptime": 86400,
"database": "connected",
"version": "26.0.1",
"node": "app1"
}
# The LB should poll this endpoint every 10-15 seconds.
# If it returns non-200 or times out, mark the node as down.
Key Metrics to Monitor
| Metric | Threshold | Alert Level | Where to Check |
| App response time (p95) | > 2s | Warning | LB logs / APM |
| App node count (healthy) | < 2 | Critical | LB health check |
| DB replication lag | > 10 MB | Warning | pg_stat_replication |
| DB replication lag | > 100 MB | Critical | pg_stat_replication |
| DB connections used | > 80% | Warning | pg_stat_activity |
| Disk usage (DB) | > 85% | Critical | df -h |
| Patroni leader status | No leader | Critical | patronictl list |
| CPU usage (any node) | > 90% sustained | Warning | Node exporter / htop |
Replication Lag Monitor Script
#!/bin/bash
# /opt/minusnow/scripts/check-replication.sh
# Run via cron every 1 minute
LAG=$(sudo -u postgres psql -t -c \
"SELECT COALESCE(pg_wal_lsn_diff(sent_lsn, replay_lsn), 0) FROM pg_stat_replication LIMIT 1;" \
2>/dev/null | tr -d ' ')
if [ -z "$LAG" ]; then
echo "CRITICAL: No replication connection"
exit 2
elif [ "$LAG" -gt 104857600 ]; then # 100 MB
echo "CRITICAL: Replication lag ${LAG} bytes"
exit 2
elif [ "$LAG" -gt 10485760 ]; then # 10 MB
echo "WARNING: Replication lag ${LAG} bytes"
exit 1
else
echo "OK: Replication lag ${LAG} bytes"
exit 0
fi
🧪 12. Failover Testing Runbook
Run these tests quarterly to verify your HA setup works. Always test during a maintenance window.
Test 1: App Node Failure
# Simulate App Node 1 crash
ssh app1 "sudo pm2 stop all" # or: docker stop app1
# Verify:
# 1. LB detects failure within 30s
# 2. All traffic routes to App Node 2
# 3. Users experience no errors (maybe brief WebSocket reconnect)
curl -k https://minusnow.yourdomain.com/api/health
# Should still return 200
# Recover:
ssh app1 "sudo pm2 start all"
Test 2: DB Primary Failure (Topology B)
# Simulate DB Primary crash
ssh db-primary "sudo systemctl stop patroni"
# Verify:
# 1. Patroni promotes Replica within 10-30s
# 2. HAProxy reroutes DB connections to new primary
# 3. App nodes reconnect automatically
# 4. Check data integrity
patronictl -c /etc/patroni/patroni.yml list
# pg-node2 should now show as "Leader"
# Test application:
curl -k https://minusnow.yourdomain.com/api/health
# Recover original primary:
ssh db-primary "sudo systemctl start patroni"
# Patroni will reinitialize it as a replica
Test 3: Complete LB Failure
# If using a single LB, this is your remaining SPOF.
# Options to eliminate:
# 1. Use keepalived with a floating VIP between 2 LB nodes
# 2. Use a cloud-managed LB (ALB, Azure App Gateway)
# 3. Use DNS-based failover (Route 53, Cloudflare)
# Test: Stop nginx/haproxy on the LB
ssh lb "sudo systemctl stop nginx"
# Result: Application becomes unreachable
# Mitigation: keepalived floats VIP to backup LB within 3s
Expected Failover Times
| Failure Scenario | Detection | Failover | Total Downtime | Data Loss |
| App node crash | ~10–30s | Instant (LB reroutes) | < 30s | None |
| DB primary crash (Patroni) | ~10s | ~10–20s (promotion) | < 30s | Near-zero (sync replication) |
| DB primary crash (manual) | Monitoring alert | 5–15 min (manual) | 5–15 min | Near-zero |
| LB crash (with keepalived) | ~3s | ~3s (VIP float) | < 5s | None |
| Full data center loss | Immediate | DNS failover | 5–30 min | Depends on replication |
💾 13. Backup Strategy
Backup Matrix
| What | How | Frequency | Retention | Where |
| PostgreSQL full dump | pg_dump / pg_basebackup | Daily (2 AM) | 30 days | Remote NFS / S3 |
| WAL archiving | Continuous (streaming) | Continuous | 7 days | Remote storage |
| Application config | .env, ecosystem.config.cjs | On change | Versioned | Git / Vault |
| Application data (uploads) | rsync / S3 sync | Hourly | 90 days | Remote NFS / S3 |
| Audit logs | rsync / S3 sync | Daily | 7 years | Compliance storage |
Automated Backup Script
#!/bin/bash
# /opt/minusnow/scripts/backup.sh
# Run daily via cron: 0 2 * * * /opt/minusnow/scripts/backup.sh
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/minusnow"
REMOTE_DIR="s3://minusnow-backups/daily"
# Database backup
pg_dump -h 10.0.2.10 -U mnow_backup -Fc minusnow_prod \
> "${BACKUP_DIR}/db_${TIMESTAMP}.dump"
# Application data
tar czf "${BACKUP_DIR}/data_${TIMESTAMP}.tar.gz" \
/opt/minusnow/data /opt/minusnow/audit-logs
# Upload to remote
aws s3 cp "${BACKUP_DIR}/db_${TIMESTAMP}.dump" "${REMOTE_DIR}/"
aws s3 cp "${BACKUP_DIR}/data_${TIMESTAMP}.tar.gz" "${REMOTE_DIR}/"
# Cleanup local (keep 7 days)
find "${BACKUP_DIR}" -name "*.dump" -mtime +7 -delete
find "${BACKUP_DIR}" -name "*.tar.gz" -mtime +7 -delete
echo "[$(date)] Backup complete: db_${TIMESTAMP}.dump"
🔧 14. Troubleshooting
14.1 App Nodes Not Joining LB Pool
# Check if the health endpoint responds:
curl http://10.0.1.11:5000/api/health
# Check Nginx upstream status:
sudo nginx -T | grep -A5 "upstream minusnow"
# Check HAProxy stats:
echo "show stat" | socat stdio /var/run/haproxy/admin.sock
# Common fix: ensure PORT=5000, firewall allows :5000
14.2 Session Lost When Switching Nodes
# Cause: Different SESSION_SECRET on each app node
# Fix: Ensure SESSION_SECRET is identical across all app nodes
# Verify:
ssh app1 "grep SESSION_SECRET /opt/minusnow/.env"
ssh app2 "grep SESSION_SECRET /opt/minusnow/.env"
# Both must output the same value
14.3 Replication Lag Increasing
# Check current lag:
sudo -u postgres psql -c "SELECT client_addr, state,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;"
# Common causes:
# 1. Slow disk I/O on replica → upgrade to NVMe
# 2. Long-running queries on replica → cancel them
# 3. Network saturation → check bandwidth between DB nodes
# 4. max_wal_senders too low → increase in postgresql.conf
14.4 Patroni Won't Start
# Check logs:
sudo journalctl -u patroni -n 50
# Common issues:
# 1. etcd not reachable → verify etcd cluster health:
etcdctl endpoint health
# 2. Data directory permissions:
ls -la /var/lib/postgresql/16/main/
sudo chown -R postgres:postgres /var/lib/postgresql/16/main/
# 3. Port conflict with standalone PostgreSQL:
sudo systemctl stop postgresql
sudo systemctl start patroni
14.5 Split-Brain Scenario
Split-brain occurs when both DB nodes think they are primary. This can cause data corruption.
# Prevention:
# 1. Use synchronous_commit = on (data safety)
# 2. maximum_lag_on_failover in Patroni (prevents stale promotion)
# 3. etcd quorum (prevents isolated node from promoting)
# Detection:
patronictl -c /etc/patroni/patroni.yml list
# If 2 "Leader" entries appear → immediate action required
# Resolution:
# 1. Stop one of the two primaries immediately
# 2. Identify which has the latest data (check pg_current_wal_lsn())
# 3. Demote the stale one: patronictl reinit minusnow-cluster <stale-node>
📋 15. Quick Reference
Architecture Decision Matrix
| Factor | Topology A (2+1) | Topology B (2+2) | Kubernetes |
| Setup complexity | Low | Medium | High |
| App-tier HA | ✅ | ✅ | ✅ |
| DB failover | ❌ Manual/Backup | ✅ Automatic | ✅ (managed DB or CNP) |
| Zero-downtime deploy | ✅ Rolling | ✅ Rolling | ✅ Rolling |
| Auto-scaling | ❌ | ❌ | ✅ HPA |
| Min servers | 3 | 4 | 3+ node K8s cluster |
| Best for | 50–500 users | 500+ users | Cloud-native teams |
Essential Commands
| Action | Command |
| Check app health | curl http://<app-ip>:5000/api/health |
| Check cluster through LB | curl -k https://minusnow.yourdomain.com/api/health |
| View replication status | sudo -u postgres psql -c "SELECT * FROM pg_stat_replication;" |
| Patroni cluster status | patronictl -c /etc/patroni/patroni.yml list |
| Manual DB failover | patronictl -c /etc/patroni/patroni.yml switchover |
| Rolling restart app | pm2 reload minusnow (on each node sequentially) |
| Docker HA start | docker compose -f docker-compose.ha.yml up -d |
| K8s scale app | kubectl scale deployment minusnow-app --replicas=4 |
Related Documentation