Deployments break. Your site goes down at 2 AM. Users see generic errors while you scramble to rollback manually. Here's how I built a deployment system that heals itself automatically.
The Problem
Traditional deployments can fail quietly. Code deploys, then breaks production hours later. Users land on bare Nginx error pages. You manually SSH in to fix it. There's a better way.
The Solution
1. Branded Error Pages
Dual-layer error handling: - Django templates when Django works - Static Nginx fallbacks when Django crashes - Always branded, never generic
2. Health Checks with Timeouts
Four checks after each deployment: - HTTP Response (30s max): Site responding? - Gunicorn Service (10s): App server running? - Django Check (20s): Configuration valid? - Static Files (30s): Assets loading?
Timeouts prevent hanging. Fail fast, rollback immediately.
3. Last-Known-Good Rollback
The clever part: we don't rollback to the previous commit, we rollback to the last working commit.
Every successful deployment saves the commit SHA. On failure, restore that exact version. Two broken commits in a row? Still safe.
Visual: The Deployment Flow
Impact
Before: 5-10 minute manual rollbacks. Users saw errors.
After: 30-second automatic rollback. Branded error pages during recovery.
The Code
# GitHub Actions workflow
- name: Health Check
run: |
curl --max-time 30 https://site.com
timeout 10 systemctl is-active gunicorn
timeout 20 python manage.py check
- name: Rollback
if: failure()
run: |
GOOD=$(cat .last_good_commit)
git reset --hard $GOOD
systemctl restart gunicorn
Key Lessons
Fail fast: Timeout hung services immediately.
Track working statepost: Not just HEAD~1, but last verified working commit.
Brand errors: Users see your brand even during failures.
Test rollbacks: Your rollback is production code too.
Built with Django, GitHub Actions, and defensive programming. Deployments should heal themselves.