Building a Self-Healing Deployment Pipeline

Deployments break. Your site goes down at 2 AM. Users see generic errors while you scramble to rollback manually. Here's how I built a deployment system that heals itself automatically.

The Problem

Traditional deployments can fail quietly. Code deploys, then breaks production hours later. Users land on bare Nginx error pages. You manually SSH in to fix it. There's a better way.

The Solution

1. Branded Error Pages

Dual-layer error handling: - Django templates when Django works - Static Nginx fallbacks when Django crashes - Always branded, never generic

2. Health Checks with Timeouts

Four checks after each deployment: - HTTP Response (30s max): Site responding? - Gunicorn Service (10s): App server running? - Django Check (20s): Configuration valid? - Static Files (30s): Assets loading?

Timeouts prevent hanging. Fail fast, rollback immediately.

3. Last-Known-Good Rollback

The clever part: we don't rollback to the previous commit, we rollback to the last working commit.

Every successful deployment saves the commit SHA. On failure, restore that exact version. Two broken commits in a row? Still safe.

Visual: The Deployment Flow

Impact

Before: 5-10 minute manual rollbacks. Users saw errors.

After: 30-second automatic rollback. Branded error pages during recovery.

The Code

# GitHub Actions workflow
- name: Health Check
  run: |
    curl --max-time 30 https://site.com
    timeout 10 systemctl is-active gunicorn
    timeout 20 python manage.py check

- name: Rollback
  if: failure()
  run: |
    GOOD=$(cat .last_good_commit)
    git reset --hard $GOOD
    systemctl restart gunicorn

Key Lessons

Fail fast: Timeout hung services immediately. Track working statepost: Not just HEAD~1, but last verified working commit. Brand errors: Users see your brand even during failures. Test rollbacks: Your rollback is production code too.

Built with Django, GitHub Actions, and defensive programming. Deployments should heal themselves.