Files
thrillwiki_django_no_react/docs/HEALTH_CHECKS.md
pacnpal edcd8f2076 Add secret management guide, client-side performance monitoring, and search accessibility enhancements
- Introduced a comprehensive Secret Management Guide detailing best practices, secret classification, development setup, production management, rotation procedures, and emergency protocols.
- Implemented a client-side performance monitoring script to track various metrics including page load performance, paint metrics, layout shifts, and memory usage.
- Enhanced search accessibility with keyboard navigation support for search results, ensuring compliance with WCAG standards and improving user experience.
2025-12-23 16:41:42 -05:00

8.3 KiB

ThrillWiki Health Check Documentation

This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes.

Overview

ThrillWiki provides three health check endpoints with different levels of detail:

Endpoint Purpose Authentication
/api/v1/health/ Comprehensive health check Public
/api/v1/health/simple/ Simple OK/ERROR for load balancers Public
/api/v1/health/performance/ Performance metrics Debug mode only

Endpoint Details

Comprehensive Health Check

Endpoint: GET /api/v1/health/

Returns detailed health information including system metrics, database status, cache status, and individual health checks.

Response Format

{
  "status": "healthy",
  "timestamp": "2025-12-23T10:30:00Z",
  "version": "1.0.0",
  "environment": "production",
  "response_time_ms": 45.23,
  "checks": {
    "DatabaseBackend": {
      "status": "healthy",
      "critical": true,
      "errors": [],
      "response_time_ms": 12.5
    },
    "CacheBackend": {
      "status": "healthy",
      "critical": false,
      "errors": [],
      "response_time_ms": 2.1
    },
    "DiskUsage": {
      "status": "healthy",
      "critical": false,
      "errors": []
    },
    "MemoryUsage": {
      "status": "healthy",
      "critical": false,
      "errors": []
    }
  },
  "metrics": {
    "cache": {
      "redis": {
        "used_memory": "45.2MB",
        "connected_clients": 15,
        "hit_rate": 94.5
      }
    },
    "database": {
      "vendor": "postgresql",
      "connection_status": "connected",
      "test_query_time_ms": 1.2,
      "active_connections": 8,
      "cache_hit_ratio": 98.3
    },
    "system": {
      "debug_mode": false,
      "memory": {
        "total_mb": 8192,
        "available_mb": 4096,
        "percent_used": 50
      },
      "cpu": {
        "percent_used": 25.3,
        "core_count": 4
      },
      "disk": {
        "total_gb": 100,
        "free_gb": 45,
        "percent_used": 55
      }
    }
  }
}

Status Codes

Code Meaning
200 All checks passed or non-critical failures
503 Critical service failure (database, etc.)

Simple Health Check

Endpoint: GET /api/v1/health/simple/

Lightweight health check designed for load balancer health probes.

Response Format (Healthy)

{
  "status": "ok",
  "timestamp": "2025-12-23T10:30:00Z"
}

Response Format (Unhealthy)

{
  "status": "error",
  "error": "Database connection failed",
  "timestamp": "2025-12-23T10:30:00Z"
}

Status Codes

Code Meaning
200 Service healthy
503 Service unhealthy

Performance Metrics

Endpoint: GET /api/v1/health/performance/

Detailed performance metrics for debugging (only available when DEBUG=True).

Response Format

{
  "timestamp": "2025-12-23T10:30:00Z",
  "database_analysis": {
    "total_queries": 0,
    "query_analysis": {}
  },
  "cache_performance": {
    "redis": {
      "used_memory": "45.2MB",
      "hit_rate": 94.5
    }
  },
  "recent_slow_queries": []
}

Status Codes

Code Meaning
200 Metrics returned
403 Not available (DEBUG=False)

Health Checks Included

Database Check

Verifies PostgreSQL connectivity by executing a simple query.

# Test query executed
cursor.execute("SELECT 1")

Critical: Yes (503 returned if fails)

Cache Check

Verifies Redis connectivity and operation.

Critical: No (200 returned with warning if fails)

Disk Usage Check

Monitors disk space to prevent storage exhaustion.

Threshold: Configurable via HEALTH_CHECK_DISK_USAGE_MAX (default: 90%)

Memory Usage Check

Monitors available memory.

Threshold: Configurable via HEALTH_CHECK_MEMORY_MIN (default: 100MB)

Integration Examples

Kubernetes Liveness Probe

livenessProbe:
  httpGet:
    path: /api/v1/health/simple/
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Kubernetes Readiness Probe

readinessProbe:
  httpGet:
    path: /api/v1/health/simple/
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

AWS Application Load Balancer

{
  "HealthCheckPath": "/api/v1/health/simple/",
  "HealthCheckIntervalSeconds": 30,
  "HealthyThresholdCount": 2,
  "UnhealthyThresholdCount": 3,
  "HealthCheckTimeoutSeconds": 5,
  "Matcher": {
    "HttpCode": "200"
  }
}

Docker Compose Health Check

services:
  web:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Nginx Upstream Health Check

upstream django {
    server web:8000;
    # With nginx-plus
    health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2;
}

Monitoring Integration

Prometheus Metrics

Health check data can be exposed as Prometheus metrics using django-prometheus:

# Example custom metrics
from prometheus_client import Gauge

database_response_time = Gauge(
    'thrillwiki_database_response_time_seconds',
    'Database query response time'
)

cache_hit_rate = Gauge(
    'thrillwiki_cache_hit_rate',
    'Cache hit rate percentage'
)

Alerting Thresholds

Recommended alerting thresholds:

Metric Warning Critical
Response time > 1s > 5s
Database query time > 100ms > 500ms
Cache hit rate < 80% < 50%
Disk usage > 80% > 90%
Memory usage > 80% > 90%

Grafana Dashboard

Import the health check dashboard:

{
  "dashboard": {
    "title": "ThrillWiki Health",
    "panels": [
      {
        "title": "Health Status",
        "type": "stat",
        "targets": [
          {
            "expr": "probe_success{job='thrillwiki'}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "thrillwiki_health_response_time_ms"
          }
        ]
      }
    ]
  }
}

Configuration

Environment Variables

Variable Default Description
HEALTH_CHECK_DISK_USAGE_MAX 90 Max disk usage percentage
HEALTH_CHECK_MEMORY_MIN 100 Min available memory (MB)

Custom Health Checks

Add custom health checks by extending the health check system:

# backend/apps/core/health_checks/custom_checks.py
from health_check.backends import BaseHealthCheckBackend

class CloudflareImagesHealthCheck(BaseHealthCheckBackend):
    """Check Cloudflare Images API connectivity."""

    critical_service = False

    def check_status(self):
        try:
            # Test Cloudflare Images API
            response = cloudflare_service.test_connection()
            if not response.ok:
                self.add_error("Cloudflare Images API unavailable")
        except Exception as e:
            self.add_error(f"Cloudflare Images error: {e}")

    def identifier(self):
        return "CloudflareImages"

Register in INSTALLED_APPS:

HEALTH_CHECK = {
    'DISK_USAGE_MAX': 90,
    'MEMORY_MIN': 100,
}

Troubleshooting

Health Check Returns 503

  1. Check database connectivity:

    uv run manage.py dbshell
    
  2. Check Redis connectivity:

    redis-cli ping
    
  3. Review application logs:

    tail -f logs/django.log
    

Slow Health Check Response

  1. Check database query performance:

    uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())"
    
  2. Check cache response time:

    redis-cli --latency
    

Missing Metrics

Ensure psutil is installed for system metrics:

uv add psutil

Best Practices

  1. Use simple endpoint for load balancers: The /simple/ endpoint is lightweight and fast
  2. Monitor comprehensive endpoint: Use /health/ for detailed monitoring dashboards
  3. Set appropriate timeouts: Health check timeouts should be shorter than intervals
  4. Alert on degraded state: Don't wait for complete failure
  5. Log health check failures: Include health status in application logs