# ThrillWiki Health Check Documentation This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes. ## Overview ThrillWiki provides three health check endpoints with different levels of detail: | Endpoint | Purpose | Authentication | |----------|---------|----------------| | `/api/v1/health/` | Comprehensive health check | Public | | `/api/v1/health/simple/` | Simple OK/ERROR for load balancers | Public | | `/api/v1/health/performance/` | Performance metrics | Debug mode only | ## Endpoint Details ### Comprehensive Health Check **Endpoint**: `GET /api/v1/health/` Returns detailed health information including system metrics, database status, cache status, and individual health checks. #### Response Format ```json { "status": "healthy", "timestamp": "2025-12-23T10:30:00Z", "version": "1.0.0", "environment": "production", "response_time_ms": 45.23, "checks": { "DatabaseBackend": { "status": "healthy", "critical": true, "errors": [], "response_time_ms": 12.5 }, "CacheBackend": { "status": "healthy", "critical": false, "errors": [], "response_time_ms": 2.1 }, "DiskUsage": { "status": "healthy", "critical": false, "errors": [] }, "MemoryUsage": { "status": "healthy", "critical": false, "errors": [] } }, "metrics": { "cache": { "redis": { "used_memory": "45.2MB", "connected_clients": 15, "hit_rate": 94.5 } }, "database": { "vendor": "postgresql", "connection_status": "connected", "test_query_time_ms": 1.2, "active_connections": 8, "cache_hit_ratio": 98.3 }, "system": { "debug_mode": false, "memory": { "total_mb": 8192, "available_mb": 4096, "percent_used": 50 }, "cpu": { "percent_used": 25.3, "core_count": 4 }, "disk": { "total_gb": 100, "free_gb": 45, "percent_used": 55 } } } } ``` #### Status Codes | Code | Meaning | |------|---------| | 200 | All checks passed or non-critical failures | | 503 | Critical service failure (database, etc.) | ### Simple Health Check **Endpoint**: `GET /api/v1/health/simple/` Lightweight health check designed for load balancer health probes. #### Response Format (Healthy) ```json { "status": "ok", "timestamp": "2025-12-23T10:30:00Z" } ``` #### Response Format (Unhealthy) ```json { "status": "error", "error": "Database connection failed", "timestamp": "2025-12-23T10:30:00Z" } ``` #### Status Codes | Code | Meaning | |------|---------| | 200 | Service healthy | | 503 | Service unhealthy | ### Performance Metrics **Endpoint**: `GET /api/v1/health/performance/` Detailed performance metrics for debugging (only available when `DEBUG=True`). #### Response Format ```json { "timestamp": "2025-12-23T10:30:00Z", "database_analysis": { "total_queries": 0, "query_analysis": {} }, "cache_performance": { "redis": { "used_memory": "45.2MB", "hit_rate": 94.5 } }, "recent_slow_queries": [] } ``` #### Status Codes | Code | Meaning | |------|---------| | 200 | Metrics returned | | 403 | Not available (DEBUG=False) | ## Health Checks Included ### Database Check Verifies PostgreSQL connectivity by executing a simple query. ```python # Test query executed cursor.execute("SELECT 1") ``` **Critical**: Yes (503 returned if fails) ### Cache Check Verifies Redis connectivity and operation. **Critical**: No (200 returned with warning if fails) ### Disk Usage Check Monitors disk space to prevent storage exhaustion. **Threshold**: Configurable via `HEALTH_CHECK_DISK_USAGE_MAX` (default: 90%) ### Memory Usage Check Monitors available memory. **Threshold**: Configurable via `HEALTH_CHECK_MEMORY_MIN` (default: 100MB) ## Integration Examples ### Kubernetes Liveness Probe ```yaml livenessProbe: httpGet: path: /api/v1/health/simple/ port: 8000 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 ``` ### Kubernetes Readiness Probe ```yaml readinessProbe: httpGet: path: /api/v1/health/simple/ port: 8000 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 2 ``` ### AWS Application Load Balancer ```json { "HealthCheckPath": "/api/v1/health/simple/", "HealthCheckIntervalSeconds": 30, "HealthyThresholdCount": 2, "UnhealthyThresholdCount": 3, "HealthCheckTimeoutSeconds": 5, "Matcher": { "HttpCode": "200" } } ``` ### Docker Compose Health Check ```yaml services: web: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"] interval: 30s timeout: 10s retries: 3 start_period: 40s ``` ### Nginx Upstream Health Check ```nginx upstream django { server web:8000; # With nginx-plus health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2; } ``` ## Monitoring Integration ### Prometheus Metrics Health check data can be exposed as Prometheus metrics using django-prometheus: ```python # Example custom metrics from prometheus_client import Gauge database_response_time = Gauge( 'thrillwiki_database_response_time_seconds', 'Database query response time' ) cache_hit_rate = Gauge( 'thrillwiki_cache_hit_rate', 'Cache hit rate percentage' ) ``` ### Alerting Thresholds Recommended alerting thresholds: | Metric | Warning | Critical | |--------|---------|----------| | Response time | > 1s | > 5s | | Database query time | > 100ms | > 500ms | | Cache hit rate | < 80% | < 50% | | Disk usage | > 80% | > 90% | | Memory usage | > 80% | > 90% | ### Grafana Dashboard Import the health check dashboard: ```json { "dashboard": { "title": "ThrillWiki Health", "panels": [ { "title": "Health Status", "type": "stat", "targets": [ { "expr": "probe_success{job='thrillwiki'}" } ] }, { "title": "Response Time", "type": "graph", "targets": [ { "expr": "thrillwiki_health_response_time_ms" } ] } ] } } ``` ## Configuration ### Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `HEALTH_CHECK_DISK_USAGE_MAX` | 90 | Max disk usage percentage | | `HEALTH_CHECK_MEMORY_MIN` | 100 | Min available memory (MB) | ### Custom Health Checks Add custom health checks by extending the health check system: ```python # backend/apps/core/health_checks/custom_checks.py from health_check.backends import BaseHealthCheckBackend class CloudflareImagesHealthCheck(BaseHealthCheckBackend): """Check Cloudflare Images API connectivity.""" critical_service = False def check_status(self): try: # Test Cloudflare Images API response = cloudflare_service.test_connection() if not response.ok: self.add_error("Cloudflare Images API unavailable") except Exception as e: self.add_error(f"Cloudflare Images error: {e}") def identifier(self): return "CloudflareImages" ``` Register in `INSTALLED_APPS`: ```python HEALTH_CHECK = { 'DISK_USAGE_MAX': 90, 'MEMORY_MIN': 100, } ``` ## Troubleshooting ### Health Check Returns 503 1. Check database connectivity: ```bash uv run manage.py dbshell ``` 2. Check Redis connectivity: ```bash redis-cli ping ``` 3. Review application logs: ```bash tail -f logs/django.log ``` ### Slow Health Check Response 1. Check database query performance: ```bash uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())" ``` 2. Check cache response time: ```bash redis-cli --latency ``` ### Missing Metrics Ensure `psutil` is installed for system metrics: ```bash uv add psutil ``` ## Best Practices 1. **Use simple endpoint for load balancers**: The `/simple/` endpoint is lightweight and fast 2. **Monitor comprehensive endpoint**: Use `/health/` for detailed monitoring dashboards 3. **Set appropriate timeouts**: Health check timeouts should be shorter than intervals 4. **Alert on degraded state**: Don't wait for complete failure 5. **Log health check failures**: Include health status in application logs