# ThrillWiki Health Check Documentation

This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes.

## Overview

ThrillWiki provides three health check endpoints with different levels of detail:

| Endpoint | Purpose | Authentication |
|----------|---------|----------------|
| `/api/v1/health/` | Comprehensive health check | Public |
| `/api/v1/health/simple/` | Simple OK/ERROR for load balancers | Public |
| `/api/v1/health/performance/` | Performance metrics | Debug mode only |

## Endpoint Details

### Comprehensive Health Check

**Endpoint**: `GET /api/v1/health/`

Returns detailed health information including system metrics, database status, cache status, and individual health checks.

#### Response Format

```json
{
  "status": "healthy",
  "timestamp": "2025-12-23T10:30:00Z",
  "version": "1.0.0",
  "environment": "production",
  "response_time_ms": 45.23,
  "checks": {
    "DatabaseBackend": {
      "status": "healthy",
      "critical": true,
      "errors": [],
      "response_time_ms": 12.5
    },
    "CacheBackend": {
      "status": "healthy",
      "critical": false,
      "errors": [],
      "response_time_ms": 2.1
    },
    "DiskUsage": {
      "status": "healthy",
      "critical": false,
      "errors": []
    },
    "MemoryUsage": {
      "status": "healthy",
      "critical": false,
      "errors": []
    }
  },
  "metrics": {
    "cache": {
      "redis": {
        "used_memory": "45.2MB",
        "connected_clients": 15,
        "hit_rate": 94.5
      }
    },
    "database": {
      "vendor": "postgresql",
      "connection_status": "connected",
      "test_query_time_ms": 1.2,
      "active_connections": 8,
      "cache_hit_ratio": 98.3
    },
    "system": {
      "debug_mode": false,
      "memory": {
        "total_mb": 8192,
        "available_mb": 4096,
        "percent_used": 50
      },
      "cpu": {
        "percent_used": 25.3,
        "core_count": 4
      },
      "disk": {
        "total_gb": 100,
        "free_gb": 45,
        "percent_used": 55
      }
    }
  }
}
```

#### Status Codes

| Code | Meaning |
|------|---------|
| 200 | All checks passed or non-critical failures |
| 503 | Critical service failure (database, etc.) |

### Simple Health Check

**Endpoint**: `GET /api/v1/health/simple/`

Lightweight health check designed for load balancer health probes.

#### Response Format (Healthy)

```json
{
  "status": "ok",
  "timestamp": "2025-12-23T10:30:00Z"
}
```

#### Response Format (Unhealthy)

```json
{
  "status": "error",
  "error": "Database connection failed",
  "timestamp": "2025-12-23T10:30:00Z"
}
```

#### Status Codes

| Code | Meaning |
|------|---------|
| 200 | Service healthy |
| 503 | Service unhealthy |

### Performance Metrics

**Endpoint**: `GET /api/v1/health/performance/`

Detailed performance metrics for debugging (only available when `DEBUG=True`).

#### Response Format

```json
{
  "timestamp": "2025-12-23T10:30:00Z",
  "database_analysis": {
    "total_queries": 0,
    "query_analysis": {}
  },
  "cache_performance": {
    "redis": {
      "used_memory": "45.2MB",
      "hit_rate": 94.5
    }
  },
  "recent_slow_queries": []
}
```

#### Status Codes

| Code | Meaning |
|------|---------|
| 200 | Metrics returned |
| 403 | Not available (DEBUG=False) |

## Health Checks Included

### Database Check

Verifies PostgreSQL connectivity by executing a simple query.

```python
# Test query executed
cursor.execute("SELECT 1")
```

**Critical**: Yes (503 returned if fails)

### Cache Check

Verifies Redis connectivity and operation.

**Critical**: No (200 returned with warning if fails)

### Disk Usage Check

Monitors disk space to prevent storage exhaustion.

**Threshold**: Configurable via `HEALTH_CHECK_DISK_USAGE_MAX` (default: 90%)

### Memory Usage Check

Monitors available memory.

**Threshold**: Configurable via `HEALTH_CHECK_MEMORY_MIN` (default: 100MB)

## Integration Examples

### Kubernetes Liveness Probe

```yaml
livenessProbe:
  httpGet:
    path: /api/v1/health/simple/
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
```

### Kubernetes Readiness Probe

```yaml
readinessProbe:
  httpGet:
    path: /api/v1/health/simple/
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2
```

### AWS Application Load Balancer

```json
{
  "HealthCheckPath": "/api/v1/health/simple/",
  "HealthCheckIntervalSeconds": 30,
  "HealthyThresholdCount": 2,
  "UnhealthyThresholdCount": 3,
  "HealthCheckTimeoutSeconds": 5,
  "Matcher": {
    "HttpCode": "200"
  }
}
```

### Docker Compose Health Check

```yaml
services:
  web:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
```

### Nginx Upstream Health Check

```nginx
upstream django {
    server web:8000;
    # With nginx-plus
    health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2;
}
```

## Monitoring Integration

### Prometheus Metrics

Health check data can be exposed as Prometheus metrics using django-prometheus:

```python
# Example custom metrics
from prometheus_client import Gauge

database_response_time = Gauge(
    'thrillwiki_database_response_time_seconds',
    'Database query response time'
)

cache_hit_rate = Gauge(
    'thrillwiki_cache_hit_rate',
    'Cache hit rate percentage'
)
```

### Alerting Thresholds

Recommended alerting thresholds:

| Metric | Warning | Critical |
|--------|---------|----------|
| Response time | > 1s | > 5s |
| Database query time | > 100ms | > 500ms |
| Cache hit rate | < 80% | < 50% |
| Disk usage | > 80% | > 90% |
| Memory usage | > 80% | > 90% |

### Grafana Dashboard

Import the health check dashboard:

```json
{
  "dashboard": {
    "title": "ThrillWiki Health",
    "panels": [
      {
        "title": "Health Status",
        "type": "stat",
        "targets": [
          {
            "expr": "probe_success{job='thrillwiki'}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "thrillwiki_health_response_time_ms"
          }
        ]
      }
    ]
  }
}
```

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `HEALTH_CHECK_DISK_USAGE_MAX` | 90 | Max disk usage percentage |
| `HEALTH_CHECK_MEMORY_MIN` | 100 | Min available memory (MB) |

### Custom Health Checks

Add custom health checks by extending the health check system:

```python
# backend/apps/core/health_checks/custom_checks.py
from health_check.backends import BaseHealthCheckBackend

class CloudflareImagesHealthCheck(BaseHealthCheckBackend):
    """Check Cloudflare Images API connectivity."""

    critical_service = False

    def check_status(self):
        try:
            # Test Cloudflare Images API
            response = cloudflare_service.test_connection()
            if not response.ok:
                self.add_error("Cloudflare Images API unavailable")
        except Exception as e:
            self.add_error(f"Cloudflare Images error: {e}")

    def identifier(self):
        return "CloudflareImages"
```

Register in `INSTALLED_APPS`:

```python
HEALTH_CHECK = {
    'DISK_USAGE_MAX': 90,
    'MEMORY_MIN': 100,
}
```

## Troubleshooting

### Health Check Returns 503

1. Check database connectivity:
   ```bash
   uv run manage.py dbshell
   ```

2. Check Redis connectivity:
   ```bash
   redis-cli ping
   ```

3. Review application logs:
   ```bash
   tail -f logs/django.log
   ```

### Slow Health Check Response

1. Check database query performance:
   ```bash
   uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())"
   ```

2. Check cache response time:
   ```bash
   redis-cli --latency
   ```

### Missing Metrics

Ensure `psutil` is installed for system metrics:

```bash
uv add psutil
```

## Best Practices

1. **Use simple endpoint for load balancers**: The `/simple/` endpoint is lightweight and fast
2. **Monitor comprehensive endpoint**: Use `/health/` for detailed monitoring dashboards
3. **Set appropriate timeouts**: Health check timeouts should be shorter than intervals
4. **Alert on degraded state**: Don't wait for complete failure
5. **Log health check failures**: Include health status in application logs