Add secret management guide, client-side performance monitoring, and search accessibility enhancements

- Introduced a comprehensive Secret Management Guide detailing best practices, secret classification, development setup, production management, rotation procedures, and emergency protocols. - Implemented a client-side performance monitoring script to track various metrics including page load performance, paint metrics, layout shifts, and memory usage. - Enhanced search accessibility with keyboard navigation support for search results, ensuring compliance with WCAG standards and improving user experience.
2025-12-24 17:31:09 -05:00 · 2025-12-23 16:41:42 -05:00
parent ae31e889d7
commit edcd8f2076
155 changed files with 22046 additions and 4645 deletions
--- a/docs/HEALTH_CHECKS.md
+++ b/docs/HEALTH_CHECKS.md
@@ -0,0 +1,414 @@
+# ThrillWiki Health Check Documentation
+
+This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes.
+
+## Overview
+
+ThrillWiki provides three health check endpoints with different levels of detail:
+
+| Endpoint | Purpose | Authentication |
+|----------|---------|----------------|
+| `/api/v1/health/` | Comprehensive health check | Public |
+| `/api/v1/health/simple/` | Simple OK/ERROR for load balancers | Public |
+| `/api/v1/health/performance/` | Performance metrics | Debug mode only |
+
+## Endpoint Details
+
+### Comprehensive Health Check
+
+**Endpoint**: `GET /api/v1/health/`
+
+Returns detailed health information including system metrics, database status, cache status, and individual health checks.
+
+#### Response Format
+
+```json
+{
+  "status": "healthy",
+  "timestamp": "2025-12-23T10:30:00Z",
+  "version": "1.0.0",
+  "environment": "production",
+  "response_time_ms": 45.23,
+  "checks": {
+    "DatabaseBackend": {
+      "status": "healthy",
+      "critical": true,
+      "errors": [],
+      "response_time_ms": 12.5
+    },
+    "CacheBackend": {
+      "status": "healthy",
+      "critical": false,
+      "errors": [],
+      "response_time_ms": 2.1
+    },
+    "DiskUsage": {
+      "status": "healthy",
+      "critical": false,
+      "errors": []
+    },
+    "MemoryUsage": {
+      "status": "healthy",
+      "critical": false,
+      "errors": []
+    }
+  },
+  "metrics": {
+    "cache": {
+      "redis": {
+        "used_memory": "45.2MB",
+        "connected_clients": 15,
+        "hit_rate": 94.5
+      }
+    },
+    "database": {
+      "vendor": "postgresql",
+      "connection_status": "connected",
+      "test_query_time_ms": 1.2,
+      "active_connections": 8,
+      "cache_hit_ratio": 98.3
+    },
+    "system": {
+      "debug_mode": false,
+      "memory": {
+        "total_mb": 8192,
+        "available_mb": 4096,
+        "percent_used": 50
+      },
+      "cpu": {
+        "percent_used": 25.3,
+        "core_count": 4
+      },
+      "disk": {
+        "total_gb": 100,
+        "free_gb": 45,
+        "percent_used": 55
+      }
+    }
+  }
+}
+```
+
+#### Status Codes
+
+| Code | Meaning |
+|------|---------|
+| 200 | All checks passed or non-critical failures |
+| 503 | Critical service failure (database, etc.) |
+
+### Simple Health Check
+
+**Endpoint**: `GET /api/v1/health/simple/`
+
+Lightweight health check designed for load balancer health probes.
+
+#### Response Format (Healthy)
+
+```json
+{
+  "status": "ok",
+  "timestamp": "2025-12-23T10:30:00Z"
+}
+```
+
+#### Response Format (Unhealthy)
+
+```json
+{
+  "status": "error",
+  "error": "Database connection failed",
+  "timestamp": "2025-12-23T10:30:00Z"
+}
+```
+
+#### Status Codes
+
+| Code | Meaning |
+|------|---------|
+| 200 | Service healthy |
+| 503 | Service unhealthy |
+
+### Performance Metrics
+
+**Endpoint**: `GET /api/v1/health/performance/`
+
+Detailed performance metrics for debugging (only available when `DEBUG=True`).
+
+#### Response Format
+
+```json
+{
+  "timestamp": "2025-12-23T10:30:00Z",
+  "database_analysis": {
+    "total_queries": 0,
+    "query_analysis": {}
+  },
+  "cache_performance": {
+    "redis": {
+      "used_memory": "45.2MB",
+      "hit_rate": 94.5
+    }
+  },
+  "recent_slow_queries": []
+}
+```
+
+#### Status Codes
+
+| Code | Meaning |
+|------|---------|
+| 200 | Metrics returned |
+| 403 | Not available (DEBUG=False) |
+
+## Health Checks Included
+
+### Database Check
+
+Verifies PostgreSQL connectivity by executing a simple query.
+
+```python
+# Test query executed
+cursor.execute("SELECT 1")
+```
+
+**Critical**: Yes (503 returned if fails)
+
+### Cache Check
+
+Verifies Redis connectivity and operation.
+
+**Critical**: No (200 returned with warning if fails)
+
+### Disk Usage Check
+
+Monitors disk space to prevent storage exhaustion.
+
+**Threshold**: Configurable via `HEALTH_CHECK_DISK_USAGE_MAX` (default: 90%)
+
+### Memory Usage Check
+
+Monitors available memory.
+
+**Threshold**: Configurable via `HEALTH_CHECK_MEMORY_MIN` (default: 100MB)
+
+## Integration Examples
+
+### Kubernetes Liveness Probe
+
+```yaml
+livenessProbe:
+  httpGet:
+    path: /api/v1/health/simple/
+    port: 8000
+  initialDelaySeconds: 30
+  periodSeconds: 10
+  timeoutSeconds: 5
+  failureThreshold: 3
+```
+
+### Kubernetes Readiness Probe
+
+```yaml
+readinessProbe:
+  httpGet:
+    path: /api/v1/health/simple/
+    port: 8000
+  initialDelaySeconds: 5
+  periodSeconds: 5
+  timeoutSeconds: 3
+  failureThreshold: 2
+```
+
+### AWS Application Load Balancer
+
+```json
+{
+  "HealthCheckPath": "/api/v1/health/simple/",
+  "HealthCheckIntervalSeconds": 30,
+  "HealthyThresholdCount": 2,
+  "UnhealthyThresholdCount": 3,
+  "HealthCheckTimeoutSeconds": 5,
+  "Matcher": {
+    "HttpCode": "200"
+  }
+}
+```
+
+### Docker Compose Health Check
+
+```yaml
+services:
+  web:
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s
+```
+
+### Nginx Upstream Health Check
+
+```nginx
+upstream django {
+    server web:8000;
+    # With nginx-plus
+    health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2;
+}
+```
+
+## Monitoring Integration
+
+### Prometheus Metrics
+
+Health check data can be exposed as Prometheus metrics using django-prometheus:
+
+```python
+# Example custom metrics
+from prometheus_client import Gauge
+
+database_response_time = Gauge(
+    'thrillwiki_database_response_time_seconds',
+    'Database query response time'
+)
+
+cache_hit_rate = Gauge(
+    'thrillwiki_cache_hit_rate',
+    'Cache hit rate percentage'
+)
+```
+
+### Alerting Thresholds
+
+Recommended alerting thresholds:
+
+| Metric | Warning | Critical |
+|--------|---------|----------|
+| Response time | > 1s | > 5s |
+| Database query time | > 100ms | > 500ms |
+| Cache hit rate | < 80% | < 50% |
+| Disk usage | > 80% | > 90% |
+| Memory usage | > 80% | > 90% |
+
+### Grafana Dashboard
+
+Import the health check dashboard:
+
+```json
+{
+  "dashboard": {
+    "title": "ThrillWiki Health",
+    "panels": [
+      {
+        "title": "Health Status",
+        "type": "stat",
+        "targets": [
+          {
+            "expr": "probe_success{job='thrillwiki'}"
+          }
+        ]
+      },
+      {
+        "title": "Response Time",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "thrillwiki_health_response_time_ms"
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+## Configuration
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `HEALTH_CHECK_DISK_USAGE_MAX` | 90 | Max disk usage percentage |
+| `HEALTH_CHECK_MEMORY_MIN` | 100 | Min available memory (MB) |
+
+### Custom Health Checks
+
+Add custom health checks by extending the health check system:
+
+```python
+# backend/apps/core/health_checks/custom_checks.py
+from health_check.backends import BaseHealthCheckBackend
+
+class CloudflareImagesHealthCheck(BaseHealthCheckBackend):
+    """Check Cloudflare Images API connectivity."""
+
+    critical_service = False
+
+    def check_status(self):
+        try:
+            # Test Cloudflare Images API
+            response = cloudflare_service.test_connection()
+            if not response.ok:
+                self.add_error("Cloudflare Images API unavailable")
+        except Exception as e:
+            self.add_error(f"Cloudflare Images error: {e}")
+
+    def identifier(self):
+        return "CloudflareImages"
+```
+
+Register in `INSTALLED_APPS`:
+
+```python
+HEALTH_CHECK = {
+    'DISK_USAGE_MAX': 90,
+    'MEMORY_MIN': 100,
+}
+```
+
+## Troubleshooting
+
+### Health Check Returns 503
+
+1. Check database connectivity:
+   ```bash
+   uv run manage.py dbshell
+   ```
+
+2. Check Redis connectivity:
+   ```bash
+   redis-cli ping
+   ```
+
+3. Review application logs:
+   ```bash
+   tail -f logs/django.log
+   ```
+
+### Slow Health Check Response
+
+1. Check database query performance:
+   ```bash
+   uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())"
+   ```
+
+2. Check cache response time:
+   ```bash
+   redis-cli --latency
+   ```
+
+### Missing Metrics
+
+Ensure `psutil` is installed for system metrics:
+
+```bash
+uv add psutil
+```
+
+## Best Practices
+
+1. **Use simple endpoint for load balancers**: The `/simple/` endpoint is lightweight and fast
+2. **Monitor comprehensive endpoint**: Use `/health/` for detailed monitoring dashboards
+3. **Set appropriate timeouts**: Health check timeouts should be shorter than intervals
+4. **Alert on degraded state**: Don't wait for complete failure
+5. **Log health check failures**: Include health status in application logs