mirror of https://github.com/pacnpal/thrillwiki_django_no_react.git synced 2025-12-24 12:11:08 -05:00

Files

pacnpal edcd8f2076 Add secret management guide, client-side performance monitoring, and search accessibility enhancements

- Introduced a comprehensive Secret Management Guide detailing best practices, secret classification, development setup, production management, rotation procedures, and emergency protocols.
- Implemented a client-side performance monitoring script to track various metrics including page load performance, paint metrics, layout shifts, and memory usage.
- Enhanced search accessibility with keyboard navigation support for search results, ensuring compliance with WCAG standards and improving user experience.

2025-12-23 16:41:42 -05:00

8.3 KiB

Raw Blame History

ThrillWiki Health Check Documentation

This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes.

Overview

ThrillWiki provides three health check endpoints with different levels of detail:

Endpoint	Purpose	Authentication
`/api/v1/health/`	Comprehensive health check	Public
`/api/v1/health/simple/`	Simple OK/ERROR for load balancers	Public
`/api/v1/health/performance/`	Performance metrics	Debug mode only

Endpoint Details

Comprehensive Health Check

Endpoint: GET /api/v1/health/

Returns detailed health information including system metrics, database status, cache status, and individual health checks.

Response Format

{
  "status": "healthy",
  "timestamp": "2025-12-23T10:30:00Z",
  "version": "1.0.0",
  "environment": "production",
  "response_time_ms": 45.23,
  "checks": {
    "DatabaseBackend": {
      "status": "healthy",
      "critical": true,
      "errors": [],
      "response_time_ms": 12.5
    },
    "CacheBackend": {
      "status": "healthy",
      "critical": false,
      "errors": [],
      "response_time_ms": 2.1
    },
    "DiskUsage": {
      "status": "healthy",
      "critical": false,
      "errors": []
    },
    "MemoryUsage": {
      "status": "healthy",
      "critical": false,
      "errors": []
    }
  },
  "metrics": {
    "cache": {
      "redis": {
        "used_memory": "45.2MB",
        "connected_clients": 15,
        "hit_rate": 94.5
      }
    },
    "database": {
      "vendor": "postgresql",
      "connection_status": "connected",
      "test_query_time_ms": 1.2,
      "active_connections": 8,
      "cache_hit_ratio": 98.3
    },
    "system": {
      "debug_mode": false,
      "memory": {
        "total_mb": 8192,
        "available_mb": 4096,
        "percent_used": 50
      },
      "cpu": {
        "percent_used": 25.3,
        "core_count": 4
      },
      "disk": {
        "total_gb": 100,
        "free_gb": 45,
        "percent_used": 55
      }
    }
  }
}

Status Codes

Code	Meaning
200	All checks passed or non-critical failures
503	Critical service failure (database, etc.)

Simple Health Check

Endpoint: GET /api/v1/health/simple/

Lightweight health check designed for load balancer health probes.

Response Format (Healthy)

{
  "status": "ok",
  "timestamp": "2025-12-23T10:30:00Z"
}

Response Format (Unhealthy)

{
  "status": "error",
  "error": "Database connection failed",
  "timestamp": "2025-12-23T10:30:00Z"
}

Status Codes

Code	Meaning
200	Service healthy
503	Service unhealthy

Performance Metrics

Endpoint: GET /api/v1/health/performance/

Detailed performance metrics for debugging (only available when DEBUG=True).

Response Format

{
  "timestamp": "2025-12-23T10:30:00Z",
  "database_analysis": {
    "total_queries": 0,
    "query_analysis": {}
  },
  "cache_performance": {
    "redis": {
      "used_memory": "45.2MB",
      "hit_rate": 94.5
    }
  },
  "recent_slow_queries": []
}

Status Codes

Code	Meaning
200	Metrics returned
403	Not available (DEBUG=False)

Health Checks Included

Database Check

Verifies PostgreSQL connectivity by executing a simple query.

# Test query executed
cursor.execute("SELECT 1")

Critical: Yes (503 returned if fails)

Cache Check

Verifies Redis connectivity and operation.

Critical: No (200 returned with warning if fails)

Disk Usage Check

Monitors disk space to prevent storage exhaustion.

Threshold: Configurable via HEALTH_CHECK_DISK_USAGE_MAX (default: 90%)

Memory Usage Check

Monitors available memory.

Threshold: Configurable via HEALTH_CHECK_MEMORY_MIN (default: 100MB)

Integration Examples

Kubernetes Liveness Probe

livenessProbe:
  httpGet:
    path: /api/v1/health/simple/
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Kubernetes Readiness Probe

readinessProbe:
  httpGet:
    path: /api/v1/health/simple/
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

AWS Application Load Balancer

{
  "HealthCheckPath": "/api/v1/health/simple/",
  "HealthCheckIntervalSeconds": 30,
  "HealthyThresholdCount": 2,
  "UnhealthyThresholdCount": 3,
  "HealthCheckTimeoutSeconds": 5,
  "Matcher": {
    "HttpCode": "200"
  }
}

Docker Compose Health Check

services:
  web:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Nginx Upstream Health Check

upstream django {
    server web:8000;
    # With nginx-plus
    health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2;
}

Monitoring Integration

Prometheus Metrics

Health check data can be exposed as Prometheus metrics using django-prometheus:

# Example custom metrics
from prometheus_client import Gauge

database_response_time = Gauge(
    'thrillwiki_database_response_time_seconds',
    'Database query response time'
)

cache_hit_rate = Gauge(
    'thrillwiki_cache_hit_rate',
    'Cache hit rate percentage'
)

Alerting Thresholds

Recommended alerting thresholds:

Metric	Warning	Critical
Response time	> 1s	> 5s
Database query time	> 100ms	> 500ms
Cache hit rate	< 80%	< 50%
Disk usage	> 80%	> 90%
Memory usage	> 80%	> 90%

Grafana Dashboard

Import the health check dashboard:

{
  "dashboard": {
    "title": "ThrillWiki Health",
    "panels": [
      {
        "title": "Health Status",
        "type": "stat",
        "targets": [
          {
            "expr": "probe_success{job='thrillwiki'}"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "thrillwiki_health_response_time_ms"
          }
        ]
      }
    ]
  }
}

Configuration

Environment Variables

Variable	Default	Description
`HEALTH_CHECK_DISK_USAGE_MAX`	90	Max disk usage percentage
`HEALTH_CHECK_MEMORY_MIN`	100	Min available memory (MB)

Custom Health Checks

Add custom health checks by extending the health check system:

# backend/apps/core/health_checks/custom_checks.py
from health_check.backends import BaseHealthCheckBackend

class CloudflareImagesHealthCheck(BaseHealthCheckBackend):
    """Check Cloudflare Images API connectivity."""

    critical_service = False

    def check_status(self):
        try:
            # Test Cloudflare Images API
            response = cloudflare_service.test_connection()
            if not response.ok:
                self.add_error("Cloudflare Images API unavailable")
        except Exception as e:
            self.add_error(f"Cloudflare Images error: {e}")

    def identifier(self):
        return "CloudflareImages"

HEALTH_CHECK = {
    'DISK_USAGE_MAX': 90,
    'MEMORY_MIN': 100,
}

Troubleshooting

Health Check Returns 503

Check database connectivity:
```
uv run manage.py dbshell
```
Check Redis connectivity:
```
redis-cli ping
```
Review application logs:
```
tail -f logs/django.log
```

Slow Health Check Response

Check database query performance:

uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())"

Check cache response time:
```
redis-cli --latency
```

Missing Metrics

Ensure psutil is installed for system metrics:

uv add psutil

Best Practices

Use simple endpoint for load balancers: The /simple/ endpoint is lightweight and fast
Monitor comprehensive endpoint: Use /health/ for detailed monitoring dashboards
Set appropriate timeouts: Health check timeouts should be shorter than intervals
Alert on degraded state: Don't wait for complete failure
Log health check failures: Include health status in application logs

8.3 KiB Raw Blame History

ThrillWiki Health Check Documentation

Overview

Endpoint Details

Comprehensive Health Check

Response Format

Status Codes

Simple Health Check

Response Format (Healthy)

Response Format (Unhealthy)

Status Codes

Performance Metrics

Response Format

Status Codes

Health Checks Included

Database Check

Cache Check

Disk Usage Check

Memory Usage Check

Integration Examples

Kubernetes Liveness Probe

Kubernetes Readiness Probe

AWS Application Load Balancer

Docker Compose Health Check

Nginx Upstream Health Check

Monitoring Integration

Prometheus Metrics

Alerting Thresholds

Grafana Dashboard

Configuration

Environment Variables

Custom Health Checks

Troubleshooting

Health Check Returns 503

Slow Health Check Response

Missing Metrics

Best Practices

8.3 KiB

Raw Blame History