- Introduced a comprehensive Secret Management Guide detailing best practices, secret classification, development setup, production management, rotation procedures, and emergency protocols. - Implemented a client-side performance monitoring script to track various metrics including page load performance, paint metrics, layout shifts, and memory usage. - Enhanced search accessibility with keyboard navigation support for search results, ensuring compliance with WCAG standards and improving user experience.
8.3 KiB
ThrillWiki Health Check Documentation
This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes.
Overview
ThrillWiki provides three health check endpoints with different levels of detail:
| Endpoint | Purpose | Authentication |
|---|---|---|
/api/v1/health/ |
Comprehensive health check | Public |
/api/v1/health/simple/ |
Simple OK/ERROR for load balancers | Public |
/api/v1/health/performance/ |
Performance metrics | Debug mode only |
Endpoint Details
Comprehensive Health Check
Endpoint: GET /api/v1/health/
Returns detailed health information including system metrics, database status, cache status, and individual health checks.
Response Format
{
"status": "healthy",
"timestamp": "2025-12-23T10:30:00Z",
"version": "1.0.0",
"environment": "production",
"response_time_ms": 45.23,
"checks": {
"DatabaseBackend": {
"status": "healthy",
"critical": true,
"errors": [],
"response_time_ms": 12.5
},
"CacheBackend": {
"status": "healthy",
"critical": false,
"errors": [],
"response_time_ms": 2.1
},
"DiskUsage": {
"status": "healthy",
"critical": false,
"errors": []
},
"MemoryUsage": {
"status": "healthy",
"critical": false,
"errors": []
}
},
"metrics": {
"cache": {
"redis": {
"used_memory": "45.2MB",
"connected_clients": 15,
"hit_rate": 94.5
}
},
"database": {
"vendor": "postgresql",
"connection_status": "connected",
"test_query_time_ms": 1.2,
"active_connections": 8,
"cache_hit_ratio": 98.3
},
"system": {
"debug_mode": false,
"memory": {
"total_mb": 8192,
"available_mb": 4096,
"percent_used": 50
},
"cpu": {
"percent_used": 25.3,
"core_count": 4
},
"disk": {
"total_gb": 100,
"free_gb": 45,
"percent_used": 55
}
}
}
}
Status Codes
| Code | Meaning |
|---|---|
| 200 | All checks passed or non-critical failures |
| 503 | Critical service failure (database, etc.) |
Simple Health Check
Endpoint: GET /api/v1/health/simple/
Lightweight health check designed for load balancer health probes.
Response Format (Healthy)
{
"status": "ok",
"timestamp": "2025-12-23T10:30:00Z"
}
Response Format (Unhealthy)
{
"status": "error",
"error": "Database connection failed",
"timestamp": "2025-12-23T10:30:00Z"
}
Status Codes
| Code | Meaning |
|---|---|
| 200 | Service healthy |
| 503 | Service unhealthy |
Performance Metrics
Endpoint: GET /api/v1/health/performance/
Detailed performance metrics for debugging (only available when DEBUG=True).
Response Format
{
"timestamp": "2025-12-23T10:30:00Z",
"database_analysis": {
"total_queries": 0,
"query_analysis": {}
},
"cache_performance": {
"redis": {
"used_memory": "45.2MB",
"hit_rate": 94.5
}
},
"recent_slow_queries": []
}
Status Codes
| Code | Meaning |
|---|---|
| 200 | Metrics returned |
| 403 | Not available (DEBUG=False) |
Health Checks Included
Database Check
Verifies PostgreSQL connectivity by executing a simple query.
# Test query executed
cursor.execute("SELECT 1")
Critical: Yes (503 returned if fails)
Cache Check
Verifies Redis connectivity and operation.
Critical: No (200 returned with warning if fails)
Disk Usage Check
Monitors disk space to prevent storage exhaustion.
Threshold: Configurable via HEALTH_CHECK_DISK_USAGE_MAX (default: 90%)
Memory Usage Check
Monitors available memory.
Threshold: Configurable via HEALTH_CHECK_MEMORY_MIN (default: 100MB)
Integration Examples
Kubernetes Liveness Probe
livenessProbe:
httpGet:
path: /api/v1/health/simple/
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Kubernetes Readiness Probe
readinessProbe:
httpGet:
path: /api/v1/health/simple/
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
AWS Application Load Balancer
{
"HealthCheckPath": "/api/v1/health/simple/",
"HealthCheckIntervalSeconds": 30,
"HealthyThresholdCount": 2,
"UnhealthyThresholdCount": 3,
"HealthCheckTimeoutSeconds": 5,
"Matcher": {
"HttpCode": "200"
}
}
Docker Compose Health Check
services:
web:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Nginx Upstream Health Check
upstream django {
server web:8000;
# With nginx-plus
health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2;
}
Monitoring Integration
Prometheus Metrics
Health check data can be exposed as Prometheus metrics using django-prometheus:
# Example custom metrics
from prometheus_client import Gauge
database_response_time = Gauge(
'thrillwiki_database_response_time_seconds',
'Database query response time'
)
cache_hit_rate = Gauge(
'thrillwiki_cache_hit_rate',
'Cache hit rate percentage'
)
Alerting Thresholds
Recommended alerting thresholds:
| Metric | Warning | Critical |
|---|---|---|
| Response time | > 1s | > 5s |
| Database query time | > 100ms | > 500ms |
| Cache hit rate | < 80% | < 50% |
| Disk usage | > 80% | > 90% |
| Memory usage | > 80% | > 90% |
Grafana Dashboard
Import the health check dashboard:
{
"dashboard": {
"title": "ThrillWiki Health",
"panels": [
{
"title": "Health Status",
"type": "stat",
"targets": [
{
"expr": "probe_success{job='thrillwiki'}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "thrillwiki_health_response_time_ms"
}
]
}
]
}
}
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
HEALTH_CHECK_DISK_USAGE_MAX |
90 | Max disk usage percentage |
HEALTH_CHECK_MEMORY_MIN |
100 | Min available memory (MB) |
Custom Health Checks
Add custom health checks by extending the health check system:
# backend/apps/core/health_checks/custom_checks.py
from health_check.backends import BaseHealthCheckBackend
class CloudflareImagesHealthCheck(BaseHealthCheckBackend):
"""Check Cloudflare Images API connectivity."""
critical_service = False
def check_status(self):
try:
# Test Cloudflare Images API
response = cloudflare_service.test_connection()
if not response.ok:
self.add_error("Cloudflare Images API unavailable")
except Exception as e:
self.add_error(f"Cloudflare Images error: {e}")
def identifier(self):
return "CloudflareImages"
Register in INSTALLED_APPS:
HEALTH_CHECK = {
'DISK_USAGE_MAX': 90,
'MEMORY_MIN': 100,
}
Troubleshooting
Health Check Returns 503
-
Check database connectivity:
uv run manage.py dbshell -
Check Redis connectivity:
redis-cli ping -
Review application logs:
tail -f logs/django.log
Slow Health Check Response
-
Check database query performance:
uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())" -
Check cache response time:
redis-cli --latency
Missing Metrics
Ensure psutil is installed for system metrics:
uv add psutil
Best Practices
- Use simple endpoint for load balancers: The
/simple/endpoint is lightweight and fast - Monitor comprehensive endpoint: Use
/health/for detailed monitoring dashboards - Set appropriate timeouts: Health check timeouts should be shorter than intervals
- Alert on degraded state: Don't wait for complete failure
- Log health check failures: Include health status in application logs