mirror of
https://github.com/pacnpal/thrillwiki_django_no_react.git
synced 2025-12-24 11:51:08 -05:00
- Introduced a comprehensive Secret Management Guide detailing best practices, secret classification, development setup, production management, rotation procedures, and emergency protocols. - Implemented a client-side performance monitoring script to track various metrics including page load performance, paint metrics, layout shifts, and memory usage. - Enhanced search accessibility with keyboard navigation support for search results, ensuring compliance with WCAG standards and improving user experience.
415 lines
8.3 KiB
Markdown
415 lines
8.3 KiB
Markdown
# ThrillWiki Health Check Documentation
|
|
|
|
This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes.
|
|
|
|
## Overview
|
|
|
|
ThrillWiki provides three health check endpoints with different levels of detail:
|
|
|
|
| Endpoint | Purpose | Authentication |
|
|
|----------|---------|----------------|
|
|
| `/api/v1/health/` | Comprehensive health check | Public |
|
|
| `/api/v1/health/simple/` | Simple OK/ERROR for load balancers | Public |
|
|
| `/api/v1/health/performance/` | Performance metrics | Debug mode only |
|
|
|
|
## Endpoint Details
|
|
|
|
### Comprehensive Health Check
|
|
|
|
**Endpoint**: `GET /api/v1/health/`
|
|
|
|
Returns detailed health information including system metrics, database status, cache status, and individual health checks.
|
|
|
|
#### Response Format
|
|
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"timestamp": "2025-12-23T10:30:00Z",
|
|
"version": "1.0.0",
|
|
"environment": "production",
|
|
"response_time_ms": 45.23,
|
|
"checks": {
|
|
"DatabaseBackend": {
|
|
"status": "healthy",
|
|
"critical": true,
|
|
"errors": [],
|
|
"response_time_ms": 12.5
|
|
},
|
|
"CacheBackend": {
|
|
"status": "healthy",
|
|
"critical": false,
|
|
"errors": [],
|
|
"response_time_ms": 2.1
|
|
},
|
|
"DiskUsage": {
|
|
"status": "healthy",
|
|
"critical": false,
|
|
"errors": []
|
|
},
|
|
"MemoryUsage": {
|
|
"status": "healthy",
|
|
"critical": false,
|
|
"errors": []
|
|
}
|
|
},
|
|
"metrics": {
|
|
"cache": {
|
|
"redis": {
|
|
"used_memory": "45.2MB",
|
|
"connected_clients": 15,
|
|
"hit_rate": 94.5
|
|
}
|
|
},
|
|
"database": {
|
|
"vendor": "postgresql",
|
|
"connection_status": "connected",
|
|
"test_query_time_ms": 1.2,
|
|
"active_connections": 8,
|
|
"cache_hit_ratio": 98.3
|
|
},
|
|
"system": {
|
|
"debug_mode": false,
|
|
"memory": {
|
|
"total_mb": 8192,
|
|
"available_mb": 4096,
|
|
"percent_used": 50
|
|
},
|
|
"cpu": {
|
|
"percent_used": 25.3,
|
|
"core_count": 4
|
|
},
|
|
"disk": {
|
|
"total_gb": 100,
|
|
"free_gb": 45,
|
|
"percent_used": 55
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Status Codes
|
|
|
|
| Code | Meaning |
|
|
|------|---------|
|
|
| 200 | All checks passed or non-critical failures |
|
|
| 503 | Critical service failure (database, etc.) |
|
|
|
|
### Simple Health Check
|
|
|
|
**Endpoint**: `GET /api/v1/health/simple/`
|
|
|
|
Lightweight health check designed for load balancer health probes.
|
|
|
|
#### Response Format (Healthy)
|
|
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"timestamp": "2025-12-23T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
#### Response Format (Unhealthy)
|
|
|
|
```json
|
|
{
|
|
"status": "error",
|
|
"error": "Database connection failed",
|
|
"timestamp": "2025-12-23T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
#### Status Codes
|
|
|
|
| Code | Meaning |
|
|
|------|---------|
|
|
| 200 | Service healthy |
|
|
| 503 | Service unhealthy |
|
|
|
|
### Performance Metrics
|
|
|
|
**Endpoint**: `GET /api/v1/health/performance/`
|
|
|
|
Detailed performance metrics for debugging (only available when `DEBUG=True`).
|
|
|
|
#### Response Format
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2025-12-23T10:30:00Z",
|
|
"database_analysis": {
|
|
"total_queries": 0,
|
|
"query_analysis": {}
|
|
},
|
|
"cache_performance": {
|
|
"redis": {
|
|
"used_memory": "45.2MB",
|
|
"hit_rate": 94.5
|
|
}
|
|
},
|
|
"recent_slow_queries": []
|
|
}
|
|
```
|
|
|
|
#### Status Codes
|
|
|
|
| Code | Meaning |
|
|
|------|---------|
|
|
| 200 | Metrics returned |
|
|
| 403 | Not available (DEBUG=False) |
|
|
|
|
## Health Checks Included
|
|
|
|
### Database Check
|
|
|
|
Verifies PostgreSQL connectivity by executing a simple query.
|
|
|
|
```python
|
|
# Test query executed
|
|
cursor.execute("SELECT 1")
|
|
```
|
|
|
|
**Critical**: Yes (503 returned if fails)
|
|
|
|
### Cache Check
|
|
|
|
Verifies Redis connectivity and operation.
|
|
|
|
**Critical**: No (200 returned with warning if fails)
|
|
|
|
### Disk Usage Check
|
|
|
|
Monitors disk space to prevent storage exhaustion.
|
|
|
|
**Threshold**: Configurable via `HEALTH_CHECK_DISK_USAGE_MAX` (default: 90%)
|
|
|
|
### Memory Usage Check
|
|
|
|
Monitors available memory.
|
|
|
|
**Threshold**: Configurable via `HEALTH_CHECK_MEMORY_MIN` (default: 100MB)
|
|
|
|
## Integration Examples
|
|
|
|
### Kubernetes Liveness Probe
|
|
|
|
```yaml
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /api/v1/health/simple/
|
|
port: 8000
|
|
initialDelaySeconds: 30
|
|
periodSeconds: 10
|
|
timeoutSeconds: 5
|
|
failureThreshold: 3
|
|
```
|
|
|
|
### Kubernetes Readiness Probe
|
|
|
|
```yaml
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /api/v1/health/simple/
|
|
port: 8000
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 5
|
|
timeoutSeconds: 3
|
|
failureThreshold: 2
|
|
```
|
|
|
|
### AWS Application Load Balancer
|
|
|
|
```json
|
|
{
|
|
"HealthCheckPath": "/api/v1/health/simple/",
|
|
"HealthCheckIntervalSeconds": 30,
|
|
"HealthyThresholdCount": 2,
|
|
"UnhealthyThresholdCount": 3,
|
|
"HealthCheckTimeoutSeconds": 5,
|
|
"Matcher": {
|
|
"HttpCode": "200"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Docker Compose Health Check
|
|
|
|
```yaml
|
|
services:
|
|
web:
|
|
healthcheck:
|
|
test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 40s
|
|
```
|
|
|
|
### Nginx Upstream Health Check
|
|
|
|
```nginx
|
|
upstream django {
|
|
server web:8000;
|
|
# With nginx-plus
|
|
health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2;
|
|
}
|
|
```
|
|
|
|
## Monitoring Integration
|
|
|
|
### Prometheus Metrics
|
|
|
|
Health check data can be exposed as Prometheus metrics using django-prometheus:
|
|
|
|
```python
|
|
# Example custom metrics
|
|
from prometheus_client import Gauge
|
|
|
|
database_response_time = Gauge(
|
|
'thrillwiki_database_response_time_seconds',
|
|
'Database query response time'
|
|
)
|
|
|
|
cache_hit_rate = Gauge(
|
|
'thrillwiki_cache_hit_rate',
|
|
'Cache hit rate percentage'
|
|
)
|
|
```
|
|
|
|
### Alerting Thresholds
|
|
|
|
Recommended alerting thresholds:
|
|
|
|
| Metric | Warning | Critical |
|
|
|--------|---------|----------|
|
|
| Response time | > 1s | > 5s |
|
|
| Database query time | > 100ms | > 500ms |
|
|
| Cache hit rate | < 80% | < 50% |
|
|
| Disk usage | > 80% | > 90% |
|
|
| Memory usage | > 80% | > 90% |
|
|
|
|
### Grafana Dashboard
|
|
|
|
Import the health check dashboard:
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"title": "ThrillWiki Health",
|
|
"panels": [
|
|
{
|
|
"title": "Health Status",
|
|
"type": "stat",
|
|
"targets": [
|
|
{
|
|
"expr": "probe_success{job='thrillwiki'}"
|
|
}
|
|
]
|
|
},
|
|
{
|
|
"title": "Response Time",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "thrillwiki_health_response_time_ms"
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `HEALTH_CHECK_DISK_USAGE_MAX` | 90 | Max disk usage percentage |
|
|
| `HEALTH_CHECK_MEMORY_MIN` | 100 | Min available memory (MB) |
|
|
|
|
### Custom Health Checks
|
|
|
|
Add custom health checks by extending the health check system:
|
|
|
|
```python
|
|
# backend/apps/core/health_checks/custom_checks.py
|
|
from health_check.backends import BaseHealthCheckBackend
|
|
|
|
class CloudflareImagesHealthCheck(BaseHealthCheckBackend):
|
|
"""Check Cloudflare Images API connectivity."""
|
|
|
|
critical_service = False
|
|
|
|
def check_status(self):
|
|
try:
|
|
# Test Cloudflare Images API
|
|
response = cloudflare_service.test_connection()
|
|
if not response.ok:
|
|
self.add_error("Cloudflare Images API unavailable")
|
|
except Exception as e:
|
|
self.add_error(f"Cloudflare Images error: {e}")
|
|
|
|
def identifier(self):
|
|
return "CloudflareImages"
|
|
```
|
|
|
|
Register in `INSTALLED_APPS`:
|
|
|
|
```python
|
|
HEALTH_CHECK = {
|
|
'DISK_USAGE_MAX': 90,
|
|
'MEMORY_MIN': 100,
|
|
}
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Health Check Returns 503
|
|
|
|
1. Check database connectivity:
|
|
```bash
|
|
uv run manage.py dbshell
|
|
```
|
|
|
|
2. Check Redis connectivity:
|
|
```bash
|
|
redis-cli ping
|
|
```
|
|
|
|
3. Review application logs:
|
|
```bash
|
|
tail -f logs/django.log
|
|
```
|
|
|
|
### Slow Health Check Response
|
|
|
|
1. Check database query performance:
|
|
```bash
|
|
uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())"
|
|
```
|
|
|
|
2. Check cache response time:
|
|
```bash
|
|
redis-cli --latency
|
|
```
|
|
|
|
### Missing Metrics
|
|
|
|
Ensure `psutil` is installed for system metrics:
|
|
|
|
```bash
|
|
uv add psutil
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Use simple endpoint for load balancers**: The `/simple/` endpoint is lightweight and fast
|
|
2. **Monitor comprehensive endpoint**: Use `/health/` for detailed monitoring dashboards
|
|
3. **Set appropriate timeouts**: Health check timeouts should be shorter than intervals
|
|
4. **Alert on degraded state**: Don't wait for complete failure
|
|
5. **Log health check failures**: Include health status in application logs
|