mirror of
https://github.com/pacnpal/thrillwiki_django_no_react.git
synced 2025-12-24 17:31:09 -05:00
Add secret management guide, client-side performance monitoring, and search accessibility enhancements
- Introduced a comprehensive Secret Management Guide detailing best practices, secret classification, development setup, production management, rotation procedures, and emergency protocols. - Implemented a client-side performance monitoring script to track various metrics including page load performance, paint metrics, layout shifts, and memory usage. - Enhanced search accessibility with keyboard navigation support for search results, ensuring compliance with WCAG standards and improving user experience.
This commit is contained in:
414
docs/HEALTH_CHECKS.md
Normal file
414
docs/HEALTH_CHECKS.md
Normal file
@@ -0,0 +1,414 @@
|
||||
# ThrillWiki Health Check Documentation
|
||||
|
||||
This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes.
|
||||
|
||||
## Overview
|
||||
|
||||
ThrillWiki provides three health check endpoints with different levels of detail:
|
||||
|
||||
| Endpoint | Purpose | Authentication |
|
||||
|----------|---------|----------------|
|
||||
| `/api/v1/health/` | Comprehensive health check | Public |
|
||||
| `/api/v1/health/simple/` | Simple OK/ERROR for load balancers | Public |
|
||||
| `/api/v1/health/performance/` | Performance metrics | Debug mode only |
|
||||
|
||||
## Endpoint Details
|
||||
|
||||
### Comprehensive Health Check
|
||||
|
||||
**Endpoint**: `GET /api/v1/health/`
|
||||
|
||||
Returns detailed health information including system metrics, database status, cache status, and individual health checks.
|
||||
|
||||
#### Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"timestamp": "2025-12-23T10:30:00Z",
|
||||
"version": "1.0.0",
|
||||
"environment": "production",
|
||||
"response_time_ms": 45.23,
|
||||
"checks": {
|
||||
"DatabaseBackend": {
|
||||
"status": "healthy",
|
||||
"critical": true,
|
||||
"errors": [],
|
||||
"response_time_ms": 12.5
|
||||
},
|
||||
"CacheBackend": {
|
||||
"status": "healthy",
|
||||
"critical": false,
|
||||
"errors": [],
|
||||
"response_time_ms": 2.1
|
||||
},
|
||||
"DiskUsage": {
|
||||
"status": "healthy",
|
||||
"critical": false,
|
||||
"errors": []
|
||||
},
|
||||
"MemoryUsage": {
|
||||
"status": "healthy",
|
||||
"critical": false,
|
||||
"errors": []
|
||||
}
|
||||
},
|
||||
"metrics": {
|
||||
"cache": {
|
||||
"redis": {
|
||||
"used_memory": "45.2MB",
|
||||
"connected_clients": 15,
|
||||
"hit_rate": 94.5
|
||||
}
|
||||
},
|
||||
"database": {
|
||||
"vendor": "postgresql",
|
||||
"connection_status": "connected",
|
||||
"test_query_time_ms": 1.2,
|
||||
"active_connections": 8,
|
||||
"cache_hit_ratio": 98.3
|
||||
},
|
||||
"system": {
|
||||
"debug_mode": false,
|
||||
"memory": {
|
||||
"total_mb": 8192,
|
||||
"available_mb": 4096,
|
||||
"percent_used": 50
|
||||
},
|
||||
"cpu": {
|
||||
"percent_used": 25.3,
|
||||
"core_count": 4
|
||||
},
|
||||
"disk": {
|
||||
"total_gb": 100,
|
||||
"free_gb": 45,
|
||||
"percent_used": 55
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Status Codes
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| 200 | All checks passed or non-critical failures |
|
||||
| 503 | Critical service failure (database, etc.) |
|
||||
|
||||
### Simple Health Check
|
||||
|
||||
**Endpoint**: `GET /api/v1/health/simple/`
|
||||
|
||||
Lightweight health check designed for load balancer health probes.
|
||||
|
||||
#### Response Format (Healthy)
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"timestamp": "2025-12-23T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
#### Response Format (Unhealthy)
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "error",
|
||||
"error": "Database connection failed",
|
||||
"timestamp": "2025-12-23T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
#### Status Codes
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| 200 | Service healthy |
|
||||
| 503 | Service unhealthy |
|
||||
|
||||
### Performance Metrics
|
||||
|
||||
**Endpoint**: `GET /api/v1/health/performance/`
|
||||
|
||||
Detailed performance metrics for debugging (only available when `DEBUG=True`).
|
||||
|
||||
#### Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2025-12-23T10:30:00Z",
|
||||
"database_analysis": {
|
||||
"total_queries": 0,
|
||||
"query_analysis": {}
|
||||
},
|
||||
"cache_performance": {
|
||||
"redis": {
|
||||
"used_memory": "45.2MB",
|
||||
"hit_rate": 94.5
|
||||
}
|
||||
},
|
||||
"recent_slow_queries": []
|
||||
}
|
||||
```
|
||||
|
||||
#### Status Codes
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| 200 | Metrics returned |
|
||||
| 403 | Not available (DEBUG=False) |
|
||||
|
||||
## Health Checks Included
|
||||
|
||||
### Database Check
|
||||
|
||||
Verifies PostgreSQL connectivity by executing a simple query.
|
||||
|
||||
```python
|
||||
# Test query executed
|
||||
cursor.execute("SELECT 1")
|
||||
```
|
||||
|
||||
**Critical**: Yes (503 returned if fails)
|
||||
|
||||
### Cache Check
|
||||
|
||||
Verifies Redis connectivity and operation.
|
||||
|
||||
**Critical**: No (200 returned with warning if fails)
|
||||
|
||||
### Disk Usage Check
|
||||
|
||||
Monitors disk space to prevent storage exhaustion.
|
||||
|
||||
**Threshold**: Configurable via `HEALTH_CHECK_DISK_USAGE_MAX` (default: 90%)
|
||||
|
||||
### Memory Usage Check
|
||||
|
||||
Monitors available memory.
|
||||
|
||||
**Threshold**: Configurable via `HEALTH_CHECK_MEMORY_MIN` (default: 100MB)
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Kubernetes Liveness Probe
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /api/v1/health/simple/
|
||||
port: 8000
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
```
|
||||
|
||||
### Kubernetes Readiness Probe
|
||||
|
||||
```yaml
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /api/v1/health/simple/
|
||||
port: 8000
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 5
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 2
|
||||
```
|
||||
|
||||
### AWS Application Load Balancer
|
||||
|
||||
```json
|
||||
{
|
||||
"HealthCheckPath": "/api/v1/health/simple/",
|
||||
"HealthCheckIntervalSeconds": 30,
|
||||
"HealthyThresholdCount": 2,
|
||||
"UnhealthyThresholdCount": 3,
|
||||
"HealthCheckTimeoutSeconds": 5,
|
||||
"Matcher": {
|
||||
"HttpCode": "200"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Docker Compose Health Check
|
||||
|
||||
```yaml
|
||||
services:
|
||||
web:
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
```
|
||||
|
||||
### Nginx Upstream Health Check
|
||||
|
||||
```nginx
|
||||
upstream django {
|
||||
server web:8000;
|
||||
# With nginx-plus
|
||||
health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2;
|
||||
}
|
||||
```
|
||||
|
||||
## Monitoring Integration
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
Health check data can be exposed as Prometheus metrics using django-prometheus:
|
||||
|
||||
```python
|
||||
# Example custom metrics
|
||||
from prometheus_client import Gauge
|
||||
|
||||
database_response_time = Gauge(
|
||||
'thrillwiki_database_response_time_seconds',
|
||||
'Database query response time'
|
||||
)
|
||||
|
||||
cache_hit_rate = Gauge(
|
||||
'thrillwiki_cache_hit_rate',
|
||||
'Cache hit rate percentage'
|
||||
)
|
||||
```
|
||||
|
||||
### Alerting Thresholds
|
||||
|
||||
Recommended alerting thresholds:
|
||||
|
||||
| Metric | Warning | Critical |
|
||||
|--------|---------|----------|
|
||||
| Response time | > 1s | > 5s |
|
||||
| Database query time | > 100ms | > 500ms |
|
||||
| Cache hit rate | < 80% | < 50% |
|
||||
| Disk usage | > 80% | > 90% |
|
||||
| Memory usage | > 80% | > 90% |
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
Import the health check dashboard:
|
||||
|
||||
```json
|
||||
{
|
||||
"dashboard": {
|
||||
"title": "ThrillWiki Health",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Health Status",
|
||||
"type": "stat",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "probe_success{job='thrillwiki'}"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Response Time",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "thrillwiki_health_response_time_ms"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `HEALTH_CHECK_DISK_USAGE_MAX` | 90 | Max disk usage percentage |
|
||||
| `HEALTH_CHECK_MEMORY_MIN` | 100 | Min available memory (MB) |
|
||||
|
||||
### Custom Health Checks
|
||||
|
||||
Add custom health checks by extending the health check system:
|
||||
|
||||
```python
|
||||
# backend/apps/core/health_checks/custom_checks.py
|
||||
from health_check.backends import BaseHealthCheckBackend
|
||||
|
||||
class CloudflareImagesHealthCheck(BaseHealthCheckBackend):
|
||||
"""Check Cloudflare Images API connectivity."""
|
||||
|
||||
critical_service = False
|
||||
|
||||
def check_status(self):
|
||||
try:
|
||||
# Test Cloudflare Images API
|
||||
response = cloudflare_service.test_connection()
|
||||
if not response.ok:
|
||||
self.add_error("Cloudflare Images API unavailable")
|
||||
except Exception as e:
|
||||
self.add_error(f"Cloudflare Images error: {e}")
|
||||
|
||||
def identifier(self):
|
||||
return "CloudflareImages"
|
||||
```
|
||||
|
||||
Register in `INSTALLED_APPS`:
|
||||
|
||||
```python
|
||||
HEALTH_CHECK = {
|
||||
'DISK_USAGE_MAX': 90,
|
||||
'MEMORY_MIN': 100,
|
||||
}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Health Check Returns 503
|
||||
|
||||
1. Check database connectivity:
|
||||
```bash
|
||||
uv run manage.py dbshell
|
||||
```
|
||||
|
||||
2. Check Redis connectivity:
|
||||
```bash
|
||||
redis-cli ping
|
||||
```
|
||||
|
||||
3. Review application logs:
|
||||
```bash
|
||||
tail -f logs/django.log
|
||||
```
|
||||
|
||||
### Slow Health Check Response
|
||||
|
||||
1. Check database query performance:
|
||||
```bash
|
||||
uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())"
|
||||
```
|
||||
|
||||
2. Check cache response time:
|
||||
```bash
|
||||
redis-cli --latency
|
||||
```
|
||||
|
||||
### Missing Metrics
|
||||
|
||||
Ensure `psutil` is installed for system metrics:
|
||||
|
||||
```bash
|
||||
uv add psutil
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use simple endpoint for load balancers**: The `/simple/` endpoint is lightweight and fast
|
||||
2. **Monitor comprehensive endpoint**: Use `/health/` for detailed monitoring dashboards
|
||||
3. **Set appropriate timeouts**: Health check timeouts should be shorter than intervals
|
||||
4. **Alert on degraded state**: Don't wait for complete failure
|
||||
5. **Log health check failures**: Include health status in application logs
|
||||
Reference in New Issue
Block a user