Add secret management guide, client-side performance monitoring, and search accessibility enhancements

- Introduced a comprehensive Secret Management Guide detailing best practices, secret classification, development setup, production management, rotation procedures, and emergency protocols.
- Implemented a client-side performance monitoring script to track various metrics including page load performance, paint metrics, layout shifts, and memory usage.
- Enhanced search accessibility with keyboard navigation support for search results, ensuring compliance with WCAG standards and improving user experience.
This commit is contained in:
pacnpal
2025-12-23 16:41:42 -05:00
parent ae31e889d7
commit edcd8f2076
155 changed files with 22046 additions and 4645 deletions

414
docs/HEALTH_CHECKS.md Normal file
View File

@@ -0,0 +1,414 @@
# ThrillWiki Health Check Documentation
This document describes the health check endpoints available in ThrillWiki for monitoring and operational purposes.
## Overview
ThrillWiki provides three health check endpoints with different levels of detail:
| Endpoint | Purpose | Authentication |
|----------|---------|----------------|
| `/api/v1/health/` | Comprehensive health check | Public |
| `/api/v1/health/simple/` | Simple OK/ERROR for load balancers | Public |
| `/api/v1/health/performance/` | Performance metrics | Debug mode only |
## Endpoint Details
### Comprehensive Health Check
**Endpoint**: `GET /api/v1/health/`
Returns detailed health information including system metrics, database status, cache status, and individual health checks.
#### Response Format
```json
{
"status": "healthy",
"timestamp": "2025-12-23T10:30:00Z",
"version": "1.0.0",
"environment": "production",
"response_time_ms": 45.23,
"checks": {
"DatabaseBackend": {
"status": "healthy",
"critical": true,
"errors": [],
"response_time_ms": 12.5
},
"CacheBackend": {
"status": "healthy",
"critical": false,
"errors": [],
"response_time_ms": 2.1
},
"DiskUsage": {
"status": "healthy",
"critical": false,
"errors": []
},
"MemoryUsage": {
"status": "healthy",
"critical": false,
"errors": []
}
},
"metrics": {
"cache": {
"redis": {
"used_memory": "45.2MB",
"connected_clients": 15,
"hit_rate": 94.5
}
},
"database": {
"vendor": "postgresql",
"connection_status": "connected",
"test_query_time_ms": 1.2,
"active_connections": 8,
"cache_hit_ratio": 98.3
},
"system": {
"debug_mode": false,
"memory": {
"total_mb": 8192,
"available_mb": 4096,
"percent_used": 50
},
"cpu": {
"percent_used": 25.3,
"core_count": 4
},
"disk": {
"total_gb": 100,
"free_gb": 45,
"percent_used": 55
}
}
}
}
```
#### Status Codes
| Code | Meaning |
|------|---------|
| 200 | All checks passed or non-critical failures |
| 503 | Critical service failure (database, etc.) |
### Simple Health Check
**Endpoint**: `GET /api/v1/health/simple/`
Lightweight health check designed for load balancer health probes.
#### Response Format (Healthy)
```json
{
"status": "ok",
"timestamp": "2025-12-23T10:30:00Z"
}
```
#### Response Format (Unhealthy)
```json
{
"status": "error",
"error": "Database connection failed",
"timestamp": "2025-12-23T10:30:00Z"
}
```
#### Status Codes
| Code | Meaning |
|------|---------|
| 200 | Service healthy |
| 503 | Service unhealthy |
### Performance Metrics
**Endpoint**: `GET /api/v1/health/performance/`
Detailed performance metrics for debugging (only available when `DEBUG=True`).
#### Response Format
```json
{
"timestamp": "2025-12-23T10:30:00Z",
"database_analysis": {
"total_queries": 0,
"query_analysis": {}
},
"cache_performance": {
"redis": {
"used_memory": "45.2MB",
"hit_rate": 94.5
}
},
"recent_slow_queries": []
}
```
#### Status Codes
| Code | Meaning |
|------|---------|
| 200 | Metrics returned |
| 403 | Not available (DEBUG=False) |
## Health Checks Included
### Database Check
Verifies PostgreSQL connectivity by executing a simple query.
```python
# Test query executed
cursor.execute("SELECT 1")
```
**Critical**: Yes (503 returned if fails)
### Cache Check
Verifies Redis connectivity and operation.
**Critical**: No (200 returned with warning if fails)
### Disk Usage Check
Monitors disk space to prevent storage exhaustion.
**Threshold**: Configurable via `HEALTH_CHECK_DISK_USAGE_MAX` (default: 90%)
### Memory Usage Check
Monitors available memory.
**Threshold**: Configurable via `HEALTH_CHECK_MEMORY_MIN` (default: 100MB)
## Integration Examples
### Kubernetes Liveness Probe
```yaml
livenessProbe:
httpGet:
path: /api/v1/health/simple/
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
```
### Kubernetes Readiness Probe
```yaml
readinessProbe:
httpGet:
path: /api/v1/health/simple/
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
```
### AWS Application Load Balancer
```json
{
"HealthCheckPath": "/api/v1/health/simple/",
"HealthCheckIntervalSeconds": 30,
"HealthyThresholdCount": 2,
"UnhealthyThresholdCount": 3,
"HealthCheckTimeoutSeconds": 5,
"Matcher": {
"HttpCode": "200"
}
}
```
### Docker Compose Health Check
```yaml
services:
web:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/v1/health/simple/"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
```
### Nginx Upstream Health Check
```nginx
upstream django {
server web:8000;
# With nginx-plus
health_check uri=/api/v1/health/simple/ interval=10s fails=3 passes=2;
}
```
## Monitoring Integration
### Prometheus Metrics
Health check data can be exposed as Prometheus metrics using django-prometheus:
```python
# Example custom metrics
from prometheus_client import Gauge
database_response_time = Gauge(
'thrillwiki_database_response_time_seconds',
'Database query response time'
)
cache_hit_rate = Gauge(
'thrillwiki_cache_hit_rate',
'Cache hit rate percentage'
)
```
### Alerting Thresholds
Recommended alerting thresholds:
| Metric | Warning | Critical |
|--------|---------|----------|
| Response time | > 1s | > 5s |
| Database query time | > 100ms | > 500ms |
| Cache hit rate | < 80% | < 50% |
| Disk usage | > 80% | > 90% |
| Memory usage | > 80% | > 90% |
### Grafana Dashboard
Import the health check dashboard:
```json
{
"dashboard": {
"title": "ThrillWiki Health",
"panels": [
{
"title": "Health Status",
"type": "stat",
"targets": [
{
"expr": "probe_success{job='thrillwiki'}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "thrillwiki_health_response_time_ms"
}
]
}
]
}
}
```
## Configuration
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `HEALTH_CHECK_DISK_USAGE_MAX` | 90 | Max disk usage percentage |
| `HEALTH_CHECK_MEMORY_MIN` | 100 | Min available memory (MB) |
### Custom Health Checks
Add custom health checks by extending the health check system:
```python
# backend/apps/core/health_checks/custom_checks.py
from health_check.backends import BaseHealthCheckBackend
class CloudflareImagesHealthCheck(BaseHealthCheckBackend):
"""Check Cloudflare Images API connectivity."""
critical_service = False
def check_status(self):
try:
# Test Cloudflare Images API
response = cloudflare_service.test_connection()
if not response.ok:
self.add_error("Cloudflare Images API unavailable")
except Exception as e:
self.add_error(f"Cloudflare Images error: {e}")
def identifier(self):
return "CloudflareImages"
```
Register in `INSTALLED_APPS`:
```python
HEALTH_CHECK = {
'DISK_USAGE_MAX': 90,
'MEMORY_MIN': 100,
}
```
## Troubleshooting
### Health Check Returns 503
1. Check database connectivity:
```bash
uv run manage.py dbshell
```
2. Check Redis connectivity:
```bash
redis-cli ping
```
3. Review application logs:
```bash
tail -f logs/django.log
```
### Slow Health Check Response
1. Check database query performance:
```bash
uv run manage.py shell -c "from django.db import connection; print(connection.ensure_connection())"
```
2. Check cache response time:
```bash
redis-cli --latency
```
### Missing Metrics
Ensure `psutil` is installed for system metrics:
```bash
uv add psutil
```
## Best Practices
1. **Use simple endpoint for load balancers**: The `/simple/` endpoint is lightweight and fast
2. **Monitor comprehensive endpoint**: Use `/health/` for detailed monitoring dashboards
3. **Set appropriate timeouts**: Health check timeouts should be shorter than intervals
4. **Alert on degraded state**: Don't wait for complete failure
5. **Log health check failures**: Include health status in application logs