Files
thrilltrack-explorer/django/README_MONITORING.md
gpt-engineer-app[bot] 915a9fe2df Add automated data retention cleanup
Implements edge function, Django tasks, and UI hooks/panels for automatic retention of old metrics, anomalies, alerts, and incidents, plus updates to query keys and monitoring dashboard to reflect data-retention workflows.
2025-11-11 02:21:27 +00:00

251 lines
7.1 KiB
Markdown

# ThrillWiki Monitoring Setup
## Overview
This document describes the automatic metric collection system for anomaly detection and system monitoring.
## Architecture
The system collects metrics from two sources:
1. **Django Backend (Celery Tasks)**: Collects Django-specific metrics like error rates, response times, queue sizes
2. **Supabase Edge Function**: Collects Supabase-specific metrics like API errors, rate limits, submission queues
## Components
### Django Components
#### 1. Metrics Collector (`apps/monitoring/metrics_collector.py`)
- Collects system metrics from various sources
- Records metrics to Supabase `metric_time_series` table
- Provides utilities for tracking:
- Error rates
- API response times
- Celery queue sizes
- Database connection counts
- Cache hit rates
#### 2. Celery Tasks (`apps/monitoring/tasks.py`)
Periodic background tasks:
- `collect_system_metrics`: Collects all metrics every minute
- `collect_error_metrics`: Tracks error rates
- `collect_performance_metrics`: Tracks response times and cache performance
- `collect_queue_metrics`: Monitors Celery queue health
#### 3. Metrics Middleware (`apps/monitoring/middleware.py`)
- Tracks API response times for every request
- Records errors and exceptions
- Updates cache with performance data
### Supabase Components
#### Edge Function (`supabase/functions/collect-metrics`)
Collects Supabase-specific metrics:
- API error counts
- Rate limit violations
- Pending submissions
- Active incidents
- Unresolved alerts
- Submission approval rates
- Average moderation times
## Setup Instructions
### 1. Django Setup
Add the monitoring app to your Django `INSTALLED_APPS`:
```python
INSTALLED_APPS = [
# ... other apps
'apps.monitoring',
]
```
Add the metrics middleware to `MIDDLEWARE`:
```python
MIDDLEWARE = [
# ... other middleware
'apps.monitoring.middleware.MetricsMiddleware',
]
```
Import and use the Celery Beat schedule in your Django settings:
```python
from config.celery_beat_schedule import CELERY_BEAT_SCHEDULE
CELERY_BEAT_SCHEDULE = CELERY_BEAT_SCHEDULE
```
Configure environment variables:
```bash
SUPABASE_URL=https://api.thrillwiki.com
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
```
### 2. Start Celery Workers
Start Celery worker for processing tasks:
```bash
celery -A config worker -l info -Q monitoring,maintenance,analytics
```
Start Celery Beat for periodic task scheduling:
```bash
celery -A config beat -l info
```
### 3. Supabase Edge Function Setup
The `collect-metrics` edge function should be called periodically. Set up a cron job in Supabase:
```sql
SELECT cron.schedule(
'collect-metrics-every-minute',
'* * * * *', -- Every minute
$$
SELECT net.http_post(
url:='https://api.thrillwiki.com/functions/v1/collect-metrics',
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
body:=concat('{"time": "', now(), '"}')::jsonb
) as request_id;
$$
);
```
### 4. Anomaly Detection Setup
The `detect-anomalies` edge function should also run periodically:
```sql
SELECT cron.schedule(
'detect-anomalies-every-5-minutes',
'*/5 * * * *', -- Every 5 minutes
$$
SELECT net.http_post(
url:='https://api.thrillwiki.com/functions/v1/detect-anomalies',
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
body:=concat('{"time": "', now(), '"}')::jsonb
) as request_id;
$$
);
```
### 5. Data Retention Cleanup Setup
The `data-retention-cleanup` edge function should run daily:
```sql
SELECT cron.schedule(
'data-retention-cleanup-daily',
'0 3 * * *', -- Daily at 3:00 AM
$$
SELECT net.http_post(
url:='https://api.thrillwiki.com/functions/v1/data-retention-cleanup',
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
body:=concat('{"time": "', now(), '"}')::jsonb
) as request_id;
$$
);
```
## Metrics Collected
### Django Metrics
- `error_rate`: Percentage of error logs (performance)
- `api_response_time`: Average API response time in ms (performance)
- `celery_queue_size`: Number of queued Celery tasks (system)
- `database_connections`: Active database connections (system)
- `cache_hit_rate`: Cache hit percentage (performance)
### Supabase Metrics
- `api_error_count`: Recent API errors (performance)
- `rate_limit_violations`: Rate limit blocks (security)
- `pending_submissions`: Submissions awaiting moderation (workflow)
- `active_incidents`: Open/investigating incidents (monitoring)
- `unresolved_alerts`: Unresolved system alerts (monitoring)
- `submission_approval_rate`: Percentage of approved submissions (workflow)
- `avg_moderation_time`: Average time to moderate in minutes (workflow)
## Data Retention Policies
The system automatically cleans up old data to manage database size:
### Retention Periods
- **Metrics** (`metric_time_series`): 30 days
- **Anomaly Detections**: 30 days (resolved alerts archived after 7 days)
- **Resolved Alerts**: 90 days
- **Resolved Incidents**: 90 days
### Cleanup Functions
The following database functions manage data retention:
1. **`cleanup_old_metrics(retention_days)`**: Deletes metrics older than specified days (default: 30)
2. **`cleanup_old_anomalies(retention_days)`**: Archives resolved anomalies and deletes old unresolved ones (default: 30)
3. **`cleanup_old_alerts(retention_days)`**: Deletes old resolved alerts (default: 90)
4. **`cleanup_old_incidents(retention_days)`**: Deletes old resolved incidents (default: 90)
5. **`run_data_retention_cleanup()`**: Master function that runs all cleanup operations
### Automated Cleanup Schedule
Django Celery tasks run retention cleanup automatically:
- Full cleanup: Daily at 3:00 AM
- Metrics cleanup: Daily at 3:30 AM
- Anomaly cleanup: Daily at 4:00 AM
View retention statistics in the Admin Dashboard's Data Retention panel.
## Monitoring
View collected metrics in the Admin Monitoring Dashboard:
- Navigate to `/admin/monitoring`
- View anomaly detections, alerts, and incidents
- Manually trigger metric collection or anomaly detection
- View real-time system health
## Troubleshooting
### No metrics being collected
1. Check Celery workers are running:
```bash
celery -A config inspect active
```
2. Check Celery Beat is running:
```bash
celery -A config inspect scheduled
```
3. Verify environment variables are set
4. Check logs for errors:
```bash
tail -f logs/celery.log
```
### Edge function not collecting metrics
1. Verify cron job is scheduled in Supabase
2. Check edge function logs in Supabase dashboard
3. Verify service role key is correct
4. Test edge function manually
## Production Considerations
1. **Resource Usage**: Collecting metrics every minute generates significant database writes. Consider adjusting frequency for production.
2. **Data Retention**: Set up periodic cleanup of old metrics (older than 30 days) to manage database size.
3. **Alert Fatigue**: Fine-tune anomaly detection sensitivity to reduce false positives.
4. **Scaling**: As traffic grows, consider moving to a time-series database like TimescaleDB or InfluxDB.
5. **Monitoring the Monitors**: Set up external health checks to ensure metric collection is working.