mirror of
https://github.com/pacnpal/thrilltrack-explorer.git
synced 2025-12-20 10:31:13 -05:00
Implements edge function, Django tasks, and UI hooks/panels for automatic retention of old metrics, anomalies, alerts, and incidents, plus updates to query keys and monitoring dashboard to reflect data-retention workflows.
251 lines
7.1 KiB
Markdown
251 lines
7.1 KiB
Markdown
# ThrillWiki Monitoring Setup
|
|
|
|
## Overview
|
|
|
|
This document describes the automatic metric collection system for anomaly detection and system monitoring.
|
|
|
|
## Architecture
|
|
|
|
The system collects metrics from two sources:
|
|
|
|
1. **Django Backend (Celery Tasks)**: Collects Django-specific metrics like error rates, response times, queue sizes
|
|
2. **Supabase Edge Function**: Collects Supabase-specific metrics like API errors, rate limits, submission queues
|
|
|
|
## Components
|
|
|
|
### Django Components
|
|
|
|
#### 1. Metrics Collector (`apps/monitoring/metrics_collector.py`)
|
|
- Collects system metrics from various sources
|
|
- Records metrics to Supabase `metric_time_series` table
|
|
- Provides utilities for tracking:
|
|
- Error rates
|
|
- API response times
|
|
- Celery queue sizes
|
|
- Database connection counts
|
|
- Cache hit rates
|
|
|
|
#### 2. Celery Tasks (`apps/monitoring/tasks.py`)
|
|
Periodic background tasks:
|
|
- `collect_system_metrics`: Collects all metrics every minute
|
|
- `collect_error_metrics`: Tracks error rates
|
|
- `collect_performance_metrics`: Tracks response times and cache performance
|
|
- `collect_queue_metrics`: Monitors Celery queue health
|
|
|
|
#### 3. Metrics Middleware (`apps/monitoring/middleware.py`)
|
|
- Tracks API response times for every request
|
|
- Records errors and exceptions
|
|
- Updates cache with performance data
|
|
|
|
### Supabase Components
|
|
|
|
#### Edge Function (`supabase/functions/collect-metrics`)
|
|
Collects Supabase-specific metrics:
|
|
- API error counts
|
|
- Rate limit violations
|
|
- Pending submissions
|
|
- Active incidents
|
|
- Unresolved alerts
|
|
- Submission approval rates
|
|
- Average moderation times
|
|
|
|
## Setup Instructions
|
|
|
|
### 1. Django Setup
|
|
|
|
Add the monitoring app to your Django `INSTALLED_APPS`:
|
|
|
|
```python
|
|
INSTALLED_APPS = [
|
|
# ... other apps
|
|
'apps.monitoring',
|
|
]
|
|
```
|
|
|
|
Add the metrics middleware to `MIDDLEWARE`:
|
|
|
|
```python
|
|
MIDDLEWARE = [
|
|
# ... other middleware
|
|
'apps.monitoring.middleware.MetricsMiddleware',
|
|
]
|
|
```
|
|
|
|
Import and use the Celery Beat schedule in your Django settings:
|
|
|
|
```python
|
|
from config.celery_beat_schedule import CELERY_BEAT_SCHEDULE
|
|
|
|
CELERY_BEAT_SCHEDULE = CELERY_BEAT_SCHEDULE
|
|
```
|
|
|
|
Configure environment variables:
|
|
|
|
```bash
|
|
SUPABASE_URL=https://api.thrillwiki.com
|
|
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
|
|
```
|
|
|
|
### 2. Start Celery Workers
|
|
|
|
Start Celery worker for processing tasks:
|
|
|
|
```bash
|
|
celery -A config worker -l info -Q monitoring,maintenance,analytics
|
|
```
|
|
|
|
Start Celery Beat for periodic task scheduling:
|
|
|
|
```bash
|
|
celery -A config beat -l info
|
|
```
|
|
|
|
### 3. Supabase Edge Function Setup
|
|
|
|
The `collect-metrics` edge function should be called periodically. Set up a cron job in Supabase:
|
|
|
|
```sql
|
|
SELECT cron.schedule(
|
|
'collect-metrics-every-minute',
|
|
'* * * * *', -- Every minute
|
|
$$
|
|
SELECT net.http_post(
|
|
url:='https://api.thrillwiki.com/functions/v1/collect-metrics',
|
|
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
|
|
body:=concat('{"time": "', now(), '"}')::jsonb
|
|
) as request_id;
|
|
$$
|
|
);
|
|
```
|
|
|
|
### 4. Anomaly Detection Setup
|
|
|
|
The `detect-anomalies` edge function should also run periodically:
|
|
|
|
```sql
|
|
SELECT cron.schedule(
|
|
'detect-anomalies-every-5-minutes',
|
|
'*/5 * * * *', -- Every 5 minutes
|
|
$$
|
|
SELECT net.http_post(
|
|
url:='https://api.thrillwiki.com/functions/v1/detect-anomalies',
|
|
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
|
|
body:=concat('{"time": "', now(), '"}')::jsonb
|
|
) as request_id;
|
|
$$
|
|
);
|
|
```
|
|
|
|
### 5. Data Retention Cleanup Setup
|
|
|
|
The `data-retention-cleanup` edge function should run daily:
|
|
|
|
```sql
|
|
SELECT cron.schedule(
|
|
'data-retention-cleanup-daily',
|
|
'0 3 * * *', -- Daily at 3:00 AM
|
|
$$
|
|
SELECT net.http_post(
|
|
url:='https://api.thrillwiki.com/functions/v1/data-retention-cleanup',
|
|
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
|
|
body:=concat('{"time": "', now(), '"}')::jsonb
|
|
) as request_id;
|
|
$$
|
|
);
|
|
```
|
|
|
|
## Metrics Collected
|
|
|
|
### Django Metrics
|
|
- `error_rate`: Percentage of error logs (performance)
|
|
- `api_response_time`: Average API response time in ms (performance)
|
|
- `celery_queue_size`: Number of queued Celery tasks (system)
|
|
- `database_connections`: Active database connections (system)
|
|
- `cache_hit_rate`: Cache hit percentage (performance)
|
|
|
|
### Supabase Metrics
|
|
- `api_error_count`: Recent API errors (performance)
|
|
- `rate_limit_violations`: Rate limit blocks (security)
|
|
- `pending_submissions`: Submissions awaiting moderation (workflow)
|
|
- `active_incidents`: Open/investigating incidents (monitoring)
|
|
- `unresolved_alerts`: Unresolved system alerts (monitoring)
|
|
- `submission_approval_rate`: Percentage of approved submissions (workflow)
|
|
- `avg_moderation_time`: Average time to moderate in minutes (workflow)
|
|
|
|
## Data Retention Policies
|
|
|
|
The system automatically cleans up old data to manage database size:
|
|
|
|
### Retention Periods
|
|
- **Metrics** (`metric_time_series`): 30 days
|
|
- **Anomaly Detections**: 30 days (resolved alerts archived after 7 days)
|
|
- **Resolved Alerts**: 90 days
|
|
- **Resolved Incidents**: 90 days
|
|
|
|
### Cleanup Functions
|
|
|
|
The following database functions manage data retention:
|
|
|
|
1. **`cleanup_old_metrics(retention_days)`**: Deletes metrics older than specified days (default: 30)
|
|
2. **`cleanup_old_anomalies(retention_days)`**: Archives resolved anomalies and deletes old unresolved ones (default: 30)
|
|
3. **`cleanup_old_alerts(retention_days)`**: Deletes old resolved alerts (default: 90)
|
|
4. **`cleanup_old_incidents(retention_days)`**: Deletes old resolved incidents (default: 90)
|
|
5. **`run_data_retention_cleanup()`**: Master function that runs all cleanup operations
|
|
|
|
### Automated Cleanup Schedule
|
|
|
|
Django Celery tasks run retention cleanup automatically:
|
|
- Full cleanup: Daily at 3:00 AM
|
|
- Metrics cleanup: Daily at 3:30 AM
|
|
- Anomaly cleanup: Daily at 4:00 AM
|
|
|
|
View retention statistics in the Admin Dashboard's Data Retention panel.
|
|
|
|
## Monitoring
|
|
|
|
View collected metrics in the Admin Monitoring Dashboard:
|
|
- Navigate to `/admin/monitoring`
|
|
- View anomaly detections, alerts, and incidents
|
|
- Manually trigger metric collection or anomaly detection
|
|
- View real-time system health
|
|
|
|
## Troubleshooting
|
|
|
|
### No metrics being collected
|
|
|
|
1. Check Celery workers are running:
|
|
```bash
|
|
celery -A config inspect active
|
|
```
|
|
|
|
2. Check Celery Beat is running:
|
|
```bash
|
|
celery -A config inspect scheduled
|
|
```
|
|
|
|
3. Verify environment variables are set
|
|
|
|
4. Check logs for errors:
|
|
```bash
|
|
tail -f logs/celery.log
|
|
```
|
|
|
|
### Edge function not collecting metrics
|
|
|
|
1. Verify cron job is scheduled in Supabase
|
|
2. Check edge function logs in Supabase dashboard
|
|
3. Verify service role key is correct
|
|
4. Test edge function manually
|
|
|
|
## Production Considerations
|
|
|
|
1. **Resource Usage**: Collecting metrics every minute generates significant database writes. Consider adjusting frequency for production.
|
|
|
|
2. **Data Retention**: Set up periodic cleanup of old metrics (older than 30 days) to manage database size.
|
|
|
|
3. **Alert Fatigue**: Fine-tune anomaly detection sensitivity to reduce false positives.
|
|
|
|
4. **Scaling**: As traffic grows, consider moving to a time-series database like TimescaleDB or InfluxDB.
|
|
|
|
5. **Monitoring the Monitors**: Set up external health checks to ensure metric collection is working.
|