# ThrillWiki Monitoring Setup ## Overview This document describes the automatic metric collection system for anomaly detection and system monitoring. ## Architecture The system collects metrics from two sources: 1. **Django Backend (Celery Tasks)**: Collects Django-specific metrics like error rates, response times, queue sizes 2. **Supabase Edge Function**: Collects Supabase-specific metrics like API errors, rate limits, submission queues ## Components ### Django Components #### 1. Metrics Collector (`apps/monitoring/metrics_collector.py`) - Collects system metrics from various sources - Records metrics to Supabase `metric_time_series` table - Provides utilities for tracking: - Error rates - API response times - Celery queue sizes - Database connection counts - Cache hit rates #### 2. Celery Tasks (`apps/monitoring/tasks.py`) Periodic background tasks: - `collect_system_metrics`: Collects all metrics every minute - `collect_error_metrics`: Tracks error rates - `collect_performance_metrics`: Tracks response times and cache performance - `collect_queue_metrics`: Monitors Celery queue health #### 3. Metrics Middleware (`apps/monitoring/middleware.py`) - Tracks API response times for every request - Records errors and exceptions - Updates cache with performance data ### Supabase Components #### Edge Function (`supabase/functions/collect-metrics`) Collects Supabase-specific metrics: - API error counts - Rate limit violations - Pending submissions - Active incidents - Unresolved alerts - Submission approval rates - Average moderation times ## Setup Instructions ### 1. Django Setup Add the monitoring app to your Django `INSTALLED_APPS`: ```python INSTALLED_APPS = [ # ... other apps 'apps.monitoring', ] ``` Add the metrics middleware to `MIDDLEWARE`: ```python MIDDLEWARE = [ # ... other middleware 'apps.monitoring.middleware.MetricsMiddleware', ] ``` Import and use the Celery Beat schedule in your Django settings: ```python from config.celery_beat_schedule import CELERY_BEAT_SCHEDULE CELERY_BEAT_SCHEDULE = CELERY_BEAT_SCHEDULE ``` Configure environment variables: ```bash SUPABASE_URL=https://api.thrillwiki.com SUPABASE_SERVICE_ROLE_KEY=your_service_role_key ``` ### 2. Start Celery Workers Start Celery worker for processing tasks: ```bash celery -A config worker -l info -Q monitoring,maintenance,analytics ``` Start Celery Beat for periodic task scheduling: ```bash celery -A config beat -l info ``` ### 3. Supabase Edge Function Setup The `collect-metrics` edge function should be called periodically. Set up a cron job in Supabase: ```sql SELECT cron.schedule( 'collect-metrics-every-minute', '* * * * *', -- Every minute $$ SELECT net.http_post( url:='https://api.thrillwiki.com/functions/v1/collect-metrics', headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb, body:=concat('{"time": "', now(), '"}')::jsonb ) as request_id; $$ ); ``` ### 4. Anomaly Detection Setup The `detect-anomalies` edge function should also run periodically: ```sql SELECT cron.schedule( 'detect-anomalies-every-5-minutes', '*/5 * * * *', -- Every 5 minutes $$ SELECT net.http_post( url:='https://api.thrillwiki.com/functions/v1/detect-anomalies', headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb, body:=concat('{"time": "', now(), '"}')::jsonb ) as request_id; $$ ); ``` ### 5. Data Retention Cleanup Setup The `data-retention-cleanup` edge function should run daily: ```sql SELECT cron.schedule( 'data-retention-cleanup-daily', '0 3 * * *', -- Daily at 3:00 AM $$ SELECT net.http_post( url:='https://api.thrillwiki.com/functions/v1/data-retention-cleanup', headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb, body:=concat('{"time": "', now(), '"}')::jsonb ) as request_id; $$ ); ``` ## Metrics Collected ### Django Metrics - `error_rate`: Percentage of error logs (performance) - `api_response_time`: Average API response time in ms (performance) - `celery_queue_size`: Number of queued Celery tasks (system) - `database_connections`: Active database connections (system) - `cache_hit_rate`: Cache hit percentage (performance) ### Supabase Metrics - `api_error_count`: Recent API errors (performance) - `rate_limit_violations`: Rate limit blocks (security) - `pending_submissions`: Submissions awaiting moderation (workflow) - `active_incidents`: Open/investigating incidents (monitoring) - `unresolved_alerts`: Unresolved system alerts (monitoring) - `submission_approval_rate`: Percentage of approved submissions (workflow) - `avg_moderation_time`: Average time to moderate in minutes (workflow) ## Data Retention Policies The system automatically cleans up old data to manage database size: ### Retention Periods - **Metrics** (`metric_time_series`): 30 days - **Anomaly Detections**: 30 days (resolved alerts archived after 7 days) - **Resolved Alerts**: 90 days - **Resolved Incidents**: 90 days ### Cleanup Functions The following database functions manage data retention: 1. **`cleanup_old_metrics(retention_days)`**: Deletes metrics older than specified days (default: 30) 2. **`cleanup_old_anomalies(retention_days)`**: Archives resolved anomalies and deletes old unresolved ones (default: 30) 3. **`cleanup_old_alerts(retention_days)`**: Deletes old resolved alerts (default: 90) 4. **`cleanup_old_incidents(retention_days)`**: Deletes old resolved incidents (default: 90) 5. **`run_data_retention_cleanup()`**: Master function that runs all cleanup operations ### Automated Cleanup Schedule Django Celery tasks run retention cleanup automatically: - Full cleanup: Daily at 3:00 AM - Metrics cleanup: Daily at 3:30 AM - Anomaly cleanup: Daily at 4:00 AM View retention statistics in the Admin Dashboard's Data Retention panel. ## Monitoring View collected metrics in the Admin Monitoring Dashboard: - Navigate to `/admin/monitoring` - View anomaly detections, alerts, and incidents - Manually trigger metric collection or anomaly detection - View real-time system health ## Troubleshooting ### No metrics being collected 1. Check Celery workers are running: ```bash celery -A config inspect active ``` 2. Check Celery Beat is running: ```bash celery -A config inspect scheduled ``` 3. Verify environment variables are set 4. Check logs for errors: ```bash tail -f logs/celery.log ``` ### Edge function not collecting metrics 1. Verify cron job is scheduled in Supabase 2. Check edge function logs in Supabase dashboard 3. Verify service role key is correct 4. Test edge function manually ## Production Considerations 1. **Resource Usage**: Collecting metrics every minute generates significant database writes. Consider adjusting frequency for production. 2. **Data Retention**: Set up periodic cleanup of old metrics (older than 30 days) to manage database size. 3. **Alert Fatigue**: Fine-tune anomaly detection sensitivity to reduce false positives. 4. **Scaling**: As traffic grows, consider moving to a time-series database like TimescaleDB or InfluxDB. 5. **Monitoring the Monitors**: Set up external health checks to ensure metric collection is working.