Implements edge function, Django tasks, and UI hooks/panels for automatic retention of old metrics, anomalies, alerts, and incidents, plus updates to query keys and monitoring dashboard to reflect data-retention workflows.
7.1 KiB
ThrillWiki Monitoring Setup
Overview
This document describes the automatic metric collection system for anomaly detection and system monitoring.
Architecture
The system collects metrics from two sources:
- Django Backend (Celery Tasks): Collects Django-specific metrics like error rates, response times, queue sizes
- Supabase Edge Function: Collects Supabase-specific metrics like API errors, rate limits, submission queues
Components
Django Components
1. Metrics Collector (apps/monitoring/metrics_collector.py)
- Collects system metrics from various sources
- Records metrics to Supabase
metric_time_seriestable - Provides utilities for tracking:
- Error rates
- API response times
- Celery queue sizes
- Database connection counts
- Cache hit rates
2. Celery Tasks (apps/monitoring/tasks.py)
Periodic background tasks:
collect_system_metrics: Collects all metrics every minutecollect_error_metrics: Tracks error ratescollect_performance_metrics: Tracks response times and cache performancecollect_queue_metrics: Monitors Celery queue health
3. Metrics Middleware (apps/monitoring/middleware.py)
- Tracks API response times for every request
- Records errors and exceptions
- Updates cache with performance data
Supabase Components
Edge Function (supabase/functions/collect-metrics)
Collects Supabase-specific metrics:
- API error counts
- Rate limit violations
- Pending submissions
- Active incidents
- Unresolved alerts
- Submission approval rates
- Average moderation times
Setup Instructions
1. Django Setup
Add the monitoring app to your Django INSTALLED_APPS:
INSTALLED_APPS = [
# ... other apps
'apps.monitoring',
]
Add the metrics middleware to MIDDLEWARE:
MIDDLEWARE = [
# ... other middleware
'apps.monitoring.middleware.MetricsMiddleware',
]
Import and use the Celery Beat schedule in your Django settings:
from config.celery_beat_schedule import CELERY_BEAT_SCHEDULE
CELERY_BEAT_SCHEDULE = CELERY_BEAT_SCHEDULE
Configure environment variables:
SUPABASE_URL=https://api.thrillwiki.com
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
2. Start Celery Workers
Start Celery worker for processing tasks:
celery -A config worker -l info -Q monitoring,maintenance,analytics
Start Celery Beat for periodic task scheduling:
celery -A config beat -l info
3. Supabase Edge Function Setup
The collect-metrics edge function should be called periodically. Set up a cron job in Supabase:
SELECT cron.schedule(
'collect-metrics-every-minute',
'* * * * *', -- Every minute
$$
SELECT net.http_post(
url:='https://api.thrillwiki.com/functions/v1/collect-metrics',
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
body:=concat('{"time": "', now(), '"}')::jsonb
) as request_id;
$$
);
4. Anomaly Detection Setup
The detect-anomalies edge function should also run periodically:
SELECT cron.schedule(
'detect-anomalies-every-5-minutes',
'*/5 * * * *', -- Every 5 minutes
$$
SELECT net.http_post(
url:='https://api.thrillwiki.com/functions/v1/detect-anomalies',
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
body:=concat('{"time": "', now(), '"}')::jsonb
) as request_id;
$$
);
5. Data Retention Cleanup Setup
The data-retention-cleanup edge function should run daily:
SELECT cron.schedule(
'data-retention-cleanup-daily',
'0 3 * * *', -- Daily at 3:00 AM
$$
SELECT net.http_post(
url:='https://api.thrillwiki.com/functions/v1/data-retention-cleanup',
headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
body:=concat('{"time": "', now(), '"}')::jsonb
) as request_id;
$$
);
Metrics Collected
Django Metrics
error_rate: Percentage of error logs (performance)api_response_time: Average API response time in ms (performance)celery_queue_size: Number of queued Celery tasks (system)database_connections: Active database connections (system)cache_hit_rate: Cache hit percentage (performance)
Supabase Metrics
api_error_count: Recent API errors (performance)rate_limit_violations: Rate limit blocks (security)pending_submissions: Submissions awaiting moderation (workflow)active_incidents: Open/investigating incidents (monitoring)unresolved_alerts: Unresolved system alerts (monitoring)submission_approval_rate: Percentage of approved submissions (workflow)avg_moderation_time: Average time to moderate in minutes (workflow)
Data Retention Policies
The system automatically cleans up old data to manage database size:
Retention Periods
- Metrics (
metric_time_series): 30 days - Anomaly Detections: 30 days (resolved alerts archived after 7 days)
- Resolved Alerts: 90 days
- Resolved Incidents: 90 days
Cleanup Functions
The following database functions manage data retention:
cleanup_old_metrics(retention_days): Deletes metrics older than specified days (default: 30)cleanup_old_anomalies(retention_days): Archives resolved anomalies and deletes old unresolved ones (default: 30)cleanup_old_alerts(retention_days): Deletes old resolved alerts (default: 90)cleanup_old_incidents(retention_days): Deletes old resolved incidents (default: 90)run_data_retention_cleanup(): Master function that runs all cleanup operations
Automated Cleanup Schedule
Django Celery tasks run retention cleanup automatically:
- Full cleanup: Daily at 3:00 AM
- Metrics cleanup: Daily at 3:30 AM
- Anomaly cleanup: Daily at 4:00 AM
View retention statistics in the Admin Dashboard's Data Retention panel.
Monitoring
View collected metrics in the Admin Monitoring Dashboard:
- Navigate to
/admin/monitoring - View anomaly detections, alerts, and incidents
- Manually trigger metric collection or anomaly detection
- View real-time system health
Troubleshooting
No metrics being collected
-
Check Celery workers are running:
celery -A config inspect active -
Check Celery Beat is running:
celery -A config inspect scheduled -
Verify environment variables are set
-
Check logs for errors:
tail -f logs/celery.log
Edge function not collecting metrics
- Verify cron job is scheduled in Supabase
- Check edge function logs in Supabase dashboard
- Verify service role key is correct
- Test edge function manually
Production Considerations
-
Resource Usage: Collecting metrics every minute generates significant database writes. Consider adjusting frequency for production.
-
Data Retention: Set up periodic cleanup of old metrics (older than 30 days) to manage database size.
-
Alert Fatigue: Fine-tune anomaly detection sensitivity to reduce false positives.
-
Scaling: As traffic grows, consider moving to a time-series database like TimescaleDB or InfluxDB.
-
Monitoring the Monitors: Set up external health checks to ensure metric collection is working.