Files
thrilltrack-explorer/django/README_MONITORING.md
gpt-engineer-app[bot] 915a9fe2df Add automated data retention cleanup
Implements edge function, Django tasks, and UI hooks/panels for automatic retention of old metrics, anomalies, alerts, and incidents, plus updates to query keys and monitoring dashboard to reflect data-retention workflows.
2025-11-11 02:21:27 +00:00

7.1 KiB

ThrillWiki Monitoring Setup

Overview

This document describes the automatic metric collection system for anomaly detection and system monitoring.

Architecture

The system collects metrics from two sources:

  1. Django Backend (Celery Tasks): Collects Django-specific metrics like error rates, response times, queue sizes
  2. Supabase Edge Function: Collects Supabase-specific metrics like API errors, rate limits, submission queues

Components

Django Components

1. Metrics Collector (apps/monitoring/metrics_collector.py)

  • Collects system metrics from various sources
  • Records metrics to Supabase metric_time_series table
  • Provides utilities for tracking:
    • Error rates
    • API response times
    • Celery queue sizes
    • Database connection counts
    • Cache hit rates

2. Celery Tasks (apps/monitoring/tasks.py)

Periodic background tasks:

  • collect_system_metrics: Collects all metrics every minute
  • collect_error_metrics: Tracks error rates
  • collect_performance_metrics: Tracks response times and cache performance
  • collect_queue_metrics: Monitors Celery queue health

3. Metrics Middleware (apps/monitoring/middleware.py)

  • Tracks API response times for every request
  • Records errors and exceptions
  • Updates cache with performance data

Supabase Components

Edge Function (supabase/functions/collect-metrics)

Collects Supabase-specific metrics:

  • API error counts
  • Rate limit violations
  • Pending submissions
  • Active incidents
  • Unresolved alerts
  • Submission approval rates
  • Average moderation times

Setup Instructions

1. Django Setup

Add the monitoring app to your Django INSTALLED_APPS:

INSTALLED_APPS = [
    # ... other apps
    'apps.monitoring',
]

Add the metrics middleware to MIDDLEWARE:

MIDDLEWARE = [
    # ... other middleware
    'apps.monitoring.middleware.MetricsMiddleware',
]

Import and use the Celery Beat schedule in your Django settings:

from config.celery_beat_schedule import CELERY_BEAT_SCHEDULE

CELERY_BEAT_SCHEDULE = CELERY_BEAT_SCHEDULE

Configure environment variables:

SUPABASE_URL=https://api.thrillwiki.com
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key

2. Start Celery Workers

Start Celery worker for processing tasks:

celery -A config worker -l info -Q monitoring,maintenance,analytics

Start Celery Beat for periodic task scheduling:

celery -A config beat -l info

3. Supabase Edge Function Setup

The collect-metrics edge function should be called periodically. Set up a cron job in Supabase:

SELECT cron.schedule(
  'collect-metrics-every-minute',
  '* * * * *', -- Every minute
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/collect-metrics',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);

4. Anomaly Detection Setup

The detect-anomalies edge function should also run periodically:

SELECT cron.schedule(
  'detect-anomalies-every-5-minutes',
  '*/5 * * * *', -- Every 5 minutes
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/detect-anomalies',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);

5. Data Retention Cleanup Setup

The data-retention-cleanup edge function should run daily:

SELECT cron.schedule(
  'data-retention-cleanup-daily',
  '0 3 * * *', -- Daily at 3:00 AM
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/data-retention-cleanup',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);

Metrics Collected

Django Metrics

  • error_rate: Percentage of error logs (performance)
  • api_response_time: Average API response time in ms (performance)
  • celery_queue_size: Number of queued Celery tasks (system)
  • database_connections: Active database connections (system)
  • cache_hit_rate: Cache hit percentage (performance)

Supabase Metrics

  • api_error_count: Recent API errors (performance)
  • rate_limit_violations: Rate limit blocks (security)
  • pending_submissions: Submissions awaiting moderation (workflow)
  • active_incidents: Open/investigating incidents (monitoring)
  • unresolved_alerts: Unresolved system alerts (monitoring)
  • submission_approval_rate: Percentage of approved submissions (workflow)
  • avg_moderation_time: Average time to moderate in minutes (workflow)

Data Retention Policies

The system automatically cleans up old data to manage database size:

Retention Periods

  • Metrics (metric_time_series): 30 days
  • Anomaly Detections: 30 days (resolved alerts archived after 7 days)
  • Resolved Alerts: 90 days
  • Resolved Incidents: 90 days

Cleanup Functions

The following database functions manage data retention:

  1. cleanup_old_metrics(retention_days): Deletes metrics older than specified days (default: 30)
  2. cleanup_old_anomalies(retention_days): Archives resolved anomalies and deletes old unresolved ones (default: 30)
  3. cleanup_old_alerts(retention_days): Deletes old resolved alerts (default: 90)
  4. cleanup_old_incidents(retention_days): Deletes old resolved incidents (default: 90)
  5. run_data_retention_cleanup(): Master function that runs all cleanup operations

Automated Cleanup Schedule

Django Celery tasks run retention cleanup automatically:

  • Full cleanup: Daily at 3:00 AM
  • Metrics cleanup: Daily at 3:30 AM
  • Anomaly cleanup: Daily at 4:00 AM

View retention statistics in the Admin Dashboard's Data Retention panel.

Monitoring

View collected metrics in the Admin Monitoring Dashboard:

  • Navigate to /admin/monitoring
  • View anomaly detections, alerts, and incidents
  • Manually trigger metric collection or anomaly detection
  • View real-time system health

Troubleshooting

No metrics being collected

  1. Check Celery workers are running:

    celery -A config inspect active
    
  2. Check Celery Beat is running:

    celery -A config inspect scheduled
    
  3. Verify environment variables are set

  4. Check logs for errors:

    tail -f logs/celery.log
    

Edge function not collecting metrics

  1. Verify cron job is scheduled in Supabase
  2. Check edge function logs in Supabase dashboard
  3. Verify service role key is correct
  4. Test edge function manually

Production Considerations

  1. Resource Usage: Collecting metrics every minute generates significant database writes. Consider adjusting frequency for production.

  2. Data Retention: Set up periodic cleanup of old metrics (older than 30 days) to manage database size.

  3. Alert Fatigue: Fine-tune anomaly detection sensitivity to reduce false positives.

  4. Scaling: As traffic grows, consider moving to a time-series database like TimescaleDB or InfluxDB.

  5. Monitoring the Monitors: Set up external health checks to ensure metric collection is working.