mirror of https://github.com/pacnpal/thrilltrack-explorer.git synced 2025-12-20 10:51:13 -05:00

Files

gpt-engineer-app[bot] 915a9fe2df Add automated data retention cleanup

Implements edge function, Django tasks, and UI hooks/panels for automatic retention of old metrics, anomalies, alerts, and incidents, plus updates to query keys and monitoring dashboard to reflect data-retention workflows.

2025-11-11 02:21:27 +00:00

7.1 KiB

Raw Blame History

ThrillWiki Monitoring Setup

Overview

This document describes the automatic metric collection system for anomaly detection and system monitoring.

Architecture

The system collects metrics from two sources:

Django Backend (Celery Tasks): Collects Django-specific metrics like error rates, response times, queue sizes
Supabase Edge Function: Collects Supabase-specific metrics like API errors, rate limits, submission queues

Components

Django Components

1. Metrics Collector (`apps/monitoring/metrics_collector.py`)

Collects system metrics from various sources
Records metrics to Supabase metric_time_series table
Provides utilities for tracking:
- Error rates
- API response times
- Celery queue sizes
- Database connection counts
- Cache hit rates

2. Celery Tasks (`apps/monitoring/tasks.py`)

Periodic background tasks:

collect_system_metrics: Collects all metrics every minute
collect_error_metrics: Tracks error rates
collect_performance_metrics: Tracks response times and cache performance
collect_queue_metrics: Monitors Celery queue health

3. Metrics Middleware (`apps/monitoring/middleware.py`)

Tracks API response times for every request
Records errors and exceptions
Updates cache with performance data

Supabase Components

Edge Function (`supabase/functions/collect-metrics`)

Collects Supabase-specific metrics:

API error counts
Rate limit violations
Pending submissions
Active incidents
Unresolved alerts
Submission approval rates
Average moderation times

Setup Instructions

1. Django Setup

Add the monitoring app to your Django INSTALLED_APPS:

INSTALLED_APPS = [
    # ... other apps
    'apps.monitoring',
]

Add the metrics middleware to MIDDLEWARE:

MIDDLEWARE = [
    # ... other middleware
    'apps.monitoring.middleware.MetricsMiddleware',
]

Import and use the Celery Beat schedule in your Django settings:

from config.celery_beat_schedule import CELERY_BEAT_SCHEDULE

CELERY_BEAT_SCHEDULE = CELERY_BEAT_SCHEDULE

Configure environment variables:

SUPABASE_URL=https://api.thrillwiki.com
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key

2. Start Celery Workers

Start Celery worker for processing tasks:

celery -A config worker -l info -Q monitoring,maintenance,analytics

Start Celery Beat for periodic task scheduling:

celery -A config beat -l info

3. Supabase Edge Function Setup

The collect-metrics edge function should be called periodically. Set up a cron job in Supabase:

SELECT cron.schedule(
  'collect-metrics-every-minute',
  '* * * * *', -- Every minute
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/collect-metrics',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);

4. Anomaly Detection Setup

The detect-anomalies edge function should also run periodically:

SELECT cron.schedule(
  'detect-anomalies-every-5-minutes',
  '*/5 * * * *', -- Every 5 minutes
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/detect-anomalies',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);

5. Data Retention Cleanup Setup

The data-retention-cleanup edge function should run daily:

SELECT cron.schedule(
  'data-retention-cleanup-daily',
  '0 3 * * *', -- Daily at 3:00 AM
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/data-retention-cleanup',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);

Metrics Collected

Django Metrics

error_rate: Percentage of error logs (performance)
api_response_time: Average API response time in ms (performance)
celery_queue_size: Number of queued Celery tasks (system)
database_connections: Active database connections (system)
cache_hit_rate: Cache hit percentage (performance)

Supabase Metrics

api_error_count: Recent API errors (performance)
rate_limit_violations: Rate limit blocks (security)
pending_submissions: Submissions awaiting moderation (workflow)
active_incidents: Open/investigating incidents (monitoring)
unresolved_alerts: Unresolved system alerts (monitoring)
submission_approval_rate: Percentage of approved submissions (workflow)
avg_moderation_time: Average time to moderate in minutes (workflow)

Data Retention Policies

The system automatically cleans up old data to manage database size:

Retention Periods

Metrics (metric_time_series): 30 days
Anomaly Detections: 30 days (resolved alerts archived after 7 days)
Resolved Alerts: 90 days
Resolved Incidents: 90 days

Cleanup Functions

The following database functions manage data retention:

cleanup_old_metrics(retention_days): Deletes metrics older than specified days (default: 30)
cleanup_old_anomalies(retention_days): Archives resolved anomalies and deletes old unresolved ones (default: 30)
cleanup_old_alerts(retention_days): Deletes old resolved alerts (default: 90)
cleanup_old_incidents(retention_days): Deletes old resolved incidents (default: 90)
run_data_retention_cleanup(): Master function that runs all cleanup operations

Automated Cleanup Schedule

Django Celery tasks run retention cleanup automatically:

Full cleanup: Daily at 3:00 AM
Metrics cleanup: Daily at 3:30 AM
Anomaly cleanup: Daily at 4:00 AM

View retention statistics in the Admin Dashboard's Data Retention panel.

Monitoring

View collected metrics in the Admin Monitoring Dashboard:

Navigate to /admin/monitoring
View anomaly detections, alerts, and incidents
Manually trigger metric collection or anomaly detection
View real-time system health

Troubleshooting

No metrics being collected

Check Celery workers are running:
```
celery -A config inspect active
```
Check Celery Beat is running:
```
celery -A config inspect scheduled
```
Verify environment variables are set
Check logs for errors:
```
tail -f logs/celery.log
```

Edge function not collecting metrics

Verify cron job is scheduled in Supabase
Check edge function logs in Supabase dashboard
Verify service role key is correct
Test edge function manually

Production Considerations

Resource Usage: Collecting metrics every minute generates significant database writes. Consider adjusting frequency for production.
Data Retention: Set up periodic cleanup of old metrics (older than 30 days) to manage database size.
Alert Fatigue: Fine-tune anomaly detection sensitivity to reduce false positives.
Scaling: As traffic grows, consider moving to a time-series database like TimescaleDB or InfluxDB.
Monitoring the Monitors: Set up external health checks to ensure metric collection is working.

7.1 KiB Raw Blame History