thrilltrack-explorer/django/README_MONITORING.md

# ThrillWiki Monitoring Setup

## Overview

This document describes the automatic metric collection system for anomaly detection and system monitoring.

## Architecture

The system collects metrics from two sources:

1. **Django Backend (Celery Tasks)**: Collects Django-specific metrics like error rates, response times, queue sizes
2. **Supabase Edge Function**: Collects Supabase-specific metrics like API errors, rate limits, submission queues

## Components

### Django Components

#### 1. Metrics Collector (`apps/monitoring/metrics_collector.py`)
- Collects system metrics from various sources
- Records metrics to Supabase `metric_time_series` table
- Provides utilities for tracking:
  - Error rates
  - API response times
  - Celery queue sizes
  - Database connection counts
  - Cache hit rates

#### 2. Celery Tasks (`apps/monitoring/tasks.py`)
Periodic background tasks:
- `collect_system_metrics`: Collects all metrics every minute
- `collect_error_metrics`: Tracks error rates
- `collect_performance_metrics`: Tracks response times and cache performance
- `collect_queue_metrics`: Monitors Celery queue health

#### 3. Metrics Middleware (`apps/monitoring/middleware.py`)
- Tracks API response times for every request
- Records errors and exceptions
- Updates cache with performance data

### Supabase Components

#### Edge Function (`supabase/functions/collect-metrics`)
Collects Supabase-specific metrics:
- API error counts
- Rate limit violations
- Pending submissions
- Active incidents
- Unresolved alerts
- Submission approval rates
- Average moderation times

## Setup Instructions

### 1. Django Setup

Add the monitoring app to your Django `INSTALLED_APPS`:

```python
INSTALLED_APPS = [
    # ... other apps
    'apps.monitoring',
]
```

Add the metrics middleware to `MIDDLEWARE`:

```python
MIDDLEWARE = [
    # ... other middleware
    'apps.monitoring.middleware.MetricsMiddleware',
]
```

Import and use the Celery Beat schedule in your Django settings:

```python
from config.celery_beat_schedule import CELERY_BEAT_SCHEDULE

CELERY_BEAT_SCHEDULE = CELERY_BEAT_SCHEDULE
```

Configure environment variables:

```bash
SUPABASE_URL=https://api.thrillwiki.com
SUPABASE_SERVICE_ROLE_KEY=your_service_role_key
```

### 2. Start Celery Workers

Start Celery worker for processing tasks:

```bash
celery -A config worker -l info -Q monitoring,maintenance,analytics
```

Start Celery Beat for periodic task scheduling:

```bash
celery -A config beat -l info
```

### 3. Supabase Edge Function Setup

The `collect-metrics` edge function should be called periodically. Set up a cron job in Supabase:

```sql
SELECT cron.schedule(
  'collect-metrics-every-minute',
  '* * * * *', -- Every minute
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/collect-metrics',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);
```

### 4. Anomaly Detection Setup

The `detect-anomalies` edge function should also run periodically:

```sql
SELECT cron.schedule(
  'detect-anomalies-every-5-minutes',
  '*/5 * * * *', -- Every 5 minutes
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/detect-anomalies',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);
```

### 5. Data Retention Cleanup Setup

The `data-retention-cleanup` edge function should run daily:

```sql
SELECT cron.schedule(
  'data-retention-cleanup-daily',
  '0 3 * * *', -- Daily at 3:00 AM
  $$
  SELECT net.http_post(
    url:='https://api.thrillwiki.com/functions/v1/data-retention-cleanup',
    headers:='{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
    body:=concat('{"time": "', now(), '"}')::jsonb
  ) as request_id;
  $$
);
```

## Metrics Collected

### Django Metrics
- `error_rate`: Percentage of error logs (performance)
- `api_response_time`: Average API response time in ms (performance)
- `celery_queue_size`: Number of queued Celery tasks (system)
- `database_connections`: Active database connections (system)
- `cache_hit_rate`: Cache hit percentage (performance)

### Supabase Metrics
- `api_error_count`: Recent API errors (performance)
- `rate_limit_violations`: Rate limit blocks (security)
- `pending_submissions`: Submissions awaiting moderation (workflow)
- `active_incidents`: Open/investigating incidents (monitoring)
- `unresolved_alerts`: Unresolved system alerts (monitoring)
- `submission_approval_rate`: Percentage of approved submissions (workflow)
- `avg_moderation_time`: Average time to moderate in minutes (workflow)

## Data Retention Policies

The system automatically cleans up old data to manage database size:

### Retention Periods
- **Metrics** (`metric_time_series`): 30 days
- **Anomaly Detections**: 30 days (resolved alerts archived after 7 days)
- **Resolved Alerts**: 90 days
- **Resolved Incidents**: 90 days

### Cleanup Functions

The following database functions manage data retention:

1. **`cleanup_old_metrics(retention_days)`**: Deletes metrics older than specified days (default: 30)
2. **`cleanup_old_anomalies(retention_days)`**: Archives resolved anomalies and deletes old unresolved ones (default: 30)
3. **`cleanup_old_alerts(retention_days)`**: Deletes old resolved alerts (default: 90)
4. **`cleanup_old_incidents(retention_days)`**: Deletes old resolved incidents (default: 90)
5. **`run_data_retention_cleanup()`**: Master function that runs all cleanup operations

### Automated Cleanup Schedule

Django Celery tasks run retention cleanup automatically:
- Full cleanup: Daily at 3:00 AM
- Metrics cleanup: Daily at 3:30 AM
- Anomaly cleanup: Daily at 4:00 AM

View retention statistics in the Admin Dashboard's Data Retention panel.

## Monitoring

View collected metrics in the Admin Monitoring Dashboard:
- Navigate to `/admin/monitoring`
- View anomaly detections, alerts, and incidents
- Manually trigger metric collection or anomaly detection
- View real-time system health

## Troubleshooting

### No metrics being collected

1. Check Celery workers are running:
   ```bash
   celery -A config inspect active
   ```

2. Check Celery Beat is running:
   ```bash
   celery -A config inspect scheduled
   ```

3. Verify environment variables are set

4. Check logs for errors:
   ```bash
   tail -f logs/celery.log
   ```

### Edge function not collecting metrics

1. Verify cron job is scheduled in Supabase
2. Check edge function logs in Supabase dashboard
3. Verify service role key is correct
4. Test edge function manually

## Production Considerations

1. **Resource Usage**: Collecting metrics every minute generates significant database writes. Consider adjusting frequency for production.

2. **Data Retention**: Set up periodic cleanup of old metrics (older than 30 days) to manage database size.

3. **Alert Fatigue**: Fine-tune anomaly detection sensitivity to reduce false positives.

4. **Scaling**: As traffic grows, consider moving to a time-series database like TimescaleDB or InfluxDB.

5. **Monitoring the Monitors**: Set up external health checks to ensure metric collection is working.