mirror of
https://github.com/pacnpal/thrilltrack-explorer.git
synced 2025-12-20 02:51:12 -05:00
Add advanced ML anomaly detection
This commit is contained in:
266
MONITORING_SETUP.md
Normal file
266
MONITORING_SETUP.md
Normal file
@@ -0,0 +1,266 @@
|
||||
# 🎯 Advanced ML Anomaly Detection & Automated Monitoring
|
||||
|
||||
## ✅ What's Now Active
|
||||
|
||||
### 1. Advanced ML Algorithms
|
||||
|
||||
Your anomaly detection now uses **6 sophisticated algorithms**:
|
||||
|
||||
#### Statistical Algorithms
|
||||
- **Z-Score**: Standard deviation-based outlier detection
|
||||
- **Moving Average**: Trend deviation detection
|
||||
- **Rate of Change**: Sudden change detection
|
||||
|
||||
#### Advanced ML Algorithms (NEW!)
|
||||
- **Isolation Forest**: Anomaly detection based on data point isolation
|
||||
- Works by measuring how "isolated" a point is from the rest
|
||||
- Excellent for detecting outliers in multi-dimensional space
|
||||
|
||||
- **Seasonal Decomposition**: Pattern-aware anomaly detection
|
||||
- Detects anomalies considering daily/weekly patterns
|
||||
- Configurable period (default: 24 hours)
|
||||
- Identifies seasonal spikes and drops
|
||||
|
||||
- **Predictive Anomaly (LSTM-inspired)**: Time-series prediction
|
||||
- Uses triple exponential smoothing (Holt-Winters)
|
||||
- Predicts next value based on level and trend
|
||||
- Flags unexpected deviations from predictions
|
||||
|
||||
- **Ensemble Method**: Multi-algorithm consensus
|
||||
- Combines all 5 algorithms for maximum accuracy
|
||||
- Requires 40%+ algorithms to agree for anomaly detection
|
||||
- Provides weighted confidence scores
|
||||
|
||||
### 2. Automated Cron Jobs
|
||||
|
||||
**NOW RUNNING AUTOMATICALLY:**
|
||||
|
||||
| Job | Schedule | Purpose |
|
||||
|-----|----------|---------|
|
||||
| `detect-anomalies-every-5-minutes` | Every 5 minutes (`*/5 * * * *`) | Run ML anomaly detection on all metrics |
|
||||
| `collect-metrics-every-minute` | Every minute (`* * * * *`) | Collect system metrics (errors, queues, API times) |
|
||||
| `data-retention-cleanup-daily` | Daily at 3 AM (`0 3 * * *`) | Clean up old data to manage DB size |
|
||||
|
||||
### 3. Algorithm Configuration
|
||||
|
||||
Each metric can be configured with different algorithms in the `anomaly_detection_config` table:
|
||||
|
||||
```sql
|
||||
-- Example: Configure a metric to use all advanced algorithms
|
||||
UPDATE anomaly_detection_config
|
||||
SET detection_algorithms = ARRAY['z_score', 'moving_average', 'isolation_forest', 'seasonal', 'predictive', 'ensemble']
|
||||
WHERE metric_name = 'api_response_time';
|
||||
```
|
||||
|
||||
**Algorithm Selection Guide:**
|
||||
|
||||
- **z_score**: Best for normally distributed data, general outlier detection
|
||||
- **moving_average**: Best for trending data, smooth patterns
|
||||
- **rate_of_change**: Best for detecting sudden spikes/drops
|
||||
- **isolation_forest**: Best for complex multi-modal distributions
|
||||
- **seasonal**: Best for cyclic patterns (hourly, daily, weekly)
|
||||
- **predictive**: Best for time-series with clear trends
|
||||
- **ensemble**: Best for maximum accuracy, combines all methods
|
||||
|
||||
### 4. Sensitivity Tuning
|
||||
|
||||
**Sensitivity Parameter** (in `anomaly_detection_config`):
|
||||
- Lower value (1.5-2.0): More sensitive, catches subtle anomalies, more false positives
|
||||
- Medium value (2.5-3.0): Balanced, recommended default
|
||||
- Higher value (3.5-5.0): Less sensitive, only major anomalies, fewer false positives
|
||||
|
||||
### 5. Monitoring Dashboard
|
||||
|
||||
View all anomaly detections in the admin panel:
|
||||
- Navigate to `/admin/monitoring`
|
||||
- See the "ML Anomaly Detection" panel
|
||||
- Real-time updates every 30 seconds
|
||||
- Manual trigger button available
|
||||
|
||||
**Anomaly Details Include:**
|
||||
- Algorithm used
|
||||
- Anomaly type (spike, drop, outlier, seasonal, etc.)
|
||||
- Severity (low, medium, high, critical)
|
||||
- Deviation score (how far from normal)
|
||||
- Confidence score (algorithm certainty)
|
||||
- Baseline vs actual values
|
||||
|
||||
## 🔍 How It Works
|
||||
|
||||
### Data Flow
|
||||
|
||||
```
|
||||
1. Metrics Collection (every minute)
|
||||
↓
|
||||
2. Store in metric_time_series table
|
||||
↓
|
||||
3. Anomaly Detection (every 5 minutes)
|
||||
↓
|
||||
4. Run ML algorithms on recent data
|
||||
↓
|
||||
5. Detect anomalies & calculate scores
|
||||
↓
|
||||
6. Insert into anomaly_detections table
|
||||
↓
|
||||
7. Auto-create system alerts (if critical/high)
|
||||
↓
|
||||
8. Display in admin dashboard
|
||||
↓
|
||||
9. Data Retention Cleanup (daily 3 AM)
|
||||
```
|
||||
|
||||
### Algorithm Comparison
|
||||
|
||||
| Algorithm | Strength | Best For | Time Complexity |
|
||||
|-----------|----------|----------|-----------------|
|
||||
| Z-Score | Simple, fast | Normal distributions | O(n) |
|
||||
| Moving Average | Trend-aware | Gradual changes | O(n) |
|
||||
| Rate of Change | Change detection | Sudden shifts | O(1) |
|
||||
| Isolation Forest | Multi-dimensional | Complex patterns | O(n log n) |
|
||||
| Seasonal | Pattern-aware | Cyclic data | O(n) |
|
||||
| Predictive | Forecast-based | Time-series | O(n) |
|
||||
| Ensemble | Highest accuracy | Any pattern | O(n log n) |
|
||||
|
||||
## 📊 Current Metrics Being Monitored
|
||||
|
||||
### Supabase Metrics (collected every minute)
|
||||
- `api_error_count`: Recent API errors
|
||||
- `rate_limit_violations`: Rate limit blocks
|
||||
- `pending_submissions`: Submissions awaiting moderation
|
||||
- `active_incidents`: Open/investigating incidents
|
||||
- `unresolved_alerts`: Unresolved system alerts
|
||||
- `submission_approval_rate`: Approval percentage
|
||||
- `avg_moderation_time`: Average moderation time
|
||||
|
||||
### Django Metrics (collected every minute, if configured)
|
||||
- `error_rate`: Error log percentage
|
||||
- `api_response_time`: Average API response time (ms)
|
||||
- `celery_queue_size`: Queued Celery tasks
|
||||
- `database_connections`: Active DB connections
|
||||
- `cache_hit_rate`: Cache hit percentage
|
||||
|
||||
## 🎛️ Configuration
|
||||
|
||||
### Add New Metrics for Detection
|
||||
|
||||
```sql
|
||||
INSERT INTO anomaly_detection_config (
|
||||
metric_name,
|
||||
metric_category,
|
||||
enabled,
|
||||
sensitivity,
|
||||
lookback_window_minutes,
|
||||
detection_algorithms,
|
||||
min_data_points,
|
||||
alert_threshold_score,
|
||||
auto_create_alert
|
||||
) VALUES (
|
||||
'custom_metric_name',
|
||||
'performance',
|
||||
true,
|
||||
2.5,
|
||||
60,
|
||||
ARRAY['ensemble', 'predictive', 'seasonal'],
|
||||
10,
|
||||
3.0,
|
||||
true
|
||||
);
|
||||
```
|
||||
|
||||
### Adjust Sensitivity
|
||||
|
||||
```sql
|
||||
-- Make detection more sensitive for critical metrics
|
||||
UPDATE anomaly_detection_config
|
||||
SET sensitivity = 2.0, alert_threshold_score = 2.5
|
||||
WHERE metric_name = 'api_error_count';
|
||||
|
||||
-- Make detection less sensitive for noisy metrics
|
||||
UPDATE anomaly_detection_config
|
||||
SET sensitivity = 4.0, alert_threshold_score = 4.0
|
||||
WHERE metric_name = 'cache_hit_rate';
|
||||
```
|
||||
|
||||
### Disable Detection for Specific Metrics
|
||||
|
||||
```sql
|
||||
UPDATE anomaly_detection_config
|
||||
SET enabled = false
|
||||
WHERE metric_name = 'some_metric';
|
||||
```
|
||||
|
||||
## 🔧 Troubleshooting
|
||||
|
||||
### Check Cron Job Status
|
||||
|
||||
```sql
|
||||
SELECT jobid, jobname, schedule, active, last_run_time, last_run_status
|
||||
FROM cron.job_run_details
|
||||
WHERE jobname LIKE '%anomal%' OR jobname LIKE '%metric%'
|
||||
ORDER BY start_time DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
### View Recent Anomalies
|
||||
|
||||
```sql
|
||||
SELECT * FROM recent_anomalies_view
|
||||
ORDER BY detected_at DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
### Check Metric Collection
|
||||
|
||||
```sql
|
||||
SELECT metric_name, COUNT(*) as count,
|
||||
MIN(timestamp) as oldest,
|
||||
MAX(timestamp) as newest
|
||||
FROM metric_time_series
|
||||
WHERE timestamp > NOW() - INTERVAL '1 hour'
|
||||
GROUP BY metric_name
|
||||
ORDER BY metric_name;
|
||||
```
|
||||
|
||||
### Manual Anomaly Detection Trigger
|
||||
|
||||
```sql
|
||||
-- Call the edge function directly
|
||||
SELECT net.http_post(
|
||||
url := 'https://ydvtmnrszybqnbcqbdcy.supabase.co/functions/v1/detect-anomalies',
|
||||
headers := '{"Content-Type": "application/json", "Authorization": "Bearer YOUR_ANON_KEY"}'::jsonb,
|
||||
body := '{}'::jsonb
|
||||
);
|
||||
```
|
||||
|
||||
## 📈 Performance Considerations
|
||||
|
||||
### Data Volume
|
||||
- Metrics: ~1440 records/day per metric (every minute)
|
||||
- With 12 metrics: ~17,280 records/day
|
||||
- 30-day retention: ~518,400 records
|
||||
- Automatic cleanup prevents unbounded growth
|
||||
|
||||
### Detection Performance
|
||||
- Each detection run processes all enabled metrics
|
||||
- Ensemble algorithm is most CPU-intensive
|
||||
- Recommended: Use ensemble only for critical metrics
|
||||
- Typical detection time: <5 seconds for 12 metrics
|
||||
|
||||
### Database Impact
|
||||
- Indexes on timestamp columns optimize queries
|
||||
- Regular cleanup maintains query performance
|
||||
- Consider partitioning for very high-volume deployments
|
||||
|
||||
## 🚀 Next Steps
|
||||
|
||||
1. **Monitor the Dashboard**: Visit `/admin/monitoring` to see anomalies
|
||||
2. **Fine-tune Sensitivity**: Adjust based on false positive rate
|
||||
3. **Add Custom Metrics**: Monitor application-specific KPIs
|
||||
4. **Set Up Alerts**: Configure notifications for critical anomalies
|
||||
5. **Review Weekly**: Check patterns and adjust algorithms
|
||||
|
||||
## 📚 Additional Resources
|
||||
|
||||
- [Edge Function Logs](https://supabase.com/dashboard/project/ydvtmnrszybqnbcqbdcy/functions/detect-anomalies/logs)
|
||||
- [Cron Jobs Dashboard](https://supabase.com/dashboard/project/ydvtmnrszybqnbcqbdcy/sql/new)
|
||||
- Django README: `django/README_MONITORING.md`
|
||||
@@ -0,0 +1,40 @@
|
||||
-- Set up automated cron jobs for monitoring and anomaly detection
|
||||
|
||||
-- 1. Detect anomalies every 5 minutes
|
||||
SELECT cron.schedule(
|
||||
'detect-anomalies-every-5-minutes',
|
||||
'*/5 * * * *', -- Every 5 minutes
|
||||
$$
|
||||
SELECT net.http_post(
|
||||
url := 'https://ydvtmnrszybqnbcqbdcy.supabase.co/functions/v1/detect-anomalies',
|
||||
headers := '{"Content-Type": "application/json", "Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InlkdnRtbnJzenlicW5iY3FiZGN5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTgzMjYzNTYsImV4cCI6MjA3MzkwMjM1Nn0.DM3oyapd_omP5ZzIlrT0H9qBsiQBxBRgw2tYuqgXKX4"}'::jsonb,
|
||||
body := jsonb_build_object('scheduled', true)
|
||||
) as request_id;
|
||||
$$
|
||||
);
|
||||
|
||||
-- 2. Collect metrics every minute
|
||||
SELECT cron.schedule(
|
||||
'collect-metrics-every-minute',
|
||||
'* * * * *', -- Every minute
|
||||
$$
|
||||
SELECT net.http_post(
|
||||
url := 'https://ydvtmnrszybqnbcqbdcy.supabase.co/functions/v1/collect-metrics',
|
||||
headers := '{"Content-Type": "application/json", "Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InlkdnRtbnJzenlicW5iY3FiZGN5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTgzMjYzNTYsImV4cCI6MjA3MzkwMjM1Nn0.DM3oyapd_omP5ZzIlrT0H9qBsiQBxBRgw2tYuqgXKX4"}'::jsonb,
|
||||
body := jsonb_build_object('scheduled', true)
|
||||
) as request_id;
|
||||
$$
|
||||
);
|
||||
|
||||
-- 3. Data retention cleanup daily at 3 AM
|
||||
SELECT cron.schedule(
|
||||
'data-retention-cleanup-daily',
|
||||
'0 3 * * *', -- Daily at 3:00 AM
|
||||
$$
|
||||
SELECT net.http_post(
|
||||
url := 'https://ydvtmnrszybqnbcqbdcy.supabase.co/functions/v1/data-retention-cleanup',
|
||||
headers := '{"Content-Type": "application/json", "Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InlkdnRtbnJzenlicW5iY3FiZGN5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTgzMjYzNTYsImV4cCI6MjA3MzkwMjM1Nn0.DM3oyapd_omP5ZzIlrT0H9qBsiQBxBRgw2tYuqgXKX4"}'::jsonb,
|
||||
body := jsonb_build_object('scheduled', true)
|
||||
) as request_id;
|
||||
$$
|
||||
);
|
||||
Reference in New Issue
Block a user