Files
thrilltrack-explorer/RATE_LIMIT_MONITORING_SETUP.md
gpt-engineer-app[bot] 28fa2fd0d4 Monitor rate limits progress
Implement monitor-rate-limits edge function to compare metrics against alert configurations, trigger notifications, and record alerts; update config and groundwork for admin UI integration.
2025-11-11 00:19:13 +00:00

211 lines
6.1 KiB
Markdown

# Rate Limit Monitoring Setup
This document explains how to set up automated rate limit monitoring with alerts.
## Overview
The rate limit monitoring system consists of:
1. **Metrics Collection** - Tracks all rate limit checks in-memory
2. **Alert Configuration** - Database table with configurable thresholds
3. **Monitor Function** - Edge function that checks metrics and triggers alerts
4. **Cron Job** - Scheduled job that runs the monitor function periodically
## Setup Instructions
### Step 1: Enable Required Extensions
Run this SQL in your Supabase SQL Editor:
```sql
-- Enable pg_cron for scheduling
CREATE EXTENSION IF NOT EXISTS pg_cron;
-- Enable pg_net for HTTP requests
CREATE EXTENSION IF NOT EXISTS pg_net;
```
### Step 2: Create the Cron Job
Run this SQL to schedule the monitor to run every 5 minutes:
```sql
SELECT cron.schedule(
'monitor-rate-limits',
'*/5 * * * *', -- Every 5 minutes
$$
SELECT
net.http_post(
url:='https://api.thrillwiki.com/functions/v1/monitor-rate-limits',
headers:='{"Content-Type": "application/json", "Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6InlkdnRtbnJzenlicW5iY3FiZGN5Iiwicm9sZSI6ImFub24iLCJpYXQiOjE3NTgzMjYzNTYsImV4cCI6MjA3MzkwMjM1Nn0.DM3oyapd_omP5ZzIlrT0H9qBsiQBxBRgw2tYuqgXKX4"}'::jsonb,
body:='{}'::jsonb
) as request_id;
$$
);
```
### Step 3: Verify the Cron Job
Check that the cron job was created:
```sql
SELECT * FROM cron.job WHERE jobname = 'monitor-rate-limits';
```
### Step 4: Configure Alert Thresholds
Visit the admin dashboard at `/admin/rate-limit-metrics` and navigate to the "Configuration" tab to:
- Enable/disable specific alerts
- Adjust threshold values
- Modify time windows
Default configurations are automatically created:
- **Block Rate Alert**: Triggers when >50% of requests are blocked in 5 minutes
- **Total Requests Alert**: Triggers when >1000 requests/minute
- **Unique IPs Alert**: Triggers when >100 unique IPs in 5 minutes (disabled by default)
## How It Works
### 1. Metrics Collection
Every rate limit check (both allowed and blocked) is recorded with:
- Timestamp
- Function name
- Client IP
- User ID (if authenticated)
- Result (allowed/blocked)
- Remaining quota
- Rate limit tier
Metrics are stored in-memory for the last 10,000 checks.
### 2. Monitoring Process
Every 5 minutes, the monitor function:
1. Fetches enabled alert configurations from the database
2. Analyzes current metrics for each configuration's time window
3. Compares metrics against configured thresholds
4. For exceeded thresholds:
- Records the alert in `rate_limit_alerts` table
- Sends notification to moderators via Novu
- Skips if a recent unresolved alert already exists (prevents spam)
### 3. Alert Deduplication
Alerts are deduplicated using a 15-minute window. If an alert for the same configuration was triggered in the last 15 minutes and hasn't been resolved, no new alert is sent.
### 4. Notifications
Alerts are sent to all moderators via the "moderators" topic in Novu, including:
- Email notifications
- In-app notifications (if configured)
- Custom notification channels (if configured)
## Monitoring the Monitor
### Check Cron Job Status
```sql
-- View recent cron job runs
SELECT * FROM cron.job_run_details
WHERE jobid = (SELECT jobid FROM cron.job WHERE jobname = 'monitor-rate-limits')
ORDER BY start_time DESC
LIMIT 10;
```
### View Function Logs
Check the edge function logs in Supabase Dashboard:
`https://supabase.com/dashboard/project/ydvtmnrszybqnbcqbdcy/functions/monitor-rate-limits/logs`
### Test Manually
You can test the monitor function manually by calling it via HTTP:
```bash
curl -X POST https://api.thrillwiki.com/functions/v1/monitor-rate-limits \
-H "Content-Type: application/json"
```
## Adjusting the Schedule
To change how often the monitor runs, update the cron schedule:
```sql
-- Update to run every 10 minutes instead
SELECT cron.alter_job('monitor-rate-limits', schedule:='*/10 * * * *');
-- Update to run every hour
SELECT cron.alter_job('monitor-rate-limits', schedule:='0 * * * *');
-- Update to run every minute (not recommended - may generate too many alerts)
SELECT cron.alter_job('monitor-rate-limits', schedule:='* * * * *');
```
## Removing the Cron Job
If you need to disable monitoring:
```sql
SELECT cron.unschedule('monitor-rate-limits');
```
## Troubleshooting
### No Alerts Being Triggered
1. Check if any alert configurations are enabled:
```sql
SELECT * FROM rate_limit_alert_config WHERE enabled = true;
```
2. Check if metrics are being collected:
- Visit `/admin/rate-limit-metrics` and check the "Recent Activity" tab
- If no activity, the rate limiter might not be in use
3. Check monitor function logs for errors
### Too Many Alerts
- Increase threshold values in the configuration
- Increase time windows for less sensitive detection
- Disable specific alert types that are too noisy
### Monitor Not Running
1. Verify cron job exists and is active
2. Check `cron.job_run_details` for error messages
3. Verify edge function deployed successfully
4. Check network connectivity between cron scheduler and edge function
## Database Tables
### `rate_limit_alert_config`
Stores alert threshold configurations. Only admins can modify.
### `rate_limit_alerts`
Stores history of all triggered alerts. Moderators can view and resolve.
## Security
- Alert configurations can only be modified by admin/superuser roles
- Alert history is only accessible to moderators and above
- The monitor function runs without JWT verification (as a cron job)
- All database operations respect Row Level Security policies
## Performance Considerations
- In-memory metrics store max 10,000 entries (auto-trimmed)
- Metrics older than the longest configured time window are not useful
- Monitor function typically runs in <500ms
- No significant database load (simple queries on small tables)
## Future Enhancements
Possible improvements:
- Function-specific alert thresholds
- Alert aggregation (daily/weekly summaries)
- Custom notification channels per alert type
- Machine learning-based anomaly detection
- Integration with external monitoring tools (Datadog, New Relic, etc.)