mirror of https://github.com/pacnpal/markov-discord.git synced 2025-12-20 11:01:04 -05:00

Files

pacnpal 1f0a2573c4 feat: Implement optimization plan for Markov Discord bot

- Added `optimization-plan.md` detailing strategies to reduce response latency and improve training throughput.
- Enhanced performance analysis in `performance-analysis.md` with identified bottlenecks and completed optimizations.
- Created `productContext.md` summarizing project goals, user scenarios, and implementation priorities.
- Developed `markov-store.ts` for high-performance serialized chain storage with alias method sampling.
- Implemented database performance indexes in `1704067200000-AddPerformanceIndexes.ts`.
- Introduced `markov-worker.ts` for handling CPU-intensive operations in separate threads.
- Established a worker pool in `worker-pool.ts` to manage multiple worker threads efficiently.

2025-09-25 13:39:22 -04:00

4.4 KiB

Raw Blame History

[MEMORY BANK: ACTIVE] Optimization Plan - Further Performance Work

Date: 2025-09-25 Purpose: Reduce response latency and improve training throughput beyond existing optimizations. Context: builds on memory-bank/performance-analysis.md and implemented changes in src/train.ts and src/index.ts.

Goals:

Target: end-to-end response generation < 500ms for typical queries.
Training throughput: process 1M messages/hour on dev hardware.
Memory: keep max heap < 2GB during training on 16GB host.

Measurement & Profiling (first actions)

Capture baseline metrics:
- Run workload A (100k messages) and record CPU, memory, latency histograms.
- Tools: Node clinic/Flame, --prof, and pprof.
Add short-term tracing: export traces for top code paths in src/index.ts and src/train.ts.
Create benchmark scripts: bench/trace.sh and bench/load_test.ts (synthetic).

High Priority (implement immediately)

Persist precomputed Markov chains per channel/guild:
- Add a serialized chain store: src/markov-store.ts (new).
- On training, update chain incrementally instead of rebuilding.
- Benefit: response generation becomes O(1) for chain lookup.
Use optimized sampling structures (Alias method):
- Replace repeated weighted selection with alias tables built per prefix.
- File changes: src/index.ts, src/markov-store.ts.
Offload CPU-bound work to Worker Threads:
- Move chain-building and heavy sampling into Node worker_threads.
- Add a worker pool (4 threads default) with backpressure.
- Files: src/train.ts, src/workers/markov-worker.ts.
Use in-memory LRU cache for active chains:
- Keep hot channels' chains in RAM; evict least-recently-used.
- Implement TTL and memory cap.

Medium Priority

Optimize SQLite for runtime:
- Use WAL mode and PRAGMA journal_mode = WAL; set synchronous = NORMAL.
- Use prepared statements and transactions for bulk writes.
- Temporarily disable non-essential indexes during major bulk imports.
- File: src/migration/1704067200000-AddPerformanceIndexes.ts.
Move heavy random-access data into a K/V store:
- Consider LevelDB/LMDB or RocksDB for prefix->suffix lists for faster reads.
Incremental training API:
- Add an HTTP or IPC to submit new messages and update chain incrementally.

Low Priority / Long term

Reimplement core hot loops in Rust via Neon or FFI for max throughput.
Shard storage by guild and run independent workers per shard.
Replace SQLite with a server DB (Postgres) only if concurrency demands it.

Implementation steps (concrete)

Add profiling scripts + run baseline (1-2 days).
Implement src/markov-store.ts with serialization and alias table builder (1-2 days).
Wire worker pool and move chain building into workers (1-2 days).
Add LRU cache around store and integrate with response path (0.5-1 day).
Apply SQLite runtime tuning and test bulk import patterns (0.5 day).
Add metrics & dashboards (Prometheus + Grafana or simple histograms) (1 day).
Run load tests and iterate on bottlenecks (1-3 days).

Benchmarks to run

Baseline: 100k messages, measure 95th percentile response latency.
After chain-store: expect >5x faster generation.
After workers + alias: expect ~10x faster generation in CPU-heavy scenarios.

Rollout & Validation

Feature-flag new chain-store and worker pool behind config toggles in config/config.json.
Canary rollout to single guild for 24h with load test traffic.
Compare metrics and only enable globally after verifying thresholds.

Observability & Metrics

Instrument: response latency histogram, chain-build time, cache hit ratio, DB query durations.
Log slow queries > 50ms with context.
Add alerts for cache thrashing and worker queue saturation.

Risks & Mitigations

Serialization format changes: include versioning and migration utilities.
Worker crashes: add supervisor and restart/backoff.
Memory blowup from caching: enforce strict memory caps and stats.

Next actions for Code mode

Create src/markov-store.ts, src/workers/markov-worker.ts, add bench scripts, and update config/config.json toggles.
I will implement the highest-priority changes in Code mode when you approve.

End.

4.4 KiB Raw Blame History

[MEMORY BANK: ACTIVE] Optimization Plan - Further Performance Work

4.4 KiB

Raw Blame History