diff --git a/plan/RATE-LIMITER-PLAN.md b/plan/RATE-LIMITER-PLAN.md index c6b975fe9..f00e55aaf 100644 --- a/plan/RATE-LIMITER-PLAN.md +++ b/plan/RATE-LIMITER-PLAN.md @@ -786,6 +786,231 @@ are env-var configurable at container start via `start.sh` defaults. --- +## Rate Limiting — Architecture Diagrams + +### NAT Thundering Herd — Before the Fix + +During a live vote all councilmembers reload simultaneously. nginx sees one +IP, exhausts its bucket, and returns 429 before Django is ever involved. +Django's per-user counter (NAT-safe) is never consulted. + +``` + Office / Chamber — behind one NAT IP + ┌──────────────────────────────────────────────────────┐ + │ Councilmember A browser reload ──┐ │ + │ Councilmember B browser reload ──┤ │ + │ Councilmember C browser reload ──┤ ~24 req/s │ + │ Staff tab 1 browser reload ──┤ same public IP │ + │ Staff tab 2 browser reload ──┘ │ + └────────────────────────────┬─────────────────────────┘ + │ all requests look identical to nginx + ▼ + ┌─────────────────────────────────────┐ + │ nginx sapl_general │ + │ rate=30r/m burst=60 nodelay │ + │ │ + │ token bucket: 0 tokens remaining │ + │ → 429 returned immediately │ + └──────────────────┬──────────────────┘ + │ + ╳ Django never reached + ╳ rl:ip:{ip}:reqs never incremented + ╳ rl:user:{uid}:reqs never consulted + │ + ▼ + 429 for all N users in the org + recovery: nginx bucket refill (~3–10 min) + NOT a Django 300s block — Redis never written +``` + +--- + +### NAT Thundering Herd — After the Session Bypass Fix + +``` + Office / Chamber — behind one NAT IP + ┌──────────────────────────────────────────────────────┐ + │ Councilmember A /voto-individual/ reload ──┐ │ + │ Councilmember B /voto-individual/ reload ──┤ │ + │ Councilmember C /sessao/2600/ordemdia ───────┤ │ + │ Staff tab /sessao/pauta-sessao/2600/ ──┘ │ + └────────────────────────────┬─────────────────────────┘ + ▼ + ┌─────────────────────────────────────┐ + │ nginx │ + │ │ + │ location ~ ^/voto-individual/ ─┐ │ + │ location ~ ^/sessao/\d+ ─┤ │ no limit_req + │ location ~ ^/painel/\d+/dados ─┘ │ pass through + └──────────────────┬──────────────────┘ + ▼ + ┌─────────────────────────────────────┐ + │ Django RateLimitMiddleware │ + │ RATE_LIMIT_BYPASS_PATHS match? │ + │ → yes: return get_response() │ + └──────────────────┬──────────────────┘ + ▼ + ✓ View served +``` + +--- + +### nginx Zone Architecture — Before vs After + +**Before** — all traffic sharing one bucket per IP: + +``` + /media/page.pdf ──┐ + /materia/123/ ───┤──► sapl_general rate=30r/m burst=60 + /api/materia/? ───┘ + + Problem: 20 media attachments on a page burn 20 tokens + from the same budget as the HTML page load +``` + +**After** — four independent buckets: + +``` + location / ──► sapl_general rate=90r/m burst=180 + location /media/ ──► sapl_media rate=180r/m burst=180 + location /api/ ──► sapl_api rate=60r/m burst=120 + location /relatorios/ ──► sapl_heavy rate=10r/m burst=20 (nodelay) + location /sessao/\d+ ──► (no zone) exempt + location /voto-indiv.. ──► (no zone) exempt + location /static/ ──► (no zone) disk-served, no Django +``` + +--- + +### Anonymous /api/ NAT Problem — Before vs After + +**Before** — anonymous API hits polluted the global IP counter: + +``` + 10 staff, JS polling /api/ → 120 req/min from NAT IP + │ + ▼ + Django _evaluate_anonymous + INCR rl:ip:{ip}:reqs → 120 ≥ threshold + SET rl:ip:{ip}:blocked EX 300 ◄── global block + │ + ▼ + Next GET /materia/ → 429 ip_blocked + Next GET /sessao/ → 429 ip_blocked + Entire org locked out of ALL paths for 300s +``` + +**After** — anonymous API skips the IP counter entirely: + +``` + 10 staff, JS polling /api/ → 120 req/min from NAT IP + │ + ▼ + nginx sapl_api rate=60r/m burst=120 + (throttles sustained traffic) + │ + ▼ + Django quota check: 500/day not exceeded → pass + Anonymous /api/: early return, no _evaluate() + rl:ip:{ip}:reqs NOT incremented + rl:ip:{ip}:blocked NOT written + │ + ▼ + Page requests from same IP: unaffected ✓ + Worst case: 500 API req/day quota exhausted + → only API access blocked, pages still work +``` + +--- + +### Authenticated Rate Breach — Before vs After + +``` + BEFORE AFTER + ────────────────────────────────── ────────────────────────────────── + User clicks fast: 241 req in 60s User clicks fast: 241 req in 60s + │ │ + ▼ ▼ + count ≥ 240 (auth threshold) count ≥ 240 (auth threshold) + │ │ + ▼ ▼ + SET rl:user:{uid}:blocked EX 300 return 429 for this request only + ZADD rl:index:blocked_users (no SET, no ZADD) + │ │ + ▼ ▼ + All requests for 300s → 429 T+60s: counter key expires + User locked out for 5 minutes User recovers automatically + No self-recovery possible No admin intervention needed +``` + +--- + +### Enforcement Stack Per Path — Trade-off Summary + +``` +Path nginx zone Django Block key? Notes +───────────────────── ───────────────── ────────────── ────────── ────────────────────── +/static/* none none — disk-served +/painel//dados none (bypass) none (bypass) — high-freq polling +/voto-individual/* none (bypass) none (bypass) — live vote +/sessao//* none (bypass) none (bypass) — live session +/media/* sapl_media anon counter anon: yes auth gate in serve_media + 180r/m b=180 runs auth: no +/api/* (anonymous) sapl_api quota only no ← no IP counter; no + 60r/m b=120 500/day collateral NAT block +/api/* (auth) sapl_api per-user 240/m no (soft) per-uid, NAT-safe + 60r/m b=120 counter runs +/relatorios/* sapl_heavy counter runs anon: yes tight — PDF generation + 10r/m b=20 +/* (everything else) sapl_general counter runs anon: yes normal browsing + 90r/m b=180 auth: no auth: 429, resets in 60s +``` + +`anon: yes` — anonymous IP gets a 300s block key on breach (all paths locked) +`auth: no` — authenticated users get 429 for that request; counter expires in 60s + +--- + +### The Fundamental NAT Constraint + +``` + IP-based rate limiting cannot distinguish these two scenarios: + + Legitimate (15 users, vote opens simultaneously) + ┌─────────────────────────────────────────────┐ + │ User 1 ──► GET /voto-individual/ │ + │ User 2 ──► GET /voto-individual/ 15 req/s │ + │ ... 1 IP │ + │ User 15 ──► GET /sessao/2600/ │ + └─────────────────────────────────────────────┘ + + Bot (1 process, 15 threads, scraping) + ┌─────────────────────────────────────────────┐ + │ Thread 1 ──► GET /materia/1/ │ + │ Thread 2 ──► GET /materia/2/ 15 req/s │ + │ ... 1 IP │ + │ Thread 15 ──► GET /materia/15/ │ + └─────────────────────────────────────────────┘ + + To nginx and an IP counter: identical. + + Mitigations applied + ┌──────────────────────────────────────────────────────────────────┐ + │ Known safe high-freq paths → bypass at both layers │ + │ Authenticated users → per-user counter (uid), NAT-safe │ + │ Anonymous /api/ → quota only, no IP counter │ + │ Everything else (anon) → IP counter + 300s block │ + └──────────────────────────────────────────────────────────────────┘ + + Long-term + ┌──────────────────────────────────────────────────────────────────┐ + │ APP_ACCESS_KEYs per tenant → quota per org, not per IP │ + │ WebSocket push for voting → eliminates polling bursts │ + └──────────────────────────────────────────────────────────────────┘ +``` + +--- + ## Session/voting bypass (2026-05-06) ### Problem