From 69a68d0a74efb343af0c58392d8fc6edf68f7aec Mon Sep 17 00:00:00 2001 From: Edward Oliveira Date: Thu, 7 May 2026 14:57:49 -0300 Subject: [PATCH] Add architecture diagrams to rate-limiter-incidents.md Seven ASCII diagrams covering: - NAT thundering herd before/after session bypass - nginx zone split before/after - Anonymous /api/ NAT problem before/after - Authenticated rate breach before/after - Per-path enforcement stack and trade-off table - The fundamental NAT constraint and resolution strategies Co-Authored-By: Claude Sonnet 4.6 --- docs/rate-limiter-incidents.md | 306 +++++++++++++++++++++++++++++++++ 1 file changed, 306 insertions(+) diff --git a/docs/rate-limiter-incidents.md b/docs/rate-limiter-incidents.md index ddd4adf68..9171612f9 100644 --- a/docs/rate-limiter-incidents.md +++ b/docs/rate-limiter-incidents.md @@ -188,6 +188,312 @@ The nginx bypass is the correct operational fix for now. The WebSocket rewrite i --- +## Diagrams — Issues, Solutions, and Trade-offs (2026-05-06/07) + +--- + +### 1. NAT Thundering Herd — Before the Fix + +During a live vote all councilmembers reload simultaneously. nginx sees one +IP, exhausts its bucket, and returns 429 before Django is ever involved. +Django's per-user counter (NAT-safe) is never consulted. + +``` + Office / Chamber — behind one NAT IP (200.175.17.66) + ┌──────────────────────────────────────────────────────┐ + │ Councilmember A browser reload ──┐ │ + │ Councilmember B browser reload ──┤ │ + │ Councilmember C browser reload ──┤ ~24 req/s │ + │ Staff tab 1 browser reload ──┤ same public IP │ + │ Staff tab 2 browser reload ──┘ │ + └────────────────────────────┬─────────────────────────┘ + │ all requests look identical to nginx + ▼ + ┌─────────────────────────────────────┐ + │ nginx sapl_general │ + │ rate=30r/m burst=60 nodelay │ + │ │ + │ token bucket: 0 tokens remaining │ + │ → 429 returned immediately │ + └──────────────────┬──────────────────┘ + │ + ╳ Django never reached + ╳ rl:ip:{ip}:reqs never incremented + ╳ rl:user:{uid}:reqs never incremented + ╳ per-user NAT-safe counter never consulted + │ + ▼ + 429 for all N users in the org + recovery: wait for nginx bucket refill + (~3–10 min depending on depletion) + NOT a Django 300s block (Redis never written) +``` + +--- + +### 2. NAT Thundering Herd — After the Session Bypass Fix + +Session and voting paths have dedicated nginx `location` blocks with no +`limit_req`. Regex locations take priority over `location /`. + +``` + Office / Chamber — behind one NAT IP + ┌──────────────────────────────────────────────────────┐ + │ Councilmember A /voto-individual/ reload ──┐ │ + │ Councilmember B /voto-individual/ reload ──┤ │ + │ Councilmember C /sessao/2600/ordemdia ───────┤ │ + │ Staff tab /sessao/pauta-sessao/2600/ ──┘ │ + └────────────────────────────┬─────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────┐ + │ nginx │ + │ │ + │ location ~ ^/voto-individual/ ─┐ │ + │ location ~ ^/sessao/\d+ ─┤ │ no limit_req + │ location ~ ^/painel/\d+/dados ─┘ │ pass through + └──────────────────┬──────────────────┘ + │ + ▼ + ┌─────────────────────────────────────┐ + │ Django RateLimitMiddleware │ + │ │ + │ RATE_LIMIT_BYPASS_PATHS match? │ + │ → yes: return get_response() │ + │ (no counter, no block check) │ + └──────────────────┬──────────────────┘ + │ + ▼ + ✓ View served + All N users get their page +``` + +--- + +### 3. nginx Zone Architecture — Before vs After + +**Before (single shared zone)** + +``` + All traffic (HTML + media + API) + │ + ▼ + ┌───────────────────────────────┐ + │ sapl_general │ ← one bucket per IP + │ rate=30r/m burst=60 │ + │ │ + │ /media/page.pdf ──────────┐ │ ← media request drains + │ /materia/123/ ──────────┤ │ the same bucket as + │ /api/materia/? ──────────┘ │ the HTML page + └───────────────────────────────┘ + + Problem: a page with 20 media attachments + burns 20 tokens from the page-load budget +``` + +**After (four independent zones)** + +``` + ┌─────────────────┐ location / + │ sapl_general │ rate=90r/m burst=180 — HTML page requests + └─────────────────┘ + + ┌─────────────────┐ location /media/ + │ sapl_media │ rate=180r/m burst=180 — media downloads + └─────────────────┘ own bucket, never + drains page quota + ┌─────────────────┐ location /api/ + │ sapl_api │ rate=60r/m burst=120 — API calls + └─────────────────┘ quota layer is + real constraint + ┌─────────────────┐ location /relatorios/ + │ sapl_heavy │ rate=10r/m burst=20 — PDF generation + └─────────────────┘ nodelay tight by design + + Session/voting paths: NO zone — exempt from limit_req entirely + Static files /static/: NO zone — served directly from disk +``` + +--- + +### 4. Anonymous /api/ NAT Problem — Before vs After + +**Before** + +``` + Office — 10 staff, JS polling /api/ every 5s = 120 req/min combined + │ + ▼ + ┌─────────────────────────────────────┐ + │ nginx sapl_general (was shared) │ + │ burst not yet exhausted → pass │ + └──────────────────┬──────────────────┘ + │ + ▼ + ┌─────────────────────────────────────┐ + │ Django _evaluate_anonymous │ + │ │ + │ INCR rl:ip:{ip}:reqs │ + │ count = 120 ≥ threshold (120/m) │ + │ │ + │ SET rl:ip:{ip}:blocked EX 300 │◄── block key written + │ ZADD rl:index:blocked_ips │ affects ALL paths + └──────────────────┬──────────────────┘ + │ + ▼ + Next request to /materia/, /sessao/, /voto-individual/ ... + → 429 ip_blocked (300s) + Entire org locked out of ALL SAPL pages + because of JS polling the API +``` + +**After** + +``` + Office — 10 staff, JS polling /api/ every 5s = 120 req/min combined + │ + ▼ + ┌─────────────────────────────────────┐ + │ nginx sapl_api │ + │ rate=60r/m burst=120 nodelay │ + │ throttles burst, passes remainder │ + └──────────────────┬──────────────────┘ + │ + ▼ + ┌─────────────────────────────────────┐ + │ Django: quota check │ + │ 500 req/day not exceeded → pass │ + └──────────────────┬──────────────────┘ + │ + ▼ + ┌─────────────────────────────────────┐ + │ Anonymous /api/ early return │ + │ │ + │ rl:ip:{ip}:reqs NOT incremented │◄── no counter + │ rl:ip:{ip}:blocked NOT written │◄── no block key + └──────────────────┬──────────────────┘ + │ + ▼ + ✓ View served + Page requests from same IP unaffected +``` + +--- + +### 5. Authenticated Rate Breach — Before vs After + +**Before** + +``` + Authenticated user clicks rapidly: 241 requests in 60s + │ + ▼ + ┌─────────────────────────────────────┐ + │ _evaluate_authenticated │ + │ INCR rl:user:{uid}:reqs → 241 │ + │ count ≥ 240 (auth threshold) │ + │ │ + │ SET rl:user:{uid}:blocked EX 300 │◄── 5-minute lockout + │ ZADD rl:index:blocked_users │ + └──────────────────┬──────────────────┘ + │ + ▼ + All requests for 300s → 429 user_blocked + User must wait 5 minutes to do anything + No way to self-recover sooner +``` + +**After** + +``` + Authenticated user clicks rapidly: 241 requests in 60s + │ + ▼ + ┌─────────────────────────────────────┐ + │ _evaluate_authenticated │ + │ INCR rl:user:{uid}:reqs → 241 │ + │ count ≥ 240 (auth threshold) │ + │ │ + │ return 429 auth_user_rate │◄── this request only + │ (no SET, no ZADD) │◄── no block key written + └──────────────────┬──────────────────┘ + │ + Counter TTL = 60s (auth_window) + │ + ▼ + T+60s: rl:user:{uid}:reqs expires + User automatically recovers + No admin intervention needed +``` + +--- + +### 6. Enforcement Stack Per Path — Trade-off Summary + +``` +Path nginx zone Django counter Block written? Notes +────────────────────── ─────────────── ──────────────── ────────────── ────────────────────────────── +/static/* none none — disk-served, zero Django cost +/painel//dados none none — bypass: high-freq polling +/voto-individual/* none none — bypass: live vote +/sessao//* none none — bypass: live session +/sessao/pauta-sessao/* none none — bypass: live session +/media/* sapl_media anon IP / auth anon: yes auth gate in serve_media() + 180r/m b=180 counter runs auth: no +/api/* (anonymous) sapl_api quota only no ← key change: no IP counter, + 60r/m b=120 500/day — no collateral NAT block +/api/* (authenticated) sapl_api per-user 240/m no (soft) per-user, NAT-safe + 60r/m b=120 counter runs +/relatorios/* sapl_heavy anon/auth runs anon: yes tight rate — PDF generation + 10r/m b=20 at Django auth: no +/* (everything else) sapl_general anon/auth runs anon: yes normal page navigation + 90r/m b=180 at Django auth: no auth gets 240/m soft limit +``` + +**Legend:** +- `anon: yes` — anonymous IP gets a 300s block key on breach +- `auth: no` — authenticated users get 429 for that request, window resets in 60s, no persistent block +- `none` — no rate limiting at either layer (path is exempt) + +--- + +### 7. The Fundamental NAT Constraint + +``` + IP-based rate limiting cannot distinguish these two scenarios: + + Scenario A — Legitimate (15 users, 1 tab each, vote opens) + ┌──────────────────────────────────────────────────────┐ + │ User 1 ──► GET /voto-individual/ │ + │ User 2 ──► GET /voto-individual/ 15 req/s │ + │ ... 1 public IP │ + │ User 15 ──► GET /sessao/2600/ordemdia │ + └──────────────────────────────────────────────────────┘ + + Scenario B — Bot (1 process, 15 threads, scraping) + ┌──────────────────────────────────────────────────────┐ + │ Thread 1 ──► GET /materia/1/ │ + │ Thread 2 ──► GET /materia/2/ 15 req/s │ + │ ... 1 public IP │ + │ Thread 15 ──► GET /materia/15/ │ + └──────────────────────────────────────────────────────┘ + + To nginx and an IP-based counter: identical. + + Resolution strategies applied: + ┌──────────────────────────────────────────────────────────────────┐ + │ Known safe high-freq paths → nginx bypass + Django bypass │ + │ Authenticated users → per-user counter (uid), NAT-safe │ + │ Anonymous /api/ → quota only, no IP counter │ + │ Everything else (anon) → IP counter + 300s block on breach │ + └──────────────────────────────────────────────────────────────────┘ + + Long-term: APP_ACCESS_KEYs per tenant → quota per org, not per IP + WebSocket push for voting → eliminates polling bursts +``` + +--- + ## Pending Investigations The following incidents may or may not share the same root cause as the PatoBranco-PR event. Each should be investigated using the OpenSearch query patterns established above and documented here.