Browse Source

Add rate limiting architecture diagrams to RATE-LIMITER-PLAN.md

New section: Rate Limiting — Architecture Diagrams, covering:
- NAT thundering herd before/after session bypass
- nginx zone split before/after
- Anonymous /api/ NAT problem before/after
- Authenticated rate breach before/after
- Per-path enforcement stack and trade-off table
- The fundamental NAT constraint and mitigations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rate-limiter-2026
Edward Ribeiro 3 weeks ago
parent
commit
425519a967
  1. 225
      plan/RATE-LIMITER-PLAN.md

225
plan/RATE-LIMITER-PLAN.md

@ -786,6 +786,231 @@ are env-var configurable at container start via `start.sh` defaults.
--- ---
## Rate Limiting — Architecture Diagrams
### NAT Thundering Herd — Before the Fix
During a live vote all councilmembers reload simultaneously. nginx sees one
IP, exhausts its bucket, and returns 429 before Django is ever involved.
Django's per-user counter (NAT-safe) is never consulted.
```
Office / Chamber — behind one NAT IP
┌──────────────────────────────────────────────────────┐
│ Councilmember A browser reload ──┐ │
│ Councilmember B browser reload ──┤ │
│ Councilmember C browser reload ──┤ ~24 req/s │
│ Staff tab 1 browser reload ──┤ same public IP │
│ Staff tab 2 browser reload ──┘ │
└────────────────────────────┬─────────────────────────┘
│ all requests look identical to nginx
┌─────────────────────────────────────┐
│ nginx sapl_general │
│ rate=30r/m burst=60 nodelay │
│ │
│ token bucket: 0 tokens remaining │
│ → 429 returned immediately │
└──────────────────┬──────────────────┘
╳ Django never reached
╳ rl:ip:{ip}:reqs never incremented
╳ rl:user:{uid}:reqs never consulted
429 for all N users in the org
recovery: nginx bucket refill (~3–10 min)
NOT a Django 300s block — Redis never written
```
---
### NAT Thundering Herd — After the Session Bypass Fix
```
Office / Chamber — behind one NAT IP
┌──────────────────────────────────────────────────────┐
│ Councilmember A /voto-individual/ reload ──┐ │
│ Councilmember B /voto-individual/ reload ──┤ │
│ Councilmember C /sessao/2600/ordemdia ───────┤ │
│ Staff tab /sessao/pauta-sessao/2600/ ──┘ │
└────────────────────────────┬─────────────────────────┘
┌─────────────────────────────────────┐
│ nginx │
│ │
│ location ~ ^/voto-individual/ ─┐ │
│ location ~ ^/sessao/\d+ ─┤ │ no limit_req
│ location ~ ^/painel/\d+/dados ─┘ │ pass through
└──────────────────┬──────────────────┘
┌─────────────────────────────────────┐
│ Django RateLimitMiddleware │
│ RATE_LIMIT_BYPASS_PATHS match? │
│ → yes: return get_response() │
└──────────────────┬──────────────────┘
✓ View served
```
---
### nginx Zone Architecture — Before vs After
**Before** — all traffic sharing one bucket per IP:
```
/media/page.pdf ──┐
/materia/123/ ───┤──► sapl_general rate=30r/m burst=60
/api/materia/? ───┘
Problem: 20 media attachments on a page burn 20 tokens
from the same budget as the HTML page load
```
**After** — four independent buckets:
```
location / ──► sapl_general rate=90r/m burst=180
location /media/ ──► sapl_media rate=180r/m burst=180
location /api/ ──► sapl_api rate=60r/m burst=120
location /relatorios/ ──► sapl_heavy rate=10r/m burst=20 (nodelay)
location /sessao/\d+ ──► (no zone) exempt
location /voto-indiv.. ──► (no zone) exempt
location /static/ ──► (no zone) disk-served, no Django
```
---
### Anonymous /api/ NAT Problem — Before vs After
**Before** — anonymous API hits polluted the global IP counter:
```
10 staff, JS polling /api/ → 120 req/min from NAT IP
Django _evaluate_anonymous
INCR rl:ip:{ip}:reqs → 120 ≥ threshold
SET rl:ip:{ip}:blocked EX 300 ◄── global block
Next GET /materia/ → 429 ip_blocked
Next GET /sessao/ → 429 ip_blocked
Entire org locked out of ALL paths for 300s
```
**After** — anonymous API skips the IP counter entirely:
```
10 staff, JS polling /api/ → 120 req/min from NAT IP
nginx sapl_api rate=60r/m burst=120
(throttles sustained traffic)
Django quota check: 500/day not exceeded → pass
Anonymous /api/: early return, no _evaluate()
rl:ip:{ip}:reqs NOT incremented
rl:ip:{ip}:blocked NOT written
Page requests from same IP: unaffected ✓
Worst case: 500 API req/day quota exhausted
→ only API access blocked, pages still work
```
---
### Authenticated Rate Breach — Before vs After
```
BEFORE AFTER
────────────────────────────────── ──────────────────────────────────
User clicks fast: 241 req in 60s User clicks fast: 241 req in 60s
│ │
▼ ▼
count ≥ 240 (auth threshold) count ≥ 240 (auth threshold)
│ │
▼ ▼
SET rl:user:{uid}:blocked EX 300 return 429 for this request only
ZADD rl:index:blocked_users (no SET, no ZADD)
│ │
▼ ▼
All requests for 300s → 429 T+60s: counter key expires
User locked out for 5 minutes User recovers automatically
No self-recovery possible No admin intervention needed
```
---
### Enforcement Stack Per Path — Trade-off Summary
```
Path nginx zone Django Block key? Notes
───────────────────── ───────────────── ────────────── ────────── ──────────────────────
/static/* none none — disk-served
/painel/<pk>/dados none (bypass) none (bypass) — high-freq polling
/voto-individual/* none (bypass) none (bypass) — live vote
/sessao/<pk>/* none (bypass) none (bypass) — live session
/media/* sapl_media anon counter anon: yes auth gate in serve_media
180r/m b=180 runs auth: no
/api/* (anonymous) sapl_api quota only no ← no IP counter; no
60r/m b=120 500/day collateral NAT block
/api/* (auth) sapl_api per-user 240/m no (soft) per-uid, NAT-safe
60r/m b=120 counter runs
/relatorios/* sapl_heavy counter runs anon: yes tight — PDF generation
10r/m b=20
/* (everything else) sapl_general counter runs anon: yes normal browsing
90r/m b=180 auth: no auth: 429, resets in 60s
```
`anon: yes` — anonymous IP gets a 300s block key on breach (all paths locked)
`auth: no` — authenticated users get 429 for that request; counter expires in 60s
---
### The Fundamental NAT Constraint
```
IP-based rate limiting cannot distinguish these two scenarios:
Legitimate (15 users, vote opens simultaneously)
┌─────────────────────────────────────────────┐
│ User 1 ──► GET /voto-individual/ │
│ User 2 ──► GET /voto-individual/ 15 req/s │
│ ... 1 IP │
│ User 15 ──► GET /sessao/2600/ │
└─────────────────────────────────────────────┘
Bot (1 process, 15 threads, scraping)
┌─────────────────────────────────────────────┐
│ Thread 1 ──► GET /materia/1/ │
│ Thread 2 ──► GET /materia/2/ 15 req/s │
│ ... 1 IP │
│ Thread 15 ──► GET /materia/15/ │
└─────────────────────────────────────────────┘
To nginx and an IP counter: identical.
Mitigations applied
┌──────────────────────────────────────────────────────────────────┐
│ Known safe high-freq paths → bypass at both layers │
│ Authenticated users → per-user counter (uid), NAT-safe │
│ Anonymous /api/ → quota only, no IP counter │
│ Everything else (anon) → IP counter + 300s block │
└──────────────────────────────────────────────────────────────────┘
Long-term
┌──────────────────────────────────────────────────────────────────┐
│ APP_ACCESS_KEYs per tenant → quota per org, not per IP │
│ WebSocket push for voting → eliminates polling bursts │
└──────────────────────────────────────────────────────────────────┘
```
---
## Session/voting bypass (2026-05-06) ## Session/voting bypass (2026-05-06)
### Problem ### Problem

Loading…
Cancel
Save