@ -188,6 +188,312 @@ The nginx bypass is the correct operational fix for now. The WebSocket rewrite i
---
## Diagrams — Issues, Solutions, and Trade-offs (2026-05-06/07)
---
### 1. NAT Thundering Herd — Before the Fix
During a live vote all councilmembers reload simultaneously. nginx sees one
IP, exhausts its bucket, and returns 429 before Django is ever involved.
Django's per-user counter (NAT-safe) is never consulted.
```
Office / Chamber — behind one NAT IP (200.175.17.66)
┌──────────────────────────────────────────────────────┐
│ Councilmember A browser reload ──┐ │
│ Councilmember B browser reload ──┤ │
│ Councilmember C browser reload ──┤ ~24 req/s │
│ Staff tab 1 browser reload ──┤ same public IP │
│ Staff tab 2 browser reload ──┘ │
└────────────────────────────┬─────────────────────────┘
│ all requests look identical to nginx
▼
┌─────────────────────────────────────┐
│ nginx sapl_general │
│ rate=30r/m burst=60 nodelay │
│ │
│ token bucket: 0 tokens remaining │
│ → 429 returned immediately │
└──────────────────┬──────────────────┘
│
╳ Django never reached
╳ rl:ip:{ip}:reqs never incremented
╳ rl:user:{uid}:reqs never incremented
╳ per-user NAT-safe counter never consulted
│
▼
429 for all N users in the org
recovery: wait for nginx bucket refill
(~3–10 min depending on depletion)
NOT a Django 300s block (Redis never written)
```
---
### 2. NAT Thundering Herd — After the Session Bypass Fix
Session and voting paths have dedicated nginx `location` blocks with no
`limit_req` . Regex locations take priority over `location /` .
```
Office / Chamber — behind one NAT IP
┌──────────────────────────────────────────────────────┐
│ Councilmember A /voto-individual/ reload ──┐ │
│ Councilmember B /voto-individual/ reload ──┤ │
│ Councilmember C /sessao/2600/ordemdia ───────┤ │
│ Staff tab /sessao/pauta-sessao/2600/ ──┘ │
└────────────────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ nginx │
│ │
│ location ~ ^/voto-individual/ ─┐ │
│ location ~ ^/sessao/\d+ ─┤ │ no limit_req
│ location ~ ^/painel/\d+/dados ─┘ │ pass through
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Django RateLimitMiddleware │
│ │
│ RATE_LIMIT_BYPASS_PATHS match? │
│ → yes: return get_response() │
│ (no counter, no block check) │
└──────────────────┬──────────────────┘
│
▼
✓ View served
All N users get their page
```
---
### 3. nginx Zone Architecture — Before vs After
**Before (single shared zone)**
```
All traffic (HTML + media + API)
│
▼
┌───────────────────────────────┐
│ sapl_general │ ← one bucket per IP
│ rate=30r/m burst=60 │
│ │
│ /media/page.pdf ──────────┐ │ ← media request drains
│ /materia/123/ ──────────┤ │ the same bucket as
│ /api/materia/? ──────────┘ │ the HTML page
└───────────────────────────────┘
Problem: a page with 20 media attachments
burns 20 tokens from the page-load budget
```
**After (four independent zones)**
```
┌─────────────────┐ location /
│ sapl_general │ rate=90r/m burst=180 — HTML page requests
└─────────────────┘
┌─────────────────┐ location /media/
│ sapl_media │ rate=180r/m burst=180 — media downloads
└─────────────────┘ own bucket, never
drains page quota
┌─────────────────┐ location /api/
│ sapl_api │ rate=60r/m burst=120 — API calls
└─────────────────┘ quota layer is
real constraint
┌─────────────────┐ location /relatorios/
│ sapl_heavy │ rate=10r/m burst=20 — PDF generation
└─────────────────┘ nodelay tight by design
Session/voting paths: NO zone — exempt from limit_req entirely
Static files /static/: NO zone — served directly from disk
```
---
### 4. Anonymous /api/ NAT Problem — Before vs After
**Before**
```
Office — 10 staff, JS polling /api/ every 5s = 120 req/min combined
│
▼
┌─────────────────────────────────────┐
│ nginx sapl_general (was shared) │
│ burst not yet exhausted → pass │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Django _evaluate_anonymous │
│ │
│ INCR rl:ip:{ip}:reqs │
│ count = 120 ≥ threshold (120/m) │
│ │
│ SET rl:ip:{ip}:blocked EX 300 │◄── block key written
│ ZADD rl:index:blocked_ips │ affects ALL paths
└──────────────────┬──────────────────┘
│
▼
Next request to /materia/, /sessao/, /voto-individual/ ...
→ 429 ip_blocked (300s)
Entire org locked out of ALL SAPL pages
because of JS polling the API
```
**After**
```
Office — 10 staff, JS polling /api/ every 5s = 120 req/min combined
│
▼
┌─────────────────────────────────────┐
│ nginx sapl_api │
│ rate=60r/m burst=120 nodelay │
│ throttles burst, passes remainder │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Django: quota check │
│ 500 req/day not exceeded → pass │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Anonymous /api/ early return │
│ │
│ rl:ip:{ip}:reqs NOT incremented │◄── no counter
│ rl:ip:{ip}:blocked NOT written │◄── no block key
└──────────────────┬──────────────────┘
│
▼
✓ View served
Page requests from same IP unaffected
```
---
### 5. Authenticated Rate Breach — Before vs After
**Before**
```
Authenticated user clicks rapidly: 241 requests in 60s
│
▼
┌─────────────────────────────────────┐
│ _evaluate_authenticated │
│ INCR rl:user:{uid}:reqs → 241 │
│ count ≥ 240 (auth threshold) │
│ │
│ SET rl:user:{uid}:blocked EX 300 │◄── 5-minute lockout
│ ZADD rl:index:blocked_users │
└──────────────────┬──────────────────┘
│
▼
All requests for 300s → 429 user_blocked
User must wait 5 minutes to do anything
No way to self-recover sooner
```
**After**
```
Authenticated user clicks rapidly: 241 requests in 60s
│
▼
┌─────────────────────────────────────┐
│ _evaluate_authenticated │
│ INCR rl:user:{uid}:reqs → 241 │
│ count ≥ 240 (auth threshold) │
│ │
│ return 429 auth_user_rate │◄── this request only
│ (no SET, no ZADD) │◄── no block key written
└──────────────────┬──────────────────┘
│
Counter TTL = 60s (auth_window)
│
▼
T+60s: rl:user:{uid}:reqs expires
User automatically recovers
No admin intervention needed
```
---
### 6. Enforcement Stack Per Path — Trade-off Summary
```
Path nginx zone Django counter Block written? Notes
────────────────────── ─────────────── ──────────────── ────────────── ──────────────────────────────
/static/* none none — disk-served, zero Django cost
/painel/< pk > /dados none none — bypass: high-freq polling
/voto-individual/* none none — bypass: live vote
/sessao/< pk > /* none none — bypass: live session
/sessao/pauta-sessao/* none none — bypass: live session
/media/* sapl_media anon IP / auth anon: yes auth gate in serve_media()
180r/m b=180 counter runs auth: no
/api/* (anonymous) sapl_api quota only no ← key change: no IP counter,
60r/m b=120 500/day — no collateral NAT block
/api/* (authenticated) sapl_api per-user 240/m no (soft) per-user, NAT-safe
60r/m b=120 counter runs
/relatorios/* sapl_heavy anon/auth runs anon: yes tight rate — PDF generation
10r/m b=20 at Django auth: no
/* (everything else) sapl_general anon/auth runs anon: yes normal page navigation
90r/m b=180 at Django auth: no auth gets 240/m soft limit
```
**Legend:**
- `anon: yes` — anonymous IP gets a 300s block key on breach
- `auth: no` — authenticated users get 429 for that request, window resets in 60s, no persistent block
- `none` — no rate limiting at either layer (path is exempt)
---
### 7. The Fundamental NAT Constraint
```
IP-based rate limiting cannot distinguish these two scenarios:
Scenario A — Legitimate (15 users, 1 tab each, vote opens)
┌──────────────────────────────────────────────────────┐
│ User 1 ──► GET /voto-individual/ │
│ User 2 ──► GET /voto-individual/ 15 req/s │
│ ... 1 public IP │
│ User 15 ──► GET /sessao/2600/ordemdia │
└──────────────────────────────────────────────────────┘
Scenario B — Bot (1 process, 15 threads, scraping)
┌──────────────────────────────────────────────────────┐
│ Thread 1 ──► GET /materia/1/ │
│ Thread 2 ──► GET /materia/2/ 15 req/s │
│ ... 1 public IP │
│ Thread 15 ──► GET /materia/15/ │
└──────────────────────────────────────────────────────┘
To nginx and an IP-based counter: identical.
Resolution strategies applied:
┌──────────────────────────────────────────────────────────────────┐
│ Known safe high-freq paths → nginx bypass + Django bypass │
│ Authenticated users → per-user counter (uid), NAT-safe │
│ Anonymous /api/ → quota only, no IP counter │
│ Everything else (anon) → IP counter + 300s block on breach │
└──────────────────────────────────────────────────────────────────┘
Long-term: APP_ACCESS_KEYs per tenant → quota per org, not per IP
WebSocket push for voting → eliminates polling bursts
```
---
## Pending Investigations
The following incidents may or may not share the same root cause as the PatoBranco-PR event. Each should be investigated using the OpenSearch query patterns established above and documented here.