diff --git a/plan/RATE-LIMITER-PLAN.md b/plan/RATE-LIMITER-PLAN.md index e7847c6f1..c6b975fe9 100644 --- a/plan/RATE-LIMITER-PLAN.md +++ b/plan/RATE-LIMITER-PLAN.md @@ -135,6 +135,11 @@ graph TD | `sendfile off` → `on` | **Bug** — flip to `on` | No valid production reason found; disabling userspace copy is always correct | | `/media/` serving | **X-Accel-Redirect** | Routes all `/media/` through Gunicorn so Django middleware runs; nginx serves bytes via internal location | | Cache backend switch | **At pod startup** via `start.sh` + waffle switch | Pod restart is acceptable; avoids per-request runtime overhead | +| nginx zone splitting (2026-05-07) | **4 zones**: general / media / api / heavy | `/media/` and `/api/` requests were draining the same bucket as HTML page loads, causing false 429s on heavy pages | +| Session/voting nginx bypass (2026-05-06) | **No `limit_req`** on `/voto-individual/` and `/sessao//` | Multiple councilmembers behind a NAT IP exhausted the nginx burst during live votes (PatoBranco-PR incident) | +| Auth rate breach: no persistent block (2026-05-07) | **429 per-request only**, window resets after 60 s | A 300 s lockout is the wrong penalty for a logged-in user who clicked too fast; persistent block is appropriate for anonymous/bot traffic only | +| Raise rate thresholds (2026-05-07) | anon 35→120/m · auth 120→240/m · 404 threshold 10→20 | SAPL pages fire 12–45 parallel requests; old thresholds blocked normal navigation for users in offices with multiple open tabs | +| API quota increase (2026-05-07) | anon 50→500/day · auth 1 000→5 000/day | Previous anon quota of 50/day was exhausted by a developer testing the API before lunch | --- @@ -224,8 +229,8 @@ The script stops and prints a summary as soon as a 429 is received. ### Examples ```bash -# Hit the anonymous threshold (35 req/min) — fire 40 requests with minimal delay -python scripts/test_ratelimiter.py http://localhost -n 40 -d 0.05 +# Hit the anonymous threshold (120 req/min) — fire 130 requests with minimal delay +python scripts/test_ratelimiter.py http://localhost -n 130 -d 0.05 # Slower fire — check that legitimate traffic is not rate-limited python scripts/test_ratelimiter.py http://localhost -n 20 -d 2 @@ -250,7 +255,7 @@ Summary: First 429 occurred at request: 36 ``` -A first-429 near the configured anonymous threshold (35 req/min) confirms the +A first-429 near the configured anonymous threshold (120 req/min) confirms the middleware is wired correctly. A first-429 much earlier points to nginx `limit_req` firing before Django sees the request. @@ -598,10 +603,15 @@ separately. Defined in `docker/config/nginx/nginx.conf` (zones) and `sapl.conf` (burst). ``` -sapl_general rate=30r/m # 1 token every 2 s -sapl_heavy rate=10r/m # 1 token every 6 s (PDF/report endpoints) +sapl_general rate=90r/m # 1 token every 0.67 s (HTML page requests) +sapl_media rate=180r/m # 1 token every 0.33 s (/media/ — own bucket) +sapl_api rate=60r/m # 1 token every 1 s (/api/ — own bucket) +sapl_heavy rate=10r/m # 1 token every 6 s (PDF/report endpoints) ``` +Each path has its own zone so media downloads and API calls cannot exhaust +the page-load bucket for a user navigating normally. + `burst=N nodelay` means nginx accepts up to N requests instantly above the current token level, then enforces the drip rate. Requests beyond the burst cap return 429 before reaching Gunicorn — **zero Python cost**. @@ -610,13 +620,17 @@ Burst values are set at container startup via env vars: | Env var | Default | Location | |---------|---------|----------| -| `NGINX_BURST_GENERAL` | `60` | `location /`, `location /media/` | -| `NGINX_BURST_API` | `60` | `location /api/` | -| `NGINX_BURST_HEAVY` | `20` | `location /relatorios/` | +| `NGINX_BURST_GENERAL` | `180` | `location /` | +| `NGINX_BURST_MEDIA` | `180` | `location /media/` | +| `NGINX_BURST_API` | `120` | `location /api/` | +| `NGINX_BURST_HEAVY` | `20` | `location /relatorios/` (nodelay kept) | Defaults are 2× the zone's per-minute rate, so a user can spend a full minute's quota in a single burst before the leaky bucket takes over. +**Session and voting paths are fully exempt from `limit_req`** — they have +dedicated location blocks with no rate zone. See §Session/voting bypass below. + ### Layer 2 — Django `RateLimitMiddleware` (sliding window) Defined in `sapl/middleware/ratelimit.py`, backed by Redis DB 1. @@ -626,17 +640,21 @@ Requests that pass nginx reach Python. The middleware counts them in a | Env var | Default | Scope | |---------|---------|-------| -| `RATE_LIMITER_RATE` | `35/m` | Anonymous IP | -| `RATE_LIMITER_RATE_AUTHENTICATED` | `120/m` | Authenticated user | +| `RATE_LIMITER_RATE` | `120/m` | Anonymous IP | +| `RATE_LIMITER_RATE_AUTHENTICATED` | `240/m` | Authenticated user (keyed by user pk — NAT-safe) | | `RATE_LIMITER_RATE_BOT` | `5/m` | *(reserved — bots are currently blocked outright, not counted)* | | `RATE_LIMITER_UA_BLOCKLIST_REFRESH` | `60` s | How often each worker re-fetches `rl:bot:ua:blocked` from Redis | -When the window count hits the threshold the IP/user block key is written -atomically (Lua: `SET key 1 EX 300` + `ZADD index score key`) with a 300 s TTL -and subsequent requests return 429 with `Retry-After: 300` — without touching -the database. The ZADD records the full key name in `rl:index:blocked_ips` or -`rl:index:blocked_users` with score = expiry unix timestamp, enabling O(log N) -enumeration of all active blocks without a `SCAN`. +**Anonymous breach** — when the window count hits the threshold the IP block key +is written atomically (Lua: `SET key 1 EX 300` + `ZADD index score key`) with a +300 s TTL. Subsequent requests from that IP return 429 without touching the +database. + +**Authenticated breach** — returns 429 for the over-limit request only; **no +persistent block key is written**. The counter expires after 60 s (the window +TTL) and the user can proceed again automatically. A 300 s lockout is the wrong +penalty for a logged-in user who clicked too fast; that severity is reserved for +anonymous/bot traffic. Decision flow inside `RateLimitMiddleware.__call__()` / `_evaluate()`: @@ -653,7 +671,7 @@ Decision flow inside `RateLimitMiddleware.__call__()` / `_evaluate()`: 3. Authenticated user? 3a. User in rl:{ns}:user:{uid}:blocked? → 429 reason=user_blocked 3b. Suspicious headers (no Accept/AL)? → 429 reason=suspicious_headers_auth - 3c. User request count ≥ auth threshold? → SET blocked, 429 reason=auth_user_rate + 3c. User request count ≥ auth threshold? → 429 (no block key) reason=auth_user_rate 4. Anonymous: 4a. Suspicious headers? → 429 reason=suspicious_headers 4b. IP request count ≥ anon threshold? → SET blocked, 429 reason=ip_rate @@ -698,9 +716,8 @@ flowchart TD C3B{"Suspicious headers?\nno Accept-Language + no Accept"} C3B -- yes --> R_SH([429\nsuspicious_headers_auth]) C3B -- no --> C3C - C3C{"User rate ≥ 120/min?"} - C3C -- yes --> SUB["SET rl:ns:user:UID:blocked TTL 300 s"] - SUB --> R_AUR([429\nauth_user_rate]) + C3C{"User rate ≥ 240/min?"} + C3C -- "yes — no block key written;\nwindow resets after 60 s" --> R_AUR([429\nauth_user_rate]) C3C -- no --> PASS_A([✓ pass]) end @@ -708,11 +725,11 @@ flowchart TD C4A{"Suspicious headers?\nno Accept-Language + no Accept"} C4A -- yes --> R_ASH([429\nsuspicious_headers]) C4A -- no --> C4B - C4B{"IP rate ≥ 35/min?"} + C4B{"IP rate ≥ 120/min?"} C4B -- yes --> SIPR["SET rl:ip:IP:blocked TTL 300 s"] SIPR --> R_IPR([429\nip_rate]) C4B -- no --> C4C - C4C{"NS/IP window hit\n≥ 35 in bucket?"} + C4C{"NS/IP window hit\n≥ 120 in bucket?"} C4C -- yes --> SUAR["SET rl:ip:IP:blocked TTL 300 s"] SUAR --> R_UAR([429\nua_rotation]) C4C -- no --> PASS_N([✓ pass]) @@ -726,14 +743,14 @@ Roll out to canary pods first; promote check-by-check in order of false-positive | Order | Check | Reason | Risk | Condition to promote | |-------|-------|--------|------|---------------------| | nginx | scanner extensions | `return 444` in `sapl.conf` for `.php`/`.asp`/etc. | Zero | Gunicorn never sees these requests | -| 0th | `quota_daily` / `quota_weekly` | Per-consumer daily/weekly cap on `/api/` paths | Low | Limits set well above per-minute rate (200/day anon, 1000/day auth) | +| 0th | `quota_daily` / `quota_weekly` | Per-consumer daily/weekly cap on `/api/` paths | Low | Limits set well above per-minute rate (500/day anon, 5 000/day auth) | | 1st | `known_ua` | Substring in hardcoded `BOT_UA_FRAGMENTS` list | Zero | UA strings are deterministic | | 2nd | `redis_ua` | Token hash in `rl:bot:ua:blocked` SET | Zero | Keys only set manually by operators | | 3rd | `ip_blocked` | Marker set by prior proven-bad requests | Zero | Fast-path only, no new blocks created | -| 4th | `ip_rate` | Rolling IP counter ≥ 35/min | Low | Threshold calibrated from canary logs | +| 4th | `ip_rate` | Rolling IP counter ≥ 120/min | Low | Threshold calibrated from canary logs | | 5th | `suspicious_headers` | No Accept-Language **and** no Accept | Medium | Confirmed no legitimate clients omit both headers | -| 6th | `ua_rotation` (ns/window) | NS/IP clock-aligned bucket ≥ 35 | Medium | NAT IP allowlist in place (see Open Questions) | -| 7th | `404_scan` | Anonymous IP accumulates ≥ 10 404s/min | Low | Catches path probes without known extensions | +| 6th | `ua_rotation` (ns/window) | NS/IP clock-aligned bucket ≥ 120 | Medium | NAT IP allowlist in place (see Open Questions) | +| 7th | `404_scan` | Anonymous IP accumulates ≥ 20 404s/min | Low | Catches path probes without known extensions | ### Decorator migration @@ -757,12 +774,73 @@ For views where `django-ratelimit` decorators already exist: | **Tuned by** | Capacity of the server | Acceptable request volume per client | | **Failure mode** | Workers overwhelmed | Legitimate user over-browsing | -A user loading a page quickly may fire 5–10 Django requests in two seconds. -With `rate=30r/m` (1 token/2 s) and `burst=60` they absorb that fine; the -leaky bucket refills before they click the next link. The Django threshold -(35/m sliding window) catches sustained automated traffic from a single IP -that looks like scraping even if it arrives slowly enough to beat the nginx -burst cap. +A SAPL page fires 12–45 parallel requests — most are `/static/` served +directly by nginx (zero Django cost), but 5–15 may reach Gunicorn. +With `rate=90r/m` and `burst=180` a user can load several heavy pages back-to-back +before the leaky bucket takes over. The Django threshold (120/m fixed window +for anonymous, 240/m for authenticated) catches sustained automated traffic that +arrives slowly enough to pass the nginx burst cap. + +Note: nginx rates are hardcoded in `nginx.conf` (rebuild to change); burst values +are env-var configurable at container start via `start.sh` defaults. + +--- + +## Session/voting bypass (2026-05-06) + +### Problem + +Multiple councilmembers behind a shared NAT IP were receiving 429 errors during +live plenary votes. Root cause: nginx's `limit_req` fires before any request +reaches Django, so Django's per-user counters (which are NAT-safe) were never +consulted. When a vote opened, 15+ users simultaneously reloaded their voting +pages, exhausting the shared IP's nginx burst bucket. + +The `voto_individual.html` template contains `setTimeout(location.reload, 30000)` +— the page reloads itself every 30 seconds. When councilmembers open the page at +roughly the same time (vote announcement), their reload timers align and all fire +in the same second. + +See `docs/rate-limiter-incidents.md` — PatoBranco-PR 2026-05-06 for full analysis. + +### Fix + +Dedicated nginx `location` blocks with **no `limit_req`** for session and voting +paths. These regex locations take priority over `location /` by nginx matching +rules. Mirrored in `RATE_LIMIT_BYPASS_PATHS` so Django's middleware also skips +counting (defense-in-depth). + +```nginx +# sapl.conf — no rate limiting on session/voting paths +location ~ ^/painel/\d+/dados$ { proxy_pass http://sapl_server; } +location ~ ^/voto-individual/ { proxy_pass http://sapl_server; } +location ~ ^/sessao/\d+ { proxy_pass http://sapl_server; } +``` + +```python +# settings.py +RATE_LIMIT_BYPASS_PATHS = [ + r'^/painel/\d+/dados$', + r'^/voto-individual/', + r'^/sessao/\d+', + r'^/sessao/pauta-sessao/\d+/', +] +``` + +### Why these paths are safe to exempt + +- All meaningful actions require an authenticated session cookie. +- Django's per-user counter (240/m, keyed by user pk) still applies as a backstop. +- The real abuse vectors (scrapers, credential stuffing) target different URL patterns. +- The cost of a false-positive block (councilmember unable to vote) far outweighs + the risk of abuse on these paths. + +### Long-term fix + +Replace `setTimeout(location.reload, 30000)` in `voto_individual.html` with +server-push (WebSocket or SSE). Removes the synchronisation mechanism entirely — +the thundering herd cannot occur if the server pushes vote-open events instead of +clients polling by reloading. --- @@ -824,8 +902,9 @@ location /static/ { } # Proxied to Gunicorn — Django middleware + serve_media() run here. +# Own zone so media downloads don't drain the general page-load bucket. location /media/ { - limit_req zone=sapl_general burst=${NGINX_BURST_GENERAL} nodelay; + limit_req zone=sapl_media burst=${NGINX_BURST_MEDIA} nodelay; proxy_pass http://sapl_server; } @@ -888,21 +967,21 @@ Redis PDF caching would solve "high request volume reaching the file layer" — | 0 | Page / view cache | `cache:{ns}:*` | 300 s (default) | — | `CACHES['default']` KEY_PREFIX | | 0 | Static file cache (logos) | `static:{ns}:{sha256}` | 3 – 24 h | — | *Future* (requires OpenResty/Lua) | | 0 | File content cache (≤ 360 KB) | `file:{ns}:{sha256}` | 1 h | — | *Future* | -| 1 | IP rate-limit counter | `rl:ip:{ip}:reqs` | 60 s | 35 (`RATE_LIMITER_RATE`) | `RL_IP_REQUESTS` | -| 1 | IP 404 counter | `rl:ip:{ip}:404s` | 60 s | 10 (`RATE_LIMIT_404_THRESHOLD`) | `RL_IP_404S` | +| 1 | IP rate-limit counter | `rl:ip:{ip}:reqs` | 60 s | 120 (`RATE_LIMITER_RATE`) | `RL_IP_REQUESTS` | +| 1 | IP 404 counter | `rl:ip:{ip}:404s` | 60 s | 20 (`RATE_LIMIT_404_THRESHOLD`) | `RL_IP_404S` | | 1 | IP blocked marker | `rl:ip:{ip}:blocked` | 300 s | — | `RL_IP_BLOCKED` | | 1 | Blocked-IP ZSET index | `rl:index:blocked_ips` | permanent ZSET, score=expiry ts | — | `RL_INDEX_BLOCKED_IPS` | -| 1 | User rate-limit counter | `rl:{ns}:user:{uid}:reqs` | 60 s | 120 (`RATE_LIMITER_RATE_AUTHENTICATED`) | `RL_USER_REQUESTS` | -| 1 | User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | — | `RL_USER_BLOCKED` | -| 1 | Blocked-user ZSET index | `rl:index:blocked_users` | permanent ZSET, score=expiry ts | — | `RL_INDEX_BLOCKED_USERS` | -| 1 | Namespace/IP sliding window | `rl:{ns}:ip:{ip}:w:{bucket}` | 120 s | 35 (`RATE_LIMITER_RATE`) | `RL_NS_WINDOW` | +| 1 | User rate-limit counter | `rl:{ns}:user:{uid}:reqs` | 60 s | 240 (`RATE_LIMITER_RATE_AUTHENTICATED`) | `RL_USER_REQUESTS` | +| 1 | User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | — *(not written on rate breach; window resets naturally)* | `RL_USER_BLOCKED` | +| 1 | Blocked-user ZSET index | `rl:index:blocked_users` | permanent ZSET, score=expiry ts | — *(not written on rate breach)* | `RL_INDEX_BLOCKED_USERS` | +| 1 | Namespace/IP sliding window | `rl:{ns}:ip:{ip}:w:{bucket}` | 120 s | 120 (`RATE_LIMITER_RATE`) | `RL_NS_WINDOW` | | 1 | Path counter (`/media/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | — (observability only) | `RL_PATH_REQUESTS` | | 1 | Path counter (`/static/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | — | *Future* (requires OpenResty/Lua) | | 1 | UA deny list | `rl:bot:ua:blocked` | permanent SET | — (block on match) | `RL_UA_BLOCKLIST` | -| 1 | API daily quota (anon) | `quota:{ns}:daily:{date}:ip:{ip}` | 24 h | 50 (`API_QUOTA_ANON_DAILY`) | `QUOTA_IP_DAILY` | -| 1 | API weekly quota (anon) | `quota:{ns}:weekly:{week}:ip:{ip}` | 7 d | 350 (`API_QUOTA_ANON_WEEKLY`) | `QUOTA_IP_WEEKLY` | -| 1 | API daily quota (auth) | `quota:{ns}:daily:{date}:user:{uid}` | 24 h | 1000 (`API_QUOTA_AUTH_DAILY`) | `QUOTA_USER_DAILY` | -| 1 | API weekly quota (auth) | `quota:{ns}:weekly:{week}:user:{uid}` | 7 d | 7000 (`API_QUOTA_AUTH_WEEKLY`) | `QUOTA_USER_WEEKLY` | +| 1 | API daily quota (anon) | `quota:{ns}:daily:{date}:ip:{ip}` | 24 h | 500 (`API_QUOTA_ANON_DAILY`) | `QUOTA_IP_DAILY` | +| 1 | API weekly quota (anon) | `quota:{ns}:weekly:{week}:ip:{ip}` | 7 d | 3 500 (`API_QUOTA_ANON_WEEKLY`) | `QUOTA_IP_WEEKLY` | +| 1 | API daily quota (auth) | `quota:{ns}:daily:{date}:user:{uid}` | 24 h | 5 000 (`API_QUOTA_AUTH_DAILY`) | `QUOTA_USER_DAILY` | +| 1 | API weekly quota (auth) | `quota:{ns}:weekly:{week}:user:{uid}` | 7 d | 35 000 (`API_QUOTA_AUTH_WEEKLY`) | `QUOTA_USER_WEEKLY` | | 2 | Django Channels | `channels:*` | session TTL | — | *Future* | ### What each counter catches — and misses @@ -971,7 +1050,7 @@ the requests come from many different IPs (e.g., distributed proxy pool), all requests share the same `uid` and accumulate in one counter. Misses: a credential that is shared across multiple legitimate users in the same -office; all their activity adds up to one counter and can trip the 120/min +office; all their activity adds up to one counter and can trip the 240/min threshold during a busy session. --- @@ -982,14 +1061,16 @@ Written when `rl:{ns}:user:{uid}:reqs` hits the authenticated threshold (step 3c Checked at step 3a — before counting — so a blocked user never increments their counter on subsequent requests during the 300 s cooldown. -Catches: credential-stuffing or runaway automation using a valid session — once the -120/min threshold is hit, the account is locked out immediately for 300 s. Unlike -the IP marker, the block is namespace-scoped, so the same user account can be -blocked on one SAPL but still active on another. +Previously caught: credential-stuffing or runaway automation using a valid session — +once the 240/min threshold was hit the account was locked out for 300 s. -Misses: same fixed-TTL weakness as the IP marker — a persistent attacker resumes -after 300 s. An account shared by multiple legitimate users (e.g., a departmental -login) can be locked out during peak collaborative use. +**Changed (2026-05-07):** `_set_block` is no longer called on authenticated rate +breach. The 429 is returned for the over-limit request; the counter expires after +60 s and the user proceeds automatically. The `rl:{ns}:user:{uid}:blocked` marker +and `rl:index:blocked_users` ZSET are therefore **not written on rate breach** — +only legacy entries from before this change may exist. A 300 s lockout is wrong +for a logged-in user who clicked too fast; that penalty is reserved for +anonymous/bot traffic. ---