Browse Source

Update RATE-LIMITER-PLAN.md with 2026-05-06/07 changes

- Decision log: zone split, session bypass, soft auth rate limiting,
  threshold increases, API quota increases
- nginx Layer 1: 4 zones replacing 2; updated burst table
- Django Layer 2: new thresholds, auth breach no longer writes block key
- Decision flow and mermaid diagram: updated thresholds and auth path
- Key schema: updated all thresholds; user blocked marker noted as dead
  for rate breach (window resets naturally)
- Session/voting bypass: new section with root cause, fix, and rationale
- Enforcement order, test script examples: updated thresholds throughout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rate-limiter-2026
Edward Ribeiro 3 weeks ago
parent
commit
405ba55d32
  1. 181
      plan/RATE-LIMITER-PLAN.md

181
plan/RATE-LIMITER-PLAN.md

@ -135,6 +135,11 @@ graph TD
| `sendfile off``on` | **Bug** — flip to `on` | No valid production reason found; disabling userspace copy is always correct |
| `/media/` serving | **X-Accel-Redirect** | Routes all `/media/` through Gunicorn so Django middleware runs; nginx serves bytes via internal location |
| Cache backend switch | **At pod startup** via `start.sh` + waffle switch | Pod restart is acceptable; avoids per-request runtime overhead |
| nginx zone splitting (2026-05-07) | **4 zones**: general / media / api / heavy | `/media/` and `/api/` requests were draining the same bucket as HTML page loads, causing false 429s on heavy pages |
| Session/voting nginx bypass (2026-05-06) | **No `limit_req`** on `/voto-individual/` and `/sessao/<pk>/` | Multiple councilmembers behind a NAT IP exhausted the nginx burst during live votes (PatoBranco-PR incident) |
| Auth rate breach: no persistent block (2026-05-07) | **429 per-request only**, window resets after 60 s | A 300 s lockout is the wrong penalty for a logged-in user who clicked too fast; persistent block is appropriate for anonymous/bot traffic only |
| Raise rate thresholds (2026-05-07) | anon 35→120/m · auth 120→240/m · 404 threshold 10→20 | SAPL pages fire 12–45 parallel requests; old thresholds blocked normal navigation for users in offices with multiple open tabs |
| API quota increase (2026-05-07) | anon 50→500/day · auth 1 000→5 000/day | Previous anon quota of 50/day was exhausted by a developer testing the API before lunch |
---
@ -224,8 +229,8 @@ The script stops and prints a summary as soon as a 429 is received.
### Examples
```bash
# Hit the anonymous threshold (35 req/min) — fire 40 requests with minimal delay
python scripts/test_ratelimiter.py http://localhost -n 40 -d 0.05
# Hit the anonymous threshold (120 req/min) — fire 130 requests with minimal delay
python scripts/test_ratelimiter.py http://localhost -n 130 -d 0.05
# Slower fire — check that legitimate traffic is not rate-limited
python scripts/test_ratelimiter.py http://localhost -n 20 -d 2
@ -250,7 +255,7 @@ Summary:
First 429 occurred at request: 36
```
A first-429 near the configured anonymous threshold (35 req/min) confirms the
A first-429 near the configured anonymous threshold (120 req/min) confirms the
middleware is wired correctly. A first-429 much earlier points to nginx `limit_req`
firing before Django sees the request.
@ -598,10 +603,15 @@ separately.
Defined in `docker/config/nginx/nginx.conf` (zones) and `sapl.conf` (burst).
```
sapl_general rate=30r/m # 1 token every 2 s
sapl_general rate=90r/m # 1 token every 0.67 s (HTML page requests)
sapl_media rate=180r/m # 1 token every 0.33 s (/media/ — own bucket)
sapl_api rate=60r/m # 1 token every 1 s (/api/ — own bucket)
sapl_heavy rate=10r/m # 1 token every 6 s (PDF/report endpoints)
```
Each path has its own zone so media downloads and API calls cannot exhaust
the page-load bucket for a user navigating normally.
`burst=N nodelay` means nginx accepts up to N requests instantly above the
current token level, then enforces the drip rate. Requests beyond the burst
cap return 429 before reaching Gunicorn — **zero Python cost**.
@ -610,13 +620,17 @@ Burst values are set at container startup via env vars:
| Env var | Default | Location |
|---------|---------|----------|
| `NGINX_BURST_GENERAL` | `60` | `location /`, `location /media/` |
| `NGINX_BURST_API` | `60` | `location /api/` |
| `NGINX_BURST_HEAVY` | `20` | `location /relatorios/` |
| `NGINX_BURST_GENERAL` | `180` | `location /` |
| `NGINX_BURST_MEDIA` | `180` | `location /media/` |
| `NGINX_BURST_API` | `120` | `location /api/` |
| `NGINX_BURST_HEAVY` | `20` | `location /relatorios/` (nodelay kept) |
Defaults are 2× the zone's per-minute rate, so a user can spend a full
minute's quota in a single burst before the leaky bucket takes over.
**Session and voting paths are fully exempt from `limit_req`** — they have
dedicated location blocks with no rate zone. See §Session/voting bypass below.
### Layer 2 — Django `RateLimitMiddleware` (sliding window)
Defined in `sapl/middleware/ratelimit.py`, backed by Redis DB 1.
@ -626,17 +640,21 @@ Requests that pass nginx reach Python. The middleware counts them in a
| Env var | Default | Scope |
|---------|---------|-------|
| `RATE_LIMITER_RATE` | `35/m` | Anonymous IP |
| `RATE_LIMITER_RATE_AUTHENTICATED` | `120/m` | Authenticated user |
| `RATE_LIMITER_RATE` | `120/m` | Anonymous IP |
| `RATE_LIMITER_RATE_AUTHENTICATED` | `240/m` | Authenticated user (keyed by user pk — NAT-safe) |
| `RATE_LIMITER_RATE_BOT` | `5/m` | *(reserved — bots are currently blocked outright, not counted)* |
| `RATE_LIMITER_UA_BLOCKLIST_REFRESH` | `60` s | How often each worker re-fetches `rl:bot:ua:blocked` from Redis |
When the window count hits the threshold the IP/user block key is written
atomically (Lua: `SET key 1 EX 300` + `ZADD index score key`) with a 300 s TTL
and subsequent requests return 429 with `Retry-After: 300` — without touching
the database. The ZADD records the full key name in `rl:index:blocked_ips` or
`rl:index:blocked_users` with score = expiry unix timestamp, enabling O(log N)
enumeration of all active blocks without a `SCAN`.
**Anonymous breach** — when the window count hits the threshold the IP block key
is written atomically (Lua: `SET key 1 EX 300` + `ZADD index score key`) with a
300 s TTL. Subsequent requests from that IP return 429 without touching the
database.
**Authenticated breach** — returns 429 for the over-limit request only; **no
persistent block key is written**. The counter expires after 60 s (the window
TTL) and the user can proceed again automatically. A 300 s lockout is the wrong
penalty for a logged-in user who clicked too fast; that severity is reserved for
anonymous/bot traffic.
Decision flow inside `RateLimitMiddleware.__call__()` / `_evaluate()`:
@ -653,7 +671,7 @@ Decision flow inside `RateLimitMiddleware.__call__()` / `_evaluate()`:
3. Authenticated user?
3a. User in rl:{ns}:user:{uid}:blocked? → 429 reason=user_blocked
3b. Suspicious headers (no Accept/AL)? → 429 reason=suspicious_headers_auth
3c. User request count ≥ auth threshold? → SET blocked, 429 reason=auth_user_rate
3c. User request count ≥ auth threshold? → 429 (no block key) reason=auth_user_rate
4. Anonymous:
4a. Suspicious headers? → 429 reason=suspicious_headers
4b. IP request count ≥ anon threshold? → SET blocked, 429 reason=ip_rate
@ -698,9 +716,8 @@ flowchart TD
C3B{"Suspicious headers?\nno Accept-Language + no Accept"}
C3B -- yes --> R_SH([429\nsuspicious_headers_auth])
C3B -- no --> C3C
C3C{"User rate ≥ 120/min?"}
C3C -- yes --> SUB["SET rl:ns:user:UID:blocked TTL 300 s"]
SUB --> R_AUR([429\nauth_user_rate])
C3C{"User rate ≥ 240/min?"}
C3C -- "yes — no block key written;\nwindow resets after 60 s" --> R_AUR([429\nauth_user_rate])
C3C -- no --> PASS_A([✓ pass])
end
@ -708,11 +725,11 @@ flowchart TD
C4A{"Suspicious headers?\nno Accept-Language + no Accept"}
C4A -- yes --> R_ASH([429\nsuspicious_headers])
C4A -- no --> C4B
C4B{"IP rate ≥ 35/min?"}
C4B{"IP rate ≥ 120/min?"}
C4B -- yes --> SIPR["SET rl:ip:IP:blocked TTL 300 s"]
SIPR --> R_IPR([429\nip_rate])
C4B -- no --> C4C
C4C{"NS/IP window hit\n≥ 35 in bucket?"}
C4C{"NS/IP window hit\n≥ 120 in bucket?"}
C4C -- yes --> SUAR["SET rl:ip:IP:blocked TTL 300 s"]
SUAR --> R_UAR([429\nua_rotation])
C4C -- no --> PASS_N([✓ pass])
@ -726,14 +743,14 @@ Roll out to canary pods first; promote check-by-check in order of false-positive
| Order | Check | Reason | Risk | Condition to promote |
|-------|-------|--------|------|---------------------|
| nginx | scanner extensions | `return 444` in `sapl.conf` for `.php`/`.asp`/etc. | Zero | Gunicorn never sees these requests |
| 0th | `quota_daily` / `quota_weekly` | Per-consumer daily/weekly cap on `/api/` paths | Low | Limits set well above per-minute rate (200/day anon, 1000/day auth) |
| 0th | `quota_daily` / `quota_weekly` | Per-consumer daily/weekly cap on `/api/` paths | Low | Limits set well above per-minute rate (500/day anon, 5 000/day auth) |
| 1st | `known_ua` | Substring in hardcoded `BOT_UA_FRAGMENTS` list | Zero | UA strings are deterministic |
| 2nd | `redis_ua` | Token hash in `rl:bot:ua:blocked` SET | Zero | Keys only set manually by operators |
| 3rd | `ip_blocked` | Marker set by prior proven-bad requests | Zero | Fast-path only, no new blocks created |
| 4th | `ip_rate` | Rolling IP counter ≥ 35/min | Low | Threshold calibrated from canary logs |
| 4th | `ip_rate` | Rolling IP counter ≥ 120/min | Low | Threshold calibrated from canary logs |
| 5th | `suspicious_headers` | No Accept-Language **and** no Accept | Medium | Confirmed no legitimate clients omit both headers |
| 6th | `ua_rotation` (ns/window) | NS/IP clock-aligned bucket ≥ 35 | Medium | NAT IP allowlist in place (see Open Questions) |
| 7th | `404_scan` | Anonymous IP accumulates ≥ 10 404s/min | Low | Catches path probes without known extensions |
| 6th | `ua_rotation` (ns/window) | NS/IP clock-aligned bucket ≥ 120 | Medium | NAT IP allowlist in place (see Open Questions) |
| 7th | `404_scan` | Anonymous IP accumulates ≥ 20 404s/min | Low | Catches path probes without known extensions |
### Decorator migration
@ -757,12 +774,73 @@ For views where `django-ratelimit` decorators already exist:
| **Tuned by** | Capacity of the server | Acceptable request volume per client |
| **Failure mode** | Workers overwhelmed | Legitimate user over-browsing |
A user loading a page quickly may fire 5–10 Django requests in two seconds.
With `rate=30r/m` (1 token/2 s) and `burst=60` they absorb that fine; the
leaky bucket refills before they click the next link. The Django threshold
(35/m sliding window) catches sustained automated traffic from a single IP
that looks like scraping even if it arrives slowly enough to beat the nginx
burst cap.
A SAPL page fires 12–45 parallel requests — most are `/static/` served
directly by nginx (zero Django cost), but 5–15 may reach Gunicorn.
With `rate=90r/m` and `burst=180` a user can load several heavy pages back-to-back
before the leaky bucket takes over. The Django threshold (120/m fixed window
for anonymous, 240/m for authenticated) catches sustained automated traffic that
arrives slowly enough to pass the nginx burst cap.
Note: nginx rates are hardcoded in `nginx.conf` (rebuild to change); burst values
are env-var configurable at container start via `start.sh` defaults.
---
## Session/voting bypass (2026-05-06)
### Problem
Multiple councilmembers behind a shared NAT IP were receiving 429 errors during
live plenary votes. Root cause: nginx's `limit_req` fires before any request
reaches Django, so Django's per-user counters (which are NAT-safe) were never
consulted. When a vote opened, 15+ users simultaneously reloaded their voting
pages, exhausting the shared IP's nginx burst bucket.
The `voto_individual.html` template contains `setTimeout(location.reload, 30000)`
— the page reloads itself every 30 seconds. When councilmembers open the page at
roughly the same time (vote announcement), their reload timers align and all fire
in the same second.
See `docs/rate-limiter-incidents.md` — PatoBranco-PR 2026-05-06 for full analysis.
### Fix
Dedicated nginx `location` blocks with **no `limit_req`** for session and voting
paths. These regex locations take priority over `location /` by nginx matching
rules. Mirrored in `RATE_LIMIT_BYPASS_PATHS` so Django's middleware also skips
counting (defense-in-depth).
```nginx
# sapl.conf — no rate limiting on session/voting paths
location ~ ^/painel/\d+/dados$ { proxy_pass http://sapl_server; }
location ~ ^/voto-individual/ { proxy_pass http://sapl_server; }
location ~ ^/sessao/\d+ { proxy_pass http://sapl_server; }
```
```python
# settings.py
RATE_LIMIT_BYPASS_PATHS = [
r'^/painel/\d+/dados$',
r'^/voto-individual/',
r'^/sessao/\d+',
r'^/sessao/pauta-sessao/\d+/',
]
```
### Why these paths are safe to exempt
- All meaningful actions require an authenticated session cookie.
- Django's per-user counter (240/m, keyed by user pk) still applies as a backstop.
- The real abuse vectors (scrapers, credential stuffing) target different URL patterns.
- The cost of a false-positive block (councilmember unable to vote) far outweighs
the risk of abuse on these paths.
### Long-term fix
Replace `setTimeout(location.reload, 30000)` in `voto_individual.html` with
server-push (WebSocket or SSE). Removes the synchronisation mechanism entirely —
the thundering herd cannot occur if the server pushes vote-open events instead of
clients polling by reloading.
---
@ -824,8 +902,9 @@ location /static/ {
}
# Proxied to Gunicorn — Django middleware + serve_media() run here.
# Own zone so media downloads don't drain the general page-load bucket.
location /media/ {
limit_req zone=sapl_general burst=${NGINX_BURST_GENERAL} nodelay;
limit_req zone=sapl_media burst=${NGINX_BURST_MEDIA} nodelay;
proxy_pass http://sapl_server;
}
@ -888,21 +967,21 @@ Redis PDF caching would solve "high request volume reaching the file layer" —
| 0 | Page / view cache | `cache:{ns}:*` | 300 s (default) | — | `CACHES['default']` KEY_PREFIX |
| 0 | Static file cache (logos) | `static:{ns}:{sha256}` | 3 – 24 h | — | *Future* (requires OpenResty/Lua) |
| 0 | File content cache (≤ 360 KB) | `file:{ns}:{sha256}` | 1 h | — | *Future* |
| 1 | IP rate-limit counter | `rl:ip:{ip}:reqs` | 60 s | 35 (`RATE_LIMITER_RATE`) | `RL_IP_REQUESTS` |
| 1 | IP 404 counter | `rl:ip:{ip}:404s` | 60 s | 10 (`RATE_LIMIT_404_THRESHOLD`) | `RL_IP_404S` |
| 1 | IP rate-limit counter | `rl:ip:{ip}:reqs` | 60 s | 120 (`RATE_LIMITER_RATE`) | `RL_IP_REQUESTS` |
| 1 | IP 404 counter | `rl:ip:{ip}:404s` | 60 s | 20 (`RATE_LIMIT_404_THRESHOLD`) | `RL_IP_404S` |
| 1 | IP blocked marker | `rl:ip:{ip}:blocked` | 300 s | — | `RL_IP_BLOCKED` |
| 1 | Blocked-IP ZSET index | `rl:index:blocked_ips` | permanent ZSET, score=expiry ts | — | `RL_INDEX_BLOCKED_IPS` |
| 1 | User rate-limit counter | `rl:{ns}:user:{uid}:reqs` | 60 s | 120 (`RATE_LIMITER_RATE_AUTHENTICATED`) | `RL_USER_REQUESTS` |
| 1 | User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | — | `RL_USER_BLOCKED` |
| 1 | Blocked-user ZSET index | `rl:index:blocked_users` | permanent ZSET, score=expiry ts | — | `RL_INDEX_BLOCKED_USERS` |
| 1 | Namespace/IP sliding window | `rl:{ns}:ip:{ip}:w:{bucket}` | 120 s | 35 (`RATE_LIMITER_RATE`) | `RL_NS_WINDOW` |
| 1 | User rate-limit counter | `rl:{ns}:user:{uid}:reqs` | 60 s | 240 (`RATE_LIMITER_RATE_AUTHENTICATED`) | `RL_USER_REQUESTS` |
| 1 | User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | — *(not written on rate breach; window resets naturally)* | `RL_USER_BLOCKED` |
| 1 | Blocked-user ZSET index | `rl:index:blocked_users` | permanent ZSET, score=expiry ts | — *(not written on rate breach)* | `RL_INDEX_BLOCKED_USERS` |
| 1 | Namespace/IP sliding window | `rl:{ns}:ip:{ip}:w:{bucket}` | 120 s | 120 (`RATE_LIMITER_RATE`) | `RL_NS_WINDOW` |
| 1 | Path counter (`/media/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | — (observability only) | `RL_PATH_REQUESTS` |
| 1 | Path counter (`/static/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | — | *Future* (requires OpenResty/Lua) |
| 1 | UA deny list | `rl:bot:ua:blocked` | permanent SET | — (block on match) | `RL_UA_BLOCKLIST` |
| 1 | API daily quota (anon) | `quota:{ns}:daily:{date}:ip:{ip}` | 24 h | 50 (`API_QUOTA_ANON_DAILY`) | `QUOTA_IP_DAILY` |
| 1 | API weekly quota (anon) | `quota:{ns}:weekly:{week}:ip:{ip}` | 7 d | 350 (`API_QUOTA_ANON_WEEKLY`) | `QUOTA_IP_WEEKLY` |
| 1 | API daily quota (auth) | `quota:{ns}:daily:{date}:user:{uid}` | 24 h | 1000 (`API_QUOTA_AUTH_DAILY`) | `QUOTA_USER_DAILY` |
| 1 | API weekly quota (auth) | `quota:{ns}:weekly:{week}:user:{uid}` | 7 d | 7000 (`API_QUOTA_AUTH_WEEKLY`) | `QUOTA_USER_WEEKLY` |
| 1 | API daily quota (anon) | `quota:{ns}:daily:{date}:ip:{ip}` | 24 h | 500 (`API_QUOTA_ANON_DAILY`) | `QUOTA_IP_DAILY` |
| 1 | API weekly quota (anon) | `quota:{ns}:weekly:{week}:ip:{ip}` | 7 d | 3 500 (`API_QUOTA_ANON_WEEKLY`) | `QUOTA_IP_WEEKLY` |
| 1 | API daily quota (auth) | `quota:{ns}:daily:{date}:user:{uid}` | 24 h | 5 000 (`API_QUOTA_AUTH_DAILY`) | `QUOTA_USER_DAILY` |
| 1 | API weekly quota (auth) | `quota:{ns}:weekly:{week}:user:{uid}` | 7 d | 35 000 (`API_QUOTA_AUTH_WEEKLY`) | `QUOTA_USER_WEEKLY` |
| 2 | Django Channels | `channels:*` | session TTL | — | *Future* |
### What each counter catches — and misses
@ -971,7 +1050,7 @@ the requests come from many different IPs (e.g., distributed proxy pool), all
requests share the same `uid` and accumulate in one counter.
Misses: a credential that is shared across multiple legitimate users in the same
office; all their activity adds up to one counter and can trip the 120/min
office; all their activity adds up to one counter and can trip the 240/min
threshold during a busy session.
---
@ -982,14 +1061,16 @@ Written when `rl:{ns}:user:{uid}:reqs` hits the authenticated threshold (step 3c
Checked at step 3a — before counting — so a blocked user never increments their
counter on subsequent requests during the 300 s cooldown.
Catches: credential-stuffing or runaway automation using a valid session — once the
120/min threshold is hit, the account is locked out immediately for 300 s. Unlike
the IP marker, the block is namespace-scoped, so the same user account can be
blocked on one SAPL but still active on another.
Previously caught: credential-stuffing or runaway automation using a valid session —
once the 240/min threshold was hit the account was locked out for 300 s.
Misses: same fixed-TTL weakness as the IP marker — a persistent attacker resumes
after 300 s. An account shared by multiple legitimate users (e.g., a departmental
login) can be locked out during peak collaborative use.
**Changed (2026-05-07):** `_set_block` is no longer called on authenticated rate
breach. The 429 is returned for the over-limit request; the counter expires after
60 s and the user proceeds automatically. The `rl:{ns}:user:{uid}:blocked` marker
and `rl:index:blocked_users` ZSET are therefore **not written on rate breach**
only legacy entries from before this change may exist. A 300 s lockout is wrong
for a logged-in user who clicked too fast; that penalty is reserved for
anonymous/bot traffic.
---

Loading…
Cancel
Save