diff --git a/docker/config/nginx/sapl.conf b/docker/config/nginx/sapl.conf index 0102d6476..24427dd37 100644 --- a/docker/config/nginx/sapl.conf +++ b/docker/config/nginx/sapl.conf @@ -134,6 +134,30 @@ server { proxy_pass http://sapl_server; } + # ---------------------------------------------------------------- + # Session voting paths — high-frequency during live votes; exempt + # from rate limiting. Multiple authenticated users share NAT IPs. + # Covers: /voto-individual/, /sessao//ordemdia, /sessao//expediente, + # /sessao//matordemdia/*, /sessao/pauta-sessao//, etc. + # ---------------------------------------------------------------- + location ~ ^/voto-individual/ { + proxy_set_header X-Request-ID $req_id; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_set_header Host $http_host; + proxy_redirect off; + proxy_pass http://sapl_server; + } + + location ~ ^/sessao/\d+ { + proxy_set_header X-Request-ID $req_id; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_set_header Host $http_host; + proxy_redirect off; + proxy_pass http://sapl_server; + } + # ---------------------------------------------------------------- # Scanner extension probes (.php, .asp, etc.) — SAPL never serves # these. Drop the connection before reaching Gunicorn. diff --git a/docs/rate-limiter-incidents.md b/docs/rate-limiter-incidents.md new file mode 100644 index 000000000..ddd4adf68 --- /dev/null +++ b/docs/rate-limiter-incidents.md @@ -0,0 +1,197 @@ +# Rate Limiter Incidents + +This document records real rate-limiting incidents, the root-cause analysis performed for each, the fixes applied, and the architectural discussion that followed. New incidents should be appended under their own section. + +--- + +## PatoBranco-PR — 2026-05-06 + +### Symptom + +Councilmembers reported being unable to access the voting interface during a live plenary session. The error was HTTP 429. Two blocking events occurred: + +| Event | Start | Recovery | Duration | +|-------|-------|----------|----------| +| 1 | 13:51:23 | ~14:01 | ~10 min | +| 2 | 14:22:30 | ~14:25 | ~3 min | + +Both recovered **before** the Django `BLOCK_TTL` of 300 seconds, which was the first diagnostic clue. + +### Environment + +- NAT IP: `200.175.17.66` (reported range `200.175.17.66/29`) +- Secondary range: `187.109.99.234/30` +- Peak observed: **24 requests/second** from that IP (confirmed in OpenSearch, 14:22:31) +- Paths involved: `/voto-individual/`, `/sessao/pauta-sessao/2600/`, `/sessao/2600/ordemdia` + +### Root Cause + +Multiple councilmembers share a single public IP via NAT. When a vote opened, all of them reloaded their browser simultaneously. Nginx saw the combined traffic as a single client exhausting its burst bucket and returned 429 — before any request reached Django. + +``` + ┌─────────────────────────────────────────┐ + │ nginx │ + │ │ + Councilmember A ──►│ IP: 200.175.17.66 │ + Councilmember B ──►│ IP: 200.175.17.66 ──► burst bucket │──► 429 (bucket full) + Councilmember C ──►│ IP: 200.175.17.66 exhausted │ + ... │ │ + └─────────────────────────────────────────┘ + │ + │ (never reached) + ▼ + ┌─────────────────────────────────────────┐ + │ Django middleware │ + │ │ + │ rl:ip::reqs (never incremented)│ + │ rl:user::reqs (never incremented)│ + │ Redis block key (never written) │ + └─────────────────────────────────────────┘ +``` + +### Why Recovery Was Faster Than 300 Seconds + +Django's block mechanism (`_set_block()`) was **never triggered**. The NAT IP was never written to Redis. The 429s came entirely from nginx's token bucket being exhausted. + +Recovery happened when the synchronized burst subsided (vote ended, users stopped reloading). The nginx bucket refilled at its configured rate. No TTL expiry was involved — recovery time was variable because it depended on how depleted the bucket was at the end of each burst, not on a fixed timer. + +Had Django's block fired, the outage would have been exactly 300 seconds both times. The variable durations (10 min vs 3 min) confirm nginx was the sole actor. + +### The Polling Source + +`voto_individual.html` contains a `setTimeout(location.reload, 30000)` — the page reloads itself every 30 seconds. When councilmembers opened the voting page at roughly the same time (vote announcement), their reload timers aligned. Each 30-second tick fired a synchronized burst from all clients behind the NAT. + +`/sessao//ordemdia` and `/sessao/pauta-sessao//` are not polled by JavaScript — they are normal page navigations. They appeared in the burst because councilmembers navigated to them at the same moment as the vote opened. + +### Two-Layer Rate Limiting Architecture + +``` + ┌──────────────────────────────────────────────────────┐ + Incoming request │ nginx │ + ─────────────────────►│ │ + │ limit_req zone=sapl_general ← IP-only, no auth │ + │ burst=${NGINX_BURST_GENERAL} nodelay │ + │ │ + │ If bucket full → 429 immediately │ + │ Redis: nothing written │ + └───────────────────────┬──────────────────────────────┘ + │ (only if bucket has room) + ▼ + ┌──────────────────────────────────────────────────────┐ + │ Django RateLimitMiddleware │ + │ │ + │ 1. Bypass check ← RATE_LIMIT_BYPASS_PATHS │ + │ 2. API quota check (if /api/) │ + │ 3. _evaluate() │ + │ a. IP block check (Redis rl:ip::blocked) │ + │ b. User block check (Redis rl:user::blocked) │ + │ c. Rate counter (rl:ip::reqs) │ + │ d. User counter (rl:user::reqs) │ + │ │ + │ If rate exceeded → SET block key (TTL=300s) │ + │ → ZADD rl:index:blocked_ips │ + └──────────────────────────────────────────────────────┘ +``` + +**The core mismatch:** Django tracks per-user buckets (`rl:user::reqs`) which are NAT-safe. Nginx tracks per-IP buckets which collapse all users behind a NAT into one. Nginx fires first, so Django's smarter per-user accounting is never consulted during a burst. + +### Fix Applied + +Added nginx `location` blocks for session and voting paths that pass requests through **without** `limit_req`. These regex locations take priority over the catch-all `location /` by nginx matching rules. + +**`docker/config/nginx/sapl.conf`:** +```nginx +location ~ ^/voto-individual/ { + proxy_set_header X-Request-ID $req_id; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_set_header Host $http_host; + proxy_redirect off; + proxy_pass http://sapl_server; +} + +location ~ ^/sessao/\d+ { + proxy_set_header X-Request-ID $req_id; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_set_header Host $http_host; + proxy_redirect off; + proxy_pass http://sapl_server; +} +``` + +**`sapl/settings.py`** — `RATE_LIMIT_BYPASS_PATHS` extended to match: +```python +RATE_LIMIT_BYPASS_PATHS = [ + r'^/painel/\d+/dados$', + r'^/voto-individual/', + r'^/sessao/\d+', + r'^/sessao/pauta-sessao/\d+/', +] +``` + +These paths are safe to exempt because: +- They require an authenticated session cookie to perform any meaningful action. +- Django's per-user rate counter still runs as a backstop. +- The cost of a false-positive block (councilmember unable to vote) outweighs the risk of abuse on these URLs. + +--- + +## Architectural Discussion + +### Why Increasing Burst Rates Is Not the Right Fix + +`burst` controls how many requests above the sustained rate are allowed in a spike before 429 fires. A larger burst absorbs thundering herds from NAT but also allows more requests from burst-style attackers (scanners, credential stuffers) before the throttle engages. It is the same knob, pulled in opposite directions. + +The correct dimensioning formula for a legislative house would be: + +``` +burst ≥ users_behind_NAT × tabs_per_user × requests_per_page_load +``` + +For a large state assembly (90 deputies, 2 tabs each) this exceeds 180 — a burst value that renders rate limiting ineffective against automated tools. + +**Conclusion:** burst tuning is not the right tool for this problem. Exemption of known-safe high-frequency paths is. + +### The Multi-Tab Problem + +Staff members commonly open multiple SAPL tabs simultaneously. Each tab has its own reload timer. If 10 staff members have 3 tabs open, a synchronized event generates 30+ requests from the NAT IP even before councilmembers are counted. This means the bucket can be pre-exhausted before the voting burst even starts. + +This further reinforces that IP-based rate limiting is the wrong unit for authenticated traffic in a shared-office environment. + +### Why Per-User Rate Limiting Does Not Fully Solve This + +Django already increments both `rl:ip::reqs` and `rl:user::reqs`. The per-user counter is NAT-safe. But it never runs during a burst because nginx drops the request first. + +Moving rate limiting for authenticated users to the Django layer only (removing nginx `limit_req` for authenticated paths) would make per-user counting the effective control. The obstacle is that nginx cannot distinguish authenticated from anonymous requests without reading and resolving the session cookie — which requires a database or Redis lookup nginx cannot perform natively. + +### Architectural Solutions + +| Approach | Effort | Solves NAT problem | Protects against bots | +|----------|--------|-------------------|----------------------| +| Nginx bypass for known session paths *(done)* | Low | Yes, for bypassed paths | Yes, general paths still rate-limited | +| Increase `NGINX_BURST_GENERAL` | Trivial | Partially | Weakens bot protection | +| Nginx `limit_req_zone` keyed on session cookie string | Medium | Yes (per-session token, not per-IP) | Yes, each session has own bucket | +| Move all auth-path rate limiting to Django only | Medium | Yes | Depends on Django rate correctly tuned | +| Replace `setTimeout(location.reload)` with WebSocket/SSE push | High | Yes — eliminates synchronized reloads entirely | N/A | + +### WebSocket / SSE Consideration + +The 30-second self-reload in `voto_individual.html` is the synchronization mechanism that creates the thundering herd. If the server pushed state changes (vote opened, result published) instead of the client polling by reloading, the synchronized burst would not exist regardless of how many users or tabs are open. + +A full WebSocket rewrite (Django Channels + Redis pub/sub channel layer) would: +- Eliminate polling bursts on session/voting paths +- Make vote-state updates instantaneous instead of up to 30 seconds late +- Require nginx configuration for `Upgrade: websocket` proxying + +The nginx bypass is the correct operational fix for now. The WebSocket rewrite is the correct architectural fix for the future. They are not substitutes — the bypass would remain useful for the initial WebSocket handshake, which is still an HTTP request subject to burst limits. + +--- + +## Pending Investigations + +The following incidents may or may not share the same root cause as the PatoBranco-PR event. Each should be investigated using the OpenSearch query patterns established above and documented here. + +- [ ] Other houses reporting intermittent 429s during session hours +- [ ] Azure crawler bot (`52.167.144.162`) — 2 × 429 observed on 2026-05-06 at patobranco-pr; appears to be a legitimate Microsoft indexer hitting non-session paths; confirm it is correctly rate-limited and not causing collateral blocks on shared IPs +- [ ] Investigate whether `187.109.99.234/30` (patobranco secondary NAT range) experienced any blocks independently of `200.175.17.66/29` diff --git a/sapl/settings.py b/sapl/settings.py index ccc07de76..01db72bad 100644 --- a/sapl/settings.py +++ b/sapl/settings.py @@ -151,6 +151,7 @@ MIDDLEWARE = [ 'django.contrib.messages.middleware.MessageMiddleware', 'django.middleware.clickjacking.XFrameOptionsMiddleware', 'django.middleware.security.SecurityMiddleware', + 'sapl.middleware.api_emergency_block.ApiEmergencySameSiteOnlyMiddleware', # TODO: REMOVE AFTER RL WORKS! 'whitenoise.middleware.WhiteNoiseMiddleware', 'waffle.middleware.WaffleMiddleware', 'sapl.middleware.check_password.CheckWeakPasswordMiddleware', @@ -431,7 +432,9 @@ RATE_LIMIT_404_THRESHOLD = config('RATE_LIMIT_404_THRESHOLD', default=10, cast=i # it is also exempt at the nginx layer (location block with no limit_req). RATE_LIMIT_BYPASS_PATHS = [ r'^/painel/\d+/dados$', - r'^/voto-individual/$', + r'^/voto-individual/', + r'^/sessao/\d+', + r'^/sessao/pauta-sessao/\d+/', ] # API quota — daily and weekly call caps per consumer (Redis-only, no DB migration).