29 KiB
Rate Limiter Incidents
This document records real rate-limiting incidents, the root-cause analysis performed for each, the fixes applied, and the architectural discussion that followed. New incidents should be appended under their own section.
PatoBranco-PR — 2026-05-06
Symptom
Councilmembers reported being unable to access the voting interface during a live plenary session. The error was HTTP 429. Two blocking events occurred:
| Event | Start | Recovery | Duration |
|---|---|---|---|
| 1 | 13:51:23 | ~14:01 | ~10 min |
| 2 | 14:22:30 | ~14:25 | ~3 min |
Both recovered before the Django BLOCK_TTL of 300 seconds, which was the first diagnostic clue.
Environment
- NAT IP:
200.175.17.66(reported range200.175.17.66/29) - Secondary range:
187.109.99.234/30 - Peak observed: 24 requests/second from that IP (confirmed in OpenSearch, 14:22:31)
- Paths involved:
/voto-individual/,/sessao/pauta-sessao/2600/,/sessao/2600/ordemdia
Root Cause
Multiple councilmembers share a single public IP via NAT. When a vote opened, all of them reloaded their browser simultaneously. Nginx saw the combined traffic as a single client exhausting its burst bucket and returned 429 — before any request reached Django.
┌─────────────────────────────────────────┐
│ nginx │
│ │
Councilmember A ──►│ IP: 200.175.17.66 │
Councilmember B ──►│ IP: 200.175.17.66 ──► burst bucket │──► 429 (bucket full)
Councilmember C ──►│ IP: 200.175.17.66 exhausted │
... │ │
└─────────────────────────────────────────┘
│
│ (never reached)
▼
┌─────────────────────────────────────────┐
│ Django middleware │
│ │
│ rl:ip:<ip>:reqs (never incremented)│
│ rl:user:<id>:reqs (never incremented)│
│ Redis block key (never written) │
└─────────────────────────────────────────┘
Why Recovery Was Faster Than 300 Seconds
Django's block mechanism (_set_block()) was never triggered. The NAT IP was never written to Redis. The 429s came entirely from nginx's token bucket being exhausted.
Recovery happened when the synchronized burst subsided (vote ended, users stopped reloading). The nginx bucket refilled at its configured rate. No TTL expiry was involved — recovery time was variable because it depended on how depleted the bucket was at the end of each burst, not on a fixed timer.
Had Django's block fired, the outage would have been exactly 300 seconds both times. The variable durations (10 min vs 3 min) confirm nginx was the sole actor.
The Polling Source
voto_individual.html contains a setTimeout(location.reload, 30000) — the page reloads itself every 30 seconds. When councilmembers opened the voting page at roughly the same time (vote announcement), their reload timers aligned. Each 30-second tick fired a synchronized burst from all clients behind the NAT.
/sessao/<pk>/ordemdia and /sessao/pauta-sessao/<pk>/ are not polled by JavaScript — they are normal page navigations. They appeared in the burst because councilmembers navigated to them at the same moment as the vote opened.
Two-Layer Rate Limiting Architecture
┌──────────────────────────────────────────────────────┐
Incoming request │ nginx │
─────────────────────►│ │
│ limit_req zone=sapl_general ← IP-only, no auth │
│ burst=${NGINX_BURST_GENERAL} nodelay │
│ │
│ If bucket full → 429 immediately │
│ Redis: nothing written │
└───────────────────────┬──────────────────────────────┘
│ (only if bucket has room)
▼
┌──────────────────────────────────────────────────────┐
│ Django RateLimitMiddleware │
│ │
│ 1. Bypass check ← RATE_LIMIT_BYPASS_PATHS │
│ 2. API quota check (if /api/) │
│ 3. _evaluate() │
│ a. IP block check (Redis rl:ip:<ip>:blocked) │
│ b. User block check (Redis rl:user:<id>:blocked) │
│ c. Rate counter (rl:ip:<ip>:reqs) │
│ d. User counter (rl:user:<id>:reqs) │
│ │
│ If rate exceeded → SET block key (TTL=300s) │
│ → ZADD rl:index:blocked_ips │
└──────────────────────────────────────────────────────┘
The core mismatch: Django tracks per-user buckets (rl:user:<id>:reqs) which are NAT-safe. Nginx tracks per-IP buckets which collapse all users behind a NAT into one. Nginx fires first, so Django's smarter per-user accounting is never consulted during a burst.
Fix Applied
Added nginx location blocks for session and voting paths that pass requests through without limit_req. These regex locations take priority over the catch-all location / by nginx matching rules.
docker/config/nginx/sapl.conf:
location ~ ^/voto-individual/ {
proxy_set_header X-Request-ID $req_id;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://sapl_server;
}
location ~ ^/sessao/\d+ {
proxy_set_header X-Request-ID $req_id;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $http_host;
proxy_redirect off;
proxy_pass http://sapl_server;
}
sapl/settings.py — RATE_LIMIT_BYPASS_PATHS extended to match:
RATE_LIMIT_BYPASS_PATHS = [
r'^/painel/\d+/dados$',
r'^/voto-individual/',
r'^/sessao/\d+',
r'^/sessao/pauta-sessao/\d+/',
]
These paths are safe to exempt because:
- They require an authenticated session cookie to perform any meaningful action.
- Django's per-user rate counter still runs as a backstop.
- The cost of a false-positive block (councilmember unable to vote) outweighs the risk of abuse on these URLs.
Architectural Discussion
Why Increasing Burst Rates Is Not the Right Fix
burst controls how many requests above the sustained rate are allowed in a spike before 429 fires. A larger burst absorbs thundering herds from NAT but also allows more requests from burst-style attackers (scanners, credential stuffers) before the throttle engages. It is the same knob, pulled in opposite directions.
The correct dimensioning formula for a legislative house would be:
burst ≥ users_behind_NAT × tabs_per_user × requests_per_page_load
For a large state assembly (90 deputies, 2 tabs each) this exceeds 180 — a burst value that renders rate limiting ineffective against automated tools.
Conclusion: burst tuning is not the right tool for this problem. Exemption of known-safe high-frequency paths is.
The Multi-Tab Problem
Staff members commonly open multiple SAPL tabs simultaneously. Each tab has its own reload timer. If 10 staff members have 3 tabs open, a synchronized event generates 30+ requests from the NAT IP even before councilmembers are counted. This means the bucket can be pre-exhausted before the voting burst even starts.
This further reinforces that IP-based rate limiting is the wrong unit for authenticated traffic in a shared-office environment.
Why Per-User Rate Limiting Does Not Fully Solve This
Django already increments both rl:ip:<ip>:reqs and rl:user:<id>:reqs. The per-user counter is NAT-safe. But it never runs during a burst because nginx drops the request first.
Moving rate limiting for authenticated users to the Django layer only (removing nginx limit_req for authenticated paths) would make per-user counting the effective control. The obstacle is that nginx cannot distinguish authenticated from anonymous requests without reading and resolving the session cookie — which requires a database or Redis lookup nginx cannot perform natively.
Architectural Solutions
| Approach | Effort | Solves NAT problem | Protects against bots |
|---|---|---|---|
| Nginx bypass for known session paths (done) | Low | Yes, for bypassed paths | Yes, general paths still rate-limited |
Increase NGINX_BURST_GENERAL |
Trivial | Partially | Weakens bot protection |
Nginx limit_req_zone keyed on session cookie string |
Medium | Yes (per-session token, not per-IP) | Yes, each session has own bucket |
| Move all auth-path rate limiting to Django only | Medium | Yes | Depends on Django rate correctly tuned |
Replace setTimeout(location.reload) with WebSocket/SSE push |
High | Yes — eliminates synchronized reloads entirely | N/A |
WebSocket / SSE Consideration
The 30-second self-reload in voto_individual.html is the synchronization mechanism that creates the thundering herd. If the server pushed state changes (vote opened, result published) instead of the client polling by reloading, the synchronized burst would not exist regardless of how many users or tabs are open.
A full WebSocket rewrite (Django Channels + Redis pub/sub channel layer) would:
- Eliminate polling bursts on session/voting paths
- Make vote-state updates instantaneous instead of up to 30 seconds late
- Require nginx configuration for
Upgrade: websocketproxying
The nginx bypass is the correct operational fix for now. The WebSocket rewrite is the correct architectural fix for the future. They are not substitutes — the bypass would remain useful for the initial WebSocket handshake, which is still an HTTP request subject to burst limits.
Diagrams — Issues, Solutions, and Trade-offs (2026-05-06/07)
1. NAT Thundering Herd — Before the Fix
During a live vote all councilmembers reload simultaneously. nginx sees one IP, exhausts its bucket, and returns 429 before Django is ever involved. Django's per-user counter (NAT-safe) is never consulted.
Office / Chamber — behind one NAT IP (200.175.17.66)
┌──────────────────────────────────────────────────────┐
│ Councilmember A browser reload ──┐ │
│ Councilmember B browser reload ──┤ │
│ Councilmember C browser reload ──┤ ~24 req/s │
│ Staff tab 1 browser reload ──┤ same public IP │
│ Staff tab 2 browser reload ──┘ │
└────────────────────────────┬─────────────────────────┘
│ all requests look identical to nginx
▼
┌─────────────────────────────────────┐
│ nginx sapl_general │
│ rate=30r/m burst=60 nodelay │
│ │
│ token bucket: 0 tokens remaining │
│ → 429 returned immediately │
└──────────────────┬──────────────────┘
│
╳ Django never reached
╳ rl:ip:{ip}:reqs never incremented
╳ rl:user:{uid}:reqs never incremented
╳ per-user NAT-safe counter never consulted
│
▼
429 for all N users in the org
recovery: wait for nginx bucket refill
(~3–10 min depending on depletion)
NOT a Django 300s block (Redis never written)
2. NAT Thundering Herd — After the Session Bypass Fix
Session and voting paths have dedicated nginx location blocks with no
limit_req. Regex locations take priority over location /.
Office / Chamber — behind one NAT IP
┌──────────────────────────────────────────────────────┐
│ Councilmember A /voto-individual/ reload ──┐ │
│ Councilmember B /voto-individual/ reload ──┤ │
│ Councilmember C /sessao/2600/ordemdia ───────┤ │
│ Staff tab /sessao/pauta-sessao/2600/ ──┘ │
└────────────────────────────┬─────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ nginx │
│ │
│ location ~ ^/voto-individual/ ─┐ │
│ location ~ ^/sessao/\d+ ─┤ │ no limit_req
│ location ~ ^/painel/\d+/dados ─┘ │ pass through
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Django RateLimitMiddleware │
│ │
│ RATE_LIMIT_BYPASS_PATHS match? │
│ → yes: return get_response() │
│ (no counter, no block check) │
└──────────────────┬──────────────────┘
│
▼
✓ View served
All N users get their page
3. nginx Zone Architecture — Before vs After
Before (single shared zone)
All traffic (HTML + media + API)
│
▼
┌───────────────────────────────┐
│ sapl_general │ ← one bucket per IP
│ rate=30r/m burst=60 │
│ │
│ /media/page.pdf ──────────┐ │ ← media request drains
│ /materia/123/ ──────────┤ │ the same bucket as
│ /api/materia/? ──────────┘ │ the HTML page
└───────────────────────────────┘
Problem: a page with 20 media attachments
burns 20 tokens from the page-load budget
After (four independent zones)
┌─────────────────┐ location /
│ sapl_general │ rate=90r/m burst=180 — HTML page requests
└─────────────────┘
┌─────────────────┐ location /media/
│ sapl_media │ rate=180r/m burst=180 — media downloads
└─────────────────┘ own bucket, never
drains page quota
┌─────────────────┐ location /api/
│ sapl_api │ rate=60r/m burst=120 — API calls
└─────────────────┘ quota layer is
real constraint
┌─────────────────┐ location /relatorios/
│ sapl_heavy │ rate=10r/m burst=20 — PDF generation
└─────────────────┘ nodelay tight by design
Session/voting paths: NO zone — exempt from limit_req entirely
Static files /static/: NO zone — served directly from disk
4. Anonymous /api/ NAT Problem — Before vs After
Before
Office — 10 staff, JS polling /api/ every 5s = 120 req/min combined
│
▼
┌─────────────────────────────────────┐
│ nginx sapl_general (was shared) │
│ burst not yet exhausted → pass │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Django _evaluate_anonymous │
│ │
│ INCR rl:ip:{ip}:reqs │
│ count = 120 ≥ threshold (120/m) │
│ │
│ SET rl:ip:{ip}:blocked EX 300 │◄── block key written
│ ZADD rl:index:blocked_ips │ affects ALL paths
└──────────────────┬──────────────────┘
│
▼
Next request to /materia/, /sessao/, /voto-individual/ ...
→ 429 ip_blocked (300s)
Entire org locked out of ALL SAPL pages
because of JS polling the API
After
Office — 10 staff, JS polling /api/ every 5s = 120 req/min combined
│
▼
┌─────────────────────────────────────┐
│ nginx sapl_api │
│ rate=60r/m burst=120 nodelay │
│ throttles burst, passes remainder │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Django: quota check │
│ 500 req/day not exceeded → pass │
└──────────────────┬──────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Anonymous /api/ early return │
│ │
│ rl:ip:{ip}:reqs NOT incremented │◄── no counter
│ rl:ip:{ip}:blocked NOT written │◄── no block key
└──────────────────┬──────────────────┘
│
▼
✓ View served
Page requests from same IP unaffected
5. Authenticated Rate Breach — Before vs After
Before
Authenticated user clicks rapidly: 241 requests in 60s
│
▼
┌─────────────────────────────────────┐
│ _evaluate_authenticated │
│ INCR rl:user:{uid}:reqs → 241 │
│ count ≥ 240 (auth threshold) │
│ │
│ SET rl:user:{uid}:blocked EX 300 │◄── 5-minute lockout
│ ZADD rl:index:blocked_users │
└──────────────────┬──────────────────┘
│
▼
All requests for 300s → 429 user_blocked
User must wait 5 minutes to do anything
No way to self-recover sooner
After
Authenticated user clicks rapidly: 241 requests in 60s
│
▼
┌─────────────────────────────────────┐
│ _evaluate_authenticated │
│ INCR rl:user:{uid}:reqs → 241 │
│ count ≥ 240 (auth threshold) │
│ │
│ return 429 auth_user_rate │◄── this request only
│ (no SET, no ZADD) │◄── no block key written
└──────────────────┬──────────────────┘
│
Counter TTL = 60s (auth_window)
│
▼
T+60s: rl:user:{uid}:reqs expires
User automatically recovers
No admin intervention needed
6. Enforcement Stack Per Path — Trade-off Summary
Path nginx zone Django counter Block written? Notes
────────────────────── ─────────────── ──────────────── ────────────── ──────────────────────────────
/static/* none none — disk-served, zero Django cost
/painel/<pk>/dados none none — bypass: high-freq polling
/voto-individual/* none none — bypass: live vote
/sessao/<pk>/* none none — bypass: live session
/sessao/pauta-sessao/* none none — bypass: live session
/media/* sapl_media anon IP / auth anon: yes auth gate in serve_media()
180r/m b=180 counter runs auth: no
/api/* (anonymous) sapl_api quota only no ← key change: no IP counter,
60r/m b=120 500/day — no collateral NAT block
/api/* (authenticated) sapl_api per-user 240/m no (soft) per-user, NAT-safe
60r/m b=120 counter runs
/relatorios/* sapl_heavy anon/auth runs anon: yes tight rate — PDF generation
10r/m b=20 at Django auth: no
/* (everything else) sapl_general anon/auth runs anon: yes normal page navigation
90r/m b=180 at Django auth: no auth gets 240/m soft limit
Legend:
anon: yes— anonymous IP gets a 300s block key on breachauth: no— authenticated users get 429 for that request, window resets in 60s, no persistent blocknone— no rate limiting at either layer (path is exempt)
7. The Fundamental NAT Constraint
IP-based rate limiting cannot distinguish these two scenarios:
Scenario A — Legitimate (15 users, 1 tab each, vote opens)
┌──────────────────────────────────────────────────────┐
│ User 1 ──► GET /voto-individual/ │
│ User 2 ──► GET /voto-individual/ 15 req/s │
│ ... 1 public IP │
│ User 15 ──► GET /sessao/2600/ordemdia │
└──────────────────────────────────────────────────────┘
Scenario B — Bot (1 process, 15 threads, scraping)
┌──────────────────────────────────────────────────────┐
│ Thread 1 ──► GET /materia/1/ │
│ Thread 2 ──► GET /materia/2/ 15 req/s │
│ ... 1 public IP │
│ Thread 15 ──► GET /materia/15/ │
└──────────────────────────────────────────────────────┘
To nginx and an IP-based counter: identical.
Resolution strategies applied:
┌──────────────────────────────────────────────────────────────────┐
│ Known safe high-freq paths → nginx bypass + Django bypass │
│ Authenticated users → per-user counter (uid), NAT-safe │
│ Anonymous /api/ → quota only, no IP counter │
│ Everything else (anon) → IP counter + 300s block on breach │
└──────────────────────────────────────────────────────────────────┘
Long-term: APP_ACCESS_KEYs per tenant → quota per org, not per IP
WebSocket push for voting → eliminates polling bursts
Pending Investigations
The following incidents may or may not share the same root cause as the PatoBranco-PR event. Each should be investigated using the OpenSearch query patterns established above and documented here.
- Other houses reporting intermittent 429s during session hours
- Azure crawler bot (
52.167.144.162) — 2 × 429 observed on 2026-05-06 at patobranco-pr; appears to be a legitimate Microsoft indexer hitting non-session paths; confirm it is correctly rate-limited and not causing collateral blocks on shared IPs - Investigate whether
187.109.99.234/30(patobranco secondary NAT range) experienced any blocks independently of200.175.17.66/29