Sistema de Apoio ao Processo Legislativo
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

12 KiB

Rate Limiter Incidents

This document records real rate-limiting incidents, the root-cause analysis performed for each, the fixes applied, and the architectural discussion that followed. New incidents should be appended under their own section.


PatoBranco-PR — 2026-05-06

Symptom

Councilmembers reported being unable to access the voting interface during a live plenary session. The error was HTTP 429. Two blocking events occurred:

Event Start Recovery Duration
1 13:51:23 ~14:01 ~10 min
2 14:22:30 ~14:25 ~3 min

Both recovered before the Django BLOCK_TTL of 300 seconds, which was the first diagnostic clue.

Environment

  • NAT IP: 200.175.17.66 (reported range 200.175.17.66/29)
  • Secondary range: 187.109.99.234/30
  • Peak observed: 24 requests/second from that IP (confirmed in OpenSearch, 14:22:31)
  • Paths involved: /voto-individual/, /sessao/pauta-sessao/2600/, /sessao/2600/ordemdia

Root Cause

Multiple councilmembers share a single public IP via NAT. When a vote opened, all of them reloaded their browser simultaneously. Nginx saw the combined traffic as a single client exhausting its burst bucket and returned 429 — before any request reached Django.

                     ┌─────────────────────────────────────────┐
                     │              nginx                       │
                     │                                         │
  Councilmember A ──►│  IP: 200.175.17.66                      │
  Councilmember B ──►│  IP: 200.175.17.66  ──► burst bucket    │──► 429 (bucket full)
  Councilmember C ──►│  IP: 200.175.17.66      exhausted       │
        ...          │                                         │
                     └─────────────────────────────────────────┘
                                        │
                                        │ (never reached)
                                        ▼
                     ┌─────────────────────────────────────────┐
                     │           Django middleware              │
                     │                                         │
                     │  rl:ip:<ip>:reqs     (never incremented)│
                     │  rl:user:<id>:reqs   (never incremented)│
                     │  Redis block key     (never written)    │
                     └─────────────────────────────────────────┘

Why Recovery Was Faster Than 300 Seconds

Django's block mechanism (_set_block()) was never triggered. The NAT IP was never written to Redis. The 429s came entirely from nginx's token bucket being exhausted.

Recovery happened when the synchronized burst subsided (vote ended, users stopped reloading). The nginx bucket refilled at its configured rate. No TTL expiry was involved — recovery time was variable because it depended on how depleted the bucket was at the end of each burst, not on a fixed timer.

Had Django's block fired, the outage would have been exactly 300 seconds both times. The variable durations (10 min vs 3 min) confirm nginx was the sole actor.

The Polling Source

voto_individual.html contains a setTimeout(location.reload, 30000) — the page reloads itself every 30 seconds. When councilmembers opened the voting page at roughly the same time (vote announcement), their reload timers aligned. Each 30-second tick fired a synchronized burst from all clients behind the NAT.

/sessao/<pk>/ordemdia and /sessao/pauta-sessao/<pk>/ are not polled by JavaScript — they are normal page navigations. They appeared in the burst because councilmembers navigated to them at the same moment as the vote opened.

Two-Layer Rate Limiting Architecture

                        ┌──────────────────────────────────────────────────────┐
  Incoming request       │  nginx                                                │
  ─────────────────────►│                                                       │
                        │  limit_req zone=sapl_general  ← IP-only, no auth     │
                        │  burst=${NGINX_BURST_GENERAL} nodelay                 │
                        │                                                       │
                        │  If bucket full → 429 immediately                    │
                        │  Redis: nothing written                               │
                        └───────────────────────┬──────────────────────────────┘
                                                │ (only if bucket has room)
                                                ▼
                        ┌──────────────────────────────────────────────────────┐
                        │  Django RateLimitMiddleware                           │
                        │                                                       │
                        │  1. Bypass check  ← RATE_LIMIT_BYPASS_PATHS          │
                        │  2. API quota check (if /api/)                       │
                        │  3. _evaluate()                                       │
                        │     a. IP block check  (Redis rl:ip:<ip>:blocked)    │
                        │     b. User block check (Redis rl:user:<id>:blocked) │
                        │     c. Rate counter    (rl:ip:<ip>:reqs)             │
                        │     d. User counter    (rl:user:<id>:reqs)           │
                        │                                                       │
                        │  If rate exceeded → SET block key (TTL=300s)         │
                        │                  → ZADD rl:index:blocked_ips         │
                        └──────────────────────────────────────────────────────┘

The core mismatch: Django tracks per-user buckets (rl:user:<id>:reqs) which are NAT-safe. Nginx tracks per-IP buckets which collapse all users behind a NAT into one. Nginx fires first, so Django's smarter per-user accounting is never consulted during a burst.

Fix Applied

Added nginx location blocks for session and voting paths that pass requests through without limit_req. These regex locations take priority over the catch-all location / by nginx matching rules.

docker/config/nginx/sapl.conf:

location ~ ^/voto-individual/ {
    proxy_set_header X-Request-ID      $req_id;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Host              $http_host;
    proxy_redirect off;
    proxy_pass http://sapl_server;
}

location ~ ^/sessao/\d+ {
    proxy_set_header X-Request-ID      $req_id;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Host              $http_host;
    proxy_redirect off;
    proxy_pass http://sapl_server;
}

sapl/settings.pyRATE_LIMIT_BYPASS_PATHS extended to match:

RATE_LIMIT_BYPASS_PATHS = [
    r'^/painel/\d+/dados$',
    r'^/voto-individual/',
    r'^/sessao/\d+',
    r'^/sessao/pauta-sessao/\d+/',
]

These paths are safe to exempt because:

  • They require an authenticated session cookie to perform any meaningful action.
  • Django's per-user rate counter still runs as a backstop.
  • The cost of a false-positive block (councilmember unable to vote) outweighs the risk of abuse on these URLs.

Architectural Discussion

Why Increasing Burst Rates Is Not the Right Fix

burst controls how many requests above the sustained rate are allowed in a spike before 429 fires. A larger burst absorbs thundering herds from NAT but also allows more requests from burst-style attackers (scanners, credential stuffers) before the throttle engages. It is the same knob, pulled in opposite directions.

The correct dimensioning formula for a legislative house would be:

burst ≥ users_behind_NAT × tabs_per_user × requests_per_page_load

For a large state assembly (90 deputies, 2 tabs each) this exceeds 180 — a burst value that renders rate limiting ineffective against automated tools.

Conclusion: burst tuning is not the right tool for this problem. Exemption of known-safe high-frequency paths is.

The Multi-Tab Problem

Staff members commonly open multiple SAPL tabs simultaneously. Each tab has its own reload timer. If 10 staff members have 3 tabs open, a synchronized event generates 30+ requests from the NAT IP even before councilmembers are counted. This means the bucket can be pre-exhausted before the voting burst even starts.

This further reinforces that IP-based rate limiting is the wrong unit for authenticated traffic in a shared-office environment.

Why Per-User Rate Limiting Does Not Fully Solve This

Django already increments both rl:ip:<ip>:reqs and rl:user:<id>:reqs. The per-user counter is NAT-safe. But it never runs during a burst because nginx drops the request first.

Moving rate limiting for authenticated users to the Django layer only (removing nginx limit_req for authenticated paths) would make per-user counting the effective control. The obstacle is that nginx cannot distinguish authenticated from anonymous requests without reading and resolving the session cookie — which requires a database or Redis lookup nginx cannot perform natively.

Architectural Solutions

Approach Effort Solves NAT problem Protects against bots
Nginx bypass for known session paths (done) Low Yes, for bypassed paths Yes, general paths still rate-limited
Increase NGINX_BURST_GENERAL Trivial Partially Weakens bot protection
Nginx limit_req_zone keyed on session cookie string Medium Yes (per-session token, not per-IP) Yes, each session has own bucket
Move all auth-path rate limiting to Django only Medium Yes Depends on Django rate correctly tuned
Replace setTimeout(location.reload) with WebSocket/SSE push High Yes — eliminates synchronized reloads entirely N/A

WebSocket / SSE Consideration

The 30-second self-reload in voto_individual.html is the synchronization mechanism that creates the thundering herd. If the server pushed state changes (vote opened, result published) instead of the client polling by reloading, the synchronized burst would not exist regardless of how many users or tabs are open.

A full WebSocket rewrite (Django Channels + Redis pub/sub channel layer) would:

  • Eliminate polling bursts on session/voting paths
  • Make vote-state updates instantaneous instead of up to 30 seconds late
  • Require nginx configuration for Upgrade: websocket proxying

The nginx bypass is the correct operational fix for now. The WebSocket rewrite is the correct architectural fix for the future. They are not substitutes — the bypass would remain useful for the initial WebSocket handshake, which is still an HTTP request subject to burst limits.


Pending Investigations

The following incidents may or may not share the same root cause as the PatoBranco-PR event. Each should be investigated using the OpenSearch query patterns established above and documented here.

  • Other houses reporting intermittent 429s during session hours
  • Azure crawler bot (52.167.144.162) — 2 × 429 observed on 2026-05-06 at patobranco-pr; appears to be a legitimate Microsoft indexer hitting non-session paths; confirm it is correctly rate-limited and not causing collateral blocks on shared IPs
  • Investigate whether 187.109.99.234/30 (patobranco secondary NAT range) experienced any blocks independently of 200.175.17.66/29