From 6ce8d8c4eb5508b24593504f7e272b184defe450 Mon Sep 17 00:00:00 2001 From: Edward Oliveira Date: Wed, 13 May 2026 17:07:26 -0300 Subject: [PATCH] Update RATE-LIMITER-PLAN.md with 2026-05-11 post-initial changes - Key schema: ns-scoped API keys, unified IP-based quota, removed auth rows - Redis budget: remove auth quota rows, add ns-scoped API rate key rows - _handle_api section: updated step table (auth not checked), settings table (threshold 120, block TTL 60 s, quota 100k/700k), files changed - Decision log: 7 new entries covering auth parity, namespace isolation, threshold/TTL/quota tuning Co-Authored-By: Claude Sonnet 4.6 --- plan/RATE-LIMITER-PLAN.md | 85 ++++++++++++++++++++------------------- 1 file changed, 44 insertions(+), 41 deletions(-) diff --git a/plan/RATE-LIMITER-PLAN.md b/plan/RATE-LIMITER-PLAN.md index 121a616dc..c881d1122 100644 --- a/plan/RATE-LIMITER-PLAN.md +++ b/plan/RATE-LIMITER-PLAN.md @@ -114,6 +114,8 @@ graph TD | NS/IP/window counter | `rl:{ns}:ip:{ip}:w:{bucket}` | 120 s | 1 | ~0.6 MB | | API daily quota (all callers, by IP) | `quota:{ns}:daily:{date}:ip:{ip}` | 24 h | 1 | negligible | | API weekly quota (all callers, by IP) | `quota:{ns}:weekly:{week}:ip:{ip}` | 7 d | 1 | negligible | +| API rate counter (ns-scoped) | `rl:api:ns:{ns}:ip:{ip}:reqs` | 60 s | 1 | negligible | +| API block marker (ns-scoped) | `rl:api:ns:{ns}:ip:{ip}:blocked` | 60 s | 1 | negligible | | Redis overhead (× 1.5) | | | | ~1.6 GB | | **Total ceiling** | | | | **~5 GB** | @@ -138,11 +140,12 @@ graph TD | Auth rate breach: no persistent block (2026-05-07) | **429 per-request only**, window resets after 60 s | A 300 s lockout is the wrong penalty for a logged-in user who clicked too fast; persistent block is appropriate for anonymous/bot traffic only | | Raise rate thresholds (2026-05-07) | anon 35→120/m · auth 120→240/m · 404 threshold 10→20 | SAPL pages fire 12–45 parallel requests; old thresholds blocked normal navigation for users in offices with multiple open tabs | | API quota increase (2026-05-07) | anon 50→500/day · auth 1 000→5 000/day | Previous anon quota of 50/day was exhausted by a developer testing the API before lunch | -| /api/ rate limiter (2026-05-11) | `_handle_api` pipeline replaces `ApiEmergencySameSiteOnlyMiddleware` | Hard 403-block replaced with graduated rate limiting; same-origin pass; NAT-safe API-only block key | -| Auth parity on /api/ (2026-05-11) | Auth users subject to same 35/min cap and 1 000/day quota as anon | Authenticating must not bypass /api/ rate controls; both caller types keyed by IP | -| API threshold (2026-05-11) | 60 → 35 req/min | Forces sane polling intervals; 35/min is still well above any legitimate use case | -| API quota recalibrated (2026-05-11) | 500/day · 3 500/week → 1 000/day · 7 000/week | Old 500/day was exhausted in ~14 min at 35/min; new cap targets slow-drip scrapers (10–20 req/min all day) | -| API key namespace (2026-05-11) | `rl:api:ip:{ip}:*` → `rl:api:ns:{ns}:ip:{ip}:*` | Block in one k8s tenant must not leak to other tenants sharing the same Redis | +| Auth not exempt from `/api/` rate limit (2026-05-11) | **All callers keyed by IP** — auth status not checked | Authenticating must not bypass the per-minute cap; `_evaluate` (240/min per-user) still governs non-`/api/` paths | +| Auth-specific API quotas removed (2026-05-11) | **Single `API_QUOTA_DAILY/WEEKLY`** for all callers by IP | Per-user quota added false precision; IP-based cap is sufficient alongside the per-minute block | +| API rate limit keys namespaced (2026-05-11) | `rl:api:ns:{ns}:ip:{ip}:reqs/blocked` | Without `{ns}` a block in one k8s pod namespace leaked into all tenants sharing the same Redis instance | +| API threshold raised (2026-05-11) | 60→120 req/min | Aligns with legitimate integration patterns; slow-drip abuse is caught by the daily quota | +| API block TTL reduced (2026-05-11) | 300→60 s | Shorter cooldown reduces false-positive lockout duration for shared IPs | +| API quota raised (2026-05-11) | 1 000→100 000/day · 7 000→700 000/week | Quota serves as outer envelope for all-day slow scrapers; 1 000/day was exhausted too quickly for legitimate integrations | --- @@ -663,8 +666,8 @@ Decision flow inside `RateLimitMiddleware.__call__()` / `_evaluate()`: ``` 0. /api/ - path AND IP daily/weekly quota exceeded? → 429 reason=quota_daily / quota_weekly - (all callers keyed by IP regardless of auth status; fail-open when Redis unavailable) + path AND consumer daily/weekly quota exceeded? → 429 reason=quota_daily / quota_weekly + (per-consumer: auth users by pk, anon by masked IP; fail-open when Redis unavailable) 1. IP in allowlist? → pass (no further checks) 1a. UA matches BOT_UA_FRAGMENTS list? → 429 reason=known_ua 1b. UA token hash in rl:bot:ua:blocked SET? → 429 reason=redis_ua @@ -689,7 +692,7 @@ flowchart TD REQ([Request]) --> C0 C0{"/api/ path AND\ndaily/weekly quota exceeded?"} - C0 -- "yes — quota:{ns}:daily/weekly:{period}:ip exceeded" --> R_QUOTA([429\nquota_daily / quota_weekly]) + C0 -- "yes — quota:{ns}:daily/weekly:{period}:user/ip exceeded" --> R_QUOTA([429\nquota_daily / quota_weekly]) C0 -- no --> C1 C1{"Known bot UA?"} @@ -746,7 +749,7 @@ Roll out to canary pods first; promote check-by-check in order of false-positive | Order | Check | Reason | Risk | Condition to promote | |-------|-------|--------|------|---------------------| | nginx | scanner extensions | `return 444` in `sapl.conf` for `.php`/`.asp`/etc. | Zero | Gunicorn never sees these requests | -| 0th | `quota_daily` / `quota_weekly` | Per-IP daily/weekly cap on `/api/` paths (all callers) | Low | Limits set above per-minute rate (1 000/day · 7 000/week) | +| 0th | `quota_daily` / `quota_weekly` | Per-consumer daily/weekly cap on `/api/` paths | Low | Limits set well above per-minute rate (500/day anon, 5 000/day auth) | | 1st | `known_ua` | Substring in hardcoded `BOT_UA_FRAGMENTS` list | Zero | UA strings are deterministic | | 2nd | `redis_ua` | Token hash in `rl:bot:ua:blocked` SET | Zero | Keys only set manually by operators | | 3rd | `ip_blocked` | Marker set by prior proven-bad requests | Zero | Fast-path only, no new blocks created | @@ -913,15 +916,15 @@ Django's per-user counter (NAT-safe) is never consulted. (throttles sustained traffic) │ ▼ - Django quota check: 1 000/day not exceeded → pass - _handle_api: 35/min counter incremented (IP-keyed, all callers) + Django quota check: 500/day not exceeded → pass + Anonymous /api/: early return, no _evaluate() rl:ip:{ip}:reqs NOT incremented rl:ip:{ip}:blocked NOT written │ ▼ Page requests from same IP: unaffected ✓ - Worst case: 35/min threshold → rl:api:ns:{ns}:ip:{ip}:blocked - → only /api/ access blocked, pages still work + Worst case: 500 API req/day quota exhausted + → only API access blocked, pages still work ``` --- @@ -959,17 +962,18 @@ Path nginx zone Django Block key? Notes /sessao//* none (bypass) none (bypass) — live session /media/* sapl_media anon counter anon: yes auth gate in serve_media 180r/m b=180 runs auth: no -/api/* sapl_api 35/min + quota api-only ← rl:api:ns:{ns}:ip:* - 60r/m b=120 1 000/day never rl:ip:*:blocked +/api/* (anonymous) sapl_api quota only no ← no IP counter; no + 60r/m b=120 500/day collateral NAT block +/api/* (auth) sapl_api per-user 240/m no (soft) per-uid, NAT-safe + 60r/m b=120 counter runs /relatorios/* sapl_heavy counter runs anon: yes tight — PDF generation 10r/m b=20 /* (everything else) sapl_general counter runs anon: yes normal browsing 90r/m b=180 auth: no auth: 429, resets in 60s ``` -`anon: yes` — anonymous IP gets a 300 s block key on breach (all paths locked) -`auth: no` — authenticated users get 429 for that request; counter expires in 60 s -`api-only` — only `rl:api:ns:{ns}:ip:{ip}:blocked` written; global `rl:ip:{ip}:blocked` never touched +`anon: yes` — anonymous IP gets a 300s block key on breach (all paths locked) +`auth: no` — authenticated users get 429 for that request; counter expires in 60s --- @@ -999,10 +1003,9 @@ Path nginx zone Django Block key? Notes Mitigations applied ┌──────────────────────────────────────────────────────────────────┐ │ Known safe high-freq paths → bypass at both layers │ - │ Authenticated users (non-api) → per-user counter (uid), NAT-safe│ - │ /api/* (all callers) → api-only block key, no global │ - │ IP counter; pages unaffected │ - │ Everything else (anon) → IP counter + 300 s block │ + │ Authenticated users → per-user counter (uid), NAT-safe │ + │ Anonymous /api/ → quota only, no IP counter │ + │ Everything else (anon) → IP counter + 300s block │ └──────────────────────────────────────────────────────────────────┘ Long-term @@ -1206,11 +1209,11 @@ Redis PDF caching would solve "high request volume reaching the file layer" — | 1 | Path counter (`/media/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | — (observability only) | `RL_PATH_REQUESTS` | | 1 | Path counter (`/static/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | — | *Future* (requires OpenResty/Lua) | | 1 | UA deny list | `rl:bot:ua:blocked` | permanent SET | — (block on match) | `RL_UA_BLOCKLIST` | -| 1 | API daily quota (all callers, by IP) | `quota:{ns}:daily:{date}:ip:{ip}` | 24 h | 1 000 (`API_QUOTA_DAILY`) | `QUOTA_IP_DAILY` | -| 1 | API weekly quota (all callers, by IP) | `quota:{ns}:weekly:{week}:ip:{ip}` | 7 d | 7 000 (`API_QUOTA_WEEKLY`) | `QUOTA_IP_WEEKLY` | -| 1 | API IP rate counter (all callers) | `rl:api:ns:{ns}:ip:{ip}:reqs` | 60 s (`API_RATE_LIMIT_WINDOW_SECONDS`) | 35 (`API_RATE_LIMIT_THRESHOLD`) | `RL_API_IP_REQUESTS` | -| 1 | API IP block marker (all callers) | `rl:api:ns:{ns}:ip:{ip}:blocked` | 300 s (`API_RATE_LIMIT_BLOCK_SECONDS`) | — | `RL_API_IP_BLOCKED` | -| 1 | API blocked-IP ZSET index | `rl:index:api:blocked_ips` | permanent ZSET, score=expiry ts | — | `RL_INDEX_API_BLOCKED_IPS` | +| 1 | API daily quota (all callers, by IP) | `quota:{ns}:daily:{date}:ip:{ip}` | 24 h | 100 000 (`API_QUOTA_DAILY`) | `QUOTA_IP_DAILY` | +| 1 | API weekly quota (all callers, by IP) | `quota:{ns}:weekly:{week}:ip:{ip}` | 7 d | 700 000 (`API_QUOTA_WEEKLY`) | `QUOTA_IP_WEEKLY` | +| 1 | API IP rate counter (all callers, ns-scoped) | `rl:api:ns:{ns}:ip:{ip}:reqs` | 60 s (`API_RATE_LIMIT_WINDOW_SECONDS`) | 120 (`API_RATE_LIMIT_THRESHOLD`) | `RL_API_IP_REQUESTS` | +| 1 | API IP block marker (ns-scoped) | `rl:api:ns:{ns}:ip:{ip}:blocked` | 60 s (`API_RATE_LIMIT_BLOCK_SECONDS`) | — | `RL_API_IP_BLOCKED` | +| 1 | API blocked-IP ZSET index | `rl:index:api_blocked_ips` | permanent ZSET, score=expiry ts | — | `RL_INDEX_API_BLOCKED_IPS` | | 2 | Django Channels | `channels:*` | session TTL | — | *Future* | ### What each counter catches — and misses @@ -1356,15 +1359,15 @@ insufficient: | 1 | `OPTIONS` method | Pass — CORS preflight must never be blocked | | 2 | Same-origin (`_is_same_origin`) | Pass — SAPL's own browser polling; no counter | | 3 | `rl:ip::blocked` exists | 429 `global_ip_blocked` — global block also covers `/api/` | -| 4 | `rl:api:ns:{ns}:ip:{ip}:blocked` exists | 429 `api_ip_blocked` — tenant-scoped API-only block | -| 5 | Daily/weekly quota exceeded (IP-keyed, all callers) | 429 `quota_daily` / `quota_weekly` | -| 6 | API counter ≥ 35/min (all callers, auth status not checked) | Write `rl:api:ns:{ns}:ip:{ip}:blocked`; 429 `api_threshold_exceeded` | -| — | Otherwise | Pass | +| 4 | `rl:api:ns::ip::blocked` exists | 429 `api_ip_blocked` — API-only, tenant-scoped block | +| 5 | Daily/weekly quota exceeded | 429 `quota_daily` / `quota_weekly` | +| 6 | API counter ≥ threshold (all callers) | Write `rl:api:ns::ip::blocked`; 429 `api_threshold_exceeded` | +| — | Under threshold | Pass | -**Key invariants**: -- `rl:ip::blocked` is **never written** because of `/api/` abuse — page requests from the same NAT are unaffected. -- Auth status is not checked at any step — authenticating cannot bypass the 35/min cap or the daily quota. -- Block keys include `{ns}` — a block in one k8s tenant does not affect other tenants. +Auth status is **not checked**. Authenticated and anonymous callers are treated identically — both keyed by IP, both subject to the same threshold and quota. `_evaluate` (240/min per-user) still governs all non-`/api/` paths. + +**Key invariant**: `rl:ip::blocked` is **never written** because of `/api/` abuse. +`rl:api:ns::ip::blocked` is tenant-scoped and blocks only `/api/` — page requests from the same NAT continue, and a block in one k8s namespace does not affect other tenants. ### Same-origin detection — `_is_same_origin` @@ -1385,20 +1388,20 @@ header means the browser knows this is cross-origin, regardless of what `Referer | Setting | Env var | Default | Purpose | |---------|---------|---------|---------| | `API_RATE_LIMIT_ENABLED` | same | `True` | Master switch; set False to revert to quota-only | -| `API_RATE_LIMIT_THRESHOLD` | same | `35` | Requests per window before API block (all callers) | +| `API_RATE_LIMIT_THRESHOLD` | same | `120` | Requests per window before API block | | `API_RATE_LIMIT_WINDOW_SECONDS` | same | `60` | Counter TTL (seconds) | -| `API_RATE_LIMIT_BLOCK_SECONDS` | same | `300` | `rl:api:ns:{ns}:ip:{ip}:blocked` TTL | +| `API_RATE_LIMIT_BLOCK_SECONDS` | same | `60` | `rl:api:ns::ip::blocked` TTL | | `API_RATE_LIMIT_SAME_ORIGIN_BYPASS` | same | `True` | Disable to test without same-origin pass | -| `API_QUOTA_DAILY` | same | `1000` | Daily cap per IP (all callers) | -| `API_QUOTA_WEEKLY` | same | `7000` | Weekly cap per IP (all callers) | +| `API_QUOTA_DAILY` | same | `100 000` | Daily call cap per IP (all callers) | +| `API_QUOTA_WEEKLY` | same | `700 000` | Weekly call cap per IP (7× daily) | ### Files changed | File | Change | |------|--------| | `sapl/middleware/api_emergency_block.py` | Deleted | -| `sapl/settings.py` | Removed `ApiEmergencySameSiteOnlyMiddleware` from `MIDDLEWARE`; added `API_RATE_LIMIT_*` settings; added `API_QUOTA_DAILY` / `API_QUOTA_WEEKLY` (replaces `API_QUOTA_ANON_*` and `API_QUOTA_AUTH_*`); threshold default 35; quota defaults 1 000/7 000 | -| `sapl/middleware/ratelimit.py` | Added `RL_API_IP_REQUESTS`, `RL_API_IP_BLOCKED` (both ns-scoped), `RL_INDEX_API_BLOCKED_IPS` constants; added `_is_same_origin`; extended `__init__`; added `_handle_api`, `_api_block_response`; refactored `__call__`; `_check_api_quota` uses IP key for all callers; auth users no longer delegate to `_evaluate` for `/api/` | +| `sapl/settings.py` | Removed `ApiEmergencySameSiteOnlyMiddleware` from `MIDDLEWARE`; added `API_RATE_LIMIT_*` and `API_QUOTA_*` settings; auth-specific quota settings removed; threshold 60→120; block TTL 300→60 s; quota 1 000→100 000/day | +| `sapl/middleware/ratelimit.py` | Added `RL_API_IP_REQUESTS`, `RL_API_IP_BLOCKED` (both ns-scoped), `RL_INDEX_API_BLOCKED_IPS` constants; added `_is_same_origin`; extended `__init__`; added `_handle_api`, `_api_block_response`; auth check removed from `_handle_api` and `_check_api_quota` — all callers keyed by IP | | `sapl/middleware/test_ratelimiter.py` | Extended `_make_middleware`; added 17 new tests | ---