Browse Source

Shard blocked-IP ZSET indexes and add inline pruning

- Add RATE_LIMITER_INDEX_SHARDS setting (default 3): each blocked-IP write
  routes to rl:index:blocked_ips:{shard} via md5(ip) % N, distributing
  write contention across N keys.
- _BLOCK_LUA now runs ZREMRANGEBYSCORE before ZADD, pruning expired entries
  from the target shard inline. Each shard stays bounded to active-only
  members; no separate maintenance job needed.
- _index_shard(ip, index_base) computes the sharded key; all four _set_block
  call sites updated.
- Fix 5 pre-existing test failures: suspicious-headers tests needed
  HTTP_USER_AGENT removed; auth_user_rate block assertion corrected (no
  persistent block key by design); ip_rate / ua_rotation tests now mock
  _set_block directly instead of checking mock_cache.set.
- Update RATE-LIMITER-PLAN.md: key schema table, Redis CLI examples, and
  ZSET index description reflect sharded keys and inline pruning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rate-limiter-2026
Edward Ribeiro 3 days ago
parent
commit
6425354e34
  1. 34
      plan/RATE-LIMITER-PLAN.md
  2. 41
      sapl/middleware/ratelimit.py
  3. 69
      sapl/middleware/test_ratelimiter.py
  4. 5
      sapl/settings.py

34
plan/RATE-LIMITER-PLAN.md

@ -105,7 +105,7 @@ graph TD
| Static cache (images/logos) | `static:{ns}:{sha256}` | 3–24 h | 0 | ~2.4 GB |
| IP request counter | `rl:ip:{ip}:reqs` | 60 s | 1 | ~0.6 MB |
| IP blocked marker | `rl:ip:{ip}:blocked` | 300 s | 1 | ~0.06 MB |
| Blocked-IP index | `rl:index:blocked_ips` | permanent ZSET | 1 | ~0.01 MB |
| Blocked-IP index | `rl:index:blocked_ips:{0..N-1}` | self-pruning ZSET (N=3) | 1 | ~0.01 MB |
| User request counter | `rl:{ns}:user:{uid}:reqs` | 60 s | 1 | negligible |
| User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | 1 | negligible |
| Blocked-user index | `rl:index:blocked_users` | permanent ZSET | 1 | negligible |
@ -374,19 +374,25 @@ rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
Via port-forward (local machine — run `kubectl port-forward svc/redis -n sapl-redis 6379:6379` first):
```bash
# All active blocked IPs via ZSET index (O(log N), no SCAN)
# All active blocked IPs via sharded ZSET index (O(log N), no SCAN)
# IPs are distributed across N shards (default 3) via md5(ip) % N.
NOW=$(date +%s)
redis-cli -n 1 ZRANGEBYSCORE rl:index:blocked_ips $NOW +inf WITHSCORES
for i in 0 1 2; do
redis-cli -n 1 ZRANGEBYSCORE rl:index:blocked_ips:$i $NOW +inf WITHSCORES
done
# All active blocked users via ZSET index
redis-cli -n 1 ZRANGEBYSCORE rl:index:blocked_users $NOW +inf WITHSCORES
# Count of currently active blocked IPs
redis-cli -n 1 ZCOUNT rl:index:blocked_ips $NOW +inf
# Count of currently active blocked IPs (sum across shards)
for i in 0 1 2; do
redis-cli -n 1 ZCOUNT rl:index:blocked_ips:$i $NOW +inf
done
# Prune expired entries from both indexes (safe to run anytime)
redis-cli -n 1 ZREMRANGEBYSCORE rl:index:blocked_ips 0 $((NOW - 1))
redis-cli -n 1 ZREMRANGEBYSCORE rl:index:blocked_users 0 $((NOW - 1))
# Pruning is automatic — each _set_block write prunes expired entries from its
# shard inline (ZREMRANGEBYSCORE inside _BLOCK_LUA). Manual pruning no longer needed.
# To prune the legacy unsharded key (harmless; expires within one BLOCK_TTL after deploy):
redis-cli -n 1 DEL rl:index:blocked_ips rl:index:api_blocked_ips
# Legacy: blocked IPs with value and remaining TTL (still works; slower on large key spaces)
redis-cli -n 1 --scan --pattern 'rl:ip:*:blocked' | while read key; do
@ -1200,7 +1206,7 @@ Redis PDF caching would solve "high request volume reaching the file layer" —
| 1 | IP rate-limit counter | `rl:ip:{ip}:reqs` | 60 s | 120 (`RATE_LIMITER_RATE`) | `RL_IP_REQUESTS` |
| 1 | IP 404 counter | `rl:ip:{ip}:404s` | 60 s | 20 (`RATE_LIMIT_404_THRESHOLD`) | `RL_IP_404S` |
| 1 | IP blocked marker | `rl:ip:{ip}:blocked` | 300 s | — | `RL_IP_BLOCKED` |
| 1 | Blocked-IP ZSET index | `rl:index:blocked_ips` | permanent ZSET, score=expiry ts | — | `RL_INDEX_BLOCKED_IPS` |
| 1 | Blocked-IP ZSET index | `rl:index:blocked_ips:{0..N-1}` | self-pruning ZSET, score=expiry ts, N=`RATE_LIMITER_INDEX_SHARDS` (default 3) | — | `RL_INDEX_BLOCKED_IPS` |
| 1 | User rate-limit counter | `rl:{ns}:user:{uid}:reqs` | 60 s | 240 (`RATE_LIMITER_RATE_AUTHENTICATED`) | `RL_USER_REQUESTS` |
| 1 | User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | — *(not written on rate breach; window resets naturally)* | `RL_USER_BLOCKED` |
| 1 | Blocked-user ZSET index | `rl:index:blocked_users` | permanent ZSET, score=expiry ts | — *(not written on rate breach)* | `RL_INDEX_BLOCKED_USERS` |
@ -1212,7 +1218,7 @@ Redis PDF caching would solve "high request volume reaching the file layer" —
| 1 | API weekly quota (all callers, by IP) | `quota:{ns}:weekly:{week}:ip:{ip}` | 7 d | 700 000 (`API_QUOTA_WEEKLY`) | `QUOTA_IP_WEEKLY` |
| 1 | API IP rate counter (all callers, ns-scoped) | `rl:api:ns:{ns}:ip:{ip}:reqs` | 60 s (`API_RATE_LIMIT_WINDOW_SECONDS`) | 120 (`API_RATE_LIMIT_THRESHOLD`) | `RL_API_IP_REQUESTS` |
| 1 | API IP block marker (ns-scoped) | `rl:api:ns:{ns}:ip:{ip}:blocked` | 60 s (`API_RATE_LIMIT_BLOCK_SECONDS`) | — | `RL_API_IP_BLOCKED` |
| 1 | API blocked-IP ZSET index | `rl:index:api_blocked_ips` | permanent ZSET, score=expiry ts | — | `RL_INDEX_API_BLOCKED_IPS` |
| 1 | API blocked-IP ZSET index | `rl:index:api_blocked_ips:{0..N-1}` | self-pruning ZSET, score=expiry ts, N=`RATE_LIMITER_INDEX_SHARDS` (default 3) | — | `RL_INDEX_API_BLOCKED_IPS` |
| 2 | Django Channels | `channels:*` | session TTL | — | *Future* |
### What each counter catches — and misses
@ -1317,13 +1323,13 @@ pre-warming or public interest event).
---
**`rl:index:blocked_ips` / `rl:index:blocked_users` — ZSET enumeration indexes**
**`rl:index:blocked_ips:{0..N-1}` / `rl:index:blocked_users` — ZSET enumeration indexes**
Written atomically alongside every block-key write via `_BLOCK_LUA` (Lua: `SET key 1 EX ttl` + `ZADD index expire_ts key`). Score = unix expiry timestamp.
Written atomically alongside every block-key write via `_BLOCK_LUA` (Lua: `SET key 1 EX ttl` + `ZREMRANGEBYSCORE index -inf now-1` + `ZADD index expire_ts key`). Score = unix expiry timestamp. IPs are routed to a shard via `md5(ip) % N` (default N=3, configurable via `RATE_LIMITER_INDEX_SHARDS`).
Catches: gives monitoring and admin tooling an O(log N) view of all active blocks — `ZRANGEBYSCORE index <now> +inf` — without a fleet-wide `SCAN` that would block Redis during large key spaces. Also enables fast `ZCOUNT` for alerting on block-rate spikes.
Catches: gives monitoring and admin tooling an O(log N) view of all active blocks — `ZRANGEBYSCORE index:<shard> <now> +inf` across all shards — without a fleet-wide `SCAN`. Distributes write contention across N keys. Inline `ZREMRANGEBYSCORE` keeps each shard bounded to active-only entries (no unbounded growth).
Misses: stale entries (blocks that expired naturally) accumulate in the ZSET because Redis does not auto-remove ZSET members when the referenced key expires. Prune periodically with `ZREMRANGEBYSCORE index 0 <now-1>`. The fallback path (Redis unavailable) skips the ZADD — the actual block key is still set via `cache.set`, but the index entry is lost for that event.
Misses: the fallback path (Redis unavailable) skips the ZADD — the actual block key is still set via `cache.set`, but the index entry is lost for that event. Querying all blocked IPs requires iterating all N shards.
---

41
sapl/middleware/ratelimit.py

@ -115,16 +115,34 @@ _INCR_LUA = """
return n
"""
# Atomically write a block key and record it in the ZSET index in one round-trip.
# KEYS[1] = block key KEYS[2] = index key
# Atomically write a block key and record it in the sharded ZSET index.
# Prunes expired entries from the target shard before inserting the new one,
# keeping each shard bounded to only active blocks (amortised O(1) cleanup).
# KEYS[1] = block key KEYS[2] = shard index key
# ARGV[1] = ttl (seconds) ARGV[2] = expiry unix timestamp (now + ttl)
# ARGV[3] = current unix timestamp (for pruning: remove score < now)
_BLOCK_LUA = """
local now = tonumber(ARGV[3])
redis.call('SET', KEYS[1], '1', 'EX', ARGV[1])
redis.call('ZREMRANGEBYSCORE', KEYS[2], '-inf', now - 1)
redis.call('ZADD', KEYS[2], ARGV[2], KEYS[1])
return 1
"""
def _index_shard(ip, index_base):
"""
Return the sharded ZSET key for an IP.
IPs are distributed across RATE_LIMITER_INDEX_SHARDS shards using
md5(ip) % N, spreading write contention and bounding each shard's size.
The mapping is deterministic: the same IP always routes to the same shard.
"""
n = settings.RATE_LIMITER_INDEX_SHARDS
shard = int(hashlib.md5(ip.encode()).hexdigest(), 16) % n
return f'{index_base}:{shard}'
def make_ratelimit_cache_key(key, key_prefix, version):
"""
Pass-through cache key function for the 'ratelimit' Django cache backend.
@ -273,16 +291,17 @@ def _incr_with_ttl(key, ttl):
def _set_block(block_key, index_key, ttl):
"""
Atomically set a block key (with TTL) and record it in a ZSET index.
Score = expiry unix timestamp so the index can be pruned with
ZREMRANGEBYSCORE <index_key> 0 <now>.
Atomically set a block key (with TTL) and record it in a sharded ZSET index.
Score = expiry unix timestamp. Prunes expired entries from the target shard
before inserting (inline cleanup no separate maintenance job needed).
Falls back to a plain cache.set when Redis is unavailable (index skipped).
"""
expire_at = int(time.time()) + ttl
now = int(time.time())
expire_at = now + ttl
try:
from django_redis import get_redis_connection
client = get_redis_connection('ratelimit')
client.eval(_BLOCK_LUA, 2, block_key, index_key, ttl, expire_at)
client.eval(_BLOCK_LUA, 2, block_key, index_key, ttl, expire_at, now)
except Exception:
caches['ratelimit'].set(block_key, 1, timeout=ttl)
@ -430,7 +449,7 @@ class RateLimitMiddleware:
if self.api_rate_limit_enabled:
count = self._incr_with_ttl(RL_API_IP_REQUESTS.format(ns=_NAMESPACE, ip=ip), self.api_window)
if count >= self.api_threshold:
_set_block(RL_API_IP_BLOCKED.format(ns=_NAMESPACE, ip=ip), RL_INDEX_API_BLOCKED_IPS, self.api_block_seconds)
_set_block(RL_API_IP_BLOCKED.format(ns=_NAMESPACE, ip=ip), _index_shard(ip, RL_INDEX_API_BLOCKED_IPS), self.api_block_seconds)
logger.warning(
'api_rate_limit_block reason=api_threshold_exceeded '
'ip=%s path=%s user_agent=%s count=%s threshold=%s',
@ -498,7 +517,7 @@ class RateLimitMiddleware:
# Check 4b: IP request rate
count = self._incr_with_ttl(RL_IP_REQUESTS.format(ip=ip), ttl=self.anon_window)
if count >= self.anon_threshold:
_set_block(RL_IP_BLOCKED.format(ip=ip), RL_INDEX_BLOCKED_IPS, self.BLOCK_TTL)
_set_block(RL_IP_BLOCKED.format(ip=ip), _index_shard(ip, RL_INDEX_BLOCKED_IPS), self.BLOCK_TTL)
return {'action': 'block', 'reason': 'ip_rate', 'ip': ip}
# Check 4c: per-namespace/IP/window (catches UA rotators behind NAT)
@ -508,7 +527,7 @@ class RateLimitMiddleware:
ttl=self.anon_window * 2,
)
if count >= self.anon_threshold:
_set_block(RL_IP_BLOCKED.format(ip=ip), RL_INDEX_BLOCKED_IPS, self.BLOCK_TTL)
_set_block(RL_IP_BLOCKED.format(ip=ip), _index_shard(ip, RL_INDEX_BLOCKED_IPS), self.BLOCK_TTL)
return {'action': 'block', 'reason': 'ua_rotation', 'ip': ip}
return {'action': 'pass', 'ip': ip}
@ -530,7 +549,7 @@ class RateLimitMiddleware:
return
count = self._incr_with_ttl(RL_IP_404S.format(ip=ip), ttl=self.anon_window)
if count >= self.not_found_threshold:
_set_block(RL_IP_BLOCKED.format(ip=ip), RL_INDEX_BLOCKED_IPS, self.BLOCK_TTL)
_set_block(RL_IP_BLOCKED.format(ip=ip), _index_shard(ip, RL_INDEX_BLOCKED_IPS), self.BLOCK_TTL)
logger.warning(
'ratelimit_block layer=django reason=404_scan ip=%s path=%s namespace=%s',
ip, request.path, _NAMESPACE,

69
sapl/middleware/test_ratelimiter.py

@ -13,6 +13,7 @@ from django.test import RequestFactory
from sapl.middleware.ratelimit import (
_NAMESPACE,
_index_shard,
_is_same_origin,
_is_suspicious_headers,
_parse_rate,
@ -21,6 +22,7 @@ from sapl.middleware.ratelimit import (
RateLimitMiddleware,
RL_API_IP_BLOCKED,
RL_API_IP_REQUESTS,
RL_INDEX_BLOCKED_IPS,
RL_IP_BLOCKED,
RL_USER_BLOCKED,
smart_key,
@ -100,6 +102,7 @@ def _make_middleware(
mock_settings.API_RATE_LIMIT_WINDOW_SECONDS = api_window
mock_settings.API_RATE_LIMIT_BLOCK_SECONDS = api_block_seconds
mock_settings.API_RATE_LIMIT_SAME_ORIGIN_BYPASS = api_same_origin_bypass
mock_settings.RATE_LIMITER_INDEX_SHARDS = 3
with (
patch('sapl.middleware.ratelimit.caches') as mock_caches,
@ -228,6 +231,45 @@ def test_smart_rate_auth_returns_auth_rate():
assert smart_rate(None, _auth_req()) == '120/m'
# ---------------------------------------------------------------------------
# _index_shard — sharded ZSET key routing
# ---------------------------------------------------------------------------
def test_index_shard_is_deterministic():
"""Same IP always maps to the same shard."""
from sapl.middleware.ratelimit import _index_shard
with patch('sapl.middleware.ratelimit.settings') as mock_s:
mock_s.RATE_LIMITER_INDEX_SHARDS = 3
key1 = _index_shard('1.2.3.4', 'rl:index:blocked_ips')
key2 = _index_shard('1.2.3.4', 'rl:index:blocked_ips')
assert key1 == key2
def test_index_shard_stays_within_range():
"""Shard suffix is always 0 … N-1."""
from sapl.middleware.ratelimit import _index_shard
import re
with patch('sapl.middleware.ratelimit.settings') as mock_s:
mock_s.RATE_LIMITER_INDEX_SHARDS = 3
ips = [f'10.0.0.{i}' for i in range(50)]
for ip in ips:
key = _index_shard(ip, 'rl:index:blocked_ips')
m = re.search(r':(\d+)$', key)
assert m and 0 <= int(m.group(1)) < 3, f'out-of-range shard for {ip}: {key}'
def test_index_shard_distributes_across_shards():
"""With enough IPs, all 3 shards are used."""
from sapl.middleware.ratelimit import _index_shard
with patch('sapl.middleware.ratelimit.settings') as mock_s:
mock_s.RATE_LIMITER_INDEX_SHARDS = 3
shards_seen = {
_index_shard(f'192.168.{i}.{j}', 'rl:index:blocked_ips').split(':')[-1]
for i in range(5) for j in range(10)
}
assert shards_seen == {'0', '1', '2'}
# ---------------------------------------------------------------------------
# Check 1 — known bot User-Agent
# ---------------------------------------------------------------------------
@ -291,6 +333,7 @@ def test_auth_suspicious_headers_blocked():
r = _auth_req()
r.META.pop('HTTP_ACCEPT', None)
r.META.pop('HTTP_ACCEPT_LANGUAGE', None)
r.META.pop('HTTP_USER_AGENT', None)
result = mw._evaluate(r)
assert result == {'action': 'block', 'reason': 'suspicious_headers_auth', 'ip': '1.2.3.4'}
@ -303,12 +346,9 @@ def test_auth_rate_exceeded_blocks_and_marks_user_blocked():
mw, mock_cache = _make_middleware(auth_rate='5/m')
mw._incr_with_ttl = MagicMock(return_value=5) # exactly at threshold
result = mw._evaluate(_auth_req(uid=7))
# auth_user_rate has no persistent block key — the window resets naturally
assert result == {'action': 'block', 'reason': 'auth_user_rate', 'ip': '1.2.3.4'}
mock_cache.set.assert_called_once_with(
RL_USER_BLOCKED.format(ns=_NAMESPACE, uid='7'),
1,
timeout=RateLimitMiddleware.BLOCK_TTL,
)
mock_cache.set.assert_not_called()
def test_auth_under_rate_passes():
@ -328,6 +368,7 @@ def test_anon_suspicious_headers_blocked():
r = _anon_req()
r.META.pop('HTTP_ACCEPT', None)
r.META.pop('HTTP_ACCEPT_LANGUAGE', None)
r.META.pop('HTTP_USER_AGENT', None)
result = mw._evaluate(r)
assert result == {'action': 'block', 'reason': 'suspicious_headers', 'ip': '1.2.3.4'}
@ -337,14 +378,15 @@ def test_anon_suspicious_headers_blocked():
# ---------------------------------------------------------------------------
def test_anon_ip_rate_exceeded_blocks_and_marks_ip_blocked():
mw, mock_cache = _make_middleware(anon_rate='5/m')
mw, _ = _make_middleware(anon_rate='5/m')
mw._incr_with_ttl = MagicMock(return_value=5) # first call (IP counter) hits threshold
with patch('sapl.middleware.ratelimit._set_block') as mock_set_block:
result = mw._evaluate(_anon_req())
assert result == {'action': 'block', 'reason': 'ip_rate', 'ip': '1.2.3.4'}
mock_cache.set.assert_called_once_with(
mock_set_block.assert_called_once_with(
RL_IP_BLOCKED.format(ip='1.2.3.4'),
1,
timeout=RateLimitMiddleware.BLOCK_TTL,
_index_shard('1.2.3.4', RL_INDEX_BLOCKED_IPS),
RateLimitMiddleware.BLOCK_TTL,
)
@ -353,15 +395,16 @@ def test_anon_ip_rate_exceeded_blocks_and_marks_ip_blocked():
# ---------------------------------------------------------------------------
def test_anon_ua_rotation_detected_blocks_and_marks_ip_blocked():
mw, mock_cache = _make_middleware(anon_rate='5/m')
mw, _ = _make_middleware(anon_rate='5/m')
# First call (IP counter) is under threshold; second (window counter) hits it.
mw._incr_with_ttl = MagicMock(side_effect=[4, 5])
with patch('sapl.middleware.ratelimit._set_block') as mock_set_block:
result = mw._evaluate(_anon_req())
assert result == {'action': 'block', 'reason': 'ua_rotation', 'ip': '1.2.3.4'}
mock_cache.set.assert_called_once_with(
mock_set_block.assert_called_once_with(
RL_IP_BLOCKED.format(ip='1.2.3.4'),
1,
timeout=RateLimitMiddleware.BLOCK_TTL,
_index_shard('1.2.3.4', RL_INDEX_BLOCKED_IPS),
RateLimitMiddleware.BLOCK_TTL,
)

5
sapl/settings.py

@ -413,6 +413,11 @@ RATE_LIMITER_RATE_BOT = config('RATE_LIMITER_RATE_BOT', default='5/m')
# Lower values pick up new blocked UAs faster; higher values reduce Redis round-trips.
RATE_LIMITER_UA_BLOCKLIST_REFRESH = config('RATE_LIMITER_UA_BLOCKLIST_REFRESH', default=60, cast=int)
# Number of shards for the blocked-IP ZSET indexes.
# Each shard receives IPs deterministically via md5(ip) % N, distributing
# write contention across N keys. Increase for high-throughput deployments.
RATE_LIMITER_INDEX_SHARDS = config('RATE_LIMITER_INDEX_SHARDS', default=3, cast=int)
# Maximum 404 responses from one anonymous IP in one anon window before the IP
# is blocked. Catches path-probing scanners that don't use recognised extensions.
RATE_LIMIT_404_THRESHOLD = config('RATE_LIMIT_404_THRESHOLD', default=20, cast=int)

Loading…
Cancel
Save