Browse Source

Phase 6: scanner probe blocking, plan consolidation, and flow diagram

Code:
- Block IPs dynamically on scanner extension probes (.php, .asp, .aspx,
  .jsp, .cgi, .env) — writes rl:ip:{ip}:blocked on first hit; subsequent
  requests short-circuit at check 2 with zero counting overhead
- Add RATE_LIMIT_SCANNER_EXTENSIONS setting (space-separated, env-overridable)
- Import os in ratelimit.py for os.path.splitext

Plan (RATE_LIMITER_PLAN.md → RATE-LIMITER-PLAN.md):
- Rename to kebab-case for consistency with rate-limiter-v2.md
- Merge missing content from rate-limiter-v2.md: context & problem statement,
  component diagram (DB0/DB1 split), decision log, Gunicorn tuning, nginx
  real-IP fixes, upload settings, N+1 fix (synced to actual implementation),
  enforcement graduation order, decorator migration table, file serving
  decision matrix, dynamic page caching guidelines, open questions
- Add Mermaid decision flow diagram for RateLimitMiddleware._evaluate()
- Add rationale section for rl:{ns}:ip:{ip}:w:{bucket} namespace scoping
  (5 arguments covering attack pattern match, gaming resistance, key
  orthogonality, multi-portal fairness, and isolation contract)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rate-limiter-2026
Edward Ribeiro 3 weeks ago
parent
commit
c5eea025ab
  1. 967
      plan/RATE-LIMITER-PLAN.md
  2. 474
      plan/RATE_LIMITER_PLAN.md
  3. 40
      sapl/base/media.py
  4. 9
      sapl/middleware/ratelimit.py
  5. 15
      sapl/settings.py

967
plan/RATE-LIMITER-PLAN.md

@ -0,0 +1,967 @@
# SAPL — Rate Limiter & Redis Operations
> **Scope**: Django / Gunicorn / nginx / Kubernetes fleet of 1,200+ pods.
> Each pod has a dedicated PostgreSQL instance. A K8s Ingress sits in front of all tenants.
> **This document is canonical** — all earlier session notes are consolidated here.
---
## Context & Problem Statement
### Fleet
| Item | Detail |
|------|--------|
| System | SAPL — Django 2.2, legislative management for Brazilian municipal chambers |
| Fleet | ~1,200 Kubernetes pods, each with a dedicated PostgreSQL pod |
| Pod limits | 1 core CPU (limit) / 35m (request) · 1600Mi RAM (limit) / 800Mi (request) |
| Users | Legislative house staff, often behind NAT (many users, one public IP) |
| Workloads | PDF generation (synchronous, ReportLab), file uploads up to 150 MB, WebSocket voting panel |
### OOM Kill Pattern
Workers grow from ~35 MB at birth to 800–900 MB within 2–3 minutes, then are killed and replaced in a continuous cycle.
Root causes:
- Bot scraping triggers synchronous PDF generation — entire document built in RAM (ReportLab)
- `worker_max_memory_per_child` only checks **between requests**; workers blocked on long requests are never recycled
- `TIMEOUT=300` lets bots hold threads for up to 5 minutes while memory accumulates
- 3 workers × 300 MB each = ~900 MB — breaching the 800Mi request threshold
### Bot Traffic Profile (Barueri pod, 16 days, 662 k requests)
| Actor | Requests | % of total |
|-------|----------|-----------|
| Googlebot | ~154,000 | 23.2% |
| Chrome/98.0.4758 (spoofed scraper) | 90,774 | 13.7% |
| kube-probe (healthcheck) | 69,065 | 10.4% |
| meta-externalagent | 28,325 | 4.3% |
| GPTBot | 11,489 | 1.7% |
| bingbot | 7,639 | 1.1% |
| OAI-SearchBot + Applebot | 6,681 | 1.0% |
| **Total identified bots** | **~377,000** | **~56.9%** |
**Botnet fingerprint:**
- Rotates User-Agents (Chrome/121, Chrome/122, Firefox/123, Safari/17…) across requests
- Crawls all sub-endpoints of the same matéria within 1 second from different IPs
- Distributes crawling across tenants — each pod stays under the per-pod rate limit, never triggering it
- Primary targets: `/relatorios/{id}/etiqueta-materia-legislativa` (~40 KB PDF) and all `/materia/{id}/*` sub-endpoints
### Static File Traffic (from CSV analysis)
| Category | Requests | Transfers |
|----------|----------|----------|
| Logos / images | 62,776 | ~24 GB |
| PDFs | 8,869 | 5.1 GB |
| Parliamentarian photos | 11,856 | ~0.5 GB |
| **Total** | **83,501** | **~30 GB** |
Top offender: `Brasão - Foz do Iguaçu.png` — 14,512 requests, 5.6 GB from a single 392 KB file.
### Hard Constraints
| Constraint | Impact |
|------------|--------|
| Per-pod PostgreSQL | Rate-limit counters not shared across pods |
| NAT environments | IP-based rate limiting causes false positives |
| `TIMEOUT=300` / uploads to 150 MB | Must not be broken — intentional for slow workflows |
---
## Architecture Overview
### Component Diagram
```mermaid
graph TD
Client([Bot / Human Client])
nginx[nginx]
gunicorn[Gunicorn\n2 workers / 4 threads]
mw[Django Middleware\nRateLimitMiddleware]
view[View Layer\nCBV + decorators]
db0[(Redis DB0\npage cache)]
db1[(Redis DB1\nrate limiter)]
pg[(PostgreSQL\nper-pod)]
fs[Filesystem\nPDFs / media]
Client -->|HTTP| nginx
nginx -->|proxy_pass| gunicorn
gunicorn --> mw
mw -->|pass| view
mw -->|429| nginx
view --> pg
view --> fs
view -->|read/write cached pages| db0
mw -->|counters + blocked markers| db1
```
> DB2 is reserved for Django Channels (WebSocket — future).
### Redis Memory Budget
| Key type | Key schema | TTL | DB | Est. size |
|----------|-----------|-----|----|----------|
| Page / view cache | `cache:{ns}:*` | 60–600 s | 0 | ~0.5 GB |
| Static cache (images/logos) | `static:{ns}:{sha256}` | 3–24 h | 0 | ~2.4 GB |
| IP request counter | `rl:ip:{ip}:reqs` | 60 s | 1 | ~0.6 MB |
| IP blocked marker | `rl:ip:{ip}:blocked` | 300 s | 1 | ~0.06 MB |
| User request counter | `rl:{ns}:user:{uid}:reqs` | 60 s | 1 | negligible |
| User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | 1 | negligible |
| Path counter | `rl:{ns}:path:{sha256}:reqs` | 60 s | 1 | ~0.3 MB |
| UA deny list | `rl:bot:ua:blocked` | permanent SET | 1 | ~0.03 MB |
| NS/IP/window counter | `rl:{ns}:ip:{ip}:w:{bucket}` | 120 s | 1 | ~0.6 MB |
| Redis overhead (× 1.5) | | | | ~1.6 GB |
| **Total ceiling** | | | | **~5 GB** |
---
## Decision Log
| Decision | Chosen | Rationale |
|----------|--------|-----------|
| Redis topology | **Single pod** (no Sentinel, no Cluster) | 65 MB of active data fits comfortably; cluster complexity not justified |
| PDF caching in Redis | **No** — ETags + sendfile are sufficient | Once rate limiting + ETags are active, repeat requests become 304s with zero bytes transferred |
| Rate-limit enforcement | **Django middleware** with shared Redis | No nginx image changes required; solves cross-pod consistency immediately |
| `worker_max_memory_per_child` | **400 MB** | Pod limit 1600Mi, 2 workers × 400 MB = 800 MB — leaves 800 Mi headroom |
| `sendfile off``on` | **Bug** — flip to `on` | No valid production reason found; disabling userspace copy is always correct |
| `/media/` serving | **X-Accel-Redirect** | Routes all `/media/` through Gunicorn so Django middleware runs; nginx serves bytes via internal location |
| Cache backend switch | **At pod startup** via `start.sh` + waffle switch | Pod restart is acceptable; avoids per-request runtime overhead |
---
## Directory layout
```
docker/k8s/
└── redis/
├── redis-configmap.yaml # redis.conf — no persistence, allkeys-lru, 5 GB ceiling
├── redis-deployment.yaml # Deployment (1 replica, redis:7-alpine)
└── redis-service.yaml # ClusterIP service on port 6379
```
---
## Prerequisites
- `kubectl` configured to talk to the target cluster.
- A `sapl-redis` namespace (created below if it doesn't exist).
---
## Deploy
```bash
# 1. Create the namespace (idempotent)
rancher kubectl create namespace sapl-redis --dry-run=client -o yaml | rancher kubectl apply -f -
# 2. Apply all three manifests
rancher kubectl apply -f docker/k8s/redis/redis-configmap.yaml
rancher kubectl apply -f docker/k8s/redis/redis-deployment.yaml
rancher kubectl apply -f docker/k8s/redis/redis-service.yaml
# 3. Verify the pod is Running
rancher kubectl -n sapl-redis get pods -l app=sapl-redis
```
Expected output:
```
NAME READY STATUS RESTARTS AGE
sapl-redis-6d9f8b7c4d-xk2lm 1/1 Running 0 30s
```
---
## Verify the rate limiter
`scripts/test_ratelimiter.py` fires repeated GET requests at a SAPL URL and reports
when the first 429 is returned.
### Usage
```
python scripts/test_ratelimiter.py <URL> [-n NUM] [-d DELAY] [-t TIMEOUT]
```
| Flag | Default | Meaning |
|------|---------|---------|
| `url` | *(required)* | Full URL including scheme, e.g. `http://localhost` |
| `-n`, `--num-requests` | `50` | Maximum requests to send |
| `-d`, `--delay` | `0.1` | Seconds between requests |
| `-t`, `--timeout` | `10` | Per-request timeout in seconds |
The script stops and prints a summary as soon as a 429 is received.
### Examples
```bash
# Hit the anonymous threshold (35 req/min) — fire 40 requests with minimal delay
python scripts/test_ratelimiter.py http://localhost -n 40 -d 0.05
# Slower fire — check that legitimate traffic is not rate-limited
python scripts/test_ratelimiter.py http://localhost -n 20 -d 2
# Test against a staging pod via port-forward
rancher kubectl port-forward -n <NAMESPACE> deploy/sapl 8080:80 &
python scripts/test_ratelimiter.py http://localhost:8080 -n 40 -d 0.05
```
### Reading the output
```
Request 1: Status 200 | Time: 0.045s
...
Request 36: Status 429 | Time: 0.038s
-> Rate limited on request 36
Summary:
Total requests attempted: 36
Successful (200): 35
Rate limited (429): 1
First 429 occurred at request: 36
```
A first-429 near the configured anonymous threshold (35 req/min) confirms the
middleware is wired correctly. A first-429 much earlier points to nginx `limit_req`
firing before Django sees the request.
---
## Inject REDIS_URL into SAPL instances
`REDIS_URL` points at the shared instance:
```
redis://redis.sapl-redis.svc.cluster.local:6379
^^^^^ ^^^^^^^^^^
svc namespace
```
`start.sh` picks it up on every pod startup and sets the `REDIS_CACHE` waffle switch
automatically — no further intervention needed.
### Fleet-wide rollout
Uses the `app.kubernetes.io/name=sapl` pod label to discover every SAPL namespace
automatically — onboarding a new municipality requires no script changes.
```bash
for ns in $(rancher kubectl get pods -A -l app.kubernetes.io/name=sapl \
-o jsonpath='{.items[*].metadata.namespace}' | tr ' ' '\n' | sort -u); do
rancher kubectl set env deployment/sapl \
REDIS_URL=redis://redis.sapl-redis.svc.cluster.local:6379 \
-n $ns
done
```
### Roll back
```bash
for ns in $(rancher kubectl get pods -A -l app.kubernetes.io/name=sapl \
-o jsonpath='{.items[*].metadata.namespace}' | tr ' ' '\n' | sort -u); do
rancher kubectl set env deployment/sapl REDIS_URL- -n $ns
done
```
`kubectl set env deployment/sapl REDIS_URL-` (trailing `-`) removes the variable.
`start.sh` then falls back to file-based cache automatically.
---
## Monitor
### Pod and events
```bash
# Pod status
rancher kubectl -n sapl-redis get pods -l app=sapl-redis -o wide
# Deployment events (useful right after apply)
rancher kubectl -n sapl-redis describe deployment sapl-redis
# Pod events (OOMKill, restarts, etc.)
rancher kubectl -n sapl-redis describe pod -l app=sapl-redis
```
### Logs
```bash
# Tail live logs
rancher kubectl -n sapl-redis logs -f deploy/sapl-redis
# Last 100 lines
rancher kubectl -n sapl-redis logs deploy/sapl-redis --tail=100
```
### Redis INFO
```bash
# Memory usage
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
redis-cli info memory \
| grep -E 'used_memory_human|maxmemory_human|mem_fragmentation_ratio'
# Connection pressure
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
redis-cli info stats \
| grep -E 'rejected_connections|instantaneous_ops_per_sec'
# Key distribution per DB
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli info keyspace
# Recent slow queries
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli slowlog get 10
# Live command sampling (1-second window)
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli --latency-history -i 1
```
### Rate-limiter keys (DB 1)
```bash
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
redis-cli -n 1 dbsize
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
redis-cli -n 1 --scan --pattern 'rl:ip:*' | head -20
```
---
## Seed the UA deny list (once after first deploy)
`rl:bot:ua:blocked` is a permanent Redis SET in DB 1. Each member is the
SHA-256 of a **UA token** — the identifying fragment extracted after splitting
on `/`, spaces, `;`, `(`, `)`, e.g.:
```
UA string: "GPTBot/1.1 (+https://openai.com/gptbot)"
Tokens: GPTBot 1.1 +https: ...
Hash stored: sha256("GPTBot")
```
The middleware (`_is_redis_blocked_ua`) tokenises the incoming UA the same
way and checks each token hash against the cached set. The SET is fetched
from Redis at most once per `RATE_LIMITER_UA_BLOCKLIST_REFRESH` seconds (default 60)
per worker process.
The bots in `BOT_UA_FRAGMENTS` (Python list, always active) and this Redis
SET are **independent** — the Python list provides the baseline and the Redis
SET allows adding new offenders at runtime **without a code deploy**.
```bash
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli -n 1 \
SADD rl:bot:ua:blocked \
"$(echo -n 'GPTBot' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'ClaudeBot' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'PerplexityBot' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'Bytespider' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'AhrefsBot' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'meta-externalagent' | sha256sum | cut -d' ' -f1)"
# Add a new offender at runtime (picked up within RATE_LIMITER_UA_BLOCKLIST_REFRESH seconds)
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli -n 1 \
SADD rl:bot:ua:blocked "$(echo -n 'NewBot' | sha256sum | cut -d' ' -f1)"
```
---
## Local standalone Redis (development / testing)
No Kubernetes? Run Redis directly with Docker:
```bash
sudo docker run --rm -p 6379:6379 redis:7-alpine \
redis-server --save "" --appendonly no
```
Then point Django at it by exporting the env var before starting the dev server:
```bash
export REDIS_URL="redis://localhost:6379"
export CACHE_BACKEND="redis"
python manage.py runserver
```
Or add them to your local `.env` file:
```
REDIS_URL=redis://localhost:6379
CACHE_BACKEND=redis
```
> **Note**: the waffle switch `REDIS_CACHE` must also be `on` in your local
> database for `start.sh` to activate the Redis backend. Run:
> ```bash
> python manage.py waffle_switch REDIS_CACHE on --create
> ```
---
## Update `redis.conf` without redeploying
```bash
# Edit the ConfigMap
rancher kubectl -n sapl-redis edit configmap redis-config
# Restart the pod to pick up the new config
rancher kubectl -n sapl-redis rollout restart deployment/sapl-redis
```
---
## Gunicorn tuning
`docker/startup_scripts/gunicorn.conf.py` — resolved values for the current pod budget (1600Mi RAM, 1 CPU):
```python
NUM_WORKERS = int(os.getenv("WEB_CONCURRENCY", "2")) # was 3
THREADS = int(os.getenv("GUNICORN_THREADS", "4")) # was 8
TIMEOUT = int(os.getenv("GUNICORN_TIMEOUT", "120")) # was 300
max_requests = 1000
max_requests_jitter = 200
worker_max_memory_per_child = 400 * 1024 * 1024 # 400 MB — was 300 MB
```
**Per-location timeout strategy** — nginx overrides the global Gunicorn timeout per-path:
| Operation | Timeout | Rationale |
|-----------|---------|-----------|
| Normal page rendering | 60 s | No legitimate page should take > 60 s |
| API endpoints | 30 s | Stateless, fast by design |
| PDF download (cached / nginx) | 30 s | nginx serves from disk, worker not involved |
| PDF generation (uncached) | 180 s | Kept high — addressed in a future phase |
| Large file upload | 180 s | nginx buffers upload; worker processes after |
---
## nginx real-IP and core fixes
Added to `docker/config/nginx/nginx.conf` (http {} block):
```nginx
# Kernel bypass — was off (bug)
sendfile on;
tcp_nopush on;
tcp_nodelay on;
# Real client IP from X-Forwarded-For set by K8s Ingress
real_ip_header X-Forwarded-For;
real_ip_recursive on;
set_real_ip_from 10.0.0.0/8;
set_real_ip_from 172.16.0.0/12;
set_real_ip_from 192.168.0.0/16;
```
Without `real_ip_recursive on`, `$remote_addr` inside the pod would always be the Ingress IP, making IP-based rate limiting and blocking meaningless.
---
## Django upload settings
Added to `sapl/settings.py` — files above 2 MB are streamed to disk rather than held in worker RAM. Critical for 150 MB upload support without OOM pressure:
```python
FILE_UPLOAD_MAX_MEMORY_SIZE = 2 * 1024 * 1024 # 2 MB
DATA_UPLOAD_MAX_MEMORY_SIZE = 10 * 1024 * 1024 # 10 MB
MAX_DOC_UPLOAD_SIZE = 150 * 1024 * 1024 # 150 MB
FILE_UPLOAD_TEMP_DIR = '/var/interlegis/sapl/tmp'
```
---
## N+1 fix — `get_etiqueta_protocolos`
`sapl/relatorios/views.py` — previously called `MateriaLegislativa.objects.filter()` inside a loop over protocols. Fixed to **three queries total** regardless of volume (one for protocols, one for materias, one for documentos):
```python
# sapl/relatorios/views.py
def get_etiqueta_protocolos(prots):
prot_list = list(prots)
if not prot_list:
return []
# Pre-fetch MateriaLegislativa for all protocols in one query.
materia_query = Q()
for p in prot_list:
materia_query |= Q(numero_protocolo=p.numero, ano=p.ano)
materias_map = {
(m.numero_protocolo, m.ano): m
for m in MateriaLegislativa.objects.filter(
materia_query).select_related('tipo')
}
# Pre-fetch DocumentoAdministrativo for all protocols in one query.
documentos_map = {
doc.protocolo_id: doc
for doc in DocumentoAdministrativo.objects.filter(
protocolo__in=prot_list).select_related('tipo')
}
protocolos = []
for p in prot_list:
dic = {}
dic['titulo'] = str(p.numero) + '/' + str(p.ano)
# ... timestamp / assunto / interessado / autor fields ...
materia = materias_map.get((p.numero, p.ano))
dic['num_materia'] = (
materia.tipo.sigla + ' ' + str(materia.numero) + '/' + str(materia.ano)
if materia else ''
)
documento = documentos_map.get(p.pk)
dic['num_documento'] = (
documento.tipo.sigla + ' ' + str(documento.numero) + '/' + str(documento.ano)
if documento else ''
)
dic['ident_processo'] = dic['num_materia'] or dic['num_documento']
protocolos.append(dic)
return protocolos
```
---
## Rate limiting — two layers, two jobs
SAPL enforces rate limits at two independent layers. They use different
algorithms and protect different things; their thresholds must be tuned
separately.
### Layer 1 — nginx `limit_req` (leaky bucket)
Defined in `docker/config/nginx/nginx.conf` (zones) and `sapl.conf` (burst).
```
sapl_general rate=30r/m # 1 token every 2 s
sapl_heavy rate=10r/m # 1 token every 6 s (PDF/report endpoints)
```
`burst=N nodelay` means nginx accepts up to N requests instantly above the
current token level, then enforces the drip rate. Requests beyond the burst
cap return 429 before reaching Gunicorn — **zero Python cost**.
Burst values are set at container startup via env vars:
| Env var | Default | Location |
|---------|---------|----------|
| `NGINX_BURST_GENERAL` | `60` | `location /`, `location /media/` |
| `NGINX_BURST_API` | `60` | `location /api/` |
| `NGINX_BURST_HEAVY` | `20` | `location /relatorios/` |
Defaults are 2× the zone's per-minute rate, so a user can spend a full
minute's quota in a single burst before the leaky bucket takes over.
### Layer 2 — Django `RateLimitMiddleware` (sliding window)
Defined in `sapl/middleware/ratelimit.py`, backed by Redis DB 1.
Requests that pass nginx reach Python. The middleware counts them in a
60-second sliding window per IP (anonymous) or per user (authenticated):
| Env var | Default | Scope |
|---------|---------|-------|
| `RATE_LIMITER_RATE` | `35/m` | Anonymous IP |
| `RATE_LIMITER_RATE_AUTHENTICATED` | `120/m` | Authenticated user |
| `RATE_LIMITER_RATE_BOT` | `5/m` | *(reserved — bots are currently blocked outright, not counted)* |
| `RATE_LIMITER_UA_BLOCKLIST_REFRESH` | `60` s | How often each worker re-fetches `rl:bot:ua:blocked` from Redis |
When the window count hits the threshold the IP/user is written to a Redis
blocked-set with a 300 s TTL and subsequent requests return 429 with
`Retry-After: 300` — without touching the database.
Decision flow inside `RateLimitMiddleware._evaluate()`:
```
1. IP in whitelist? → pass (no further checks)
1a. UA matches BOT_UA_FRAGMENTS list? → 429 reason=known_ua
1b. UA token hash in rl:bot:ua:blocked SET? → 429 reason=redis_ua
2. IP in rl:ip:{ip}:blocked? → 429 reason=ip_blocked
2b. Path extension in RATE_LIMIT_SCANNER_EXTENSIONS? → SET blocked, 429 reason=scanner_probe
3. Authenticated user?
3a. User in rl:{ns}:user:{uid}:blocked? → 429 reason=user_blocked
3b. Suspicious headers (no Accept/AL)? → 429 reason=suspicious_headers_auth
3c. User request count ≥ auth threshold? → SET blocked, 429 reason=auth_user_rate
4. Anonymous:
4a. Suspicious headers? → 429 reason=suspicious_headers
4b. IP request count ≥ anon threshold? → SET blocked, 429 reason=ip_rate
4c. NS/IP window count ≥ anon threshold? → SET blocked, 429 reason=ua_rotation
→ pass
```
### Decision flow diagram
```mermaid
flowchart TD
REQ([Request]) --> C1
C1{"Known bot UA?"}
C1 -- "yes — substring in BOT_UA_FRAGMENTS" --> R_UA([429\nknown_ua])
C1 -- no --> C1B
C1B{"Redis UA deny list?"}
C1B -- "yes — token hash in rl:bot:ua:blocked" --> R_RUA([429\nredis_ua])
C1B -- no --> C2
C2{"IP blocked?"}
C2 -- "yes — rl:ip:IP:blocked exists" --> R_IPB([429\nip_blocked])
C2 -- no --> C2B
C2B{"Scanner extension?\n.php .asp .aspx …"}
C2B -- yes --> SIPB["SET rl:ip:IP:blocked TTL 300 s"]
SIPB --> R_SCN([429\nscanner_probe])
C2B -- no --> C3
C3{"Authenticated?"}
C3 -- yes --> C3A
C3 -- no --> C4A
subgraph AUTH ["Authenticated"]
C3A{"User blocked?"}
C3A -- "yes — rl:ns:user:UID:blocked" --> R_UB([429\nuser_blocked])
C3A -- no --> C3B
C3B{"Suspicious headers?\nno Accept-Language + no Accept"}
C3B -- yes --> R_SH([429\nsuspicious_headers_auth])
C3B -- no --> C3C
C3C{"User rate ≥ 120/min?"}
C3C -- yes --> SUB["SET rl:ns:user:UID:blocked TTL 300 s"]
SUB --> R_AUR([429\nauth_user_rate])
C3C -- no --> PASS_A([✓ pass])
end
subgraph ANON ["Anonymous"]
C4A{"Suspicious headers?\nno Accept-Language + no Accept"}
C4A -- yes --> R_ASH([429\nsuspicious_headers])
C4A -- no --> C4B
C4B{"IP rate ≥ 35/min?"}
C4B -- yes --> SIPR["SET rl:ip:IP:blocked TTL 300 s"]
SIPR --> R_IPR([429\nip_rate])
C4B -- no --> C4C
C4C{"NS/IP window hit\n≥ 35 in bucket?"}
C4C -- yes --> SUAR["SET rl:ip:IP:blocked TTL 300 s"]
SUAR --> R_UAR([429\nua_rotation])
C4C -- no --> PASS_N([✓ pass])
end
```
### Enforcement graduation order
Roll out to canary pods first; promote check-by-check in order of false-positive risk:
| Order | Check | Reason | Risk | Condition to promote |
|-------|-------|--------|------|---------------------|
| 1st | `known_ua` | Substring in hardcoded `BOT_UA_FRAGMENTS` list | Zero | UA strings are deterministic |
| 2nd | `redis_ua` | Token hash in `rl:bot:ua:blocked` SET | Zero | Keys only set manually by operators |
| 3rd | `ip_blocked` | Marker set by prior proven-bad requests | Zero | Fast-path only, no new blocks created |
| 4th | `scanner_probe` | Path ext in `RATE_LIMIT_SCANNER_EXTENSIONS` | Zero | Django never legitimately serves `.php`/`.asp`/etc. |
| 5th | `ip_rate` | Rolling IP counter ≥ 35/min | Low | Threshold calibrated from canary logs |
| 6th | `suspicious_headers` | No Accept-Language **and** no Accept | Medium | Confirmed no legitimate clients omit both headers |
| 7th | `ua_rotation` (ns/window) | NS/IP clock-aligned bucket ≥ 35 | Medium | NAT IP whitelist in place (see Open Questions) |
### Decorator migration
For views where `django-ratelimit` decorators already exist:
| Endpoint type | Action | Reason |
|---------------|--------|--------|
| List views (GET) | Remove after middleware stable | Middleware covers equivalent threshold |
| Detail views (GET) | Remove after middleware stable | Middleware covers equivalent threshold |
| Search / filter views | Remove last | Expensive queries — keep stricter per-view limit until traffic data confirms safety |
| PDF / file generation | **Keep permanently** | Most expensive endpoint; per-view limit tighter than global |
| Write endpoints (POST/PUT/DELETE) | **Keep permanently** | Different abuse surface |
| Auth endpoints (login, reset) | **Keep permanently** | Credential stuffing; must be independent of IP rate |
### Why they are not the same number
| | nginx burst | Django threshold |
|-|------------|-----------------|
| **Algorithm** | Leaky bucket — token refills over time | Sliding window — hard count per 60 s |
| **Protects** | Gunicorn workers from being flooded | Per-client fairness, business policy |
| **Tuned by** | Capacity of the server | Acceptable request volume per client |
| **Failure mode** | Workers overwhelmed | Legitimate user over-browsing |
A user loading a page quickly may fire 5–10 Django requests in two seconds.
With `rate=30r/m` (1 token/2 s) and `burst=60` they absorb that fine; the
leaky bucket refills before they click the next link. The Django threshold
(35/m sliding window) catches sustained automated traffic from a single IP
that looks like scraping even if it arrives slowly enough to beat the nginx
burst cap.
---
## Request routing — how nginx reaches Django
`proxy_pass http://sapl_server` forwards the HTTP request — with the original
path intact — to the Gunicorn Unix socket. Django doesn't know or care that
nginx is in front; it sees a standard HTTP request.
```
GET /media/foo.pdf
nginx (sapl.conf)
location /media/ → proxy_pass to Unix socket
Gunicorn (WSGI server)
receives raw HTTP, calls Django WSGI application
Django middleware stack (settings.MIDDLEWARE)
RateLimitMiddleware → pass or 429
Django URL router (sapl/urls.py)
r'^media/(?P<path>.*)$' → serve_media
serve_media(request, path='foo.pdf')
returns HttpResponse with X-Accel-Redirect: /_accel/media/foo.pdf
nginx sees X-Accel-Redirect header
/_accel/media/ internal location → reads file from disk → sends to client
```
nginx does no routing beyond picking a `location` block. The mapping from
URL path to Python function lives entirely in `sapl/urls.py`. `proxy_pass` is
just a pipe.
---
## Media file serving — `serve_media` and X-Accel-Redirect
All `/media/` requests (public and private) are routed through Gunicorn so that
Django middleware runs on every hit. Nginx serves the file bytes via
`X-Accel-Redirect` — the Gunicorn worker is freed as soon as it sends the
response headers.
### nginx locations (`docker/config/nginx/sapl.conf`)
```nginx
# Proxied to Gunicorn — Django middleware + serve_media() run here.
location /media/ {
limit_req zone=sapl_general burst=${NGINX_BURST_GENERAL} nodelay;
proxy_pass http://sapl_server;
}
# Internal — only reachable via X-Accel-Redirect, not by external clients.
location /_accel/media/ {
internal;
alias /var/interlegis/sapl/media/;
sendfile on;
etag on;
}
```
### Django view (`sapl/base/media.py`)
`serve_media(request, path)` — registered at `^media/(?P<path>.*)$` in `sapl/urls.py`.
Per-request steps:
1. **Path traversal guard**`os.path.abspath` check; raises 404 on escape.
2. **Auth gate**`documentos_privados/` paths require an authenticated session; redirects to login otherwise.
3. **Path counter** — increments `rl:{ns}:path:{sha256}:reqs` in Redis DB 1 (TTL = `MEDIA_PATH_COUNTER_TTL`).
4. **Serve** — in DEBUG: `django.views.static.serve` directly. In production: `X-Accel-Redirect: /_accel/media/<path>`. Nginx sets `Content-Type` from its own `mime.types`.
### Settings
| Setting | Default | Purpose |
|---------|---------|---------|
| `MEDIA_PATH_COUNTER_TTL` | `60` s | TTL for both URL-path and storage-path counters (DB 1) |
### File serving decision matrix
| File type | Size | Strategy |
|-----------|------|----------|
| Logos / images | Any | nginx `alias` + `sendfile` + ETag + `Cache-Control` |
| Small PDFs | ≤ 360 KB | nginx direct + ETag |
| Medium PDFs | 360 KB – 2 MB | nginx direct + ETag + rate limit |
| Large PDFs | > 2 MB | nginx direct + strict rate limit; never Redis |
| LGPD-restricted | Any | Django `serve_media``X-Accel-Redirect` → nginx (access control enforced) |
| Public `/media/` | Any | Django `serve_media``X-Accel-Redirect` → nginx (middleware runs; path counter written) |
### Why Redis is not needed for PDFs
With the full mitigation stack active:
- **ASN blocking** drops datacenter bot traffic at nginx (zero Python cost)
- **UA blocking** drops known-UA bots at nginx (zero Python cost)
- **Shared Redis rate counters** enforce limits across all pods
- **ETags** convert repeat requests to 304 responses with zero bytes transferred
- **`sendfile on`** means disk reads bypass userspace entirely
Redis PDF caching would solve "high request volume reaching the file layer" — but that problem no longer exists once the above stack is active. For `Brasão - Foz do Iguaçu.png` (392 KB × 14,512 requests = 5.6 GB), a 50% conditional-request hit rate saves ~2.8 GB immediately — without any Redis.
---
## Key schema reference
| DB | Use case | Key pattern | TTL | Threshold | Constant |
|----|----------|-------------|-----|-----------|----------|
| 0 | Page / view cache | `cache:{ns}:*` | 300 s (default) | — | `CACHES['default']` KEY_PREFIX |
| 0 | Static file cache (logos) | `static:{ns}:{sha256}` | 3 – 24 h | — | *Future* (requires OpenResty/Lua) |
| 0 | File content cache (≤ 360 KB) | `file:{ns}:{sha256}` | 1 h | — | *Future* |
| 1 | IP rate-limit counter | `rl:ip:{ip}:reqs` | 60 s | 35 (`RATE_LIMITER_RATE`) | `RL_IP_REQUESTS` |
| 1 | IP blocked marker | `rl:ip:{ip}:blocked` | 300 s | — | `RL_IP_BLOCKED` |
| 1 | User rate-limit counter | `rl:{ns}:user:{uid}:reqs` | 60 s | 120 (`RATE_LIMITER_RATE_AUTHENTICATED`) | `RL_USER_REQUESTS` |
| 1 | User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | — | `RL_USER_BLOCKED` |
| 1 | Namespace/IP sliding window | `rl:{ns}:ip:{ip}:w:{bucket}` | 120 s | 35 (`RATE_LIMITER_RATE`) | `RL_NS_WINDOW` |
| 1 | Path counter (`/media/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | — (observability only) | `RL_PATH_REQUESTS` |
| 1 | Path counter (`/static/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | — | *Future* (requires OpenResty/Lua) |
| 1 | UA deny list | `rl:bot:ua:blocked` | permanent SET | — (block on match) | `RL_UA_BLOCKLIST` |
| 2 | Django Channels | `channels:*` | session TTL | — | *Future* |
### What each counter catches — and misses
**`rl:ip:{ip}:reqs` — global rolling IP counter**
Catches: any sustained anonymous volume from a single IP regardless of namespace,
path, or User-Agent — pure request rate.
Misses: a user legitimately accessing several municipality SAPLs simultaneously;
their requests accumulate across namespaces into one global count and may trip the
threshold even though no individual SAPL is being abused. Also misses a
timing-aware scraper that paces exactly 34 req/min: the 60 s TTL resets from the
first request, so the attacker can safely send 34, wait for reset, repeat forever.
---
**`rl:ip:{ip}:blocked` — IP short-circuit marker**
Written when `rl:ip:{ip}:reqs` hits the anonymous threshold (step 4b) or when the
namespace/IP bucket hits the threshold (step 4c). Checked at step 2 — before any
counting — so a blocked IP never increments any counter on subsequent requests.
Catches: saves Redis INCR + EXPIRE calls for every request from an already-blocked
IP; the 300 s TTL is a hard cooldown regardless of how many requests arrive.
Misses: the TTL is fixed — a persistent attacker simply waits 300 s and gets
another full window quota. Also, because the key is global (no namespace), an IP
blocked for one municipal SAPL is blocked for all SAPLs on the same pod —
collateral effect for shared IPs.
---
**`rl:{ns}:ip:{ip}:w:{bucket}` — namespace-scoped clock-aligned bucket**
Catches: sustained scraping against a *specific* municipal SAPL that stays just
under the global threshold; a scraper pacing 34 req/min globally across namespaces
still accumulates in the per-namespace bucket. Clock alignment (bucket =
`time() // 60`) means a burst straddling a minute boundary still contributes to
the *next* bucket for 120 s (2× TTL), making precise timing attacks harder.
Misses: an IP that floods one namespace to exactly 34 req/min: it never reaches 35
in the bucket either. Cross-namespace legitimate traffic that happens to land
within the same clock minute — same blind spot as `rl:ip:*` but scoped lower.
**Why this key is namespace-scoped**
Five arguments for `rl:{ns}:ip:{ip}:w:{bucket}` over a global `rl:ip:{ip}:w:{bucket}`:
1. **Matches the observed attack pattern.** The botnet in §Bot Traffic Profile targets one SAPL at a time, not the fleet evenly. A scraper hammering `fortaleza-ce` at 34 req/min has a namespace counter of 34 and a global counter of 34. Without the namespace the two keys are redundant — the window adds no new signal. With it, a scraper that legitimately distributes across 5 SAPLs (7 req/min each, 35 globally) is caught globally but *not* per-SAPL — correct behaviour, since no single SAPL is being abused.
2. **Two counters defeat two different gaming strategies.** `rl:ip:{ip}:reqs` uses a rolling TTL (starts on the first INCR). A scraper that knows this can send 34 requests, wait ~61 s for the key to expire, and repeat indefinitely. The clock-aligned window resets at wall-clock minute boundaries. To game *both* simultaneously the attacker must time bursts to expire the rolling key *and* land entirely within one clock window — two independent constraints that are hard to satisfy together.
3. **Without the namespace it duplicates the global counter.** All pods share the same Redis. A global `rl:ip:{ip}:w:{bucket}` would aggregate that IP's traffic from every pod — exactly what `rl:ip:{ip}:reqs` already does, just with different reset timing. Two keys measuring the same dimension is wasted INCR overhead with no added signal.
4. **Multi-SAPL legitimate IPs are not penalised.** Municipal IT departments, ISP shared exit nodes, and Googlebot all produce high global request rates while being individually harmless to any one SAPL. A namespaced window lets them access 10 SAPLs at 3 req/min each without triggering a per-SAPL block, while the global counter still catches them if their total rate is abusive.
5. **Consistent with the established `{ns}` isolation contract.** All user-keyed (`rl:{ns}:user:{uid}:*`) and path-keyed (`rl:{ns}:path:{sha256}:reqs`) entries are namespace-scoped. A global window key would break the invariant that per-tenant data is isolated — complicating key-space inspection, `SCAN`-based dashboards, and future per-tenant rate adjustments.
---
**`rl:{ns}:user:{uid}:reqs` — authenticated user counter**
Catches: an authenticated account being used as a scraping credential — even if
the requests come from many different IPs (e.g., distributed proxy pool), all
requests share the same `uid` and accumulate in one counter.
Misses: a credential that is shared across multiple legitimate users in the same
office; all their activity adds up to one counter and can trip the 120/min
threshold during a busy session.
---
**`rl:{ns}:user:{uid}:blocked` — authenticated user short-circuit marker**
Written when `rl:{ns}:user:{uid}:reqs` hits the authenticated threshold (step 3c).
Checked at step 3a — before counting — so a blocked user never increments their
counter on subsequent requests during the 300 s cooldown.
Catches: credential-stuffing or runaway automation using a valid session — once the
120/min threshold is hit, the account is locked out immediately for 300 s. Unlike
the IP marker, the block is namespace-scoped, so the same user account can be
blocked on one SAPL but still active on another.
Misses: same fixed-TTL weakness as the IP marker — a persistent attacker resumes
after 300 s. An account shared by multiple legitimate users (e.g., a departmental
login) can be locked out during peak collaborative use.
---
**`rl:{ns}:path:{sha256}:reqs` — per-media-file URL counter**
Currently observability-only (no threshold enforced). Intended for future
hot-file detection: a single document being hammered by many IPs would show
a spike in this counter even if no individual IP exceeds the IP threshold.
Misses: nothing is blocked today. Once a threshold is added, it will miss
distributed access where many IPs each download the file once (legitimate CDN
pre-warming or public interest event).
---
**`rl:bot:ua:blocked` — runtime UA deny list**
Catches: new bot UA tokens added at runtime via `redis-cli SADD` without a code
deploy; picked up within `RATE_LIMITER_UA_BLOCKLIST_REFRESH` seconds (default 60)
per worker. Complements the hardcoded `BOT_UA_FRAGMENTS` Python list.
Misses: bots that rotate UA tokens on every request (no single token accumulates);
bots that impersonate a valid browser UA completely (no known fragment to match).
---
## Dynamic page caching
**Goal**: Eliminate ORM queries for anonymous bot requests on list views.
**Prerequisite**: Phase 1 (shared Redis, `CACHE_BACKEND=redis`).
Many SAPL list views (`pesquisar-materia`, `norma`, etc.) are not truly dynamic for anonymous users between edits. A bot hammering `?page=1` through `?page=100` triggers 100 ORM queries per pod. With Redis page cache, each unique URL is queried once per TTL across the entire fleet.
```python
# Apply to anonymous list views only — AnonCachePageMixin already wired to materia/sessao detail views.
from django.views.decorators.cache import cache_page
from django.utils.decorators import method_decorator
@method_decorator(cache_page(60 * 5), name='dispatch') # 5-minute TTL
class PesquisarMateriaView(FilterView):
...
```
> **Safety check**: `cache_page` sets `Cache-Control: private` for authenticated sessions automatically.
> Verify this is working before deploying — accidentally caching a session-aware response is a data leak.
### Cache TTL guidelines
| View type | TTL | Reasoning |
|-----------|-----|-----------|
| Matéria list (anonymous) | 300 s | Changes infrequently between sessions |
| Norma list (anonymous) | 300 s | Same |
| Parlamentar list | 3600 s | Changes rarely |
| Search results | 60 s | Query-dependent; shorter TTL safer |
| Authenticated views | Never | `cache_page` respects this automatically |
| PDF generation | Never | Too large — serve from disk via nginx |
---
## Open Questions
| # | Question | Status | Blocks |
|---|----------|--------|--------|
| 1 | Does Chrome/98.0.4758 impersonator appear consistently in nginx access logs? | Needs investigation | UA block safety |
| 2 | Which legislative house IPs can be pre-whitelisted in `RATE_LIMIT_WHITELIST_IPS`? | No list yet — obtain in the future. Setting is **optional / future**. | Enforcement safety for NAT users |
| 3 | `CONN_MAX_AGE` tuning | Currently **300 s** (`sapl/settings.py`). Evaluate whether to reduce given worker recycling at 400 MB. | Gunicorn tuning |
| 4 | WebSocket voting panel priority | Separate project. Resumes after Redis is on k8s, bot siege addressed, and OOM pressure reduced. | Phase 5 sequencing |

474
plan/RATE_LIMITER_PLAN.md

@ -1,474 +0,0 @@
# SAPL — Kubernetes Redis
Manifests for the shared Redis instance used by all SAPL pods for
cross-pod rate limiting (DB 1) and view/static-file caching (DB 0).
---
## Directory layout
```
docker/k8s/
└── redis/
├── redis-configmap.yaml # redis.conf — no persistence, allkeys-lru, 5 GB ceiling
├── redis-deployment.yaml # Deployment (1 replica, redis:7-alpine)
└── redis-service.yaml # ClusterIP service on port 6379
```
---
## Prerequisites
- `kubectl` configured to talk to the target cluster.
- A `sapl-redis` namespace (created below if it doesn't exist).
---
## Deploy
```bash
# 1. Create the namespace (idempotent)
rancher kubectl create namespace sapl-redis --dry-run=client -o yaml | rancher kubectl apply -f -
# 2. Apply all three manifests
rancher kubectl apply -f docker/k8s/redis/redis-configmap.yaml
rancher kubectl apply -f docker/k8s/redis/redis-deployment.yaml
rancher kubectl apply -f docker/k8s/redis/redis-service.yaml
# 3. Verify the pod is Running
rancher kubectl -n sapl-redis get pods -l app=sapl-redis
```
Expected output:
```
NAME READY STATUS RESTARTS AGE
sapl-redis-6d9f8b7c4d-xk2lm 1/1 Running 0 30s
```
---
## Verify the rate limiter
`scripts/test_ratelimiter.py` fires repeated GET requests at a SAPL URL and reports
when the first 429 is returned.
### Usage
```
python scripts/test_ratelimiter.py <URL> [-n NUM] [-d DELAY] [-t TIMEOUT]
```
| Flag | Default | Meaning |
|------|---------|---------|
| `url` | *(required)* | Full URL including scheme, e.g. `http://localhost` |
| `-n`, `--num-requests` | `50` | Maximum requests to send |
| `-d`, `--delay` | `0.1` | Seconds between requests |
| `-t`, `--timeout` | `10` | Per-request timeout in seconds |
The script stops and prints a summary as soon as a 429 is received.
### Examples
```bash
# Hit the anonymous threshold (35 req/min) — fire 40 requests with minimal delay
python scripts/test_ratelimiter.py http://localhost -n 40 -d 0.05
# Slower fire — check that legitimate traffic is not rate-limited
python scripts/test_ratelimiter.py http://localhost -n 20 -d 2
# Test against a staging pod via port-forward
rancher kubectl port-forward -n <NAMESPACE> deploy/sapl 8080:80 &
python scripts/test_ratelimiter.py http://localhost:8080 -n 40 -d 0.05
```
### Reading the output
```
Request 1: Status 200 | Time: 0.045s
...
Request 36: Status 429 | Time: 0.038s
-> Rate limited on request 36
Summary:
Total requests attempted: 36
Successful (200): 35
Rate limited (429): 1
First 429 occurred at request: 36
```
A first-429 near the configured anonymous threshold (35 req/min) confirms the
middleware is wired correctly. A first-429 much earlier points to nginx `limit_req`
firing before Django sees the request.
---
## Inject REDIS_URL into SAPL instances
`REDIS_URL` points at the shared instance:
```
redis://redis.sapl-redis.svc.cluster.local:6379
^^^^^ ^^^^^^^^^^
svc namespace
```
`start.sh` picks it up on every pod startup and sets the `REDIS_CACHE` waffle switch
automatically — no further intervention needed.
### Fleet-wide rollout
Uses the `app.kubernetes.io/name=sapl` pod label to discover every SAPL namespace
automatically — onboarding a new municipality requires no script changes.
```bash
for ns in $(rancher kubectl get pods -A -l app.kubernetes.io/name=sapl \
-o jsonpath='{.items[*].metadata.namespace}' | tr ' ' '\n' | sort -u); do
rancher kubectl set env deployment/sapl \
REDIS_URL=redis://redis.sapl-redis.svc.cluster.local:6379 \
-n $ns
done
```
### Roll back
```bash
for ns in $(rancher kubectl get pods -A -l app.kubernetes.io/name=sapl \
-o jsonpath='{.items[*].metadata.namespace}' | tr ' ' '\n' | sort -u); do
rancher kubectl set env deployment/sapl REDIS_URL- -n $ns
done
```
`kubectl set env deployment/sapl REDIS_URL-` (trailing `-`) removes the variable.
`start.sh` then falls back to file-based cache automatically.
---
## Monitor
### Pod and events
```bash
# Pod status
rancher kubectl -n sapl-redis get pods -l app=sapl-redis -o wide
# Deployment events (useful right after apply)
rancher kubectl -n sapl-redis describe deployment sapl-redis
# Pod events (OOMKill, restarts, etc.)
rancher kubectl -n sapl-redis describe pod -l app=sapl-redis
```
### Logs
```bash
# Tail live logs
rancher kubectl -n sapl-redis logs -f deploy/sapl-redis
# Last 100 lines
rancher kubectl -n sapl-redis logs deploy/sapl-redis --tail=100
```
### Redis INFO
```bash
# Memory usage
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
redis-cli info memory \
| grep -E 'used_memory_human|maxmemory_human|mem_fragmentation_ratio'
# Connection pressure
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
redis-cli info stats \
| grep -E 'rejected_connections|instantaneous_ops_per_sec'
# Key distribution per DB
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli info keyspace
# Recent slow queries
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli slowlog get 10
# Live command sampling (1-second window)
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli --latency-history -i 1
```
### Rate-limiter keys (DB 1)
```bash
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
redis-cli -n 1 dbsize
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- \
redis-cli -n 1 --scan --pattern 'rl:ip:*' | head -20
```
---
## Seed the UA deny list (once after first deploy)
`rl:bot:ua:blocked` is a permanent Redis SET in DB 1. Each member is the
SHA-256 of a **UA token** — the identifying fragment extracted after splitting
on `/`, spaces, `;`, `(`, `)`, e.g.:
```
UA string: "GPTBot/1.1 (+https://openai.com/gptbot)"
Tokens: GPTBot 1.1 +https: ...
Hash stored: sha256("GPTBot")
```
The middleware (`_is_redis_blocked_ua`) tokenises the incoming UA the same
way and checks each token hash against the cached set. The SET is fetched
from Redis at most once per `RATE_LIMITER_UA_BLOCKLIST_REFRESH` seconds (default 60)
per worker process.
The bots in `BOT_UA_FRAGMENTS` (Python list, always active) and this Redis
SET are **independent** — the Python list provides the baseline and the Redis
SET allows adding new offenders at runtime **without a code deploy**.
```bash
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli -n 1 \
SADD rl:bot:ua:blocked \
"$(echo -n 'GPTBot' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'ClaudeBot' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'PerplexityBot' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'Bytespider' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'AhrefsBot' | sha256sum | cut -d' ' -f1)" \
"$(echo -n 'meta-externalagent' | sha256sum | cut -d' ' -f1)"
# Add a new offender at runtime (picked up within RATE_LIMITER_UA_BLOCKLIST_REFRESH seconds)
rancher kubectl exec -n sapl-redis deploy/sapl-redis -- redis-cli -n 1 \
SADD rl:bot:ua:blocked "$(echo -n 'NewBot' | sha256sum | cut -d' ' -f1)"
```
---
## Local standalone Redis (development / testing)
No Kubernetes? Run Redis directly with Docker:
```bash
sudo docker run --rm -p 6379:6379 redis:7-alpine \
redis-server --save "" --appendonly no
```
Then point Django at it by exporting the env var before starting the dev server:
```bash
export REDIS_URL="redis://localhost:6379"
export CACHE_BACKEND="redis"
python manage.py runserver
```
Or add them to your local `.env` file:
```
REDIS_URL=redis://localhost:6379
CACHE_BACKEND=redis
```
> **Note**: the waffle switch `REDIS_CACHE` must also be `on` in your local
> database for `start.sh` to activate the Redis backend. Run:
> ```bash
> python manage.py waffle_switch REDIS_CACHE on --create
> ```
---
## Update `redis.conf` without redeploying
```bash
# Edit the ConfigMap
rancher kubectl -n sapl-redis edit configmap redis-config
# Restart the pod to pick up the new config
rancher kubectl -n sapl-redis rollout restart deployment/sapl-redis
```
---
## Rate limiting — two layers, two jobs
SAPL enforces rate limits at two independent layers. They use different
algorithms and protect different things; their thresholds must be tuned
separately.
### Layer 1 — nginx `limit_req` (leaky bucket)
Defined in `docker/config/nginx/nginx.conf` (zones) and `sapl.conf` (burst).
```
sapl_general rate=30r/m # 1 token every 2 s
sapl_heavy rate=10r/m # 1 token every 6 s (PDF/report endpoints)
```
`burst=N nodelay` means nginx accepts up to N requests instantly above the
current token level, then enforces the drip rate. Requests beyond the burst
cap return 429 before reaching Gunicorn — **zero Python cost**.
Burst values are set at container startup via env vars:
| Env var | Default | Location |
|---------|---------|----------|
| `NGINX_BURST_GENERAL` | `60` | `location /`, `location /media/` |
| `NGINX_BURST_API` | `60` | `location /api/` |
| `NGINX_BURST_HEAVY` | `20` | `location /relatorios/` |
Defaults are 2× the zone's per-minute rate, so a user can spend a full
minute's quota in a single burst before the leaky bucket takes over.
### Layer 2 — Django `RateLimitMiddleware` (sliding window)
Defined in `sapl/middleware/ratelimit.py`, backed by Redis DB 1.
Requests that pass nginx reach Python. The middleware counts them in a
60-second sliding window per IP (anonymous) or per user (authenticated):
| Env var | Default | Scope |
|---------|---------|-------|
| `RATE_LIMITER_RATE` | `35/m` | Anonymous IP |
| `RATE_LIMITER_RATE_AUTHENTICATED` | `120/m` | Authenticated user |
| `RATE_LIMITER_RATE_BOT` | `5/m` | *(reserved — bots are currently blocked outright, not counted)* |
| `RATE_LIMITER_UA_BLOCKLIST_REFRESH` | `60` s | How often each worker re-fetches `rl:bot:ua:blocked` from Redis |
When the window count hits the threshold the IP/user is written to a Redis
blocked-set with a 300 s TTL and subsequent requests return 429 with
`Retry-After: 300` — without touching the database.
Decision flow inside `RateLimitMiddleware._evaluate()`:
```
1. IP in whitelist? → pass (no further checks)
1a. UA matches BOT_UA_FRAGMENTS list? → 429 reason=known_ua
1b. UA token hash in rl:bot:ua:blocked SET? → 429 reason=redis_ua
2. IP in rl:ip:{ip}:blocked? → 429 reason=ip_blocked
3. Authenticated user?
3a. User in rl:{ns}:user:{uid}:blocked? → 429 reason=user_blocked
3b. Suspicious headers (no Accept/AL)? → 429 reason=suspicious_headers_auth
3c. User request count ≥ auth threshold? → SET blocked, 429 reason=auth_user_rate
4. Anonymous:
4a. Suspicious headers? → 429 reason=suspicious_headers
4b. IP request count ≥ anon threshold? → SET blocked, 429 reason=ip_rate
4c. NS/IP window count ≥ anon threshold? → SET blocked, 429 reason=ua_rotation
→ pass
```
### Why they are not the same number
| | nginx burst | Django threshold |
|-|------------|-----------------|
| **Algorithm** | Leaky bucket — token refills over time | Sliding window — hard count per 60 s |
| **Protects** | Gunicorn workers from being flooded | Per-client fairness, business policy |
| **Tuned by** | Capacity of the server | Acceptable request volume per client |
| **Failure mode** | Workers overwhelmed | Legitimate user over-browsing |
A user loading a page quickly may fire 5–10 Django requests in two seconds.
With `rate=30r/m` (1 token/2 s) and `burst=60` they absorb that fine; the
leaky bucket refills before they click the next link. The Django threshold
(35/m sliding window) catches sustained automated traffic from a single IP
that looks like scraping even if it arrives slowly enough to beat the nginx
burst cap.
---
## Request routing — how nginx reaches Django
`proxy_pass http://sapl_server` forwards the HTTP request — with the original
path intact — to the Gunicorn Unix socket. Django doesn't know or care that
nginx is in front; it sees a standard HTTP request.
```
GET /media/foo.pdf
nginx (sapl.conf)
location /media/ → proxy_pass to Unix socket
Gunicorn (WSGI server)
receives raw HTTP, calls Django WSGI application
Django middleware stack (settings.MIDDLEWARE)
RateLimitMiddleware → pass or 429
Django URL router (sapl/urls.py)
r'^media/(?P<path>.*)$' → serve_media
serve_media(request, path='foo.pdf')
returns HttpResponse with X-Accel-Redirect: /_accel/media/foo.pdf
nginx sees X-Accel-Redirect header
/_accel/media/ internal location → reads file from disk → sends to client
```
nginx does no routing beyond picking a `location` block. The mapping from
URL path to Python function lives entirely in `sapl/urls.py`. `proxy_pass` is
just a pipe.
---
## Media file serving — `serve_media` and X-Accel-Redirect
All `/media/` requests (public and private) are routed through Gunicorn so that
Django middleware runs on every hit. Nginx serves the file bytes via
`X-Accel-Redirect` — the Gunicorn worker is freed as soon as it sends the
response headers.
### nginx locations (`docker/config/nginx/sapl.conf`)
```nginx
# Proxied to Gunicorn — Django middleware + serve_media() run here.
location /media/ {
limit_req zone=sapl_general burst=${NGINX_BURST_GENERAL} nodelay;
proxy_pass http://sapl_server;
}
# Internal — only reachable via X-Accel-Redirect, not by external clients.
location /_accel/media/ {
internal;
alias /var/interlegis/sapl/media/;
sendfile on;
etag on;
}
```
### Django view (`sapl/base/media.py`)
`serve_media(request, path)` — registered at `^media/(?P<path>.*)$` in `sapl/urls.py`.
Per-request steps:
1. **Path traversal guard**`os.path.abspath` check; raises 404 on escape.
2. **Auth gate**`documentos_privados/` paths require an authenticated session; redirects to login otherwise.
3. **Path counter** — increments `rl:{ns}:path:{sha256}:reqs` in Redis DB 1 (TTL = `MEDIA_PATH_COUNTER_TTL`).
4. **Content-type cache** — reads `file:{ns}:{sha256}` from Django default cache (DB 0); on miss, calls `mimetypes.guess_type`, stores result (TTL = `MEDIA_FILE_CACHE_TTL`).
5. **Serve** — in DEBUG: `django.views.static.serve` directly. In production: `X-Accel-Redirect: /_accel/media/<path>`.
### Settings
| Setting | Default | Purpose |
|---------|---------|---------|
| `FILE_META_KEY` | `'file:{ns}:{sha256}'` | Key template for content-type cache (DB 0) |
| `MEDIA_PATH_COUNTER_TTL` | `60` s | Per-path counter window |
| `MEDIA_FILE_CACHE_TTL` | `3600` s | Content-type metadata TTL |
---
## Key schema reference
| DB | Use case | Key pattern | TTL | Constant |
|----|----------|-------------|-----|----------|
| 0 | Page / view cache | `cache:{ns}:*` | 300 s (default) | `CACHES['default']` KEY_PREFIX |
| 0 | Static file cache (logos) | `static:{ns}:{sha256}` | 3 – 24 h | *Future* (requires OpenResty/Lua) |
| 0 | Media file content-type cache | `file:{ns}:{sha256}` | 1 h | `FILE_META_KEY` |
| 1 | IP rate-limit counter | `rl:ip:{ip}:reqs` | 60 s | `RL_IP_REQUESTS` |
| 1 | IP blocked marker | `rl:ip:{ip}:blocked` | 300 s | `RL_IP_BLOCKED` |
| 1 | User rate-limit counter | `rl:{ns}:user:{uid}:reqs` | 60 s | `RL_USER_REQUESTS` |
| 1 | User blocked marker | `rl:{ns}:user:{uid}:blocked` | 300 s | `RL_USER_BLOCKED` |
| 1 | Namespace/IP sliding window | `rl:{ns}:ip:{ip}:w:{bucket}` | 120 s | `RL_NS_WINDOW` |
| 1 | Path counter (`/media/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | `RL_PATH_REQUESTS` |
| 1 | Path counter (`/static/`) | `rl:{ns}:path:{sha256}:reqs` | 60 s | *Future* (requires OpenResty/Lua) |
| 1 | UA deny list | `rl:bot:ua:blocked` | permanent SET | `RL_UA_BLOCKLIST` |
| 2 | Django Channels | `channels:*` | session TTL | *Future* |

40
sapl/base/media.py

@ -4,34 +4,26 @@ serve_media — X-Accel-Redirect gate for all /media/ files.
Production flow (nginx proxies /media/ to Gunicorn):
1. Django middleware runs (IP rate-limit, bot UA check, etc.).
2. serve_media() runs auth check for documentos_privados/, writes
per-path counter to Redis DB 1, caches content-type in Redis DB 0.
3. Returns an empty 200 with X-Accel-Redirect pointing to the nginx
internal location /_accel/media/<path>. Nginx serves the bytes
directly from disk Gunicorn worker is freed immediately.
URL-path counter to Redis DB 1, then returns X-Accel-Redirect.
Nginx serves the bytes directly from disk Gunicorn worker freed immediately.
Development flow (DEBUG=True, nginx absent):
Falls back to django.views.static.serve for live file serving.
Redis side-effects per request:
DB 1 rl:{ns}:path:{sha256}:reqs per-path access counter, TTL=MEDIA_PATH_COUNTER_TTL
DB 0 file:{ns}:{sha256} content-type metadata, TTL=MEDIA_FILE_CACHE_TTL
(sha256 is of the URL path, e.g. sha256('/media/2024/01/doc.pdf'))
Key template: FILE_META_KEY (sapl/middleware/ratelimit.py); TTLs in sapl/settings.py
Redis side-effects per request (DB 1, TTL=MEDIA_PATH_COUNTER_TTL):
rl:{ns}:path:{sha256('/media/<path>')}:reqs URL-path access counter
"""
import hashlib
import mimetypes
import os
from django.conf import settings
from django.core.cache import caches
from django.http import Http404, HttpResponse
from django.views.static import serve
from sapl import settings as sapl_settings
from sapl.middleware.ratelimit import (
_NAMESPACE,
FILE_META_KEY,
RL_PATH_REQUESTS,
_incr_with_ttl,
)
@ -65,31 +57,23 @@ def serve_media(request, path):
from django.contrib.auth.views import redirect_to_login
return redirect_to_login(request.get_full_path())
# Per-path rate counter (DB 1) — key uses URL path so that storage
# location changes in the next PR don't reset existing counters.
path_hash = hashlib.sha256(f'/media/{path}'.encode()).hexdigest()
# 404 before writing any counters.
if not os.path.isfile(abs_path):
raise Http404
# URL-path counter (DB 1).
_incr_with_ttl(
RL_PATH_REQUESTS.format(ns=_NAMESPACE, sha256=path_hash),
RL_PATH_REQUESTS.format(ns=_NAMESPACE, sha256=hashlib.sha256(f'/media/{path}'.encode()).hexdigest()),
ttl=sapl_settings.MEDIA_PATH_COUNTER_TTL,
)
# Content-type metadata cache (DB 0) — avoids mimetypes.guess_type
# and os.path.isfile on every hit for hot files.
file_key = FILE_META_KEY.format(ns=_NAMESPACE, sha256=path_hash)
content_type = caches['default'].get(file_key)
if content_type is None:
if not os.path.isfile(abs_path):
raise Http404
guessed, _ = mimetypes.guess_type(abs_path)
content_type = guessed or 'application/octet-stream'
caches['default'].set(file_key, content_type, timeout=sapl_settings.MEDIA_FILE_CACHE_TTL)
if settings.DEBUG:
# Development: no nginx present; serve the file directly.
return serve(request, path, document_root=settings.MEDIA_ROOT)
# Production: tell nginx to serve the file from the internal location.
response = HttpResponse(content_type=content_type)
# Nginx sets Content-Type from its own mime.types when serving the file.
response = HttpResponse()
response['X-Accel-Redirect'] = f'/_accel/media/{path}'
response['Cache-Control'] = 'public, max-age=86400, stale-while-revalidate=3600'
response['X-Robots-Tag'] = 'noindex'

9
sapl/middleware/ratelimit.py

@ -5,6 +5,7 @@ Decision flow (per request):
1. Known bot UA? 429 (Python list substring match)
1b. Redis UA deny list? 429 (runtime SET token hash match, refreshed every 60 s)
2. IP in blocked set? 429
2b. Path extension in scanner set? SET RL_IP_BLOCKED, 429
3. Authenticated user?
a. User blocked? 429
b. Suspicious hdrs? 429
@ -27,6 +28,7 @@ no per-request lookup is needed or correct.
import hashlib
import logging
import os
import re
import time
@ -55,7 +57,6 @@ RL_USER_BLOCKED = 'rl:{ns}:user:{uid}:blocked'
RL_NS_WINDOW = 'rl:{ns}:ip:{ip}:w:{bucket}'
RL_PATH_REQUESTS = 'rl:{ns}:path:{sha256}:reqs'
RL_UA_BLOCKLIST = 'rl:bot:ua:blocked' # permanent SET — runtime UA deny list
FILE_META_KEY = 'file:{ns}:{sha256}' # content-type metadata cache (DB 0)
# ---------------------------------------------------------------------------
# Bot UA fragments
@ -260,6 +261,12 @@ class RateLimitMiddleware:
if self._rl_cache.get(RL_IP_BLOCKED.format(ip=ip)):
return {'action': 'block', 'reason': 'ip_blocked', 'ip': ip}
# Check 2b: scanner probe (e.g. .php, .asp) — Django never serves these.
ext = os.path.splitext(request.path)[1].lower()
if ext in settings.RATE_LIMIT_SCANNER_EXTENSIONS:
self._rl_cache.set(RL_IP_BLOCKED.format(ip=ip), 1, timeout=self.BLOCK_TTL)
return {'action': 'block', 'reason': 'scanner_probe', 'ip': ip}
user = getattr(request, 'user', None)
if user is not None and user.is_authenticated:
return self._evaluate_authenticated(request, ip)

15
sapl/settings.py

@ -415,9 +415,20 @@ RATE_LIMIT_WHITELIST_IPS = config(
# Lower values pick up new blocked UAs faster; higher values reduce Redis round-trips.
RATE_LIMITER_UA_BLOCKLIST_REFRESH = config('RATE_LIMITER_UA_BLOCKLIST_REFRESH', default=60, cast=int)
# File extensions that indicate a scanner probe (e.g. PHP/ASP app fingerprinting).
# Requests for these extensions are blocked immediately and the IP is written to
# rl:ip:{ip}:blocked for BLOCK_TTL seconds — Django never legitimately serves them.
RATE_LIMIT_SCANNER_EXTENSIONS = frozenset(
config(
'RATE_LIMIT_SCANNER_EXTENSIONS',
default='.php .asp .aspx .jsp .cgi .env',
cast=lambda v: [x.strip() for x in v.split() if x.strip()],
)
)
# Media file serving — serve_media (sapl/base/media.py) via X-Accel-Redirect.
MEDIA_PATH_COUNTER_TTL = config('MEDIA_PATH_COUNTER_TTL', default=60, cast=int) # seconds — per-path counter window
MEDIA_FILE_CACHE_TTL = config('MEDIA_FILE_CACHE_TTL', default=3600, cast=int) # seconds — content-type metadata TTL
# TTL for both URL-path and storage-path access counters (DB 1).
MEDIA_PATH_COUNTER_TTL = config('MEDIA_PATH_COUNTER_TTL', default=60, cast=int)
# ---------------------------------------------------------------------------
# Anonymous page caching — AnonCachePageMixin (sapl/middleware/page_cache.py)

Loading…
Cancel
Save