mirror of https://github.com/interlegis/sapl.git
Browse Source
Import fixes (all three imported get_client_ip/ratelimit_ip from sapl.utils which no longer exports them — causing the ImportError at startup): - sapl/materia/forms.py: move get_client_ip to sapl.middleware.ratelimit - sapl/materia/views.py: move get_client_ip + ratelimit_ip; keep RATE_LIMITER_RATE in sapl.settings (used by @ratelimit decorators) - sapl/base/views.py: same pattern as materia/views.py Docs: - rate-limiter-v2.md: remove Phase 5 section (§8); renumber Open Questions to §8; update Table of Contents - work_queues.md (new): Async PDF via Celery + Django Channels WebSocket voting panel, with full context, Redis B topology rationale, k8s manifest list, and open questions. Planned start: after rate-limiter-2026 is stable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>rate-limiter-2026
7 changed files with 191 additions and 95 deletions
@ -0,0 +1,173 @@ |
|||||
|
# SAPL — Work Queues & Real-Time: Async PDF + WebSocket Voting |
||||
|
|
||||
|
> **Status**: Planned follow-up mini-project. |
||||
|
> **Prerequisite**: Redis A (cache + rate-limiter pod, `rate-limiter-2026` branch) must be |
||||
|
> deployed to production, stable, and OOM pressure confirmed reduced before starting this work. |
||||
|
> **Scope**: Django 2.2 / Gunicorn / Celery / Django Channels — same fleet of 1,200+ pods. |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## Table of Contents |
||||
|
|
||||
|
1. [Context & Motivation](#1-context--motivation) |
||||
|
2. [Redis Topology for Work Queues](#2-redis-topology-for-work-queues) |
||||
|
3. [Phase 1 — Async PDF via Celery](#3-phase-1--async-pdf-via-celery) |
||||
|
4. [Phase 2 — Django Channels (WebSocket Voting Panel)](#4-phase-2--django-channels-websocket-voting-panel) |
||||
|
5. [Open Questions](#5-open-questions) |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 1. Context & Motivation |
||||
|
|
||||
|
After `rate-limiter-2026` ships: |
||||
|
|
||||
|
| Remaining pain point | Current behaviour | Target | |
||||
|
|---|---|---| |
||||
|
| PDF generation | Holds a Gunicorn worker thread for the full build duration (10–60 s). Workers are at 400 MB cap — a PDF request burns one slot for up to a minute | Enqueue via Celery; respond 202 immediately; worker is freed | |
||||
|
| WebSocket voting panel | Not implemented; councillors use a polling page | Persistent connection via Django Channels backed by Redis | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 2. Redis Topology for Work Queues |
||||
|
|
||||
|
> **Critical constraint**: Celery broker **must** be a **separate** Redis instance (Redis B) |
||||
|
> with `noeviction` policy. |
||||
|
> Redis A (cache + rate-limiter) uses `allkeys-lru` — tasks enqueued there would be silently |
||||
|
> evicted under memory pressure, causing jobs to vanish without error. |
||||
|
|
||||
|
| Instance | Role | Eviction policy | Persistence | |
||||
|
|---|---|---|---| |
||||
|
| **Redis A** (existing) | Page cache (DB0), rate limiter (DB1), Django Channels (DB2) | `allkeys-lru` | none | |
||||
|
| **Redis B** (new) | Celery broker + result backend | `noeviction` | AOF + RDB snapshot | |
||||
|
|
||||
|
```yaml |
||||
|
# docker/k8s/redis-celery-configmap.yaml |
||||
|
data: |
||||
|
redis.conf: | |
||||
|
maxmemory-policy noeviction # never evict tasks |
||||
|
appendonly yes # AOF persistence ON |
||||
|
save "900 1" # RDB snapshot every 15 min if ≥1 change |
||||
|
databases 2 # DB0 = broker queue, DB1 = result backend |
||||
|
``` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 3. Phase 1 — Async PDF via Celery |
||||
|
|
||||
|
### 3.1 Current (synchronous) flow |
||||
|
|
||||
|
Holds worker memory for the entire PDF build: |
||||
|
|
||||
|
```mermaid |
||||
|
sequenceDiagram |
||||
|
participant B as Browser |
||||
|
participant G as Gunicorn worker |
||||
|
participant ORM as PostgreSQL |
||||
|
participant RL as ReportLab |
||||
|
|
||||
|
B->>G: GET /pdf/materia/12345 |
||||
|
G->>ORM: N+1 queries (get_etiqueta_protocolos) |
||||
|
ORM-->>G: data |
||||
|
G->>RL: build entire PDF in RAM |
||||
|
RL-->>G: PDF bytes (held in worker memory) |
||||
|
G-->>B: stream response |
||||
|
note over G: worker blocked for full duration |
||||
|
``` |
||||
|
|
||||
|
### 3.2 Target (async) flow |
||||
|
|
||||
|
Worker freed immediately after enqueueing: |
||||
|
|
||||
|
```mermaid |
||||
|
sequenceDiagram |
||||
|
participant B as Browser |
||||
|
participant G as Gunicorn worker |
||||
|
participant Q as Redis B (Celery queue) |
||||
|
participant W as Celery worker |
||||
|
participant D as Disk / nginx |
||||
|
|
||||
|
B->>G: POST /pdf/materia/12345 |
||||
|
G->>Q: enqueue task |
||||
|
G-->>B: 202 Accepted + task_id |
||||
|
W->>W: build PDF (out of band) |
||||
|
W->>D: write PDF to /media/pdf/task_id.pdf |
||||
|
B->>G: GET /pdf/status/task_id |
||||
|
G-->>B: 302 → nginx /media/pdf/task_id.pdf |
||||
|
``` |
||||
|
|
||||
|
### 3.3 Celery settings |
||||
|
|
||||
|
```python |
||||
|
# sapl/settings.py additions |
||||
|
CELERY_BROKER_URL = config('CELERY_BROKER_URL', default='') |
||||
|
CELERY_RESULT_BACKEND = config('CELERY_RESULT_BACKEND', default='') |
||||
|
|
||||
|
# Soft limit: warn at 350 MB; hard limit: kill+restart at 450 MB. |
||||
|
# Keeps Celery workers inside the same memory envelope as Gunicorn workers. |
||||
|
CELERY_WORKER_MAX_MEMORY_PER_CHILD = 400 * 1024 # KB |
||||
|
CELERY_TASK_SOFT_TIME_LIMIT = 120 # seconds — warn |
||||
|
CELERY_TASK_TIME_LIMIT = 180 # seconds — SIGKILL |
||||
|
``` |
||||
|
|
||||
|
### 3.4 k8s manifests |
||||
|
|
||||
|
New files to be created under `docker/k8s/`: |
||||
|
|
||||
|
- `redis-celery-configmap.yaml` — Redis B config (noeviction, AOF) |
||||
|
- `redis-celery-deployment.yaml` — single-replica Redis B pod |
||||
|
- `redis-celery-service.yaml` — ClusterIP service |
||||
|
- `celery-deployment.yaml` — Celery worker deployment (same image as SAPL) |
||||
|
|
||||
|
### 3.5 Environment variables (per-namespace Secret) |
||||
|
|
||||
|
| Variable | Example value | Notes | |
||||
|
|---|---|---| |
||||
|
| `CELERY_BROKER_URL` | `redis://sapl-redis-celery.redis.svc:6379/0` | Redis B, DB0 | |
||||
|
| `CELERY_RESULT_BACKEND` | `redis://sapl-redis-celery.redis.svc:6379/1` | Redis B, DB1 | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 4. Phase 2 — Django Channels (WebSocket Voting Panel) |
||||
|
|
||||
|
Uses **Redis A DB2** (reserved in the existing key-layout table — no new infra needed beyond |
||||
|
what ships in `rate-limiter-2026`). |
||||
|
|
||||
|
### 4.1 Channel layer settings |
||||
|
|
||||
|
```python |
||||
|
# sapl/settings.py additions |
||||
|
CHANNEL_LAYERS = { |
||||
|
"default": { |
||||
|
"BACKEND": "channels_redis.core.RedisChannelLayer", |
||||
|
"CONFIG": { |
||||
|
"hosts": [("sapl-redis.redis.svc.cluster.local", 6379)], |
||||
|
"db": 2, # DB2 reserved for channels (see rate-limiter-v2.md §0.2) |
||||
|
"capacity": 1500, |
||||
|
"expiry": 10, |
||||
|
}, |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
### 4.2 Prerequisites before starting |
||||
|
|
||||
|
- [ ] Redis A stable in production (rate limiter + cache confirmed working) |
||||
|
- [ ] OOM kill rate reduced to near-zero |
||||
|
- [ ] Bot siege resolved (Phase 0–2 metrics reviewed) |
||||
|
- [ ] Decision on ASGI server (Daphne vs Uvicorn + channels) — Gunicorn alone cannot serve WebSockets |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 5. Open Questions |
||||
|
|
||||
|
| # | Question | Blocks | |
||||
|
|---|---|---| |
||||
|
| 1 | Which PDF endpoints are highest priority for async migration? (`/relatorios/`, `/materia/pdf/`, other)? | Phase 1 scope | |
||||
|
| 2 | Should the Celery worker run in the same pod as Gunicorn (sidecar) or a dedicated deployment? | Phase 1 k8s design | |
||||
|
| 3 | Result backend TTL — how long should generated PDFs be retained before cleanup? | Phase 1 storage design | |
||||
|
| 4 | ASGI server selection for Channels (Daphne vs uvicorn + channels) | Phase 2 | |
||||
|
| 5 | WebSocket voting panel: is per-session or per-pod state acceptable? | Phase 2 architecture | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
*Planned work — begins after `rate-limiter-2026` is stable in production.* |
||||
Loading…
Reference in new issue