mirror of https://github.com/interlegis/sapl.git
Browse Source
Import fixes (all three imported get_client_ip/ratelimit_ip from sapl.utils which no longer exports them — causing the ImportError at startup): - sapl/materia/forms.py: move get_client_ip to sapl.middleware.ratelimit - sapl/materia/views.py: move get_client_ip + ratelimit_ip; keep RATE_LIMITER_RATE in sapl.settings (used by @ratelimit decorators) - sapl/base/views.py: same pattern as materia/views.py Docs: - rate-limiter-v2.md: remove Phase 5 section (§8); renumber Open Questions to §8; update Table of Contents - work_queues.md (new): Async PDF via Celery + Django Channels WebSocket voting panel, with full context, Redis B topology rationale, k8s manifest list, and open questions. Planned start: after rate-limiter-2026 is stable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>rate-limiter-2026
7 changed files with 191 additions and 95 deletions
@ -0,0 +1,173 @@ |
|||
# SAPL — Work Queues & Real-Time: Async PDF + WebSocket Voting |
|||
|
|||
> **Status**: Planned follow-up mini-project. |
|||
> **Prerequisite**: Redis A (cache + rate-limiter pod, `rate-limiter-2026` branch) must be |
|||
> deployed to production, stable, and OOM pressure confirmed reduced before starting this work. |
|||
> **Scope**: Django 2.2 / Gunicorn / Celery / Django Channels — same fleet of 1,200+ pods. |
|||
|
|||
--- |
|||
|
|||
## Table of Contents |
|||
|
|||
1. [Context & Motivation](#1-context--motivation) |
|||
2. [Redis Topology for Work Queues](#2-redis-topology-for-work-queues) |
|||
3. [Phase 1 — Async PDF via Celery](#3-phase-1--async-pdf-via-celery) |
|||
4. [Phase 2 — Django Channels (WebSocket Voting Panel)](#4-phase-2--django-channels-websocket-voting-panel) |
|||
5. [Open Questions](#5-open-questions) |
|||
|
|||
--- |
|||
|
|||
## 1. Context & Motivation |
|||
|
|||
After `rate-limiter-2026` ships: |
|||
|
|||
| Remaining pain point | Current behaviour | Target | |
|||
|---|---|---| |
|||
| PDF generation | Holds a Gunicorn worker thread for the full build duration (10–60 s). Workers are at 400 MB cap — a PDF request burns one slot for up to a minute | Enqueue via Celery; respond 202 immediately; worker is freed | |
|||
| WebSocket voting panel | Not implemented; councillors use a polling page | Persistent connection via Django Channels backed by Redis | |
|||
|
|||
--- |
|||
|
|||
## 2. Redis Topology for Work Queues |
|||
|
|||
> **Critical constraint**: Celery broker **must** be a **separate** Redis instance (Redis B) |
|||
> with `noeviction` policy. |
|||
> Redis A (cache + rate-limiter) uses `allkeys-lru` — tasks enqueued there would be silently |
|||
> evicted under memory pressure, causing jobs to vanish without error. |
|||
|
|||
| Instance | Role | Eviction policy | Persistence | |
|||
|---|---|---|---| |
|||
| **Redis A** (existing) | Page cache (DB0), rate limiter (DB1), Django Channels (DB2) | `allkeys-lru` | none | |
|||
| **Redis B** (new) | Celery broker + result backend | `noeviction` | AOF + RDB snapshot | |
|||
|
|||
```yaml |
|||
# docker/k8s/redis-celery-configmap.yaml |
|||
data: |
|||
redis.conf: | |
|||
maxmemory-policy noeviction # never evict tasks |
|||
appendonly yes # AOF persistence ON |
|||
save "900 1" # RDB snapshot every 15 min if ≥1 change |
|||
databases 2 # DB0 = broker queue, DB1 = result backend |
|||
``` |
|||
|
|||
--- |
|||
|
|||
## 3. Phase 1 — Async PDF via Celery |
|||
|
|||
### 3.1 Current (synchronous) flow |
|||
|
|||
Holds worker memory for the entire PDF build: |
|||
|
|||
```mermaid |
|||
sequenceDiagram |
|||
participant B as Browser |
|||
participant G as Gunicorn worker |
|||
participant ORM as PostgreSQL |
|||
participant RL as ReportLab |
|||
|
|||
B->>G: GET /pdf/materia/12345 |
|||
G->>ORM: N+1 queries (get_etiqueta_protocolos) |
|||
ORM-->>G: data |
|||
G->>RL: build entire PDF in RAM |
|||
RL-->>G: PDF bytes (held in worker memory) |
|||
G-->>B: stream response |
|||
note over G: worker blocked for full duration |
|||
``` |
|||
|
|||
### 3.2 Target (async) flow |
|||
|
|||
Worker freed immediately after enqueueing: |
|||
|
|||
```mermaid |
|||
sequenceDiagram |
|||
participant B as Browser |
|||
participant G as Gunicorn worker |
|||
participant Q as Redis B (Celery queue) |
|||
participant W as Celery worker |
|||
participant D as Disk / nginx |
|||
|
|||
B->>G: POST /pdf/materia/12345 |
|||
G->>Q: enqueue task |
|||
G-->>B: 202 Accepted + task_id |
|||
W->>W: build PDF (out of band) |
|||
W->>D: write PDF to /media/pdf/task_id.pdf |
|||
B->>G: GET /pdf/status/task_id |
|||
G-->>B: 302 → nginx /media/pdf/task_id.pdf |
|||
``` |
|||
|
|||
### 3.3 Celery settings |
|||
|
|||
```python |
|||
# sapl/settings.py additions |
|||
CELERY_BROKER_URL = config('CELERY_BROKER_URL', default='') |
|||
CELERY_RESULT_BACKEND = config('CELERY_RESULT_BACKEND', default='') |
|||
|
|||
# Soft limit: warn at 350 MB; hard limit: kill+restart at 450 MB. |
|||
# Keeps Celery workers inside the same memory envelope as Gunicorn workers. |
|||
CELERY_WORKER_MAX_MEMORY_PER_CHILD = 400 * 1024 # KB |
|||
CELERY_TASK_SOFT_TIME_LIMIT = 120 # seconds — warn |
|||
CELERY_TASK_TIME_LIMIT = 180 # seconds — SIGKILL |
|||
``` |
|||
|
|||
### 3.4 k8s manifests |
|||
|
|||
New files to be created under `docker/k8s/`: |
|||
|
|||
- `redis-celery-configmap.yaml` — Redis B config (noeviction, AOF) |
|||
- `redis-celery-deployment.yaml` — single-replica Redis B pod |
|||
- `redis-celery-service.yaml` — ClusterIP service |
|||
- `celery-deployment.yaml` — Celery worker deployment (same image as SAPL) |
|||
|
|||
### 3.5 Environment variables (per-namespace Secret) |
|||
|
|||
| Variable | Example value | Notes | |
|||
|---|---|---| |
|||
| `CELERY_BROKER_URL` | `redis://sapl-redis-celery.redis.svc:6379/0` | Redis B, DB0 | |
|||
| `CELERY_RESULT_BACKEND` | `redis://sapl-redis-celery.redis.svc:6379/1` | Redis B, DB1 | |
|||
|
|||
--- |
|||
|
|||
## 4. Phase 2 — Django Channels (WebSocket Voting Panel) |
|||
|
|||
Uses **Redis A DB2** (reserved in the existing key-layout table — no new infra needed beyond |
|||
what ships in `rate-limiter-2026`). |
|||
|
|||
### 4.1 Channel layer settings |
|||
|
|||
```python |
|||
# sapl/settings.py additions |
|||
CHANNEL_LAYERS = { |
|||
"default": { |
|||
"BACKEND": "channels_redis.core.RedisChannelLayer", |
|||
"CONFIG": { |
|||
"hosts": [("sapl-redis.redis.svc.cluster.local", 6379)], |
|||
"db": 2, # DB2 reserved for channels (see rate-limiter-v2.md §0.2) |
|||
"capacity": 1500, |
|||
"expiry": 10, |
|||
}, |
|||
} |
|||
} |
|||
``` |
|||
|
|||
### 4.2 Prerequisites before starting |
|||
|
|||
- [ ] Redis A stable in production (rate limiter + cache confirmed working) |
|||
- [ ] OOM kill rate reduced to near-zero |
|||
- [ ] Bot siege resolved (Phase 0–2 metrics reviewed) |
|||
- [ ] Decision on ASGI server (Daphne vs Uvicorn + channels) — Gunicorn alone cannot serve WebSockets |
|||
|
|||
--- |
|||
|
|||
## 5. Open Questions |
|||
|
|||
| # | Question | Blocks | |
|||
|---|---|---| |
|||
| 1 | Which PDF endpoints are highest priority for async migration? (`/relatorios/`, `/materia/pdf/`, other)? | Phase 1 scope | |
|||
| 2 | Should the Celery worker run in the same pod as Gunicorn (sidecar) or a dedicated deployment? | Phase 1 k8s design | |
|||
| 3 | Result backend TTL — how long should generated PDFs be retained before cleanup? | Phase 1 storage design | |
|||
| 4 | ASGI server selection for Channels (Daphne vs uvicorn + channels) | Phase 2 | |
|||
| 5 | WebSocket voting panel: is per-session or per-pod state acceptable? | Phase 2 architecture | |
|||
|
|||
--- |
|||
|
|||
*Planned work — begins after `rate-limiter-2026` is stable in production.* |
|||
Loading…
Reference in new issue