Browse Source

feat: add new fields to FileMetadata

file-metafields
Edward Ribeiro 2 weeks ago
parent
commit
cdca221f58
  1. 2
      .gitignore
  2. 151
      CLAUDE.md
  3. 8
      docker/docker-compose.yaml
  4. 2
      docker/startup_scripts/start.sh
  5. 297
      sapl/base/fields.py
  6. 149
      sapl/base/management/commands/backfill_file_metadata.py
  7. 111
      sapl/base/management/commands/backfill_file_metadata_hashes.py
  8. 153
      sapl/base/management/commands/backfill_file_metadata_structural.py
  9. 43
      sapl/base/migrations/0062_filemetadata_owner_fields.py
  10. 11
      sapl/base/models.py
  11. 119
      sapl/base/views.py

2
.gitignore

@ -110,3 +110,5 @@ media/*
!media/.gitkeep
restauracoes/*
.claude/

151
CLAUDE.md

@ -0,0 +1,151 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
SAPL (Sistema de Apoio ao Processo Legislativo) is a Django-based legislative management system used by Brazilian municipal and state legislative houses. It manages bills, parliamentary sessions, committees, norms, protocols, and related legislative workflows.
## Design references
@/Users/eribeiro/projects/sapl-docs/rfc/files-metafields-impl.md
## Commands
### Development
```bash
# Run dev server
python manage.py runserver
# Docker (dev, without bundled DB)
docker-compose -f docker/docker-compose-dev.yml up
# Docker (dev, with PostgreSQL container)
docker-compose -f docker/docker-compose-dev-db.yml up
```
### Database Setup (local PostgreSQL)
```bash
sudo -u postgres psql -c "CREATE ROLE sapl LOGIN ENCRYPTED PASSWORD 'sapl' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE NOREPLICATION;"
sudo -u postgres psql -c "CREATE DATABASE sapl WITH OWNER=sapl ENCODING='UTF8' LC_COLLATE='pt_BR.UTF-8' LC_CTYPE='pt_BR.UTF-8' CONNECTION LIMIT=-1 TEMPLATE template0;"
python manage.py migrate
```
### Testing
```bash
# All tests (reuses DB by default for speed)
pytest
# Single test file or test function
pytest sapl/materia/tests/test_materia.py
pytest sapl/materia/tests/test_materia.py::test_function_name
# Force DB recreation
pytest --create-db
# With coverage
pytest --cov=sapl
```
Tests require `DJANGO_SETTINGS_MODULE=sapl.settings` (set in `pytest.ini`). All tests must be marked with `@pytest.mark.django_db`. The `conftest.py` root fixture provides an `app` fixture (WebTest `DjangoTestApp`).
### Linting / Formatting
```bash
flake8 .
isort .
autopep8 --in-place <file.py>
```
### Restore Database from Backup
```bash
./scripts/restore_db.sh -f /path/to/dump
./scripts/restore_db.sh -f /path/to/dump -p 5433 # Docker port
```
## Architecture
### Django Apps
Apps are under `sapl/` and follow domain boundaries:
| App | Domain |
|-----|--------|
| `base` | `CasaLegislativa` (legislative house config), `AppConfig`, `Autor` (authorship) |
| `parliamentary` | `Parlamentar`, `Legislatura`, `SessaoLegislativa`, `Coligacao` |
| `materia` | Bills (`MateriaLegislativa`), types, tracking, annexes |
| `norma` | Laws/norms (`NormaJuridica`) and hierarchies |
| `sessao` | Plenary sessions, agenda, attendance, voting |
| `comissoes` | Committees (`Comissao`) and meetings (`Reuniao`) |
| `protocoloadm` | Administrative protocols and document intake |
| `compilacao` | Structured/articulated texts (LexML-like tree structure) |
| `lexml` | LexML XML standard integration |
| `audiencia` | Public hearings |
| `painel` | Real-time session display panel |
| `relatorios` | PDF report generation |
| `api` | REST API entry point (auto-generated ViewSets) |
| `crud` | Generic CRUD base views |
| `rules` | Business rules and permission definitions |
### REST API
The API uses a custom `drfautoapi` package (`drfautoapi/drfautoapi.py`) that auto-generates DRF ViewSets, Serializers, and FilterSets from Django models. Authentication is Token + Session. Permissions use a custom `SaplModelPermissions` class that maps HTTP methods to Django model permissions.
OpenAPI 3.0 docs are generated by drf-spectacular.
### Caching
- **Default:** File-based (`/var/tmp/django_cache`)
- **Production:** Redis via django-redis; configured at startup by `configure_redis_cache()` in `sapl/settings.py`
- **Cache key prefix:** `cache:{POD_NAMESPACE}:` (namespace-isolated for multi-tenant k8s)
- **Rate limiter state** is shared via Redis keys
### Feature Flags
django-waffle is used for feature flags. Switches (global on/off) can be toggled via:
```bash
python manage.py waffle_switch <switch_name> on|off
```
### Key Environment Variables
| Variable | Purpose |
|----------|---------|
| `DATABASE_URL` | PostgreSQL connection string |
| `SECRET_KEY` | Django secret key |
| `DEBUG` | Debug mode |
| `REDIS_URL` | Redis host:port |
| `CACHE_BACKEND` | `file` or `redis` |
| `POD_NAMESPACE` | K8s namespace (used in cache key prefix) |
| `USE_SOLR` | Enable Haystack/Solr full-text search |
| `SOLR_URL` / `SOLR_COLLECTION` | Solr connection |
### Docker Build
The production build requires a MaxMind GeoLite2-ASN license key (for nginx ASN-based bot blocking):
```bash
docker build --secret id=maxmind_key,src=.env -f docker/Dockerfile -t sapl:local .
```
Optional build args: `WITH_NGINX`, `WITH_GRAPHVIZ`, `WITH_POPPLER`, `WITH_PSQL_CLIENT`.
### Key File Locations
| File | Purpose |
|------|---------|
| `sapl/settings.py` | All Django settings, including cache/rate-limit setup |
| `pytest.ini` | Test configuration (DJANGO_SETTINGS_MODULE, addopts) |
| `conftest.py` | Root pytest fixtures |
| `drfautoapi/drfautoapi.py` | Auto-API generation logic |
| `docker/startup_scripts/start.sh` | Container entrypoint (migrations, waffle, gunicorn) |
| `requirements/requirements.txt` | Production deps |
| `requirements/test-requirements.txt` | Test deps |
| `requirements/dev-requirements.txt` | Dev/lint deps |

8
docker/docker-compose.yaml

@ -33,10 +33,10 @@ services:
networks:
- sapl-net
sapl:
image: interlegis/sapl:3.1.165-RC2
# build:
# context: ../
# dockerfile: ./docker/Dockerfile
# image: interlegis/sapl:3.1.165-RC2
build:
context: ../
dockerfile: ./docker/Dockerfile
container_name: sapl
labels:
NAME: "sapl"

2
docker/startup_scripts/start.sh

@ -273,7 +273,7 @@ main() {
# deployed. Runs as a background job so pod startup is not delayed.
# Must be after migrate_db so the base_file_metadata table exists.
# Once all instances have been fully backfilled this line can be removed.
python3 manage.py backfill_file_metadata --rate-limit=20 &
python3 manage.py backfill_file_metadata_structural &
configure_solr || true
configure_sapn
create_admin

297
sapl/base/fields.py

@ -1,13 +1,24 @@
import hashlib
import logging
import posixpath
import unicodedata
from pathlib import Path
from urllib.parse import quote
from uuid import uuid4
from django.core.files import File
import magic
from django.core.exceptions import ValidationError
from django.core.files.storage import default_storage
from django.db import models
from django.db import models, transaction
from django.db.models.fields.files import FieldFile
from django.db.models.signals import post_save
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# MetadataFieldFile — FieldFile subclass with semantic URL support
# ---------------------------------------------------------------------------
class MetadataFieldFile(FieldFile):
"""
@ -28,8 +39,6 @@ class MetadataFieldFile(FieldFile):
meta = getattr(self.instance, meta_attr, None)
if meta and meta.original_filename:
return meta.original_filename
# Fallback: basename of storage path (may be UUID for newly uploaded files
# whose metadata row hasn't been committed yet).
return Path(self.name).name if self.name else ''
@property
@ -41,16 +50,12 @@ class MetadataFieldFile(FieldFile):
meta_attr = f'{self.field.name}_metadata'
meta = getattr(instance, meta_attr, None)
# Fallback: no metadata row yet (pre-backfill existing file or first save
# before commit) → return the raw storage URL so nothing breaks.
if meta is None:
return self.storage.url(self.name)
pk = getattr(instance, 'pk', None)
if pk is not None:
# Saved instance — return the semantic alias.
# Lazy import avoids a circular dependency at module load time.
from django.urls import reverse
return reverse(
'serve_model_file',
@ -62,16 +67,112 @@ class MetadataFieldFile(FieldFile):
},
)
# Unsaved instance — return canonical UUID form.
return f'/documentos/{meta.uuid}/'
def _compute_size_and_hash(field_file):
"""
Read the file content once to compute size and SHA-256 digest.
field_file must be open-able via field_file.open().
Returns (size_in_bytes, hex_digest).
"""
# ---------------------------------------------------------------------------
# Filename sanitization
# ---------------------------------------------------------------------------
def _sanitize_filename(name: str) -> str:
name = name.strip()
name = ''.join(c for c in name if ord(c) >= 0x20 and ord(c) != 0x7F)
name = unicodedata.normalize('NFC', name)
name = ' '.join(name.split())
name = name[:255]
if not name:
name = 'untitled'
return name
# ---------------------------------------------------------------------------
# MIME validation
# ---------------------------------------------------------------------------
FIELD_ALLOWED_TYPES = {
('materia', 'materialegislativa', 'texto_original'): frozenset([
'application/pdf', 'application/msword',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/rtf', 'text/plain',
]),
('materia', 'documentoacessorio', 'arquivo'): frozenset([
'application/pdf', 'application/msword',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'image/jpeg', 'image/png', 'image/tiff',
]),
('materia', 'proposicao', 'texto_original'): frozenset([
'application/pdf', 'application/msword',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
]),
('protocoloadm', 'documentoadministrativo', 'texto_integral'): frozenset([
'application/pdf',
]),
('protocoloadm', 'documentoacessorioadministrativo', 'arquivo'): frozenset([
'application/pdf', 'image/jpeg', 'image/png', 'image/tiff',
]),
('norma', 'normajuridica', 'texto_integral'): frozenset([
'application/pdf', 'application/msword',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
]),
('norma', 'anexonormajuridica', 'anexo_arquivo'): frozenset([
'application/pdf', 'image/jpeg', 'image/png',
]),
**{
(app, model, field): frozenset(['application/pdf', 'image/jpeg', 'image/png', 'image/tiff'])
for app, model, field in [
('sessao', 'sessaoplenaria', 'upload_pauta'),
('sessao', 'sessaoplenaria', 'upload_ata'),
('sessao', 'sessaoplenaria', 'upload_anexo'),
('sessao', 'justificativaausencia', 'upload_anexo'),
('comissoes', 'reuniao', 'upload_pauta'),
('comissoes', 'reuniao', 'upload_ata'),
('comissoes', 'reuniao', 'upload_anexo'),
('comissoes', 'documentoacessorio', 'arquivo'),
('audiencia', 'audienciapublica', 'upload_pauta'),
('audiencia', 'audienciapublica', 'upload_ata'),
('audiencia', 'audienciapublica', 'upload_anexo'),
('audiencia', 'anexoaudienciapublica', 'arquivo'),
]
},
}
_MIME_TO_EXTENSIONS = {
'application/pdf': {'.pdf'},
'application/msword': {'.doc'},
'application/vnd.openxmlformats-officedocument.wordprocessingml.document': {'.docx'},
'application/rtf': {'.rtf'},
'text/plain': {'.txt'},
'image/jpeg': {'.jpg', '.jpeg'},
'image/png': {'.png'},
'image/tiff': {'.tif', '.tiff'},
}
def _validate_file_type(file_obj, app_label, model_name, field_name):
allowed = FIELD_ALLOWED_TYPES.get((app_label, model_name, field_name))
if allowed is None:
return
header = file_obj.file.read(512)
file_obj.file.seek(0)
sniffed = magic.from_buffer(header, mime=True)
if sniffed not in allowed:
raise ValidationError(
f'Tipo de arquivo não permitido: {sniffed}. '
f'Permitidos: {", ".join(sorted(allowed))}'
)
ext = Path(file_obj.name).suffix.lower()
if ext and sniffed in _MIME_TO_EXTENSIONS:
if ext not in _MIME_TO_EXTENSIONS[sniffed]:
raise ValidationError(
f'Extensão {ext!r} não corresponde ao tipo detectado ({sniffed}).'
)
# ---------------------------------------------------------------------------
# Content hash + size
# ---------------------------------------------------------------------------
def _compute_size_and_hash(field_file) -> tuple:
h = hashlib.sha256()
size = 0
with field_file.open('rb') as fh:
@ -81,37 +182,67 @@ def _compute_size_and_hash(field_file):
return size, h.hexdigest()
# ---------------------------------------------------------------------------
# Blob deletion
# ---------------------------------------------------------------------------
def _delete_blob_safe(storage_name: str) -> None:
try:
default_storage.delete(storage_name)
except Exception:
logger.warning('Failed to delete blob %s', storage_name, exc_info=True)
# ---------------------------------------------------------------------------
# Content-Disposition
# ---------------------------------------------------------------------------
_INLINE_EXTENSIONS = frozenset(['.pdf'])
def _content_disposition(filename: str) -> str:
ext = Path(filename).suffix.lower()
disposition = 'inline' if ext in _INLINE_EXTENSIONS else 'attachment'
safe_ascii = filename.encode('ascii', 'replace').decode()
encoded = quote(filename, safe='')
return f'{disposition}; filename="{safe_ascii}"; filename*=UTF-8\'\'{encoded}'
# ---------------------------------------------------------------------------
# Visibility / cache helpers
# ---------------------------------------------------------------------------
_PRIVATE_FIELDS = frozenset({
('materia', 'proposicao', 'texto_original'),
('protocoloadm', 'documentoadministrativo', 'texto_integral'),
('protocoloadm', 'documentoacessorioadministrativo', 'arquivo'),
})
def _visibility(meta) -> str:
return 'private' if (meta.app_label, meta.model_name, meta.field_name) in _PRIVATE_FIELDS else 'public'
def _is_public(meta) -> bool:
return (meta.app_label, meta.model_name, meta.field_name) not in _PRIVATE_FIELDS
# ---------------------------------------------------------------------------
# MetadataFileField
# ---------------------------------------------------------------------------
class MetadataFileField(models.FileField):
"""
Drop-in replacement for models.FileField.
Uses MetadataFieldFile as its descriptor so that .url returns the semantic
alias /<app>/<model>/<pk>/<field>/download for saved instances, and falls
back to /documentos/<uuid>/ for unsaved instances.
In addition to normal FileField behaviour, this field:
1. Injects a companion ForeignKey '<fieldname>_metadata' pointing to
base.FileMetadata on the owning model class at class-definition time.
2. In pre_save, creates or updates the FileMetadata row that tracks the
stable uuid, storage_name, original_filename, size, and hash for the
uploaded file.
Four lifecycle scenarios handled in pre_save:
Case 1 first upload : create a new FileMetadata row; set the FK.
Case 2 replacement : delete the old physical file; update the existing
FileMetadata row in-place so the uuid (and therefore
/documentos/<uuid>/) never changes.
Case 3 field cleared : nullify the FK; delete the FileMetadata row;
physical file cleanup is deferred to the
clean_orphan_files management command.
Case 4 no-op re-save : nothing touched.
Injects a companion FK '<fieldname>_metadata' FileMetadata and maintains
it across upload, replacement, and clear operations.
"""
attr_class = MetadataFieldFile
def contribute_to_class(self, cls, name):
super().contribute_to_class(cls, name)
# Inject companion FK: e.g. texto_original → texto_original_metadata_id
fk = models.ForeignKey(
'base.FileMetadata',
null=True,
@ -121,39 +252,35 @@ class MetadataFileField(models.FileField):
verbose_name='File metadata',
)
cls.add_to_class(f'{name}_metadata', fk)
post_save.connect(self._fix_owner_pk, sender=cls, weak=False)
def _fix_owner_pk(self, sender, instance, created, **kwargs):
"""Fills owner_pk after INSERT — pk is None at pre_save time for new objects."""
if not created:
return
meta = getattr(instance, f'{self.attname}_metadata', None)
if meta is not None and meta.owner_pk is None:
meta.owner_pk = instance.pk
meta.save(update_fields=['owner_pk'])
def generate_filename(self, instance, filename):
"""
Override: substitute a UUID for the filename in the upload_to path so
that newly uploaded files get stable, unguessable storage paths like
sapl/public/normajuridica/2025/9395/<uuid>.pdf (RFC §6.3).
For replacement uploads (Case 2) the existing meta UUID is reused so
the physical path and therefore /documentos/<uuid>/ stays stable.
For first uploads (Case 1) a fresh UUID is generated and stashed on the
instance under _pending_uuid_<fieldname> for pre_save to pick up when
creating the FileMetadata row.
Substitutes a UUID for the filename in the upload_to path so newly
uploaded files get stable, unguessable storage paths (RFC §6.3).
"""
# 1. Let upload_to produce the directory + original filename.
if callable(self.upload_to):
upload_name = self.upload_to(instance, filename)
else:
upload_name = posixpath.join(self.upload_to, filename)
# 2. Determine the UUID to embed in the path.
meta_attr = f'{self.name}_metadata'
meta = getattr(instance, meta_attr, None)
if meta is not None:
# Replacement (Case 2): reuse existing UUID — path stays identical,
# OverwriteStorage replaces the bytes in-place.
file_uuid = str(meta.uuid)
else:
# First upload (Case 1): generate a fresh UUID and stash it so
# pre_save can wire it into the new FileMetadata row.
file_uuid = str(uuid4())
setattr(instance, f'_pending_uuid_{self.name}', file_uuid)
# 3. Replace the filename portion with <uuid><ext>.
ext = Path(filename).suffix.lower()
directory = posixpath.dirname(upload_name)
new_name = posixpath.join(directory, f'{file_uuid}{ext}') if directory else f'{file_uuid}{ext}'
@ -167,78 +294,76 @@ class MetadataFileField(models.FileField):
file_before = getattr(instance, self.attname)
meta_before = getattr(instance, meta_attr, None)
# Capture intent BEFORE super() — storage.save() sets _committed=True,
# erasing the distinction between "new upload" and "already committed".
has_new_upload = bool(file_before) and not getattr(file_before, '_committed', True)
is_clearing = not file_before and meta_before is not None
# Capture browser-supplied filename before storage renames it to the UUID path.
# file_before.name is set to the original upload name by FileDescriptor.__get__
# when it wraps the UploadedFile — more reliable than file_before.file.name
# which for TemporaryUploadedFile is the NamedTemporaryFile path.
# Sanitize original filename before storage renames it.
# file_before.name holds the user-supplied name at this point.
if has_new_upload:
original_filename = Path(file_before.name).name if file_before.name else ''
original_filename = _sanitize_filename(Path(file_before.name).name) if file_before.name else 'untitled'
else:
original_filename = ''
if has_new_upload:
_validate_file_type(file_before, instance._meta.app_label,
instance._meta.model_name, self.attname)
file = super().pre_save(instance, add)
# file.name is now the UUID-based storage path from generate_filename,
# e.g. "sapl/public/normajuridica/2025/9395/<uuid>.pdf"
if is_clearing:
# Case 3: ClearableFileInput submitted with clear=True.
# Nullify FK on the in-memory instance immediately so the subsequent
# model.save() writes NULL — do NOT rely on SET_NULL cascade, which
# only fires in the DB. Physical file left for offline cleanup.
old_storage_name = meta_before.storage_name
setattr(instance, f'{meta_attr}_id', None)
meta_before.delete()
transaction.on_commit(lambda: _delete_blob_safe(old_storage_name))
elif file and has_new_upload:
storage_name = file.name
size, digest = _compute_size_and_hash(file)
if meta_before is None:
# Case 1: first upload — create a new FileMetadata row.
# Use the UUID that generate_filename already baked into the path
# so that FileMetadata.uuid matches the on-disk filename.
# Case 1: first upload.
pending_uuid = getattr(instance, f'_pending_uuid_{self.name}', None)
meta_kwargs = dict(
storage_name=storage_name,
original_filename=original_filename,
file_size_bytes=size,
content_hash=digest,
app_label=instance._meta.app_label,
model_name=instance._meta.model_name,
field_name=self.attname,
owner_pk=instance.pk or None,
)
if pending_uuid:
from uuid import UUID
meta_kwargs['uuid'] = UUID(pending_uuid)
# Clean up the temporary stash attribute.
try:
delattr(instance, f'_pending_uuid_{self.name}')
except AttributeError:
pass
with transaction.atomic():
meta = FileMetadata(**meta_kwargs)
meta.save()
setattr(instance, f'{meta_attr}_id', meta.pk)
else:
# Case 2: replacement — reuse the existing row so the stable uuid
# (and /documentos/<uuid>/) never changes.
# For UUID-based paths the old and new paths are identical
# (same uuid, same ext) — OverwriteStorage already replaced the
# bytes in-place, so we must NOT delete the newly written file.
if meta_before.storage_name != storage_name:
# Legacy (non-UUID) paths: old path differs → delete old file.
try:
default_storage.delete(meta_before.storage_name)
except OSError:
pass # already gone — proceed
meta_before.version += 1
meta_before.storage_name = storage_name
meta_before.original_filename = original_filename
meta_before.file_size_bytes = size
meta_before.content_hash = digest
meta_before.save(update_fields=[
# Case 2: replacement — reuse existing row so uuid (and /documentos/<uuid>/) never changes.
with transaction.atomic():
locked = FileMetadata.objects.select_for_update().get(pk=meta_before.pk)
old_storage_name = locked.storage_name
locked.version += 1
locked.storage_name = storage_name
locked.original_filename = original_filename
locked.file_size_bytes = size
locked.content_hash = digest
locked.app_label = instance._meta.app_label
locked.model_name = instance._meta.model_name
locked.field_name = self.attname
locked.owner_pk = instance.pk
locked.save(update_fields=[
'version', 'storage_name', 'original_filename',
'file_size_bytes', 'content_hash',
'app_label', 'model_name', 'field_name', 'owner_pk',
])
if old_storage_name != storage_name:
transaction.on_commit(lambda: _delete_blob_safe(old_storage_name))
return file

149
sapl/base/management/commands/backfill_file_metadata.py

@ -2,14 +2,24 @@
Backfill FileMetadata rows for all existing uploaded files.
Run once after deploying MetadataFileField to production. Safe to interrupt
and re-run: processes only rows where the _metadata FK is NULL and the file
field is non-empty.
and re-run: idempotent in both phases.
Phase 1 creates missing FileMetadata rows for parent-model instances whose
_metadata FK is NULL (files uploaded before MetadataFileField was deployed).
Phase 2 fills in content_hash / file_size_bytes for FileMetadata rows that
exist but are missing those lazy fields (RFC §8 query:
WHERE content_hash = '' OR file_size_bytes IS NULL).
By default both phases run. Use --phase1 or --phase2 to run only one.
Usage:
python manage.py backfill_file_metadata
python manage.py backfill_file_metadata --batch-size=200 --rate-limit=20
python manage.py backfill_file_metadata # both phases
python manage.py backfill_file_metadata --phase1 # create missing rows only
python manage.py backfill_file_metadata --phase2 # fill size/hash only
python manage.py backfill_file_metadata --phase1 --rate-limit=20
python manage.py backfill_file_metadata --phase1 --app materia --model materialegislativa
python manage.py backfill_file_metadata --phase2 --skip-hash
python manage.py backfill_file_metadata --dry-run
python manage.py backfill_file_metadata --app materia --model materialegislativa
"""
import hashlib
import os
@ -20,6 +30,7 @@ from django.apps import apps
from django.conf import settings
from django.core.management.base import BaseCommand
from django.db import transaction
from django.db.models import Q
from django.utils import timezone
# Every (app_label, model_name, field_name) that uses MetadataFileField.
@ -61,7 +72,8 @@ def _compute_hash(path):
class Command(BaseCommand):
help = (
'Backfill FileMetadata rows for all existing uploaded files. '
'Idempotent — skips rows that already have a _metadata FK set.'
'Idempotent — Phase 1 creates missing rows, Phase 2 fills in '
'content_hash / file_size_bytes for incomplete rows (RFC §8).'
)
def add_arguments(self, parser):
@ -79,16 +91,24 @@ class Command(BaseCommand):
)
parser.add_argument(
'--app', type=str, default=None,
help='Restrict to a single app_label.',
help='Restrict Phase 1 to a single app_label.',
)
parser.add_argument(
'--model', type=str, default=None,
help='Restrict to a single model_name (requires --app).',
help='Restrict Phase 1 to a single model_name (requires --app).',
)
parser.add_argument(
'--skip-hash', action='store_true',
help='Skip content_hash computation (only stat for file_size_bytes).',
)
parser.add_argument(
'--phase1', action='store_true',
help='Run Phase 1 only (create missing FileMetadata rows).',
)
parser.add_argument(
'--phase2', action='store_true',
help='Run Phase 2 only (fill in content_hash / file_size_bytes for incomplete rows).',
)
def handle(self, *args, **options):
from sapl.base.models import FileMetadata
@ -99,10 +119,25 @@ class Command(BaseCommand):
only_app = options['app']
only_model = options['model']
skip_hash = options['skip_hash']
run_phase1 = options['phase1']
run_phase2 = options['phase2']
# if neither flag given, run both
if not run_phase1 and not run_phase2:
run_phase1 = run_phase2 = True
start_time = time.time()
if dry_run:
self.stdout.write(self.style.WARNING('DRY RUN — no changes will be written.'))
total_created = 0
total_updated = 0
total_errors = 0
# ── Phase 1: create FileMetadata rows for FK-NULL parent instances ────────
if run_phase1:
self.stdout.write('Phase 1: creating missing FileMetadata rows...')
targets = [
(app, model, field)
for (app, model, field) in METADATA_FILE_FIELDS
@ -110,20 +145,15 @@ class Command(BaseCommand):
and (only_model is None or model == only_model)
]
total_created = 0
total_skipped = 0
total_errors = 0
for app_label, model_name, field_name in targets:
try:
Model = apps.get_model(app_label, model_name)
except LookupError:
self.stdout.write(
self.style.ERROR(f'Model {app_label}.{model_name} not found — skipping.'))
self.style.ERROR(f' Model {app_label}.{model_name} not found — skipping.'))
continue
meta_fk = f'{field_name}_metadata'
# Only process rows where file is set but metadata FK is NULL.
qs = Model.objects.filter(
**{f'{field_name}__isnull': False,
f'{meta_fk}__isnull': True}
@ -134,20 +164,19 @@ class Command(BaseCommand):
count = qs.count()
if count == 0:
self.stdout.write(
f'{app_label}.{model_name}.{field_name}: already up to date.')
f' {app_label}.{model_name}.{field_name}: up to date.')
continue
self.stdout.write(
f'{app_label}.{model_name}.{field_name}: {count} rows to backfill...')
f' {app_label}.{model_name}.{field_name}: {count} rows to backfill...')
batch_start = time.time()
processed = 0
for instance in qs.iterator(chunk_size=batch_size):
field_file = getattr(instance, field_name)
storage_name = field_file.name # relative path stored in DB
storage_name = field_file.name
original_filename = Path(storage_name).name
full_path = os.path.join(settings.MEDIA_ROOT, storage_name)
file_size_bytes = None
@ -178,6 +207,10 @@ class Command(BaseCommand):
file_size_bytes=file_size_bytes,
content_hash=content_hash,
backfilled_at=timezone.now(),
app_label=app_label,
model_name=model_name,
field_name=field_name,
owner_pk=instance.pk,
)
meta.save()
setattr(instance, f'{meta_fk}_id', meta.pk)
@ -193,12 +226,84 @@ class Command(BaseCommand):
time.sleep(target_elapsed - elapsed)
self.stdout.write(
self.style.SUCCESS(
f' done: {processed} rows processed.'))
self.style.SUCCESS(f' done: {processed} rows created.'))
else:
self.stdout.write('Phase 1: skipped.')
# ── Phase 2: fill in content_hash / file_size_bytes for incomplete rows ──
# RFC §8: query WHERE content_hash = '' OR file_size_bytes IS NULL.
# --app/--model do not restrict Phase 2 (it operates on FileMetadata directly).
if run_phase2:
self.stdout.write('Phase 2: filling in missing size/hash for existing rows...')
incomplete_qs = FileMetadata.objects.filter(
Q(content_hash='') | Q(file_size_bytes__isnull=True)
)
count = incomplete_qs.count()
if count == 0:
self.stdout.write(' all rows already have size and hash.')
else:
self.stdout.write(f' {count} rows to update...')
batch_start = time.time()
processed = 0
for meta in incomplete_qs.iterator(chunk_size=batch_size):
full_path = os.path.join(settings.MEDIA_ROOT, meta.storage_name)
try:
stat = os.stat(full_path)
file_size_bytes = stat.st_size
content_hash = (
_compute_hash(full_path)
if not skip_hash and not meta.content_hash
else meta.content_hash
)
except OSError as e:
self.stdout.write(
self.style.WARNING(
f' uuid={meta.uuid}: cannot read {full_path}: {e}'))
total_errors += 1
continue
if dry_run:
self.stdout.write(
f' [dry-run] uuid={meta.uuid} → size={file_size_bytes}')
total_updated += 1
processed += 1
continue
update_fields = ['backfilled_at']
if meta.file_size_bytes is None:
meta.file_size_bytes = file_size_bytes
update_fields.append('file_size_bytes')
if not meta.content_hash and not skip_hash:
meta.content_hash = content_hash
update_fields.append('content_hash')
meta.backfilled_at = timezone.now()
meta.save(update_fields=update_fields)
total_updated += 1
processed += 1
if rate_limit > 0:
elapsed = time.time() - batch_start
target_elapsed = processed / rate_limit
if target_elapsed > elapsed:
time.sleep(target_elapsed - elapsed)
self.stdout.write(
self.style.SUCCESS(f' done: {processed} rows updated.'))
else:
self.stdout.write('Phase 2: skipped.')
elapsed = time.time() - start_time
self.stdout.write('')
self.stdout.write(
f'Backfill complete — created: {total_created}, '
f'errors (file missing): {total_errors}.')
f'Backfill complete — created: {total_created}, updated: {total_updated}, '
f'errors (file missing): {total_errors}, elapsed: {elapsed:.1f}s'
)
if dry_run:
self.stdout.write(self.style.WARNING('(dry run — nothing was written)'))

111
sapl/base/management/commands/backfill_file_metadata_hashes.py

@ -0,0 +1,111 @@
"""
Phase 2 backfill fills file_size_bytes and content_hash for FileMetadata rows
where backfilled_at IS NULL. I/O-bound; designed to run at low priority over
days or weeks without affecting production traffic.
Safe to interrupt and resume: sets backfilled_at on each completed row.
Usage:
python manage.py backfill_file_metadata_hashes [--batch-size=200] [--rate-limit=20]
Progress query:
SELECT
COUNT(*) FILTER (WHERE backfilled_at IS NULL) AS pending,
COUNT(*) FILTER (WHERE backfilled_at IS NOT NULL) AS done
FROM base_file_metadata;
"""
import hashlib
import os
import time
from django.conf import settings
from django.core.management.base import BaseCommand
from django.utils import timezone
def _compute_hash(path):
h = hashlib.sha256()
with open(path, 'rb') as fh:
for chunk in iter(lambda: fh.read(65536), b''):
h.update(chunk)
return h.hexdigest()
class Command(BaseCommand):
help = (
'Phase 2 backfill: fill file_size_bytes and content_hash for FileMetadata rows '
'where backfilled_at IS NULL. Resumable and rate-limited.'
)
def add_arguments(self, parser):
parser.add_argument(
'--batch-size', type=int, default=200,
help='Rows per iteration (default: 200).',
)
parser.add_argument(
'--rate-limit', type=int, default=0,
help='Max rows per second (0 = unlimited).',
)
parser.add_argument(
'--skip-hash', action='store_true',
help='Only populate file_size_bytes, skip SHA-256 (faster).',
)
parser.add_argument(
'--dry-run', action='store_true',
help='Report counts without writing.',
)
def handle(self, *args, **options):
from sapl.base.models import FileMetadata
batch_size = options['batch_size']
rate_limit = options['rate_limit']
skip_hash = options['skip_hash']
dry_run = options['dry_run']
if dry_run:
self.stdout.write(self.style.WARNING('DRY RUN — nothing will be written.'))
qs = FileMetadata.objects.filter(backfilled_at__isnull=True)
total = qs.count()
self.stdout.write(f'Phase 2: {total} rows to process...')
processed = 0
errors = 0
batch_start = time.time()
for meta in qs.iterator(chunk_size=batch_size):
full_path = os.path.join(settings.MEDIA_ROOT, meta.storage_name)
try:
stat = os.stat(full_path)
file_size_bytes = stat.st_size
content_hash = _compute_hash(full_path) if not skip_hash else ''
except OSError as e:
self.stdout.write(self.style.WARNING(
f' uuid={meta.uuid}: cannot read {full_path}: {e}'))
errors += 1
continue
if not dry_run:
update_fields = ['backfilled_at']
meta.backfilled_at = timezone.now()
if meta.file_size_bytes is None:
meta.file_size_bytes = file_size_bytes
update_fields.append('file_size_bytes')
if not meta.content_hash and not skip_hash:
meta.content_hash = content_hash
update_fields.append('content_hash')
meta.save(update_fields=update_fields)
processed += 1
if rate_limit > 0:
elapsed = time.time() - batch_start
target = processed / rate_limit
if target > elapsed:
time.sleep(target - elapsed)
self.stdout.write(self.style.SUCCESS(
f'Phase 2 complete — processed: {processed}, errors: {errors}'))
if dry_run:
self.stdout.write(self.style.WARNING('(dry run — nothing was written)'))

153
sapl/base/management/commands/backfill_file_metadata_structural.py

@ -0,0 +1,153 @@
"""
Phase 1 backfill creates FileMetadata rows for parent-model instances whose
_metadata FK is NULL. Pure DB work, no disk I/O. Completes in seconds.
Safe to re-run (idempotent). Safe to run from multiple workers simultaneously
(bulk_create with ignore_conflicts=True handles races).
Usage:
python manage.py backfill_file_metadata_structural [--batch-size=1000]
"""
from pathlib import Path
from django.apps import apps
from django.core.management.base import BaseCommand
from django.db import transaction
METADATA_FILE_FIELDS = [
('materia', 'materialegislativa', 'texto_original'),
('materia', 'documentoacessorio', 'arquivo'),
('materia', 'proposicao', 'texto_original'),
('protocoloadm', 'documentoadministrativo', 'texto_integral'),
('protocoloadm', 'documentoacessorioadministrativo', 'arquivo'),
('norma', 'normajuridica', 'texto_integral'),
('norma', 'anexonormajuridica', 'anexo_arquivo'),
('comissoes', 'reuniao', 'upload_pauta'),
('comissoes', 'reuniao', 'upload_ata'),
('comissoes', 'reuniao', 'upload_anexo'),
('comissoes', 'documentoacessorio', 'arquivo'),
('audiencia', 'audienciapublica', 'upload_pauta'),
('audiencia', 'audienciapublica', 'upload_ata'),
('audiencia', 'audienciapublica', 'upload_anexo'),
('audiencia', 'anexoaudienciapublica', 'arquivo'),
('sessao', 'sessaoplenaria', 'upload_pauta'),
('sessao', 'sessaoplenaria', 'upload_ata'),
('sessao', 'sessaoplenaria', 'upload_anexo'),
('sessao', 'justificativaausencia', 'upload_anexo'),
('sessao', 'orador', 'upload_anexo'),
('sessao', 'oradorexpediente', 'upload_anexo'),
('sessao', 'oradorordemdia', 'upload_anexo'),
]
class Command(BaseCommand):
help = (
'Phase 1 backfill: create FileMetadata rows for existing files (no disk I/O). '
'Idempotent — skips instances where _metadata FK is already set.'
)
def add_arguments(self, parser):
parser.add_argument(
'--batch-size', type=int, default=1000,
help='Rows per bulk_create batch (default: 1000).',
)
parser.add_argument(
'--dry-run', action='store_true',
help='Report counts without writing.',
)
def handle(self, *args, **options):
from sapl.base.models import FileMetadata
batch_size = options['batch_size']
dry_run = options['dry_run']
if dry_run:
self.stdout.write(self.style.WARNING('DRY RUN — nothing will be written.'))
total_created = 0
for app_label, model_name, field_name in METADATA_FILE_FIELDS:
try:
Model = apps.get_model(app_label, model_name)
except LookupError:
self.stdout.write(self.style.ERROR(
f' {app_label}.{model_name} not found — skipping.'))
continue
meta_fk = f'{field_name}_metadata'
qs = (
Model.objects
.filter(**{f'{field_name}__isnull': False, f'{meta_fk}__isnull': True})
.exclude(**{field_name: ''})
.only('pk', field_name, meta_fk)
)
count = qs.count()
if count == 0:
self.stdout.write(f' {app_label}.{model_name}.{field_name}: up to date.')
continue
self.stdout.write(f' {app_label}.{model_name}.{field_name}: {count} rows...')
if dry_run:
total_created += count
continue
# Process in batches: bulk_create rows, then bulk_update the FK back.
batch = []
instances = []
for instance in qs.iterator(chunk_size=batch_size):
field_file = getattr(instance, field_name)
storage_name = field_file.name
batch.append(FileMetadata(
storage_name=storage_name,
original_filename=Path(storage_name).name,
app_label=app_label,
model_name=model_name,
field_name=field_name,
owner_pk=instance.pk,
))
instances.append(instance)
if len(batch) >= batch_size:
total_created += self._flush(
batch, instances, meta_fk, field_name, Model)
batch = []
instances = []
if batch:
total_created += self._flush(batch, instances, meta_fk, field_name, Model)
self.stdout.write(self.style.SUCCESS(f' done.'))
self.stdout.write(f'Structural backfill complete — created: {total_created}')
if dry_run:
self.stdout.write(self.style.WARNING('(dry run — nothing was written)'))
def _flush(self, batch, instances, meta_fk, field_name, Model):
from sapl.base.models import FileMetadata
with transaction.atomic():
created = FileMetadata.objects.bulk_create(batch, ignore_conflicts=True)
# Re-query to get PKs for the rows we just inserted (bulk_create may not
# return PKs on all DB backends, and ignore_conflicts rows have pk=None).
storage_names = [m.storage_name for m in batch]
meta_map = {
m.storage_name: m.pk
for m in FileMetadata.objects.filter(storage_name__in=storage_names)
}
update_instances = []
for instance in instances:
field_file = getattr(instance, field_name)
pk = meta_map.get(field_file.name)
if pk:
setattr(instance, f'{meta_fk}_id', pk)
update_instances.append(instance)
if update_instances:
with transaction.atomic():
Model.objects.bulk_update(update_instances, [f'{meta_fk}_id'])
return len(created)

43
sapl/base/migrations/0062_filemetadata_owner_fields.py

@ -0,0 +1,43 @@
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('base', '0061_file_metadata'),
]
operations = [
migrations.AddField(
model_name='filemetadata',
name='app_label',
field=models.CharField(blank=True, default='', max_length=100),
),
migrations.AddField(
model_name='filemetadata',
name='model_name',
field=models.CharField(blank=True, default='', max_length=100),
),
migrations.AddField(
model_name='filemetadata',
name='field_name',
field=models.CharField(blank=True, default='', max_length=100),
),
migrations.AddField(
model_name='filemetadata',
name='owner_pk',
field=models.PositiveIntegerField(blank=True, null=True),
),
migrations.AlterField(
model_name='filemetadata',
name='storage_name',
field=models.CharField(editable=False, max_length=512, verbose_name='Storage name'),
),
migrations.AddIndex(
model_name='filemetadata',
index=models.Index(
fields=['app_label', 'model_name', 'field_name'],
name='filemetadata_owner_context_idx',
),
),
]

11
sapl/base/models.py

@ -511,6 +511,7 @@ class FileMetadata(models.Model):
)
storage_name = models.CharField(
max_length=512,
editable=False,
verbose_name=_('Storage name'),
)
original_filename = models.CharField(
@ -537,11 +538,21 @@ class FileMetadata(models.Model):
blank=True,
verbose_name=_('Backfilled at'),
)
app_label = models.CharField(max_length=100, default='', blank=True)
model_name = models.CharField(max_length=100, default='', blank=True)
field_name = models.CharField(max_length=100, default='', blank=True)
owner_pk = models.PositiveIntegerField(null=True, blank=True)
class Meta:
db_table = 'base_file_metadata'
verbose_name = _('File Metadata')
verbose_name_plural = _('File Metadata')
indexes = [
models.Index(
fields=['app_label', 'model_name', 'field_name'],
name='filemetadata_owner_context_idx',
),
]
def __str__(self):
return f'{self.original_filename} (v{self.version})'

119
sapl/base/views.py

@ -1575,11 +1575,26 @@ def pesquisa_textual(request):
# File-serving views (RFC §6.4, §9)
# ---------------------------------------------------------------------------
import logging as _logging # noqa: E402
import os as _os # noqa: E402
from urllib.parse import quote # noqa: E402 — kept near usage site
from django.http import FileResponse, HttpResponse, HttpResponseForbidden # noqa: E402
from django.http import FileResponse, HttpResponse # noqa: E402
from sapl.base.fields import ( # noqa: E402
_PRIVATE_FIELDS, _content_disposition, _is_public,
)
_serve_logger = _logging.getLogger(__name__)
try:
from prometheus_client import Counter as _Counter
_metadata_fallback_counter = _Counter(
'sapl_metadata_fallback_total',
'Requests served via legacy pre-metadata fallback',
['app', 'model', 'field'],
)
except Exception:
_metadata_fallback_counter = None
SERVE_FILE_FIELDS = frozenset({
('materia', 'materialegislativa', 'texto_original'),
@ -1604,31 +1619,63 @@ SERVE_FILE_FIELDS = frozenset({
})
def serve_file(request, file_uuid):
def can_download_file(user, meta, owner_object=None) -> bool:
"""Single authorization gate for all file downloads (RFC §6.4.1)."""
key = (meta.app_label, meta.model_name, meta.field_name)
if key not in _PRIVATE_FIELDS:
return True
# Lazy-load owner_object from DB when not provided.
if owner_object is None and meta.owner_pk is not None:
try:
model = apps.get_model(meta.app_label, meta.model_name)
owner_object = model.objects.filter(pk=meta.owner_pk).first()
except LookupError:
pass
if key == ('materia', 'proposicao', 'texto_original'):
if user.is_staff:
return True
if owner_object and hasattr(owner_object, 'autor'):
return owner_object.autor.filter(user_profile__user=user).exists()
return False
if meta.app_label == 'protocoloadm' and meta.model_name == 'documentoadministrativo':
if user.is_staff:
return True
return owner_object is not None and not owner_object.restrito
if key == ('protocoloadm', 'documentoacessorioadministrativo', 'arquivo'):
if user.is_staff:
return True
parent = getattr(owner_object, 'documento', None) if owner_object else None
return parent is not None and not parent.restrito
return False
def serve_file(request, file_uuid, owner_object=None):
"""
Secure file-serving view the single chokepoint for all document downloads.
Resolves uuid FileMetadata storage_name, performs a permission check
(currently public files pass unconditionally), then delegates the actual
byte transfer to nginx via X-Accel-Redirect. Django never reads file bytes
into Python memory; nginx's sendfile delivers them zero-copy.
The /media/ location in nginx must be marked 'internal' so that clients
cannot bypass this view and fetch files directly.
Resolves uuid FileMetadata storage_name, enforces access control, then
delegates byte transfer to nginx via X-Accel-Redirect. Django never reads
file bytes into memory.
"""
from sapl.base.models import FileMetadata
from django.shortcuts import get_object_or_404 as _get_or_404
meta = _get_or_404(FileMetadata, uuid=file_uuid)
# Permission check — currently unconditional for public files.
# When DocumentoAdministrativo.restrito / nivel_restricao is wired,
# insert the per-file check here (RFC §6.4).
if not can_download_file(request.user, meta, owner_object):
if not request.user.is_authenticated:
raise Http404
return HttpResponseForbidden()
display_name = meta.original_filename or Path(meta.storage_name).name
if settings.DEBUG:
# runserver has no nginx: serve the bytes directly from the filesystem.
file_path = _os.path.join(settings.MEDIA_ROOT, meta.storage_name)
try:
fh = open(file_path, 'rb')
@ -1636,30 +1683,16 @@ def serve_file(request, file_uuid):
raise Http404
return FileResponse(fh, as_attachment=False, filename=display_name)
# Production: delegate byte transfer to nginx via X-Accel-Redirect.
# storage_name is relative to MEDIA_ROOT (e.g. "sapl/public/norma/…/file.pdf").
internal_path = f'/media/{meta.storage_name}'
response = HttpResponse()
response['X-Accel-Redirect'] = internal_path
# RFC 6266 — dual filename parameter: ASCII fallback + UTF-8 encoded.
filename_ascii = display_name.encode('ascii', 'replace').decode()
filename_encoded = quote(display_name, safe='')
response['Content-Disposition'] = (
f'inline; filename="{filename_ascii}"'
f"; filename*=UTF-8''{filename_encoded}"
)
response['X-Accel-Redirect'] = f'/media/{meta.storage_name}'
response['Content-Disposition'] = _content_disposition(display_name)
response['Cache-Control'] = 'public, max-age=300' if _is_public(meta) else 'no-store'
return response
def serve_model_file(request, app_label, model_name, pk, field_name):
"""
Semantic alias for file downloads: /<app>/<model>/<pk>/<field>/download
Validates the (app_label, model_name, field_name) triple against an explicit
allowlist, fetches the parent model instance, resolves the _metadata FK, then
delegates to serve_file. All permission logic lives in serve_file only.
"""
from django.shortcuts import get_object_or_404 as _get_or_404
@ -1673,10 +1706,32 @@ def serve_model_file(request, app_label, model_name, pk, field_name):
instance = _get_or_404(model, pk=pk)
meta = getattr(instance, f'{field_name}_metadata', None)
if meta is None:
# Temporary fallback while backfill is in progress.
field_file = getattr(instance, field_name)
if not field_file:
raise Http404
_serve_logger.warning(
'serve_model_file fallback: metadata NULL for %s/%s/%s pk=%s',
app_label, model_name, field_name, pk,
)
if _metadata_fallback_counter is not None:
_metadata_fallback_counter.labels(
app=app_label, model=model_name, field=field_name,
).inc()
if settings.DEBUG:
file_path = _os.path.join(settings.MEDIA_ROOT, field_file.name)
try:
fh = open(file_path, 'rb')
except OSError:
raise Http404
return FileResponse(fh)
response = HttpResponse()
response['X-Accel-Redirect'] = f'/media/{field_file.name}'
return response
return serve_file(request, file_uuid=meta.uuid)
return serve_file(request, file_uuid=meta.uuid, owner_object=instance)
# Image fields served via X-Accel-Redirect — same nginx internal mechanism as

Loading…
Cancel
Save