You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/thoughts/shared/designs/2026-03-23-motion-content-e...

116 lines
7.5 KiB

---
date: 2026-03-23
topic: "motion content enrichment - next steps"
status: draft
---
## Problem Statement
We successfully ingested SyncFeed motion content, fetched body texts, re-embedded motives, ran fusion (SVD-based) and rebuilt the similarity cache. The pipeline ran end-to-end but showed intermittent failures (embedding provider batch failures, connection-pool warnings) and produced a small number of missing body_texts and potential spurious similarity hits.
**Goal:** Stabilize and harden the motion content enrichment + embedding/fusion/similarity pipeline so it runs reliably, is testable, and produces high-quality similarity results for production use.
## Constraints
- **Do not modify** app.py or scheduler.py.
- Use **DuckDB only** (data/motions.db) and open/close connections per method; avoid long-lived global connections.
- No print() calls in library modules — use logging.getLogger(__name__).
- Tests must continue to run under the existing pytest setup and monkeypatching in CI.
- Avoid YAGNI features: only add monitoring/metrics that are actionable and low-effort.
## Approach (chosen)
I'm leaning toward an **incremental hardening** approach: small, high-impact fixes and QA steps first (low effort, immediate benefit), then follow with a short set of robustness improvements (retries, backoff, audit events) and targeted tests. This minimizes risk and gives quick confidence that the bulk import can be re-run safely.
Alternatives considered:
- Full rewrite of SyncFeed walker to a resilient state-machine (higher effort; unnecessary today).
- Push heavy-duty observability (Prometheus + Grafana) immediately (high overhead; defer to specific metrics and logs first).
I chose incremental hardening because it fixes the concrete failures we saw (provider batch errors, connection pool warnings, one 404 body) quickly and keeps the codebase small and testable.
## Architecture
High-level components:
- **SyncFeed sync script** (scripts/sync_motion_content.py): walk feeds, build title/ext-id maps, fetch body text, update DB.
- **Text embedding pipeline** (pipeline/text_pipeline.py, scripts/rerun_embeddings.py): convert selected text into embeddings, with provider retry logic.
- **Fusion/SVD pipeline** (pipeline/fusion.py, pipeline/svd_pipeline.py): fuse embeddings per-window and produce fused vectors.
- **Similarity compute & lookup** (similarity/compute.py, similarity/lookup.py): compute pairwise similarities and populate cache.
- **DB layer** (database.py, migrations): motions table (body_text, externe_identifier), fused_embeddings, svd_vectors, similarity_cache and audit events.
- **Audit & continuity** (thoughts/ledgers/*, audit_events table): record run summaries and per-window results.
Responsibilities are unchanged; we add a small **ai_provider wrapper** and an **operations script** for QA and rerun orchestration.
## Components & Responsibilities
- **sync_motion_content.py**: keep as-is; add more granular logging and a CLI flag to limit to a subset (for QA). Responsible for idempotent updates.
- **_fetch_body_text / fetch_body_texts**: reduce max_workers or add retry on transient HTTP errors; wrap requests.Session with adapters to control pool size.
- **text_pipeline.ai_provider**: add a small retry/backoff wrapper that retries failed batches with exponential backoff and a fallback to smaller batch_size.
- **scripts/rerun_embeddings.py**: expose a `--retry-missing` mode that detects missing embeddings and retries with smaller batches.
- **similarity.compute**: keep padding logic; add a filter to avoid trivial 1.0 matches for extremely short titles (query/UI should also filter but apply DB-side filter for safety).
- **migrations**: add audit_events or mark which motions failed fetch/embedding for manual review.
- **tests**: add deterministic tests for retry behavior and for the QA-sample similarity checks.
## Data Flow
1. Walk SyncFeed (Besluit, Zaak, Document, DocumentVersie) → parse elements.
2. Build **title_map** and **ext_id_map** in-memory.
3. Fetch body_texts in parallel (ThreadPoolExecutor) → map ext_id -> body_text.
4. Update motions table with title, externe_identifier, body_text.
5. Run text embeddings for motions (COALESCE priority: layman_explanation → body_text → description → title).
6. Fuse embeddings per-window (svd_vectors) → produce fused_embeddings.
7. Compute similarity cache per-window and insert rows.
8. QA checks and audit logs produced for runs.
## Error Handling Strategy
- **HTTP / body fetches:** add per-ext_id retries (3 attempts) with short exponential backoff; capture and store failures in audit_events table for manual follow-up.
- **Connection pool warnings:** reduce ThreadPoolExecutor concurrency (configurable flag) and attach a requests.adapters HTTPAdapter with a limited pool size to avoid 'Connection pool is full' warnings.
- **Embedding provider failures:** implement a wrapper which:
- retries batches up to N times with exponential backoff,
- on persistent failure, retry missing items with a smaller batch_size,
- mark failed motion ids in an audit table rather than blocking the entire run.
- **Similarity anomalies (1.0 scores):** filter out identity matches and very-short-text matches when building similarity cache; record these in diagnostics output.
## Testing Strategy
- Add unit tests for parser functions (already present) to cover edge cases seen in real SyncFeed XML.
- Add a unit test for the ai_provider retry wrapper that simulates provider failures and verifies fallback to smaller batches.
- Add an integration QA script (scripts/qa_similarity.py) that:
- samples N motions across windows,
- runs lookup.similarity and asserts results are within expected ranges (e.g., top-5 not all 1.0 unless identical text),
- outputs a short summary JSON saved to thoughts/ledgers/ for each run.
- CI: run the new provider-retry test and the QA script with a small dataset (mocked provider) to ensure no regressions.
## Actionable Next Steps (prioritized)
1. Quick QA (1 day) — sample 50 motions and inspect similarity quality.
- Implement scripts/qa_similarity.py (sample + assert heuristics).
- Run locally and record summary in thoughts/ledgers.
2. Small robustness fixes (1–2 days) — low-risk changes with big wins.
- Add ai_provider retry/backoff wrapper and unit tests.
- Add `--max-body-workers` CLI flag and drop default to 10; add per-request retries.
- Add `--retry-missing` mode to rerun_embeddings to retry failed batches with smaller sizes.
3. Observability & audit (1 day) — make failures visible and actionable.
- Add audit_events table rows when body_text fetch or embedding fails.
- Write an end-of-run JSON summary (already done) and attach per-window stats to ledger.
4. Safety filters & dedupe (0.5 day)
- Add a small DB-side filter to skip trivial identical-title matches in similarity cache.
- Audit SVD windows for duplication and dedupe if needed.
5. Run full re-run (off-peak) and validate (1 day)
- Re-run embeddings, fusion and similarity; run QA script and review ledgers.
Estimated total: 3–5 days of focused work.
## Open Questions
- Do we want to persist per-item failure flags in DuckDB (audit_events) or just in ledgers? I recommend adding an **audit_events** table to speed triage.
- What SLA / acceptance criteria should we use for similarity quality? E.g., maximum allowed fraction of top-1 exact-title matches for non-identical motions.
- Are we comfortable reducing body fetch concurrency by default, or should we attempt a more adaptive concurrency strategy?
---
I'm proceeding to create the design doc. Interrupt if you want changes.