--- date: 2026-03-23 topic: "Motion Content Enrichment via SyncFeed" status: validated --- # Motion Content Enrichment via SyncFeed ## Problem Statement All 25,521 motions in the DB have NULL `body_text` and NULL `layman_explanation`. Their `title`/`description` are outcome strings ("Aangenomen.", "Verworpen.") because the bulk downloader used `skip_details=True`. The text embedding pipeline uses `COALESCE(layman_explanation, description, title)`, so all embeddings are effectively embeddings of "Aangenomen." — zero semantic signal. Goal: populate real motion titles (Zaak.Onderwerp) and motion body text (officielebekendmakingen.nl HTML) for all motions, then re-run embeddings for the complete dataset. ## Constraints - Do NOT modify `app.py` or `scheduler.py` - DuckDB only; open/close per method - Use Python logging, no print() in library code - `motions.id` primary key is an INTEGER autoincrement; `motions.url` contains `https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit-uuid}` — the UUID is the Besluit.Id in the Tweede Kamer data model - `database.py` CREATE TABLE for motions is missing `body_text` and `externe_identifier` columns even though INSERT statements reference them — schema must be fixed ## Approach Use the **SyncFeed API** (`https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed`) to bulk-walk 4 entity types and build a complete local join index. This replaces the per-motion OData chain (3 API calls × 25,521 = 76,000+ calls) with ~2,000–4,000 paginated feed pages across all entity types. Alternatives considered: - **OData per-motion** (`_get_motion_details`): 76k+ calls, estimated 10+ hours. Rejected. - **OData bulk $expand**: Works for titles (~100 pages) but getting ExterneIdentifier still requires per-Zaak calls. Partially useful but incomplete. Rejected in favour of SyncFeed which handles everything in one pass. ## Architecture ``` SyncFeed walk (4 feeds) ├─ category=Besluit → {besluit_id: [zaak_ids]} ├─ category=Zaak → {zaak_id: {onderwerp, soort}} ├─ category=Document → {document_id: [zaak_ids]} └─ category=DocumentVersie → {document_id: externe_identifier} ↓ In-memory join: besluit_id → zaak_id → onderwerp (title) besluit_id → zaak_id → document_id → ext_id (ExterneIdentifier) ↓ DB update pass: UPDATE motions SET title=?, externe_identifier=? WHERE url LIKE ? ↓ Parallel HTML fetch (thread pool, 20 workers): GET zoek.officielebekendmakingen.nl/{ext_id}.html → extract text → UPDATE motions.body_text ↓ Pipeline re-run: clear embeddings → text pipeline → fusion (all windows) → similarity cache (all windows) ``` ## Components ### `scripts/sync_motion_content.py` (new) Orchestrates the full enrichment: 1. **SyncFeed walker** — generic paginated Atom/XML reader that follows `` until exhausted, yielding parsed entity dicts per page. Respects 429/rate-limit via exponential backoff. 2. **Entity parsers** — one per entity type: - `parse_besluit(xml)` → `{id, zaak_refs: [uuid, ...], verwijderd}` - `parse_zaak(xml)` → `{id, onderwerp, soort, verwijderd}` - `parse_document(xml)` → `{id, zaak_refs: [uuid, ...], verwijderd}` - `parse_documentversie(xml)` → `{id, document_id, externe_identifier, extensie, verwijderd}` 3. **Join builder** — after all 4 feeds are walked: - `build_title_map(besluit_index, zaak_index)` → `{besluit_id: onderwerp}` - `build_ext_id_map(besluit_index, zaak_index, doc_index, docversie_index)` → `{besluit_id: externe_identifier}` - For motions with multiple Zaak, prefer Soort="Motie"; fall back to first 4. **DB updater** — open DuckDB, bulk UPDATE motions using the join maps. Extract `besluit_id` from `url` column via string split. 5. **Body text fetcher** — thread pool (20 workers), fetch HTML from `zoek.officielebekendmakingen.nl/{ext_id}.html`, strip HTML tags with regex (reuse existing `_fetch_body_text` logic), UPDATE `motions.body_text`. 6. **Progress reporting** — log counts: motions updated with title, motions with ExterneIdentifier found, body text fetched, failures. ### `database.py` schema fix Add missing columns to `CREATE TABLE motions` DDL: - `body_text TEXT` - `externe_identifier TEXT` Also add `ALTER TABLE IF NOT EXISTS` guard calls in `_init_database()` for existing DBs that don't have these columns yet. ### `pipeline/text_pipeline.py` change Update `_select_text` SQL: ``` COALESCE(m.layman_explanation, m.body_text, m.description, m.title) ``` (adds `m.body_text` as second-priority fallback) ### `scripts/rerun_embeddings.py` (new or inline in sync script) After enrichment: 1. `DELETE FROM embeddings` — wipe all stale embeddings (they're all "Aangenomen.") 2. Run `pipeline.text_pipeline.ensure_text_embeddings(db_path, model, batch_size)` 3. Run `pipeline.fusion.fuse_for_window(window_id, db_path)` for all 20 windows 4. Run `similarity.compute.compute_similarities(vector_type='fused', window_id=w)` for all 20 windows ## Data Flow ``` motions.url → extract besluit_uuid (split on '/') → look up in title_map → UPDATE motions.title, motions.description → look up in ext_id_map → UPDATE motions.externe_identifier → fetch HTML → UPDATE motions.body_text text_pipeline._select_text → COALESCE(layman_explanation, body_text, description, title) → now returns real motion text for ~60-80% of motions → outcome string fallback for the rest fused_embeddings → [svd_vector || text_vector] (text now has semantic content) similarity_cache → re-computed for all 20 windows with meaningful vectors ``` ## Error Handling Strategy - **SyncFeed**: exponential backoff on 429/5xx; log and skip individual malformed entries; checkpoint skiptoken to disk so walk can resume after crash - **Body text fetch**: catch all per-URL exceptions, log, continue; motions without body text fall back to Zaak.Onderwerp in COALESCE - **DB update**: use DuckDB transactions per batch of 1000; rollback on failure - **Missing Zaak/Document**: expected for procedural votes; log counts; these motions get title = NULL → COALESCE falls back to "Aangenomen." as before ## Testing Strategy - Unit tests for each XML parser using hardcoded fixture XML strings - Unit test for `build_title_map` with a small synthetic index - Integration test: walk 1 page of Besluit SyncFeed live, assert > 0 entries returned - After full run: query `SELECT COUNT(*) FROM motions WHERE title NOT IN ('Aangenomen.', 'Verworpen.', 'Gestaakt.')` — expect > 10,000 - After embeddings: spot-check cosine similarity between two related motions (same topic) is higher than between unrelated motions ## Open Questions - **Document–Zaak relationship**: The SyncFeed Document entity may reference multiple Zaak IDs. For motions with multiple linked documents, we prefer the one with Soort="Motie" on the Zaak. Edge cases may need manual inspection. - **SyncFeed total record count**: Unknown until walked. Estimate 2,000–6,000 pages total across 4 feeds. Could be more for Document/DocumentVersie. - **Rate limits**: SyncFeed documentation doesn't specify limits. Start at 1 req/s, increase if no 429s. - **Body text coverage**: Not all motions have an associated kamerstuk document. Procedural votes (e.g., "Rondgezonden en gepubliceerd") typically won't. Expect 40–60% body text coverage.