You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/thoughts/shared/designs/2026-03-23-motion-content-e...

7.4 KiB

date topic status
2026-03-23 Motion Content Enrichment via SyncFeed validated

Motion Content Enrichment via SyncFeed

Problem Statement

All 25,521 motions in the DB have NULL body_text and NULL layman_explanation. Their title/description are outcome strings ("Aangenomen.", "Verworpen.") because the bulk downloader used skip_details=True. The text embedding pipeline uses COALESCE(layman_explanation, description, title), so all embeddings are effectively embeddings of "Aangenomen." — zero semantic signal.

Goal: populate real motion titles (Zaak.Onderwerp) and motion body text (officielebekendmakingen.nl HTML) for all motions, then re-run embeddings for the complete dataset.

Constraints

  • Do NOT modify app.py or scheduler.py
  • DuckDB only; open/close per method
  • Use Python logging, no print() in library code
  • motions.id primary key is an INTEGER autoincrement; motions.url contains https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit-uuid} — the UUID is the Besluit.Id in the Tweede Kamer data model
  • database.py CREATE TABLE for motions is missing body_text and externe_identifier columns even though INSERT statements reference them — schema must be fixed

Approach

Use the SyncFeed API (https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed) to bulk-walk 4 entity types and build a complete local join index. This replaces the per-motion OData chain (3 API calls × 25,521 = 76,000+ calls) with ~2,000–4,000 paginated feed pages across all entity types.

Alternatives considered:

  • OData per-motion (_get_motion_details): 76k+ calls, estimated 10+ hours. Rejected.
  • OData bulk $expand: Works for titles (~100 pages) but getting ExterneIdentifier still requires per-Zaak calls. Partially useful but incomplete. Rejected in favour of SyncFeed which handles everything in one pass.

Architecture

SyncFeed walk (4 feeds)
  ├─ category=Besluit     → {besluit_id: [zaak_ids]}
  ├─ category=Zaak        → {zaak_id: {onderwerp, soort}}
  ├─ category=Document    → {document_id: [zaak_ids]}
  └─ category=DocumentVersie → {document_id: externe_identifier}
         ↓
  In-memory join:
  besluit_id → zaak_id → onderwerp          (title)
  besluit_id → zaak_id → document_id → ext_id  (ExterneIdentifier)
         ↓
  DB update pass: UPDATE motions SET title=?, externe_identifier=? WHERE url LIKE ?
         ↓
  Parallel HTML fetch (thread pool, 20 workers):
  GET zoek.officielebekendmakingen.nl/{ext_id}.html → extract text → UPDATE motions.body_text
         ↓
  Pipeline re-run:
  clear embeddings → text pipeline → fusion (all windows) → similarity cache (all windows)

Components

scripts/sync_motion_content.py (new)

Orchestrates the full enrichment:

  1. SyncFeed walker — generic paginated Atom/XML reader that follows <link rel="next"> until exhausted, yielding parsed entity dicts per page. Respects 429/rate-limit via exponential backoff.

  2. Entity parsers — one per entity type:

    • parse_besluit(xml){id, zaak_refs: [uuid, ...], verwijderd}
    • parse_zaak(xml){id, onderwerp, soort, verwijderd}
    • parse_document(xml){id, zaak_refs: [uuid, ...], verwijderd}
    • parse_documentversie(xml){id, document_id, externe_identifier, extensie, verwijderd}
  3. Join builder — after all 4 feeds are walked:

    • build_title_map(besluit_index, zaak_index){besluit_id: onderwerp}
    • build_ext_id_map(besluit_index, zaak_index, doc_index, docversie_index){besluit_id: externe_identifier}
    • For motions with multiple Zaak, prefer Soort="Motie"; fall back to first
  4. DB updater — open DuckDB, bulk UPDATE motions using the join maps. Extract besluit_id from url column via string split.

  5. Body text fetcher — thread pool (20 workers), fetch HTML from zoek.officielebekendmakingen.nl/{ext_id}.html, strip HTML tags with regex (reuse existing _fetch_body_text logic), UPDATE motions.body_text.

  6. Progress reporting — log counts: motions updated with title, motions with ExterneIdentifier found, body text fetched, failures.

database.py schema fix

Add missing columns to CREATE TABLE motions DDL:

  • body_text TEXT
  • externe_identifier TEXT

Also add ALTER TABLE IF NOT EXISTS guard calls in _init_database() for existing DBs that don't have these columns yet.

pipeline/text_pipeline.py change

Update _select_text SQL:

COALESCE(m.layman_explanation, m.body_text, m.description, m.title)

(adds m.body_text as second-priority fallback)

scripts/rerun_embeddings.py (new or inline in sync script)

After enrichment:

  1. DELETE FROM embeddings — wipe all stale embeddings (they're all "Aangenomen.")
  2. Run pipeline.text_pipeline.ensure_text_embeddings(db_path, model, batch_size)
  3. Run pipeline.fusion.fuse_for_window(window_id, db_path) for all 20 windows
  4. Run similarity.compute.compute_similarities(vector_type='fused', window_id=w) for all 20 windows

Data Flow

motions.url
  → extract besluit_uuid (split on '/')
  → look up in title_map → UPDATE motions.title, motions.description
  → look up in ext_id_map → UPDATE motions.externe_identifier
  → fetch HTML → UPDATE motions.body_text

text_pipeline._select_text
  → COALESCE(layman_explanation, body_text, description, title)
  → now returns real motion text for ~60-80% of motions
  → outcome string fallback for the rest

fused_embeddings
  → [svd_vector || text_vector]  (text now has semantic content)

similarity_cache
  → re-computed for all 20 windows with meaningful vectors

Error Handling Strategy

  • SyncFeed: exponential backoff on 429/5xx; log and skip individual malformed entries; checkpoint skiptoken to disk so walk can resume after crash
  • Body text fetch: catch all per-URL exceptions, log, continue; motions without body text fall back to Zaak.Onderwerp in COALESCE
  • DB update: use DuckDB transactions per batch of 1000; rollback on failure
  • Missing Zaak/Document: expected for procedural votes; log counts; these motions get title = NULL → COALESCE falls back to "Aangenomen." as before

Testing Strategy

  • Unit tests for each XML parser using hardcoded fixture XML strings
  • Unit test for build_title_map with a small synthetic index
  • Integration test: walk 1 page of Besluit SyncFeed live, assert > 0 entries returned
  • After full run: query SELECT COUNT(*) FROM motions WHERE title NOT IN ('Aangenomen.', 'Verworpen.', 'Gestaakt.') — expect > 10,000
  • After embeddings: spot-check cosine similarity between two related motions (same topic) is higher than between unrelated motions

Open Questions

  • Document–Zaak relationship: The SyncFeed Document entity may reference multiple Zaak IDs. For motions with multiple linked documents, we prefer the one with Soort="Motie" on the Zaak. Edge cases may need manual inspection.
  • SyncFeed total record count: Unknown until walked. Estimate 2,000–6,000 pages total across 4 feeds. Could be more for Document/DocumentVersie.
  • Rate limits: SyncFeed documentation doesn't specify limits. Start at 1 req/s, increase if no 429s.
  • Body text coverage: Not all motions have an associated kamerstuk document. Procedural votes (e.g., "Rondgezonden en gepubliceerd") typically won't. Expect 40–60% body text coverage.