7.4 KiB

Raw Blame History

date	topic	status
2026-03-23	Motion Content Enrichment via SyncFeed	validated

Motion Content Enrichment via SyncFeed

Problem Statement

All 25,521 motions in the DB have NULL body_text and NULL layman_explanation. Their title/description are outcome strings ("Aangenomen.", "Verworpen.") because the bulk downloader used skip_details=True. The text embedding pipeline uses COALESCE(layman_explanation, description, title), so all embeddings are effectively embeddings of "Aangenomen." — zero semantic signal.

Goal: populate real motion titles (Zaak.Onderwerp) and motion body text (officielebekendmakingen.nl HTML) for all motions, then re-run embeddings for the complete dataset.

Constraints

Do NOT modify app.py or scheduler.py
DuckDB only; open/close per method
Use Python logging, no print() in library code
motions.id primary key is an INTEGER autoincrement; motions.url contains https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit-uuid} — the UUID is the Besluit.Id in the Tweede Kamer data model
database.py CREATE TABLE for motions is missing body_text and externe_identifier columns even though INSERT statements reference them — schema must be fixed

Approach

Use the SyncFeed API (https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed) to bulk-walk 4 entity types and build a complete local join index. This replaces the per-motion OData chain (3 API calls × 25,521 = 76,000+ calls) with ~2,000–4,000 paginated feed pages across all entity types.

Alternatives considered:

OData per-motion (_get_motion_details): 76k+ calls, estimated 10+ hours. Rejected.
OData bulk $expand: Works for titles (~100 pages) but getting ExterneIdentifier still requires per-Zaak calls. Partially useful but incomplete. Rejected in favour of SyncFeed which handles everything in one pass.

Architecture

SyncFeed walk (4 feeds)
  ├─ category=Besluit     → {besluit_id: [zaak_ids]}
  ├─ category=Zaak        → {zaak_id: {onderwerp, soort}}
  ├─ category=Document    → {document_id: [zaak_ids]}
  └─ category=DocumentVersie → {document_id: externe_identifier}
         ↓
  In-memory join:
  besluit_id → zaak_id → onderwerp          (title)
  besluit_id → zaak_id → document_id → ext_id  (ExterneIdentifier)
         ↓
  DB update pass: UPDATE motions SET title=?, externe_identifier=? WHERE url LIKE ?
         ↓
  Parallel HTML fetch (thread pool, 20 workers):
  GET zoek.officielebekendmakingen.nl/{ext_id}.html → extract text → UPDATE motions.body_text
         ↓
  Pipeline re-run:
  clear embeddings → text pipeline → fusion (all windows) → similarity cache (all windows)

Components

`scripts/sync_motion_content.py` (new)

Orchestrates the full enrichment:

SyncFeed walker — generic paginated Atom/XML reader that follows <link rel="next"> until exhausted, yielding parsed entity dicts per page. Respects 429/rate-limit via exponential backoff.
Entity parsers — one per entity type:
- parse_besluit(xml) → {id, zaak_refs: [uuid, ...], verwijderd}
- parse_zaak(xml) → {id, onderwerp, soort, verwijderd}
- parse_document(xml) → {id, zaak_refs: [uuid, ...], verwijderd}
- parse_documentversie(xml) → {id, document_id, externe_identifier, extensie, verwijderd}
Join builder — after all 4 feeds are walked:
- build_title_map(besluit_index, zaak_index) → {besluit_id: onderwerp}
- build_ext_id_map(besluit_index, zaak_index, doc_index, docversie_index) → {besluit_id: externe_identifier}
- For motions with multiple Zaak, prefer Soort="Motie"; fall back to first
DB updater — open DuckDB, bulk UPDATE motions using the join maps. Extract besluit_id from url column via string split.
Body text fetcher — thread pool (20 workers), fetch HTML from zoek.officielebekendmakingen.nl/{ext_id}.html, strip HTML tags with regex (reuse existing _fetch_body_text logic), UPDATE motions.body_text.
Progress reporting — log counts: motions updated with title, motions with ExterneIdentifier found, body text fetched, failures.

`database.py` schema fix

Add missing columns to CREATE TABLE motions DDL:

body_text TEXT
externe_identifier TEXT

Also add ALTER TABLE IF NOT EXISTS guard calls in _init_database() for existing DBs that don't have these columns yet.

`pipeline/text_pipeline.py` change

Update _select_text SQL:

COALESCE(m.layman_explanation, m.body_text, m.description, m.title)

(adds m.body_text as second-priority fallback)

`scripts/rerun_embeddings.py` (new or inline in sync script)

After enrichment:

DELETE FROM embeddings — wipe all stale embeddings (they're all "Aangenomen.")
Run pipeline.text_pipeline.ensure_text_embeddings(db_path, model, batch_size)
Run pipeline.fusion.fuse_for_window(window_id, db_path) for all 20 windows
Run similarity.compute.compute_similarities(vector_type='fused', window_id=w) for all 20 windows

Data Flow

motions.url
  → extract besluit_uuid (split on '/')
  → look up in title_map → UPDATE motions.title, motions.description
  → look up in ext_id_map → UPDATE motions.externe_identifier
  → fetch HTML → UPDATE motions.body_text

text_pipeline._select_text
  → COALESCE(layman_explanation, body_text, description, title)
  → now returns real motion text for ~60-80% of motions
  → outcome string fallback for the rest

fused_embeddings
  → [svd_vector || text_vector]  (text now has semantic content)

similarity_cache
  → re-computed for all 20 windows with meaningful vectors

Error Handling Strategy

SyncFeed: exponential backoff on 429/5xx; log and skip individual malformed entries; checkpoint skiptoken to disk so walk can resume after crash
Body text fetch: catch all per-URL exceptions, log, continue; motions without body text fall back to Zaak.Onderwerp in COALESCE
DB update: use DuckDB transactions per batch of 1000; rollback on failure
Missing Zaak/Document: expected for procedural votes; log counts; these motions get title = NULL → COALESCE falls back to "Aangenomen." as before

Testing Strategy

Unit tests for each XML parser using hardcoded fixture XML strings
Unit test for build_title_map with a small synthetic index
Integration test: walk 1 page of Besluit SyncFeed live, assert > 0 entries returned
After full run: query SELECT COUNT(*) FROM motions WHERE title NOT IN ('Aangenomen.', 'Verworpen.', 'Gestaakt.') — expect > 10,000
After embeddings: spot-check cosine similarity between two related motions (same topic) is higher than between unrelated motions

Open Questions

Document–Zaak relationship: The SyncFeed Document entity may reference multiple Zaak IDs. For motions with multiple linked documents, we prefer the one with Soort="Motie" on the Zaak. Edge cases may need manual inspection.
SyncFeed total record count: Unknown until walked. Estimate 2,000–6,000 pages total across 4 feeds. Could be more for Document/DocumentVersie.
Rate limits: SyncFeed documentation doesn't specify limits. Start at 1 req/s, increase if no 429s.
Body text coverage: Not all motions have an associated kamerstuk document. Procedural votes (e.g., "Rondgezonden en gepubliceerd") typically won't. Expect 40–60% body text coverage.

7.4 KiB Raw Blame History