7.4 KiB
| date | topic | status |
|---|---|---|
| 2026-03-23 | Motion Content Enrichment via SyncFeed | validated |
Motion Content Enrichment via SyncFeed
Problem Statement
All 25,521 motions in the DB have NULL body_text and NULL layman_explanation. Their
title/description are outcome strings ("Aangenomen.", "Verworpen.") because the bulk
downloader used skip_details=True. The text embedding pipeline uses
COALESCE(layman_explanation, description, title), so all embeddings are effectively
embeddings of "Aangenomen." — zero semantic signal.
Goal: populate real motion titles (Zaak.Onderwerp) and motion body text (officielebekendmakingen.nl HTML) for all motions, then re-run embeddings for the complete dataset.
Constraints
- Do NOT modify
app.pyorscheduler.py - DuckDB only; open/close per method
- Use Python logging, no print() in library code
motions.idprimary key is an INTEGER autoincrement;motions.urlcontainshttps://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit-uuid}— the UUID is the Besluit.Id in the Tweede Kamer data modeldatabase.pyCREATE TABLE for motions is missingbody_textandexterne_identifiercolumns even though INSERT statements reference them — schema must be fixed
Approach
Use the SyncFeed API (https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed) to
bulk-walk 4 entity types and build a complete local join index. This replaces the
per-motion OData chain (3 API calls × 25,521 = 76,000+ calls) with ~2,000–4,000 paginated
feed pages across all entity types.
Alternatives considered:
- OData per-motion (
_get_motion_details): 76k+ calls, estimated 10+ hours. Rejected. - OData bulk $expand: Works for titles (~100 pages) but getting ExterneIdentifier still requires per-Zaak calls. Partially useful but incomplete. Rejected in favour of SyncFeed which handles everything in one pass.
Architecture
SyncFeed walk (4 feeds)
├─ category=Besluit → {besluit_id: [zaak_ids]}
├─ category=Zaak → {zaak_id: {onderwerp, soort}}
├─ category=Document → {document_id: [zaak_ids]}
└─ category=DocumentVersie → {document_id: externe_identifier}
↓
In-memory join:
besluit_id → zaak_id → onderwerp (title)
besluit_id → zaak_id → document_id → ext_id (ExterneIdentifier)
↓
DB update pass: UPDATE motions SET title=?, externe_identifier=? WHERE url LIKE ?
↓
Parallel HTML fetch (thread pool, 20 workers):
GET zoek.officielebekendmakingen.nl/{ext_id}.html → extract text → UPDATE motions.body_text
↓
Pipeline re-run:
clear embeddings → text pipeline → fusion (all windows) → similarity cache (all windows)
Components
scripts/sync_motion_content.py (new)
Orchestrates the full enrichment:
-
SyncFeed walker — generic paginated Atom/XML reader that follows
<link rel="next">until exhausted, yielding parsed entity dicts per page. Respects 429/rate-limit via exponential backoff. -
Entity parsers — one per entity type:
parse_besluit(xml)→{id, zaak_refs: [uuid, ...], verwijderd}parse_zaak(xml)→{id, onderwerp, soort, verwijderd}parse_document(xml)→{id, zaak_refs: [uuid, ...], verwijderd}parse_documentversie(xml)→{id, document_id, externe_identifier, extensie, verwijderd}
-
Join builder — after all 4 feeds are walked:
build_title_map(besluit_index, zaak_index)→{besluit_id: onderwerp}build_ext_id_map(besluit_index, zaak_index, doc_index, docversie_index)→{besluit_id: externe_identifier}- For motions with multiple Zaak, prefer Soort="Motie"; fall back to first
-
DB updater — open DuckDB, bulk UPDATE motions using the join maps. Extract
besluit_idfromurlcolumn via string split. -
Body text fetcher — thread pool (20 workers), fetch HTML from
zoek.officielebekendmakingen.nl/{ext_id}.html, strip HTML tags with regex (reuse existing_fetch_body_textlogic), UPDATEmotions.body_text. -
Progress reporting — log counts: motions updated with title, motions with ExterneIdentifier found, body text fetched, failures.
database.py schema fix
Add missing columns to CREATE TABLE motions DDL:
body_text TEXTexterne_identifier TEXT
Also add ALTER TABLE IF NOT EXISTS guard calls in _init_database() for existing DBs
that don't have these columns yet.
pipeline/text_pipeline.py change
Update _select_text SQL:
COALESCE(m.layman_explanation, m.body_text, m.description, m.title)
(adds m.body_text as second-priority fallback)
scripts/rerun_embeddings.py (new or inline in sync script)
After enrichment:
DELETE FROM embeddings— wipe all stale embeddings (they're all "Aangenomen.")- Run
pipeline.text_pipeline.ensure_text_embeddings(db_path, model, batch_size) - Run
pipeline.fusion.fuse_for_window(window_id, db_path)for all 20 windows - Run
similarity.compute.compute_similarities(vector_type='fused', window_id=w)for all 20 windows
Data Flow
motions.url
→ extract besluit_uuid (split on '/')
→ look up in title_map → UPDATE motions.title, motions.description
→ look up in ext_id_map → UPDATE motions.externe_identifier
→ fetch HTML → UPDATE motions.body_text
text_pipeline._select_text
→ COALESCE(layman_explanation, body_text, description, title)
→ now returns real motion text for ~60-80% of motions
→ outcome string fallback for the rest
fused_embeddings
→ [svd_vector || text_vector] (text now has semantic content)
similarity_cache
→ re-computed for all 20 windows with meaningful vectors
Error Handling Strategy
- SyncFeed: exponential backoff on 429/5xx; log and skip individual malformed entries; checkpoint skiptoken to disk so walk can resume after crash
- Body text fetch: catch all per-URL exceptions, log, continue; motions without body text fall back to Zaak.Onderwerp in COALESCE
- DB update: use DuckDB transactions per batch of 1000; rollback on failure
- Missing Zaak/Document: expected for procedural votes; log counts; these motions get title = NULL → COALESCE falls back to "Aangenomen." as before
Testing Strategy
- Unit tests for each XML parser using hardcoded fixture XML strings
- Unit test for
build_title_mapwith a small synthetic index - Integration test: walk 1 page of Besluit SyncFeed live, assert > 0 entries returned
- After full run: query
SELECT COUNT(*) FROM motions WHERE title NOT IN ('Aangenomen.', 'Verworpen.', 'Gestaakt.')— expect > 10,000 - After embeddings: spot-check cosine similarity between two related motions (same topic) is higher than between unrelated motions
Open Questions
- Document–Zaak relationship: The SyncFeed Document entity may reference multiple Zaak IDs. For motions with multiple linked documents, we prefer the one with Soort="Motie" on the Zaak. Edge cases may need manual inspection.
- SyncFeed total record count: Unknown until walked. Estimate 2,000–6,000 pages total across 4 feeds. Could be more for Document/DocumentVersie.
- Rate limits: SyncFeed documentation doesn't specify limits. Start at 1 req/s, increase if no 429s.
- Body text coverage: Not all motions have an associated kamerstuk document. Procedural votes (e.g., "Rondgezonden en gepubliceerd") typically won't. Expect 40–60% body text coverage.