You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
177 lines
7.4 KiB
177 lines
7.4 KiB
---
|
|
date: 2026-03-23
|
|
topic: "Motion Content Enrichment via SyncFeed"
|
|
status: validated
|
|
---
|
|
|
|
# Motion Content Enrichment via SyncFeed
|
|
|
|
## Problem Statement
|
|
|
|
All 25,521 motions in the DB have NULL `body_text` and NULL `layman_explanation`. Their
|
|
`title`/`description` are outcome strings ("Aangenomen.", "Verworpen.") because the bulk
|
|
downloader used `skip_details=True`. The text embedding pipeline uses
|
|
`COALESCE(layman_explanation, description, title)`, so all embeddings are effectively
|
|
embeddings of "Aangenomen." — zero semantic signal.
|
|
|
|
Goal: populate real motion titles (Zaak.Onderwerp) and motion body text
|
|
(officielebekendmakingen.nl HTML) for all motions, then re-run embeddings for the complete
|
|
dataset.
|
|
|
|
## Constraints
|
|
|
|
- Do NOT modify `app.py` or `scheduler.py`
|
|
- DuckDB only; open/close per method
|
|
- Use Python logging, no print() in library code
|
|
- `motions.id` primary key is an INTEGER autoincrement; `motions.url` contains
|
|
`https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit-uuid}` — the UUID
|
|
is the Besluit.Id in the Tweede Kamer data model
|
|
- `database.py` CREATE TABLE for motions is missing `body_text` and `externe_identifier`
|
|
columns even though INSERT statements reference them — schema must be fixed
|
|
|
|
## Approach
|
|
|
|
Use the **SyncFeed API** (`https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed`) to
|
|
bulk-walk 4 entity types and build a complete local join index. This replaces the
|
|
per-motion OData chain (3 API calls × 25,521 = 76,000+ calls) with ~2,000–4,000 paginated
|
|
feed pages across all entity types.
|
|
|
|
Alternatives considered:
|
|
- **OData per-motion** (`_get_motion_details`): 76k+ calls, estimated 10+ hours. Rejected.
|
|
- **OData bulk $expand**: Works for titles (~100 pages) but getting ExterneIdentifier
|
|
still requires per-Zaak calls. Partially useful but incomplete. Rejected in favour of
|
|
SyncFeed which handles everything in one pass.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
SyncFeed walk (4 feeds)
|
|
├─ category=Besluit → {besluit_id: [zaak_ids]}
|
|
├─ category=Zaak → {zaak_id: {onderwerp, soort}}
|
|
├─ category=Document → {document_id: [zaak_ids]}
|
|
└─ category=DocumentVersie → {document_id: externe_identifier}
|
|
↓
|
|
In-memory join:
|
|
besluit_id → zaak_id → onderwerp (title)
|
|
besluit_id → zaak_id → document_id → ext_id (ExterneIdentifier)
|
|
↓
|
|
DB update pass: UPDATE motions SET title=?, externe_identifier=? WHERE url LIKE ?
|
|
↓
|
|
Parallel HTML fetch (thread pool, 20 workers):
|
|
GET zoek.officielebekendmakingen.nl/{ext_id}.html → extract text → UPDATE motions.body_text
|
|
↓
|
|
Pipeline re-run:
|
|
clear embeddings → text pipeline → fusion (all windows) → similarity cache (all windows)
|
|
```
|
|
|
|
## Components
|
|
|
|
### `scripts/sync_motion_content.py` (new)
|
|
|
|
Orchestrates the full enrichment:
|
|
|
|
1. **SyncFeed walker** — generic paginated Atom/XML reader that follows `<link rel="next">`
|
|
until exhausted, yielding parsed entity dicts per page. Respects 429/rate-limit via
|
|
exponential backoff.
|
|
|
|
2. **Entity parsers** — one per entity type:
|
|
- `parse_besluit(xml)` → `{id, zaak_refs: [uuid, ...], verwijderd}`
|
|
- `parse_zaak(xml)` → `{id, onderwerp, soort, verwijderd}`
|
|
- `parse_document(xml)` → `{id, zaak_refs: [uuid, ...], verwijderd}`
|
|
- `parse_documentversie(xml)` → `{id, document_id, externe_identifier, extensie, verwijderd}`
|
|
|
|
3. **Join builder** — after all 4 feeds are walked:
|
|
- `build_title_map(besluit_index, zaak_index)` → `{besluit_id: onderwerp}`
|
|
- `build_ext_id_map(besluit_index, zaak_index, doc_index, docversie_index)`
|
|
→ `{besluit_id: externe_identifier}`
|
|
- For motions with multiple Zaak, prefer Soort="Motie"; fall back to first
|
|
|
|
4. **DB updater** — open DuckDB, bulk UPDATE motions using the join maps. Extract
|
|
`besluit_id` from `url` column via string split.
|
|
|
|
5. **Body text fetcher** — thread pool (20 workers), fetch HTML from
|
|
`zoek.officielebekendmakingen.nl/{ext_id}.html`, strip HTML tags with regex (reuse
|
|
existing `_fetch_body_text` logic), UPDATE `motions.body_text`.
|
|
|
|
6. **Progress reporting** — log counts: motions updated with title, motions with
|
|
ExterneIdentifier found, body text fetched, failures.
|
|
|
|
### `database.py` schema fix
|
|
|
|
Add missing columns to `CREATE TABLE motions` DDL:
|
|
- `body_text TEXT`
|
|
- `externe_identifier TEXT`
|
|
|
|
Also add `ALTER TABLE IF NOT EXISTS` guard calls in `_init_database()` for existing DBs
|
|
that don't have these columns yet.
|
|
|
|
### `pipeline/text_pipeline.py` change
|
|
|
|
Update `_select_text` SQL:
|
|
```
|
|
COALESCE(m.layman_explanation, m.body_text, m.description, m.title)
|
|
```
|
|
(adds `m.body_text` as second-priority fallback)
|
|
|
|
### `scripts/rerun_embeddings.py` (new or inline in sync script)
|
|
|
|
After enrichment:
|
|
1. `DELETE FROM embeddings` — wipe all stale embeddings (they're all "Aangenomen.")
|
|
2. Run `pipeline.text_pipeline.ensure_text_embeddings(db_path, model, batch_size)`
|
|
3. Run `pipeline.fusion.fuse_for_window(window_id, db_path)` for all 20 windows
|
|
4. Run `similarity.compute.compute_similarities(vector_type='fused', window_id=w)` for
|
|
all 20 windows
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
motions.url
|
|
→ extract besluit_uuid (split on '/')
|
|
→ look up in title_map → UPDATE motions.title, motions.description
|
|
→ look up in ext_id_map → UPDATE motions.externe_identifier
|
|
→ fetch HTML → UPDATE motions.body_text
|
|
|
|
text_pipeline._select_text
|
|
→ COALESCE(layman_explanation, body_text, description, title)
|
|
→ now returns real motion text for ~60-80% of motions
|
|
→ outcome string fallback for the rest
|
|
|
|
fused_embeddings
|
|
→ [svd_vector || text_vector] (text now has semantic content)
|
|
|
|
similarity_cache
|
|
→ re-computed for all 20 windows with meaningful vectors
|
|
```
|
|
|
|
## Error Handling Strategy
|
|
|
|
- **SyncFeed**: exponential backoff on 429/5xx; log and skip individual malformed entries;
|
|
checkpoint skiptoken to disk so walk can resume after crash
|
|
- **Body text fetch**: catch all per-URL exceptions, log, continue; motions without body
|
|
text fall back to Zaak.Onderwerp in COALESCE
|
|
- **DB update**: use DuckDB transactions per batch of 1000; rollback on failure
|
|
- **Missing Zaak/Document**: expected for procedural votes; log counts; these motions get
|
|
title = NULL → COALESCE falls back to "Aangenomen." as before
|
|
|
|
## Testing Strategy
|
|
|
|
- Unit tests for each XML parser using hardcoded fixture XML strings
|
|
- Unit test for `build_title_map` with a small synthetic index
|
|
- Integration test: walk 1 page of Besluit SyncFeed live, assert > 0 entries returned
|
|
- After full run: query `SELECT COUNT(*) FROM motions WHERE title NOT IN ('Aangenomen.',
|
|
'Verworpen.', 'Gestaakt.')` — expect > 10,000
|
|
- After embeddings: spot-check cosine similarity between two related motions (same topic)
|
|
is higher than between unrelated motions
|
|
|
|
## Open Questions
|
|
|
|
- **Document–Zaak relationship**: The SyncFeed Document entity may reference multiple
|
|
Zaak IDs. For motions with multiple linked documents, we prefer the one with
|
|
Soort="Motie" on the Zaak. Edge cases may need manual inspection.
|
|
- **SyncFeed total record count**: Unknown until walked. Estimate 2,000–6,000 pages total
|
|
across 4 feeds. Could be more for Document/DocumentVersie.
|
|
- **Rate limits**: SyncFeed documentation doesn't specify limits. Start at 1 req/s,
|
|
increase if no 429s.
|
|
- **Body text coverage**: Not all motions have an associated kamerstuk document.
|
|
Procedural votes (e.g., "Rondgezonden en gepubliceerd") typically won't. Expect
|
|
40–60% body text coverage.
|
|
|