# Session: stemwijzer — Parliamentary Embedding Pipeline Updated: 2026-03-22T16:00:00Z ## Goal 2D political compass + motion similarity search from parliamentary votes + motion text. Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings. ## Constraints - DuckDB only (`data/motions.db`); open/close `duckdb.connect(self.db_path)` per method - Vectors stored as JSON text (no external vector DB) - Logging via `logging.getLogger(__name__)`; no `print()` in library modules - Tests run offline (network monkeypatched) — use `.venv/bin/python -m pytest -q` - Do NOT modify `app.py` or `scheduler.py` - Use `.venv/bin/python` (Arch Linux system Python is externally managed) ## Current DB State (verified 2026-03-22 ~16:00) | Table | Rows | |---|---| | motions | 10,613 | | embeddings | 10,753 | | svd_vectors | 24,528 | | fused_embeddings | **10,613** (1:1 with motions, 0 duplicates) | | similarity_cache | **212,206** (top_k=20, all annual windows) | | mp_votes | 199,967 | | mp_metadata | 798 | ## Annual Window Coverage | Year | Motions | Fused | Similarity | |---|---|---|---| | 2016 | 132 | 132 | 2,640 | | 2017 | 30 | 30 | 600 | | 2018 | 100 | 100 | 2,000 | | 2019 | 3 | 3 | 6 | | 2020 | 0 | 0 | 0 (no data) | | 2021 | 0 | 0 | 0 (no data) | | 2022 | 4,116 | 4,116 | 82,320 | | 2023 | 621 | 621 | 12,420 | | 2024 | 948 | 948 | 18,960 | | 2025 | 3,715 | 3,715 | 74,300 | | 2026 | 948 | 948 | 18,960 | ## Completed This Session - [x] Text embeddings: ran with real OpenRouter API at batch_size=200 → 10,753 embedding rows - [x] Re-ran `extract_mp_votes` on all motions → 111,978 new rows (party-level votes backfilled) - [x] SVD re-run (annual 2016–2026) with full vote data → 24,528 svd_vector rows - [x] Fixed `store_fused_embedding` double-counting bug: added DELETE before INSERT - [x] Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates - [x] Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows - [x] Test suite: **34 passed, 2 skipped** ✅ ## Key Decisions - `store_fused_embedding` (database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs. - Annual windows chosen for historical political compass (2016–2026). - top_k=20 for similarity cache. - Party-level votes (e.g. `{"PVV": "voor"}`) handled in `extract_mp_votes` — actor without comma → `party=actor_name`. ## Open Items (not blocking, data coverage gaps) 1. **2020–2021 data gap**: No motions in DB at all. Need to run downloader with `--start-date 2019-01-01 --end-date 2021-12-31` if data exists in API. 2. **2024 gap ~3,020 motions**: OData API has ~3,968 2024 motions, only 948 in DB. Root cause unclear — needs investigation of URL-based dedup in `insert_motion`. 3. **"Verworpen." dedup**: Short-text motions (title="Verworpen.") get spurious similarity=1.0. UI/query layer should filter `score < 0.999 OR title != 'Verworpen.'`. 4. **svd_vectors has duplicates**: 2025 has 7,430 rows for 3,715 motions (2x). Doesn't affect fused_embeddings (DELETE+INSERT handles it) but wastes space. Low priority. ## Key File Paths - DB: `data/motions.db` - Venv: `.venv/bin/python` - Pipeline entry: `pipeline/run_pipeline.py` - Fusion: `pipeline/fusion.py` - SVD: `pipeline/svd_pipeline.py` - Text embeddings: `pipeline/text_pipeline.py` - MP votes extraction: `pipeline/extract_mp_votes.py` - Database layer: `database.py` - Similarity compute: `similarity/compute.py` - Similarity lookup: `similarity/lookup.py` - Tests: `tests/` (pytest, offline) ## Branch `main`