You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/thoughts/ledgers/CONTINUITY_stemwijzer.md

3.5 KiB

Session: stemwijzer — Parliamentary Embedding Pipeline

Updated: 2026-03-22T16:00:00Z

Goal

2D political compass + motion similarity search from parliamentary votes + motion text. Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings.

Constraints

  • DuckDB only (data/motions.db); open/close duckdb.connect(self.db_path) per method
  • Vectors stored as JSON text (no external vector DB)
  • Logging via logging.getLogger(__name__); no print() in library modules
  • Tests run offline (network monkeypatched) — use .venv/bin/python -m pytest -q
  • Do NOT modify app.py or scheduler.py
  • Use .venv/bin/python (Arch Linux system Python is externally managed)

Current DB State (verified 2026-03-22 ~16:00)

Table Rows
motions 10,613
embeddings 10,753
svd_vectors 24,528
fused_embeddings 10,613 (1:1 with motions, 0 duplicates)
similarity_cache 212,206 (top_k=20, all annual windows)
mp_votes 199,967
mp_metadata 798

Annual Window Coverage

Year Motions Fused Similarity
2016 132 132 2,640
2017 30 30 600
2018 100 100 2,000
2019 3 3 6
2020 0 0 0 (no data)
2021 0 0 0 (no data)
2022 4,116 4,116 82,320
2023 621 621 12,420
2024 948 948 18,960
2025 3,715 3,715 74,300
2026 948 948 18,960

Completed This Session

  • Text embeddings: ran with real OpenRouter API at batch_size=200 → 10,753 embedding rows
  • Re-ran extract_mp_votes on all motions → 111,978 new rows (party-level votes backfilled)
  • SVD re-run (annual 2016–2026) with full vote data → 24,528 svd_vector rows
  • Fixed store_fused_embedding double-counting bug: added DELETE before INSERT
  • Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates
  • Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows
  • Test suite: 34 passed, 2 skipped

Key Decisions

  • store_fused_embedding (database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs.
  • Annual windows chosen for historical political compass (2016–2026).
  • top_k=20 for similarity cache.
  • Party-level votes (e.g. {"PVV": "voor"}) handled in extract_mp_votes — actor without comma → party=actor_name.

Open Items (not blocking, data coverage gaps)

  1. 2020–2021 data gap: No motions in DB at all. Need to run downloader with --start-date 2019-01-01 --end-date 2021-12-31 if data exists in API.
  2. 2024 gap ~3,020 motions: OData API has ~3,968 2024 motions, only 948 in DB. Root cause unclear — needs investigation of URL-based dedup in insert_motion.
  3. "Verworpen." dedup: Short-text motions (title="Verworpen.") get spurious similarity=1.0. UI/query layer should filter score < 0.999 OR title != 'Verworpen.'.
  4. svd_vectors has duplicates: 2025 has 7,430 rows for 3,715 motions (2x). Doesn't affect fused_embeddings (DELETE+INSERT handles it) but wastes space. Low priority.

Key File Paths

  • DB: data/motions.db
  • Venv: .venv/bin/python
  • Pipeline entry: pipeline/run_pipeline.py
  • Fusion: pipeline/fusion.py
  • SVD: pipeline/svd_pipeline.py
  • Text embeddings: pipeline/text_pipeline.py
  • MP votes extraction: pipeline/extract_mp_votes.py
  • Database layer: database.py
  • Similarity compute: similarity/compute.py
  • Similarity lookup: similarity/lookup.py
  • Tests: tests/ (pytest, offline)

Branch

main