You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/thoughts/ledgers/CONTINUITY_stemwijzer.md

7.6 KiB

Session: stemwijzer

Updated: 2026-03-31T12:40:00Z

Goal

2D political compass + motion similarity search from parliamentary votes + motion text. Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings.

Constraints

  • DuckDB only (data/motions.db); open/close duckdb.connect(self.db_path) per method
  • Vectors stored as JSON text (no external vector DB)
  • Logging via logging.getLogger(__name__); no print() in library modules
  • Tests run offline (network monkeypatched) — use .venv/bin/python -m pytest -q
  • Do NOT modify app.py or scheduler.py
  • Use .venv/bin/python (Arch Linux system Python is externally managed)

Current DB State (verified 2026-03-22 ~16:00; additional run summary 2026-03-23)

Table Rows
motions 10,613
embeddings 10,753
svd_vectors 24,528
fused_embeddings 10,613 (1:1 with motions, 0 duplicates) — per-run fusion summary reported larger aggregate inserts (see Critical Context) (UNCONFIRMED mapping)
similarity_cache 212,206 (top_k=20, all annual windows) — fusion+similarity run produced a larger set of inserted rows (see Critical Context) (UNCONFIRMED mapping)
mp_votes 199,967
mp_metadata 798

Annual Window Coverage

Year Motions Fused Similarity
2016 132 132 2,640
2017 30 30 600
2018 100 100 2,000
2019 3 3 6
2020 0 0 0 (no data)
2021 0 0 0 (no data)
2022 4,116 4,116 82,320
2023 621 621 12,420
2024 948 948 18,960
2025 3,715 3,715 74,300
2026 948 948 18,960

Completed This Session

  • Text embeddings: ran with real OpenRouter API at batch_size=200 → 10,753 embedding rows
  • Re-ran extract_mp_votes on all motions → 111,978 new rows (party-level votes backfilled)
  • SVD re-run (annual 2016–2026) with full vote data → 24,528 svd_vector rows
  • Fixed store_fused_embedding double-counting bug: added DELETE before INSERT
  • Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates
  • Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows
  • Test suite: 34 passed, 2 skipped
  • Rerun embeddings (scripts/rerun_embeddings.py) completed: embeddings stored = 28,172 (final) — recorded in fusion+similarity run summary (UNCONFIRMED mapping to embeddings table)
  • Fusion + similarity run completed (per-window processing) — aggregate inserts recorded in thoughts/ledgers/fusion_similarity_summary.json

Key Decisions

  • store_fused_embedding (database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs.
  • Annual windows chosen for historical political compass (2016–2026).
  • top_k=20 for similarity cache.
  • Party-level votes (e.g. {"PVV": "voor"}) handled in extract_mp_votes — actor without comma → party=actor_name.

Open Items (not blocking, data coverage gaps)

  1. 2020–2021 data gap: No motions in DB at all. Need to run downloader with --start-date 2019-01-01 --end-date 2021-12-31 if data exists in API.
  2. 2024 gap ~3,020 motions: OData API has ~3,968 2024 motions, only 948 in DB. Root cause unclear — needs investigation of URL-based dedup in insert_motion.
  3. "Verworpen." dedup: Short-text motions (title="Verworpen.") get spurious similarity=1.0. UI/query layer should filter score < 0.999 OR title != 'Verworpen.'.
  4. svd_vectors has duplicates: 2025 has 7,430 rows for 3,715 motions (2x). Doesn't affect fused_embeddings (DELETE+INSERT handles it) but wastes space. Low priority.

Key File Paths

  • DB: data/motions.db
  • Venv: .venv/bin/python
  • Pipeline entry: pipeline/run_pipeline.py
  • Fusion: pipeline/fusion.py
  • SVD: pipeline/svd_pipeline.py
  • Text embeddings: pipeline/text_pipeline.py
  • MP votes extraction: pipeline/extract_mp_votes.py
  • Database layer: database.py
  • Similarity compute: similarity/compute.py
  • Similarity lookup: similarity/lookup.py
  • Tests: tests/ (pytest, offline)

Branch

main

Progress

Done

  • All items listed under "Completed This Session" above

In Progress

  • Short QA: sample similarity lookups and sanity checks (N=20-50) against fused_embeddings/similarity results
    • Purpose: validate fused vectors, detect padding/anomalies, and confirm similarity rows are sensible
    • Estimated effort: 30–60 minutes
  • Trajectories tab: chart not rendering — root cause found (silent exception in st.plotly_chart)
    • Fix applied: commit 72d1c20 — shows st.error + diagnostics when rendering fails
    • Pending: user to verify fix by running Explorer with EXPLORER_DEBUG_TRAJECTORIES=1

Blocked

  • None blocking for QA; earlier provider failures affected embedding rerun but rerun was completed per fusion run summary (UNCONFIRMED)

Key Decisions

  • Retry strategy on provider failure: On repeated provider failures, retry embedding batches with smaller batch_size (e.g. 50 -> 20) or switch provider. Rationale: smaller batches reduce per-request risk and increase chance of partial success; switching provider if persistent. (UNCONFIRMED)

Next Steps

  1. Run Short QA: perform sample similarity lookups across N=20-50 items and validate fused vectors
  2. Inspect thoughts/ledgers/fusion_similarity_summary.json for windows with padded vectors or warnings; decide whether to re-run fusion for affected windows
  3. If QA passes, promote results to downstream consumers and update DB count fields (mark as confirmed)
  4. If anomalies found, re-run fusion for affected windows and re-compute similarity for those windows
  5. Archive list of any failed motion IDs from embedding run and consider retry with smaller batch_size or alternate provider (if any failures remain) (UNCONFIRMED)

File Operations

Read

  • data/motions.db
  • scripts/rerun_embeddings.py (invoked)
  • thoughts/ledgers/fusion_similarity_summary.json (run summary)

Modified

  • thoughts/ledgers/CONTINUITY_stemwijzer.md (this file)
  • thoughts/ledgers/fusion_similarity_summary.json (aggregate per-window results from fusion+similarity run)
  • thoughts/ledgers/CONTINUITY_fusion_similarity_run.md

Critical Context

  • Rerun embeddings started 2026-03-23T01:42Z; final embedding count recorded by fusion run = 28,172 (see thoughts/ledgers/fusion_similarity_summary.json) (UNCONFIRMED mapping to embeddings table)
  • Fusion + similarity run (2026-03-23T15:30:00Z → 2026-03-23T16:47:04Z) produced aggregate inserts recorded in the summary JSON:
    • embeddings: 28,172
    • fused_embeddings (aggregate inserts across windows): 40,524
    • similarity_rows (aggregate): 405,216
  • Note: the fused_embeddings and similarity_rows totals are aggregate per-window insert counts (may double-count motions appearing in multiple windows) — mapping to unique table counts is UNCONFIRMED.
  • Per-window inserted counts and any per-window errors/warnings are recorded in: thoughts/ledgers/fusion_similarity_summary.json.
  • Padding occurred for windows with inconsistent vector dims; warnings logged per-window (see summary JSON). Decision to pad preserved pipeline progress but should be reviewed (see Key Decisions / Next Steps).
  • Earlier provider error: Batch 951..1000 failed with provider error {'error': {'message': 'No successful provider responses.', 'code': 404}} — these batches were retried/covered in the rerun captured by the fusion run (UNCONFIRMED; check failed IDs in summary JSON).

Working Set

  • Branch: main
  • Key files: data/motions.db, scripts/rerun_embeddings.py, thoughts/ledgers/CONTINUITY_stemwijzer.md, thoughts/ledgers/fusion_similarity_summary.json, thoughts/ledgers/CONTINUITY_fusion_similarity_run.md