You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/thoughts/ledgers/CONTINUITY_stemwijzer.md

5.8 KiB

Session: stemwijzer

Updated: 2026-03-23T09:00:00Z

Goal

2D political compass + motion similarity search from parliamentary votes + motion text. Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings.

Constraints

  • DuckDB only (data/motions.db); open/close duckdb.connect(self.db_path) per method
  • Vectors stored as JSON text (no external vector DB)
  • Logging via logging.getLogger(__name__); no print() in library modules
  • Tests run offline (network monkeypatched) — use .venv/bin/python -m pytest -q
  • Do NOT modify app.py or scheduler.py
  • Use .venv/bin/python (Arch Linux system Python is externally managed)

Current DB State (verified 2026-03-22 ~16:00)

Table Rows
motions 10,613
embeddings 10,753
svd_vectors 24,528
fused_embeddings 10,613 (1:1 with motions, 0 duplicates)
similarity_cache 212,206 (top_k=20, all annual windows)
mp_votes 199,967
mp_metadata 798

Annual Window Coverage

Year Motions Fused Similarity
2016 132 132 2,640
2017 30 30 600
2018 100 100 2,000
2019 3 3 6
2020 0 0 0 (no data)
2021 0 0 0 (no data)
2022 4,116 4,116 82,320
2023 621 621 12,420
2024 948 948 18,960
2025 3,715 3,715 74,300
2026 948 948 18,960

Completed This Session

  • Text embeddings: ran with real OpenRouter API at batch_size=200 → 10,753 embedding rows
  • Re-ran extract_mp_votes on all motions → 111,978 new rows (party-level votes backfilled)
  • SVD re-run (annual 2016–2026) with full vote data → 24,528 svd_vector rows
  • Fixed store_fused_embedding double-counting bug: added DELETE before INSERT
  • Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates
  • Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows
  • Test suite: 34 passed, 2 skipped

Key Decisions

  • store_fused_embedding (database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs.
  • Annual windows chosen for historical political compass (2016–2026).
  • top_k=20 for similarity cache.
  • Party-level votes (e.g. {"PVV": "voor"}) handled in extract_mp_votes — actor without comma → party=actor_name.

Open Items (not blocking, data coverage gaps)

  1. 2020–2021 data gap: No motions in DB at all. Need to run downloader with --start-date 2019-01-01 --end-date 2021-12-31 if data exists in API.
  2. 2024 gap ~3,020 motions: OData API has ~3,968 2024 motions, only 948 in DB. Root cause unclear — needs investigation of URL-based dedup in insert_motion.
  3. "Verworpen." dedup: Short-text motions (title="Verworpen.") get spurious similarity=1.0. UI/query layer should filter score < 0.999 OR title != 'Verworpen.'.
  4. svd_vectors has duplicates: 2025 has 7,430 rows for 3,715 motions (2x). Doesn't affect fused_embeddings (DELETE+INSERT handles it) but wastes space. Low priority.

Key File Paths

  • DB: data/motions.db
  • Venv: .venv/bin/python
  • Pipeline entry: pipeline/run_pipeline.py
  • Fusion: pipeline/fusion.py
  • SVD: pipeline/svd_pipeline.py
  • Text embeddings: pipeline/text_pipeline.py
  • MP votes extraction: pipeline/extract_mp_votes.py
  • Database layer: database.py
  • Similarity compute: similarity/compute.py
  • Similarity lookup: similarity/lookup.py
  • Tests: tests/ (pytest, offline)

Branch

main

Progress

Done

  • All items listed under "Completed This Session" above

In Progress

  • Rerun embeddings: started scripts/rerun_embeddings.py against data/motions.db
    • Start time: 2026-03-23T01:42:00Z (approx)
    • Current progress: embeddings stored = 950 / total motions = 28,172
    • fused_embeddings = 0 (not started)
    • similarity_cache = 0 (not started)

Blocked

  • Not fully blocked, but encountering provider failures and warnings that slow progress:
    • Batch 951..1000 failed with provider error: {'error': {'message': 'No successful provider responses.', 'code': 404}} (recorded)
    • Occasional connection pool warnings during earlier body fetch phase (logged)
    • Provider failures are transient but may require retries or provider change if repeated

Key Decisions

  • Retry strategy on provider failure: On repeated provider failures, retry embedding batches with smaller batch_size (e.g. 50 -> 20) or switch provider. Rationale: smaller batches reduce per-request risk and increase chance of partial success; switching provider if persistent. (UNCONFIRMED)

Next Steps

  1. Continue the rerun_embeddings job until completion; monitor batches closely
  2. If provider failures repeat, retry failed batches with smaller batch_size (50 -> 20) or switch provider (as above)
  3. On completion, update ledger with final counts and list any failed motion IDs
  4. If fused_embeddings / similarity_cache remain 0 after embeddings finished, run fusion and similarity recompute pipelines

File Operations

Read

  • data/motions.db
  • scripts/rerun_embeddings.py (invoked)

Modified

  • thoughts/ledgers/CONTINUITY_stemwijzer.md (this file)

Critical Context

  • Rerun started 2026-03-23T01:42Z; current embeddings stored = 950 of 28,172 total motions.
  • Recent error: Batch 951..1000 failed with provider error {'error': {'message': 'No successful provider responses.', 'code': 404}} — these batch numbers and error payload should be retried.
  • ETA: approx 1.5–2.5 hours remaining at current rate (UNCONFIRMED, depends on provider stability)
  • Earlier stage produced occasional connection pool warnings while fetching motion bodies; these did not stop progress but may indicate transient network instability.

Working Set

  • Branch: main
  • Key files: data/motions.db, scripts/rerun_embeddings.py, thoughts/ledgers/CONTINUITY_stemwijzer.md