7.6 KiB

Raw Permalink Blame History

Session: stemwijzer

Updated: 2026-03-31T12:40:00Z

Goal

2D political compass + motion similarity search from parliamentary votes + motion text. Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings.

Constraints

DuckDB only (data/motions.db); open/close duckdb.connect(self.db_path) per method
Vectors stored as JSON text (no external vector DB)
Logging via logging.getLogger(__name__); no print() in library modules
Tests run offline (network monkeypatched) — use .venv/bin/python -m pytest -q
Do NOT modify app.py or scheduler.py
Use .venv/bin/python (Arch Linux system Python is externally managed)

Current DB State (verified 2026-03-22 ~16:00; additional run summary 2026-03-23)

Table	Rows
motions	10,613
embeddings	10,753
svd_vectors	24,528
fused_embeddings	10,613 (1:1 with motions, 0 duplicates) — per-run fusion summary reported larger aggregate inserts (see Critical Context) (UNCONFIRMED mapping)
similarity_cache	212,206 (top_k=20, all annual windows) — fusion+similarity run produced a larger set of inserted rows (see Critical Context) (UNCONFIRMED mapping)
mp_votes	199,967
mp_metadata	798

Annual Window Coverage

Year	Motions	Fused	Similarity
2016	132	132	2,640
2017	30	30	600
2018	100	100	2,000
2019	3	3	6
2020	0	0	0 (no data)
2021	0	0	0 (no data)
2022	4,116	4,116	82,320
2023	621	621	12,420
2024	948	948	18,960
2025	3,715	3,715	74,300
2026	948	948	18,960

Completed This Session

Text embeddings: ran with real OpenRouter API at batch_size=200 → 10,753 embedding rows
Re-ran extract_mp_votes on all motions → 111,978 new rows (party-level votes backfilled)
SVD re-run (annual 2016–2026) with full vote data → 24,528 svd_vector rows
Fixed store_fused_embedding double-counting bug: added DELETE before INSERT
Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates
Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows
Test suite: 34 passed, 2 skipped ✅
Rerun embeddings (scripts/rerun_embeddings.py) completed: embeddings stored = 28,172 (final) — recorded in fusion+similarity run summary (UNCONFIRMED mapping to embeddings table)
Fusion + similarity run completed (per-window processing) — aggregate inserts recorded in thoughts/ledgers/fusion_similarity_summary.json

Key Decisions

store_fused_embedding (database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs.
Annual windows chosen for historical political compass (2016–2026).
top_k=20 for similarity cache.
Party-level votes (e.g. {"PVV": "voor"}) handled in extract_mp_votes — actor without comma → party=actor_name.

Open Items (not blocking, data coverage gaps)

2020–2021 data gap: No motions in DB at all. Need to run downloader with --start-date 2019-01-01 --end-date 2021-12-31 if data exists in API.
2024 gap ~3,020 motions: OData API has ~3,968 2024 motions, only 948 in DB. Root cause unclear — needs investigation of URL-based dedup in insert_motion.
"Verworpen." dedup: Short-text motions (title="Verworpen.") get spurious similarity=1.0. UI/query layer should filter score < 0.999 OR title != 'Verworpen.'.
svd_vectors has duplicates: 2025 has 7,430 rows for 3,715 motions (2x). Doesn't affect fused_embeddings (DELETE+INSERT handles it) but wastes space. Low priority.

Key File Paths

DB: data/motions.db
Venv: .venv/bin/python
Pipeline entry: pipeline/run_pipeline.py
Fusion: pipeline/fusion.py
SVD: pipeline/svd_pipeline.py
Text embeddings: pipeline/text_pipeline.py
MP votes extraction: pipeline/extract_mp_votes.py
Database layer: database.py
Similarity compute: similarity/compute.py
Similarity lookup: similarity/lookup.py
Tests: tests/ (pytest, offline)

Branch

main

Progress

Done

All items listed under "Completed This Session" above

In Progress

Short QA: sample similarity lookups and sanity checks (N=20-50) against fused_embeddings/similarity results
- Purpose: validate fused vectors, detect padding/anomalies, and confirm similarity rows are sensible
- Estimated effort: 30–60 minutes
Trajectories tab: chart not rendering — root cause found (silent exception in st.plotly_chart)
- Fix applied: commit 72d1c20 — shows st.error + diagnostics when rendering fails
- Pending: user to verify fix by running Explorer with EXPLORER_DEBUG_TRAJECTORIES=1

Blocked

None blocking for QA; earlier provider failures affected embedding rerun but rerun was completed per fusion run summary (UNCONFIRMED)

Key Decisions

Retry strategy on provider failure: On repeated provider failures, retry embedding batches with smaller batch_size (e.g. 50 -> 20) or switch provider. Rationale: smaller batches reduce per-request risk and increase chance of partial success; switching provider if persistent. (UNCONFIRMED)

Next Steps

Run Short QA: perform sample similarity lookups across N=20-50 items and validate fused vectors
Inspect thoughts/ledgers/fusion_similarity_summary.json for windows with padded vectors or warnings; decide whether to re-run fusion for affected windows
If QA passes, promote results to downstream consumers and update DB count fields (mark as confirmed)
If anomalies found, re-run fusion for affected windows and re-compute similarity for those windows
Archive list of any failed motion IDs from embedding run and consider retry with smaller batch_size or alternate provider (if any failures remain) (UNCONFIRMED)

File Operations

Read

data/motions.db
scripts/rerun_embeddings.py (invoked)
thoughts/ledgers/fusion_similarity_summary.json (run summary)

Modified

thoughts/ledgers/CONTINUITY_stemwijzer.md (this file)
thoughts/ledgers/fusion_similarity_summary.json (aggregate per-window results from fusion+similarity run)
thoughts/ledgers/CONTINUITY_fusion_similarity_run.md

Critical Context

Rerun embeddings started 2026-03-23T01:42Z; final embedding count recorded by fusion run = 28,172 (see thoughts/ledgers/fusion_similarity_summary.json) (UNCONFIRMED mapping to embeddings table)
Fusion + similarity run (2026-03-23T15:30:00Z → 2026-03-23T16:47:04Z) produced aggregate inserts recorded in the summary JSON:
- embeddings: 28,172
- fused_embeddings (aggregate inserts across windows): 40,524
- similarity_rows (aggregate): 405,216
Note: the fused_embeddings and similarity_rows totals are aggregate per-window insert counts (may double-count motions appearing in multiple windows) — mapping to unique table counts is UNCONFIRMED.
Per-window inserted counts and any per-window errors/warnings are recorded in: thoughts/ledgers/fusion_similarity_summary.json.
Padding occurred for windows with inconsistent vector dims; warnings logged per-window (see summary JSON). Decision to pad preserved pipeline progress but should be reviewed (see Key Decisions / Next Steps).
Earlier provider error: Batch 951..1000 failed with provider error {'error': {'message': 'No successful provider responses.', 'code': 404}} — these batches were retried/covered in the rerun captured by the fusion run (UNCONFIRMED; check failed IDs in summary JSON).

Working Set

Branch: main
Key files: data/motions.db, scripts/rerun_embeddings.py, thoughts/ledgers/CONTINUITY_stemwijzer.md, thoughts/ledgers/fusion_similarity_summary.json, thoughts/ledgers/CONTINUITY_fusion_similarity_run.md

7.6 KiB Raw Permalink Blame History

Session: stemwijzer

Goal

Constraints

Current DB State (verified 2026-03-22 ~16:00; additional run summary 2026-03-23)

Annual Window Coverage

Completed This Session

Key Decisions

Open Items (not blocking, data coverage gaps)

Key File Paths

Branch

Progress

Done

In Progress

Blocked

Key Decisions

Next Steps

File Operations

Read

Modified

Critical Context

Working Set

7.6 KiB

Raw Permalink Blame History