You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
5.8 KiB
5.8 KiB
Session: stemwijzer
Updated: 2026-03-23T09:00:00Z
Goal
2D political compass + motion similarity search from parliamentary votes + motion text. Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings.
Constraints
- DuckDB only (
data/motions.db); open/closeduckdb.connect(self.db_path)per method - Vectors stored as JSON text (no external vector DB)
- Logging via
logging.getLogger(__name__); noprint()in library modules - Tests run offline (network monkeypatched) — use
.venv/bin/python -m pytest -q - Do NOT modify
app.pyorscheduler.py - Use
.venv/bin/python(Arch Linux system Python is externally managed)
Current DB State (verified 2026-03-22 ~16:00)
| Table | Rows |
|---|---|
| motions | 10,613 |
| embeddings | 10,753 |
| svd_vectors | 24,528 |
| fused_embeddings | 10,613 (1:1 with motions, 0 duplicates) |
| similarity_cache | 212,206 (top_k=20, all annual windows) |
| mp_votes | 199,967 |
| mp_metadata | 798 |
Annual Window Coverage
| Year | Motions | Fused | Similarity |
|---|---|---|---|
| 2016 | 132 | 132 | 2,640 |
| 2017 | 30 | 30 | 600 |
| 2018 | 100 | 100 | 2,000 |
| 2019 | 3 | 3 | 6 |
| 2020 | 0 | 0 | 0 (no data) |
| 2021 | 0 | 0 | 0 (no data) |
| 2022 | 4,116 | 4,116 | 82,320 |
| 2023 | 621 | 621 | 12,420 |
| 2024 | 948 | 948 | 18,960 |
| 2025 | 3,715 | 3,715 | 74,300 |
| 2026 | 948 | 948 | 18,960 |
Completed This Session
- Text embeddings: ran with real OpenRouter API at batch_size=200 → 10,753 embedding rows
- Re-ran
extract_mp_voteson all motions → 111,978 new rows (party-level votes backfilled) - SVD re-run (annual 2016–2026) with full vote data → 24,528 svd_vector rows
- Fixed
store_fused_embeddingdouble-counting bug: added DELETE before INSERT - Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates
- Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows
- Test suite: 34 passed, 2 skipped ✅
Key Decisions
store_fused_embedding(database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs.- Annual windows chosen for historical political compass (2016–2026).
- top_k=20 for similarity cache.
- Party-level votes (e.g.
{"PVV": "voor"}) handled inextract_mp_votes— actor without comma →party=actor_name.
Open Items (not blocking, data coverage gaps)
- 2020–2021 data gap: No motions in DB at all. Need to run downloader with
--start-date 2019-01-01 --end-date 2021-12-31if data exists in API. - 2024 gap ~3,020 motions: OData API has ~3,968 2024 motions, only 948 in DB. Root cause unclear — needs investigation of URL-based dedup in
insert_motion. - "Verworpen." dedup: Short-text motions (title="Verworpen.") get spurious similarity=1.0. UI/query layer should filter
score < 0.999 OR title != 'Verworpen.'. - svd_vectors has duplicates: 2025 has 7,430 rows for 3,715 motions (2x). Doesn't affect fused_embeddings (DELETE+INSERT handles it) but wastes space. Low priority.
Key File Paths
- DB:
data/motions.db - Venv:
.venv/bin/python - Pipeline entry:
pipeline/run_pipeline.py - Fusion:
pipeline/fusion.py - SVD:
pipeline/svd_pipeline.py - Text embeddings:
pipeline/text_pipeline.py - MP votes extraction:
pipeline/extract_mp_votes.py - Database layer:
database.py - Similarity compute:
similarity/compute.py - Similarity lookup:
similarity/lookup.py - Tests:
tests/(pytest, offline)
Branch
main
Progress
Done
- All items listed under "Completed This Session" above
In Progress
- Rerun embeddings: started scripts/rerun_embeddings.py against
data/motions.db- Start time: 2026-03-23T01:42:00Z (approx)
- Current progress: embeddings stored = 950 / total motions = 28,172
- fused_embeddings = 0 (not started)
- similarity_cache = 0 (not started)
Blocked
- Not fully blocked, but encountering provider failures and warnings that slow progress:
- Batch 951..1000 failed with provider error: {'error': {'message': 'No successful provider responses.', 'code': 404}} (recorded)
- Occasional connection pool warnings during earlier body fetch phase (logged)
- Provider failures are transient but may require retries or provider change if repeated
Key Decisions
- Retry strategy on provider failure: On repeated provider failures, retry embedding batches with smaller batch_size (e.g. 50 -> 20) or switch provider. Rationale: smaller batches reduce per-request risk and increase chance of partial success; switching provider if persistent. (UNCONFIRMED)
Next Steps
- Continue the rerun_embeddings job until completion; monitor batches closely
- If provider failures repeat, retry failed batches with smaller batch_size (50 -> 20) or switch provider (as above)
- On completion, update ledger with final counts and list any failed motion IDs
- If fused_embeddings / similarity_cache remain 0 after embeddings finished, run fusion and similarity recompute pipelines
File Operations
Read
data/motions.dbscripts/rerun_embeddings.py(invoked)
Modified
thoughts/ledgers/CONTINUITY_stemwijzer.md(this file)
Critical Context
- Rerun started 2026-03-23T01:42Z; current embeddings stored = 950 of 28,172 total motions.
- Recent error: Batch 951..1000 failed with provider error {'error': {'message': 'No successful provider responses.', 'code': 404}} — these batch numbers and error payload should be retried.
- ETA: approx 1.5–2.5 hours remaining at current rate (UNCONFIRMED, depends on provider stability)
- Earlier stage produced occasional connection pool warnings while fetching motion bodies; these did not stop progress but may indicate transient network instability.
Working Set
- Branch:
main - Key files:
data/motions.db,scripts/rerun_embeddings.py,thoughts/ledgers/CONTINUITY_stemwijzer.md