You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
131 lines
7.6 KiB
131 lines
7.6 KiB
# Session: stemwijzer
|
|
Updated: 2026-03-31T12:40:00Z
|
|
|
|
## Goal
|
|
2D political compass + motion similarity search from parliamentary votes + motion text. Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings.
|
|
|
|
## Constraints
|
|
- DuckDB only (`data/motions.db`); open/close `duckdb.connect(self.db_path)` per method
|
|
- Vectors stored as JSON text (no external vector DB)
|
|
- Logging via `logging.getLogger(__name__)`; no `print()` in library modules
|
|
- Tests run offline (network monkeypatched) — use `.venv/bin/python -m pytest -q`
|
|
- Do NOT modify `app.py` or `scheduler.py`
|
|
- Use `.venv/bin/python` (Arch Linux system Python is externally managed)
|
|
|
|
## Current DB State (verified 2026-03-22 ~16:00; additional run summary 2026-03-23)
|
|
|
|
| Table | Rows |
|
|
|---|---|
|
|
| motions | 10,613 |
|
|
| embeddings | 10,753 |
|
|
| svd_vectors | 24,528 |
|
|
| fused_embeddings | **10,613** (1:1 with motions, 0 duplicates) — per-run fusion summary reported larger aggregate inserts (see Critical Context) (UNCONFIRMED mapping)
|
|
| similarity_cache | **212,206** (top_k=20, all annual windows) — fusion+similarity run produced a larger set of inserted rows (see Critical Context) (UNCONFIRMED mapping)
|
|
| mp_votes | 199,967 |
|
|
| mp_metadata | 798 |
|
|
|
|
## Annual Window Coverage
|
|
|
|
| Year | Motions | Fused | Similarity |
|
|
|---|---|---|---|
|
|
| 2016 | 132 | 132 | 2,640 |
|
|
| 2017 | 30 | 30 | 600 |
|
|
| 2018 | 100 | 100 | 2,000 |
|
|
| 2019 | 3 | 3 | 6 |
|
|
| 2020 | 0 | 0 | 0 (no data) |
|
|
| 2021 | 0 | 0 | 0 (no data) |
|
|
| 2022 | 4,116 | 4,116 | 82,320 |
|
|
| 2023 | 621 | 621 | 12,420 |
|
|
| 2024 | 948 | 948 | 18,960 |
|
|
| 2025 | 3,715 | 3,715 | 74,300 |
|
|
| 2026 | 948 | 948 | 18,960 |
|
|
|
|
## Completed This Session
|
|
- [x] Text embeddings: ran with real OpenRouter API at batch_size=200 → 10,753 embedding rows
|
|
- [x] Re-ran `extract_mp_votes` on all motions → 111,978 new rows (party-level votes backfilled)
|
|
- [x] SVD re-run (annual 2016–2026) with full vote data → 24,528 svd_vector rows
|
|
- [x] Fixed `store_fused_embedding` double-counting bug: added DELETE before INSERT
|
|
- [x] Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates
|
|
- [x] Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows
|
|
- [x] Test suite: **34 passed, 2 skipped** ✅
|
|
- [x] Rerun embeddings (scripts/rerun_embeddings.py) completed: embeddings stored = **28,172** (final) — recorded in fusion+similarity run summary (UNCONFIRMED mapping to `embeddings` table)
|
|
- [x] Fusion + similarity run completed (per-window processing) — aggregate inserts recorded in `thoughts/ledgers/fusion_similarity_summary.json`
|
|
|
|
## Key Decisions
|
|
- `store_fused_embedding` (database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs.
|
|
- Annual windows chosen for historical political compass (2016–2026).
|
|
- top_k=20 for similarity cache.
|
|
- Party-level votes (e.g. `{"PVV": "voor"}`) handled in `extract_mp_votes` — actor without comma → `party=actor_name`.
|
|
|
|
## Open Items (not blocking, data coverage gaps)
|
|
1. **2020–2021 data gap**: No motions in DB at all. Need to run downloader with `--start-date 2019-01-01 --end-date 2021-12-31` if data exists in API.
|
|
2. **2024 gap ~3,020 motions**: OData API has ~3,968 2024 motions, only 948 in DB. Root cause unclear — needs investigation of URL-based dedup in `insert_motion`.
|
|
3. **"Verworpen." dedup**: Short-text motions (title="Verworpen.") get spurious similarity=1.0. UI/query layer should filter `score < 0.999 OR title != 'Verworpen.'`.
|
|
4. **svd_vectors has duplicates**: 2025 has 7,430 rows for 3,715 motions (2x). Doesn't affect fused_embeddings (DELETE+INSERT handles it) but wastes space. Low priority.
|
|
|
|
## Key File Paths
|
|
- DB: `data/motions.db`
|
|
- Venv: `.venv/bin/python`
|
|
- Pipeline entry: `pipeline/run_pipeline.py`
|
|
- Fusion: `pipeline/fusion.py`
|
|
- SVD: `pipeline/svd_pipeline.py`
|
|
- Text embeddings: `pipeline/text_pipeline.py`
|
|
- MP votes extraction: `pipeline/extract_mp_votes.py`
|
|
- Database layer: `database.py`
|
|
- Similarity compute: `similarity/compute.py`
|
|
- Similarity lookup: `similarity/lookup.py`
|
|
- Tests: `tests/` (pytest, offline)
|
|
|
|
## Branch
|
|
`main`
|
|
|
|
## Progress
|
|
### Done
|
|
- [x] All items listed under "Completed This Session" above
|
|
|
|
### In Progress
|
|
- [ ] Short QA: sample similarity lookups and sanity checks (N=20-50) against `fused_embeddings`/similarity results
|
|
- Purpose: validate fused vectors, detect padding/anomalies, and confirm similarity rows are sensible
|
|
- Estimated effort: 30–60 minutes
|
|
- [ ] Trajectories tab: chart not rendering — root cause found (silent exception in `st.plotly_chart`)
|
|
- Fix applied: commit 72d1c20 — shows st.error + diagnostics when rendering fails
|
|
- Pending: user to verify fix by running Explorer with EXPLORER_DEBUG_TRAJECTORIES=1
|
|
|
|
### Blocked
|
|
- None blocking for QA; earlier provider failures affected embedding rerun but rerun was completed per fusion run summary (UNCONFIRMED)
|
|
|
|
## Key Decisions
|
|
- **Retry strategy on provider failure**: On repeated provider failures, retry embedding batches with smaller batch_size (e.g. 50 -> 20) or switch provider. Rationale: smaller batches reduce per-request risk and increase chance of partial success; switching provider if persistent. (UNCONFIRMED)
|
|
|
|
## Next Steps
|
|
1. Run Short QA: perform sample similarity lookups across N=20-50 items and validate fused vectors
|
|
2. Inspect `thoughts/ledgers/fusion_similarity_summary.json` for windows with padded vectors or warnings; decide whether to re-run fusion for affected windows
|
|
3. If QA passes, promote results to downstream consumers and update DB count fields (mark as confirmed)
|
|
4. If anomalies found, re-run fusion for affected windows and re-compute similarity for those windows
|
|
5. Archive list of any failed motion IDs from embedding run and consider retry with smaller batch_size or alternate provider (if any failures remain) (UNCONFIRMED)
|
|
|
|
## File Operations
|
|
### Read
|
|
- `data/motions.db`
|
|
- `scripts/rerun_embeddings.py` (invoked)
|
|
- `thoughts/ledgers/fusion_similarity_summary.json` (run summary)
|
|
|
|
### Modified
|
|
- `thoughts/ledgers/CONTINUITY_stemwijzer.md` (this file)
|
|
- `thoughts/ledgers/fusion_similarity_summary.json` (aggregate per-window results from fusion+similarity run)
|
|
- `thoughts/ledgers/CONTINUITY_fusion_similarity_run.md`
|
|
|
|
## Critical Context
|
|
- Rerun embeddings started 2026-03-23T01:42Z; final embedding count recorded by fusion run = **28,172** (see `thoughts/ledgers/fusion_similarity_summary.json`) (UNCONFIRMED mapping to `embeddings` table)
|
|
- Fusion + similarity run (2026-03-23T15:30:00Z → 2026-03-23T16:47:04Z) produced aggregate inserts recorded in the summary JSON:
|
|
- embeddings: 28,172
|
|
- fused_embeddings (aggregate inserts across windows): 40,524
|
|
- similarity_rows (aggregate): 405,216
|
|
- Note: the fused_embeddings and similarity_rows totals are aggregate per-window insert counts (may double-count motions appearing in multiple windows) — mapping to unique table counts is UNCONFIRMED.
|
|
- Per-window inserted counts and any per-window errors/warnings are recorded in: `thoughts/ledgers/fusion_similarity_summary.json`.
|
|
- Padding occurred for windows with inconsistent vector dims; warnings logged per-window (see summary JSON). Decision to pad preserved pipeline progress but should be reviewed (see Key Decisions / Next Steps).
|
|
- Earlier provider error: Batch 951..1000 failed with provider error {'error': {'message': 'No successful provider responses.', 'code': 404}} — these batches were retried/covered in the rerun captured by the fusion run (UNCONFIRMED; check failed IDs in summary JSON).
|
|
|
|
## Working Set
|
|
- Branch: `main`
|
|
- Key files: `data/motions.db`, `scripts/rerun_embeddings.py`, `thoughts/ledgers/CONTINUITY_stemwijzer.md`, `thoughts/ledgers/fusion_similarity_summary.json`, `thoughts/ledgers/CONTINUITY_fusion_similarity_run.md`
|
|
|