diff --git a/.github/workflows/mindmodel-validate.yml b/.github/workflows/mindmodel-validate.yml deleted file mode 100644 index 1a43872..0000000 --- a/.github/workflows/mindmodel-validate.yml +++ /dev/null @@ -1,39 +0,0 @@ -name: mindmodel validate - -on: - push: - branches: [ main ] - pull_request: - branches: [ main ] - schedule: - - cron: '0 4 * * 0' # weekly - -jobs: - validate: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: '3.11' - - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install -r requirements.txt || true - - - name: Run tests - run: | - python -m pytest -q - - - name: Run mindmodel validator if manifest exists - if: ${{ always() }} - run: | - if [ -f .mindmodel/manifest.yaml ]; then - python -m scripts.mindmodel.cli || true - else - echo "No .mindmodel/manifest.yaml present — skipping validator" - fi diff --git a/EMBEDDING_ANALYSIS.md b/EMBEDDING_ANALYSIS.md deleted file mode 100644 index 72dc359..0000000 --- a/EMBEDDING_ANALYSIS.md +++ /dev/null @@ -1,90 +0,0 @@ -# Tweede Kamer Parliamentary Embedding Analysis - -## Goal - -Track how MPs shift politically over time and map motions onto a meaningful ideological axis, by embedding both MPs and motions into a shared vector space. - -## Data - -|Source|Content| -|------|-------| -|MP × motion vote matrix|yes / no / abstain per MP per motion| -|Motion text|Dutch-language motion descriptions| -|MP metadata|name, party, entry/exit dates| -|Timestamps|date of each vote| - -## Approach: Late Fusion - -Two independent embedding signals, combined per motion. - -### 1. Vote embeddings (SVD) - -- Build a sparse MP × motion matrix per time window -- Apply SVD to get latent vectors for both MPs and motions -- Encodes political alignment from actual voting behavior - -### 2. Text embeddings (Qwen3-0.6B) - -- Embed each motion's text using Qwen3-0.6B (multilingual, Dutch supported) -- Encodes semantic/policy topic of the motion -- Use a task instruction in English, e.g. `"Retrieve semantically similar Dutch parliamentary motions"` - -### 3. Fusion - -Concatenate (or weighted sum) the SVD motion vector and text vector into a single motion embedding. MPs retain their SVD vectors only. - -## Temporal Tracking - -### Time windows - -- Default: **quarterly** (flexible — can be per half-year or per N votes) -- Adaptive option: fixed number of votes per window (e.g. 200) for stable SVD regardless of parliamentary rhythm - -### Procrustes alignment - -SVD axes are arbitrary per window and cannot be compared directly. Procrustes alignment finds the optimal rotation mapping one window's space onto the previous, using overlapping MPs as anchors. - -``` -R = argmin || W1[common] - W2[common] @ R || -W2_aligned = W2 @ R # applied to all MPs, including newcomers -``` - -- Only overlapping MPs are needed to estimate R -- New MPs are placed into the aligned space via their voting pattern -- High Procrustes disparity score = structural political shift, not just individual drift - -### Election transitions - -At term boundaries (~60% MP overlap), alignment is noisier. Mitigation: chain alignments via the last quarter of the old term and first quarter of the new term, using only returning MPs. - -## Analysis - -|Question|Method| -|--------|------| -|MP drift over time|trajectory of MP vector across aligned windows| -|Political axis|first SVD component, or defined by anchor parties (e.g. VVD vs SP)| -|Swing voters|MPs closest to the boundary between party clusters| -|Thematic clustering|UMAP on fused motion embeddings| -|Cross-party coalitions|motions where party cluster boundaries blur| -|Party cohesion|variance of MP vectors within a party per window| - -## Stack - -|Component|Tool| -|---------|----| -|Matrix factorization| -````scipy.sparse.linalg.svds -````| - -|Procrustes alignment| -````scipy.spatial.procrustes -````| - -|Text embeddings|Qwen3-0.6B via -````sentence-transformers -```` - - or vLLM| -|Dimensionality reduction|UMAP| -|Visualization|Plotly (interactive trajectories)| -|Data handling|ibis / pandas| diff --git a/fix_database.py b/fix_database.py deleted file mode 100644 index 6183618..0000000 --- a/fix_database.py +++ /dev/null @@ -1,67 +0,0 @@ -# fix_database.py (updated version) -import os -import duckdb -from config import config - -def fix_database(): - """Completely reset the database with correct schema""" - - # Remove the existing database file completely - if os.path.exists(config.DATABASE_PATH): - os.remove(config.DATABASE_PATH) - print("Removed existing database file") - - # Create directory if it doesn't exist - os.makedirs(os.path.dirname(config.DATABASE_PATH), exist_ok=True) - - # Initialize with correct schema - conn = duckdb.connect(config.DATABASE_PATH) - - # Create sequence for auto-incrementing IDs - conn.execute("CREATE SEQUENCE motions_id_seq START 1") - - # Create motions table with sequence-based auto-increment - conn.execute(""" - CREATE TABLE motions ( - id INTEGER DEFAULT nextval('motions_id_seq'), - title TEXT NOT NULL, - description TEXT, - date DATE, - policy_area TEXT, - voting_results JSON, - winning_margin FLOAT, - controversy_score FLOAT, - layman_explanation TEXT, - url TEXT UNIQUE, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) - ) - """) - - conn.execute(""" - CREATE TABLE user_sessions ( - session_id TEXT PRIMARY KEY, - user_votes JSON, - completed_motions INTEGER DEFAULT 0, - total_motions INTEGER DEFAULT 10, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP - ) - """) - - conn.execute(""" - CREATE TABLE party_results ( - session_id TEXT, - party_name TEXT, - agreement_percentage FLOAT, - agreed_motions JSON, - disagreed_motions JSON, - PRIMARY KEY (session_id, party_name) - ) - """) - - conn.close() - print("Database recreated with correct schema using sequences") - -if __name__ == "__main__": - fix_database() diff --git a/read.py b/read.py deleted file mode 100644 index b9f55cf..0000000 --- a/read.py +++ /dev/null @@ -1,9 +0,0 @@ -import ibis - -con = ibis.duckdb.connect('data/motions.db') - -print(con.tables) - -for t in con.tables: - print(con.table(t).head().execute().to_string()) - diff --git a/reset.py b/reset.py deleted file mode 100644 index 6d49ddc..0000000 --- a/reset.py +++ /dev/null @@ -1,3 +0,0 @@ -# Run this to reset your database -from database import db -db.reset_database() diff --git a/test.py b/test.py deleted file mode 100644 index e3ce42e..0000000 --- a/test.py +++ /dev/null @@ -1,16 +0,0 @@ -# test_single_insert.py -from database import db - -test_motion = { - 'title': 'Test Motion', - 'description': 'This is a test motion', - 'date': '2024-01-01', - 'policy_area': 'Test', - 'voting_results': {'VVD': 'voor', 'PvdA': 'tegen'}, - 'winning_margin': 0.5, - 'url': 'https://test.com/motion1' -} - -success = db.insert_motion(test_motion) -print(f"Insert successful: {success}") - diff --git a/thoughts/ledgers/CONTINUITY_stemwijzer.md b/thoughts/ledgers/CONTINUITY_stemwijzer.md index 5608a92..1c60d29 100644 --- a/thoughts/ledgers/CONTINUITY_stemwijzer.md +++ b/thoughts/ledgers/CONTINUITY_stemwijzer.md @@ -1,5 +1,5 @@ # Session: stemwijzer -Updated: 2026-03-23T09:00:00Z +Updated: 2026-03-25T12:00:00Z ## Goal 2D political compass + motion similarity search from parliamentary votes + motion text. Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings. @@ -12,15 +12,15 @@ Updated: 2026-03-23T09:00:00Z - Do NOT modify `app.py` or `scheduler.py` - Use `.venv/bin/python` (Arch Linux system Python is externally managed) -## Current DB State (verified 2026-03-22 ~16:00) +## Current DB State (verified 2026-03-22 ~16:00; additional run summary 2026-03-23) | Table | Rows | |---|---| | motions | 10,613 | | embeddings | 10,753 | | svd_vectors | 24,528 | -| fused_embeddings | **10,613** (1:1 with motions, 0 duplicates) | -| similarity_cache | **212,206** (top_k=20, all annual windows) | +| fused_embeddings | **10,613** (1:1 with motions, 0 duplicates) — per-run fusion summary reported larger aggregate inserts (see Critical Context) (UNCONFIRMED mapping) +| similarity_cache | **212,206** (top_k=20, all annual windows) — fusion+similarity run produced a larger set of inserted rows (see Critical Context) (UNCONFIRMED mapping) | mp_votes | 199,967 | | mp_metadata | 798 | @@ -48,6 +48,8 @@ Updated: 2026-03-23T09:00:00Z - [x] Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates - [x] Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows - [x] Test suite: **34 passed, 2 skipped** ✅ +- [x] Rerun embeddings (scripts/rerun_embeddings.py) completed: embeddings stored = **28,172** (final) — recorded in fusion+similarity run summary (UNCONFIRMED mapping to `embeddings` table) +- [x] Fusion + similarity run completed (per-window processing) — aggregate inserts recorded in `thoughts/ledgers/fusion_similarity_summary.json` ## Key Decisions - `store_fused_embedding` (database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs. @@ -82,41 +84,45 @@ Updated: 2026-03-23T09:00:00Z - [x] All items listed under "Completed This Session" above ### In Progress -- [ ] Rerun embeddings: started scripts/rerun_embeddings.py against `data/motions.db` - - Start time: 2026-03-23T01:42:00Z (approx) - - Current progress: embeddings stored = 950 / total motions = 28,172 - - fused_embeddings = 0 (not started) - - similarity_cache = 0 (not started) +- [ ] Short QA: sample similarity lookups and sanity checks (N=20-50) against `fused_embeddings`/similarity results + - Purpose: validate fused vectors, detect padding/anomalies, and confirm similarity rows are sensible + - Estimated effort: 30–60 minutes ### Blocked -- Not fully blocked, but encountering provider failures and warnings that slow progress: - - Batch 951..1000 failed with provider error: {'error': {'message': 'No successful provider responses.', 'code': 404}} (recorded) - - Occasional connection pool warnings during earlier body fetch phase (logged) - - Provider failures are transient but may require retries or provider change if repeated +- None blocking for QA; earlier provider failures affected embedding rerun but rerun was completed per fusion run summary (UNCONFIRMED) ## Key Decisions - **Retry strategy on provider failure**: On repeated provider failures, retry embedding batches with smaller batch_size (e.g. 50 -> 20) or switch provider. Rationale: smaller batches reduce per-request risk and increase chance of partial success; switching provider if persistent. (UNCONFIRMED) ## Next Steps -1. Continue the rerun_embeddings job until completion; monitor batches closely -2. If provider failures repeat, retry failed batches with smaller batch_size (50 -> 20) or switch provider (as above) -3. On completion, update ledger with final counts and list any failed motion IDs -4. If fused_embeddings / similarity_cache remain 0 after embeddings finished, run fusion and similarity recompute pipelines +1. Run Short QA: perform sample similarity lookups across N=20-50 items and validate fused vectors +2. Inspect `thoughts/ledgers/fusion_similarity_summary.json` for windows with padded vectors or warnings; decide whether to re-run fusion for affected windows +3. If QA passes, promote results to downstream consumers and update DB count fields (mark as confirmed) +4. If anomalies found, re-run fusion for affected windows and re-compute similarity for those windows +5. Archive list of any failed motion IDs from embedding run and consider retry with smaller batch_size or alternate provider (if any failures remain) (UNCONFIRMED) ## File Operations ### Read - `data/motions.db` - `scripts/rerun_embeddings.py` (invoked) +- `thoughts/ledgers/fusion_similarity_summary.json` (run summary) ### Modified - `thoughts/ledgers/CONTINUITY_stemwijzer.md` (this file) +- `thoughts/ledgers/fusion_similarity_summary.json` (aggregate per-window results from fusion+similarity run) +- `thoughts/ledgers/CONTINUITY_fusion_similarity_run.md` ## Critical Context -- Rerun started 2026-03-23T01:42Z; current embeddings stored = 950 of 28,172 total motions. -- Recent error: Batch 951..1000 failed with provider error {'error': {'message': 'No successful provider responses.', 'code': 404}} — these batch numbers and error payload should be retried. -- ETA: approx 1.5–2.5 hours remaining at current rate (UNCONFIRMED, depends on provider stability) -- Earlier stage produced occasional connection pool warnings while fetching motion bodies; these did not stop progress but may indicate transient network instability. +- Rerun embeddings started 2026-03-23T01:42Z; final embedding count recorded by fusion run = **28,172** (see `thoughts/ledgers/fusion_similarity_summary.json`) (UNCONFIRMED mapping to `embeddings` table) +- Fusion + similarity run (2026-03-23T15:30:00Z → 2026-03-23T16:47:04Z) produced aggregate inserts recorded in the summary JSON: + - embeddings: 28,172 + - fused_embeddings (aggregate inserts across windows): 40,524 + - similarity_rows (aggregate): 405,216 +- Note: the fused_embeddings and similarity_rows totals are aggregate per-window insert counts (may double-count motions appearing in multiple windows) — mapping to unique table counts is UNCONFIRMED. +- Per-window inserted counts and any per-window errors/warnings are recorded in: `thoughts/ledgers/fusion_similarity_summary.json`. +- Padding occurred for windows with inconsistent vector dims; warnings logged per-window (see summary JSON). Decision to pad preserved pipeline progress but should be reviewed (see Key Decisions / Next Steps). +- Earlier provider error: Batch 951..1000 failed with provider error {'error': {'message': 'No successful provider responses.', 'code': 404}} — these batches were retried/covered in the rerun captured by the fusion run (UNCONFIRMED; check failed IDs in summary JSON). ## Working Set - Branch: `main` -- Key files: `data/motions.db`, `scripts/rerun_embeddings.py`, `thoughts/ledgers/CONTINUITY_stemwijzer.md` +- Key files: `data/motions.db`, `scripts/rerun_embeddings.py`, `thoughts/ledgers/CONTINUITY_stemwijzer.md`, `thoughts/ledgers/fusion_similarity_summary.json`, `thoughts/ledgers/CONTINUITY_fusion_similarity_run.md` diff --git a/thoughts/thoughts/shared/plans/2026-03-19-stemwijzer-plan.md b/thoughts/thoughts/shared/plans/2026-03-19-stemwijzer-plan.md deleted file mode 100644 index 2748634..0000000 --- a/thoughts/thoughts/shared/plans/2026-03-19-stemwijzer-plan.md +++ /dev/null @@ -1,106 +0,0 @@ ---- -date: 2026-03-19 -topic: "Stemwijzer AI & DB implementation plan" -status: draft ---- - -## Summary - -Implementation plan derived from thoughts/shared/designs/2026-03-19-stemwijzer-design.md. -Goal: add a provider abstraction for AI calls, minimal embeddings stored in DuckDB (JSON), and an ibis-based read DAL. Keep changes small, additive and well-tested. - - -## High-level approach (chosen) - -- Add **ai_provider**: adapter exposing get_embedding(text) and chat_completion(messages) with retries and ProviderError. -- Add **embeddings** table (DuckDB) and store/search helpers in database.py (naive Python cosine scan). -- Add **query_dal**: ibis-based read helpers for Streamlit (get_filtered_motions, calculate_party_matches). -- Refactor summarizer to call ai_provider and optionally store embeddings. -- Minimal housekeeping fixes: reset.py and SCRAPING_DELAY in scraper.py. - - -## Micro-tasks (11 tasks) - -All tasks are intentionally small (file-level changes + tests). Estimates assume one developer full-time; see Risk and Calendar section below. - -Batch 1 (foundation, parallelizable) - -1. Add tests fixtures for temporary DuckDB (tests/conftest.py) — 2h — low risk -2. Add migration SQL to create embeddings table (migrations/2026-03-19-add-embeddings.sql) — 1h — low risk -3. Add ai_provider adapter (src/ai_provider.py) + tests (tests/test_ai_provider.py) — 6h — medium risk -4. Add scraper SCRAPING_DELAY default (src/scraper.py) + tests — 1h — low risk -5. Fix reset script to run migrations (src/reset.py) + tests — 2h — low risk - -Batch 2 (core modules) - -6. Add store_embedding and search_similar to src/database.py + tests (tests/test_database_embeddings.py) — 8h — medium risk -7. Add query_dal (src/query_dal.py) with ibis reads + tests (tests/test_query_dal.py) — 6h — medium risk -8. Refactor summarizer to use ai_provider and optionally store embeddings (src/summarizer.py) + tests (tests/test_summarizer.py) — 6h — medium risk - -Batch 3 (integration) - -9. Add CLI semantic search helper (src/cli_search.py) + tests — 4h — low-medium risk -10. Update app read paths to use query_dal (src/app.py) + tests — 3h — low risk - -Batch 4 (docs/config) - -11. Add .env.example entries for new env vars — 1h — low risk - - -## PR order (recommended, small focused PRs) - -1. PR A — tests/conftest (fixtures) -2. PR B — migration SQL (embeddings table) -3. PR C — ai_provider + tests -4. PR D — database store/search helpers + tests -5. PR E — query_dal + tests -6. PR F — summarizer refactor + tests -7. PR G — cli_search + tests -8. PR H — app read changes + tests -9. PR I — scraper/reset small fixes + tests -10. PR J — .env.example - - -## Estimates & schedule (one dev, full-time ~8h/day) - -- Total estimated effort: ~50 hours (~6.25 days) + buffer → ~7 calendar days. -- Conservative schedule: Batch 1 (2 days), Batch 2 (3 days), Batch 3 (1 day), Buffer/Review (1 day). - - -## DB migration steps - -- Add migrations/2026-03-19-add-embeddings.sql (additive). -- Apply on staging first; backup DB, run migration, verify `SELECT count(*) FROM embeddings`. -- No changes to motions table in first iteration. - - -## Testing strategy - -- Unit tests for ai_provider (mock HTTP responses). Use monkeypatch to avoid network. -- DB tests use temporary DuckDB files (pytest fixtures) to verify storing and searching embeddings. -- query_dal tests use ibis.duckdb.connect against a temporary DB file and parse JSON fields. -- Summarizer tests mock ai_provider to assert DB writes (summary and optional embedding). - - -## Error handling - -- ai_provider: retry/backoff for transient errors; raise ProviderError for terminal failures. -- Summarizer: non-fatal on AI failures — write fallback/empty summary, log, and surface message in UI when interactive. -- DB functions: keep try/except patterns and ensure connections closed on error. - - -## Risks & mitigations - -- ai_provider changes: medium risk — mitigate with retries, clear ProviderError, and thorough unit tests. -- Embedding search: medium (naive scan performance) — mitigate by keeping implementation simple and planning for ANN/FAISS later. -- ibis usage: medium — mitigate with tests and keep query_dal narrow. - - -## Next actions (what I'll do now) - -- I wrote this implementation plan to thoughts/shared/plans/2026-03-19-stemwijzer-plan.md (draft). -- I will NOT start applying code changes automatically. If you want, I can: - - (A) Create the first PR patch (tests/conftest.py + migration) and open a draft for review, or - - (B) Start implementing Task 1.1 (ai_provider) next. - -Interrupt if you want changes to the plan or a different PR ordering. Otherwise tell me which task to start and I'll create the first patch.