Cleanup performed by assistant: removed generated caches and stale files: __pycache__, *.pyc, .pytest_cache, .ruff_cache, dummy/, test.py, read.py, reset.py, fix_database.py, thoughts/thoughts/, .github/workflows/mindmodel-validate.yml. No push performed.main
parent
867fcd1989
commit
a20bd834fc
@ -1,39 +0,0 @@ |
|||||||
name: mindmodel validate |
|
||||||
|
|
||||||
on: |
|
||||||
push: |
|
||||||
branches: [ main ] |
|
||||||
pull_request: |
|
||||||
branches: [ main ] |
|
||||||
schedule: |
|
||||||
- cron: '0 4 * * 0' # weekly |
|
||||||
|
|
||||||
jobs: |
|
||||||
validate: |
|
||||||
runs-on: ubuntu-latest |
|
||||||
steps: |
|
||||||
- name: Checkout |
|
||||||
uses: actions/checkout@v4 |
|
||||||
|
|
||||||
- name: Set up Python |
|
||||||
uses: actions/setup-python@v4 |
|
||||||
with: |
|
||||||
python-version: '3.11' |
|
||||||
|
|
||||||
- name: Install dependencies |
|
||||||
run: | |
|
||||||
python -m pip install --upgrade pip |
|
||||||
pip install -r requirements.txt || true |
|
||||||
|
|
||||||
- name: Run tests |
|
||||||
run: | |
|
||||||
python -m pytest -q |
|
||||||
|
|
||||||
- name: Run mindmodel validator if manifest exists |
|
||||||
if: ${{ always() }} |
|
||||||
run: | |
|
||||||
if [ -f .mindmodel/manifest.yaml ]; then |
|
||||||
python -m scripts.mindmodel.cli || true |
|
||||||
else |
|
||||||
echo "No .mindmodel/manifest.yaml present — skipping validator" |
|
||||||
fi |
|
||||||
@ -1,90 +0,0 @@ |
|||||||
# Tweede Kamer Parliamentary Embedding Analysis |
|
||||||
|
|
||||||
## Goal |
|
||||||
|
|
||||||
Track how MPs shift politically over time and map motions onto a meaningful ideological axis, by embedding both MPs and motions into a shared vector space. |
|
||||||
|
|
||||||
## Data |
|
||||||
|
|
||||||
|Source|Content| |
|
||||||
|------|-------| |
|
||||||
|MP × motion vote matrix|yes / no / abstain per MP per motion| |
|
||||||
|Motion text|Dutch-language motion descriptions| |
|
||||||
|MP metadata|name, party, entry/exit dates| |
|
||||||
|Timestamps|date of each vote| |
|
||||||
|
|
||||||
## Approach: Late Fusion |
|
||||||
|
|
||||||
Two independent embedding signals, combined per motion. |
|
||||||
|
|
||||||
### 1. Vote embeddings (SVD) |
|
||||||
|
|
||||||
- Build a sparse MP × motion matrix per time window |
|
||||||
- Apply SVD to get latent vectors for both MPs and motions |
|
||||||
- Encodes political alignment from actual voting behavior |
|
||||||
|
|
||||||
### 2. Text embeddings (Qwen3-0.6B) |
|
||||||
|
|
||||||
- Embed each motion's text using Qwen3-0.6B (multilingual, Dutch supported) |
|
||||||
- Encodes semantic/policy topic of the motion |
|
||||||
- Use a task instruction in English, e.g. `"Retrieve semantically similar Dutch parliamentary motions"` |
|
||||||
|
|
||||||
### 3. Fusion |
|
||||||
|
|
||||||
Concatenate (or weighted sum) the SVD motion vector and text vector into a single motion embedding. MPs retain their SVD vectors only. |
|
||||||
|
|
||||||
## Temporal Tracking |
|
||||||
|
|
||||||
### Time windows |
|
||||||
|
|
||||||
- Default: **quarterly** (flexible — can be per half-year or per N votes) |
|
||||||
- Adaptive option: fixed number of votes per window (e.g. 200) for stable SVD regardless of parliamentary rhythm |
|
||||||
|
|
||||||
### Procrustes alignment |
|
||||||
|
|
||||||
SVD axes are arbitrary per window and cannot be compared directly. Procrustes alignment finds the optimal rotation mapping one window's space onto the previous, using overlapping MPs as anchors. |
|
||||||
|
|
||||||
``` |
|
||||||
R = argmin || W1[common] - W2[common] @ R || |
|
||||||
W2_aligned = W2 @ R # applied to all MPs, including newcomers |
|
||||||
``` |
|
||||||
|
|
||||||
- Only overlapping MPs are needed to estimate R |
|
||||||
- New MPs are placed into the aligned space via their voting pattern |
|
||||||
- High Procrustes disparity score = structural political shift, not just individual drift |
|
||||||
|
|
||||||
### Election transitions |
|
||||||
|
|
||||||
At term boundaries (~60% MP overlap), alignment is noisier. Mitigation: chain alignments via the last quarter of the old term and first quarter of the new term, using only returning MPs. |
|
||||||
|
|
||||||
## Analysis |
|
||||||
|
|
||||||
|Question|Method| |
|
||||||
|--------|------| |
|
||||||
|MP drift over time|trajectory of MP vector across aligned windows| |
|
||||||
|Political axis|first SVD component, or defined by anchor parties (e.g. VVD vs SP)| |
|
||||||
|Swing voters|MPs closest to the boundary between party clusters| |
|
||||||
|Thematic clustering|UMAP on fused motion embeddings| |
|
||||||
|Cross-party coalitions|motions where party cluster boundaries blur| |
|
||||||
|Party cohesion|variance of MP vectors within a party per window| |
|
||||||
|
|
||||||
## Stack |
|
||||||
|
|
||||||
|Component|Tool| |
|
||||||
|---------|----| |
|
||||||
|Matrix factorization| |
|
||||||
````scipy.sparse.linalg.svds |
|
||||||
````| |
|
||||||
|
|
||||||
|Procrustes alignment| |
|
||||||
````scipy.spatial.procrustes |
|
||||||
````| |
|
||||||
|
|
||||||
|Text embeddings|Qwen3-0.6B via |
|
||||||
````sentence-transformers |
|
||||||
```` |
|
||||||
|
|
||||||
or vLLM| |
|
||||||
|Dimensionality reduction|UMAP| |
|
||||||
|Visualization|Plotly (interactive trajectories)| |
|
||||||
|Data handling|ibis / pandas| |
|
||||||
@ -1,67 +0,0 @@ |
|||||||
# fix_database.py (updated version) |
|
||||||
import os |
|
||||||
import duckdb |
|
||||||
from config import config |
|
||||||
|
|
||||||
def fix_database(): |
|
||||||
"""Completely reset the database with correct schema""" |
|
||||||
|
|
||||||
# Remove the existing database file completely |
|
||||||
if os.path.exists(config.DATABASE_PATH): |
|
||||||
os.remove(config.DATABASE_PATH) |
|
||||||
print("Removed existing database file") |
|
||||||
|
|
||||||
# Create directory if it doesn't exist |
|
||||||
os.makedirs(os.path.dirname(config.DATABASE_PATH), exist_ok=True) |
|
||||||
|
|
||||||
# Initialize with correct schema |
|
||||||
conn = duckdb.connect(config.DATABASE_PATH) |
|
||||||
|
|
||||||
# Create sequence for auto-incrementing IDs |
|
||||||
conn.execute("CREATE SEQUENCE motions_id_seq START 1") |
|
||||||
|
|
||||||
# Create motions table with sequence-based auto-increment |
|
||||||
conn.execute(""" |
|
||||||
CREATE TABLE motions ( |
|
||||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
|
||||||
title TEXT NOT NULL, |
|
||||||
description TEXT, |
|
||||||
date DATE, |
|
||||||
policy_area TEXT, |
|
||||||
voting_results JSON, |
|
||||||
winning_margin FLOAT, |
|
||||||
controversy_score FLOAT, |
|
||||||
layman_explanation TEXT, |
|
||||||
url TEXT UNIQUE, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
""") |
|
||||||
|
|
||||||
conn.execute(""" |
|
||||||
CREATE TABLE user_sessions ( |
|
||||||
session_id TEXT PRIMARY KEY, |
|
||||||
user_votes JSON, |
|
||||||
completed_motions INTEGER DEFAULT 0, |
|
||||||
total_motions INTEGER DEFAULT 10, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP |
|
||||||
) |
|
||||||
""") |
|
||||||
|
|
||||||
conn.execute(""" |
|
||||||
CREATE TABLE party_results ( |
|
||||||
session_id TEXT, |
|
||||||
party_name TEXT, |
|
||||||
agreement_percentage FLOAT, |
|
||||||
agreed_motions JSON, |
|
||||||
disagreed_motions JSON, |
|
||||||
PRIMARY KEY (session_id, party_name) |
|
||||||
) |
|
||||||
""") |
|
||||||
|
|
||||||
conn.close() |
|
||||||
print("Database recreated with correct schema using sequences") |
|
||||||
|
|
||||||
if __name__ == "__main__": |
|
||||||
fix_database() |
|
||||||
@ -1,9 +0,0 @@ |
|||||||
import ibis |
|
||||||
|
|
||||||
con = ibis.duckdb.connect('data/motions.db') |
|
||||||
|
|
||||||
print(con.tables) |
|
||||||
|
|
||||||
for t in con.tables: |
|
||||||
print(con.table(t).head().execute().to_string()) |
|
||||||
|
|
||||||
@ -1,3 +0,0 @@ |
|||||||
# Run this to reset your database |
|
||||||
from database import db |
|
||||||
db.reset_database() |
|
||||||
@ -1,16 +0,0 @@ |
|||||||
# test_single_insert.py |
|
||||||
from database import db |
|
||||||
|
|
||||||
test_motion = { |
|
||||||
'title': 'Test Motion', |
|
||||||
'description': 'This is a test motion', |
|
||||||
'date': '2024-01-01', |
|
||||||
'policy_area': 'Test', |
|
||||||
'voting_results': {'VVD': 'voor', 'PvdA': 'tegen'}, |
|
||||||
'winning_margin': 0.5, |
|
||||||
'url': 'https://test.com/motion1' |
|
||||||
} |
|
||||||
|
|
||||||
success = db.insert_motion(test_motion) |
|
||||||
print(f"Insert successful: {success}") |
|
||||||
|
|
||||||
@ -1,106 +0,0 @@ |
|||||||
--- |
|
||||||
date: 2026-03-19 |
|
||||||
topic: "Stemwijzer AI & DB implementation plan" |
|
||||||
status: draft |
|
||||||
--- |
|
||||||
|
|
||||||
## Summary |
|
||||||
|
|
||||||
Implementation plan derived from thoughts/shared/designs/2026-03-19-stemwijzer-design.md. |
|
||||||
Goal: add a provider abstraction for AI calls, minimal embeddings stored in DuckDB (JSON), and an ibis-based read DAL. Keep changes small, additive and well-tested. |
|
||||||
|
|
||||||
|
|
||||||
## High-level approach (chosen) |
|
||||||
|
|
||||||
- Add **ai_provider**: adapter exposing get_embedding(text) and chat_completion(messages) with retries and ProviderError. |
|
||||||
- Add **embeddings** table (DuckDB) and store/search helpers in database.py (naive Python cosine scan). |
|
||||||
- Add **query_dal**: ibis-based read helpers for Streamlit (get_filtered_motions, calculate_party_matches). |
|
||||||
- Refactor summarizer to call ai_provider and optionally store embeddings. |
|
||||||
- Minimal housekeeping fixes: reset.py and SCRAPING_DELAY in scraper.py. |
|
||||||
|
|
||||||
|
|
||||||
## Micro-tasks (11 tasks) |
|
||||||
|
|
||||||
All tasks are intentionally small (file-level changes + tests). Estimates assume one developer full-time; see Risk and Calendar section below. |
|
||||||
|
|
||||||
Batch 1 (foundation, parallelizable) |
|
||||||
|
|
||||||
1. Add tests fixtures for temporary DuckDB (tests/conftest.py) — 2h — low risk |
|
||||||
2. Add migration SQL to create embeddings table (migrations/2026-03-19-add-embeddings.sql) — 1h — low risk |
|
||||||
3. Add ai_provider adapter (src/ai_provider.py) + tests (tests/test_ai_provider.py) — 6h — medium risk |
|
||||||
4. Add scraper SCRAPING_DELAY default (src/scraper.py) + tests — 1h — low risk |
|
||||||
5. Fix reset script to run migrations (src/reset.py) + tests — 2h — low risk |
|
||||||
|
|
||||||
Batch 2 (core modules) |
|
||||||
|
|
||||||
6. Add store_embedding and search_similar to src/database.py + tests (tests/test_database_embeddings.py) — 8h — medium risk |
|
||||||
7. Add query_dal (src/query_dal.py) with ibis reads + tests (tests/test_query_dal.py) — 6h — medium risk |
|
||||||
8. Refactor summarizer to use ai_provider and optionally store embeddings (src/summarizer.py) + tests (tests/test_summarizer.py) — 6h — medium risk |
|
||||||
|
|
||||||
Batch 3 (integration) |
|
||||||
|
|
||||||
9. Add CLI semantic search helper (src/cli_search.py) + tests — 4h — low-medium risk |
|
||||||
10. Update app read paths to use query_dal (src/app.py) + tests — 3h — low risk |
|
||||||
|
|
||||||
Batch 4 (docs/config) |
|
||||||
|
|
||||||
11. Add .env.example entries for new env vars — 1h — low risk |
|
||||||
|
|
||||||
|
|
||||||
## PR order (recommended, small focused PRs) |
|
||||||
|
|
||||||
1. PR A — tests/conftest (fixtures) |
|
||||||
2. PR B — migration SQL (embeddings table) |
|
||||||
3. PR C — ai_provider + tests |
|
||||||
4. PR D — database store/search helpers + tests |
|
||||||
5. PR E — query_dal + tests |
|
||||||
6. PR F — summarizer refactor + tests |
|
||||||
7. PR G — cli_search + tests |
|
||||||
8. PR H — app read changes + tests |
|
||||||
9. PR I — scraper/reset small fixes + tests |
|
||||||
10. PR J — .env.example |
|
||||||
|
|
||||||
|
|
||||||
## Estimates & schedule (one dev, full-time ~8h/day) |
|
||||||
|
|
||||||
- Total estimated effort: ~50 hours (~6.25 days) + buffer → ~7 calendar days. |
|
||||||
- Conservative schedule: Batch 1 (2 days), Batch 2 (3 days), Batch 3 (1 day), Buffer/Review (1 day). |
|
||||||
|
|
||||||
|
|
||||||
## DB migration steps |
|
||||||
|
|
||||||
- Add migrations/2026-03-19-add-embeddings.sql (additive). |
|
||||||
- Apply on staging first; backup DB, run migration, verify `SELECT count(*) FROM embeddings`. |
|
||||||
- No changes to motions table in first iteration. |
|
||||||
|
|
||||||
|
|
||||||
## Testing strategy |
|
||||||
|
|
||||||
- Unit tests for ai_provider (mock HTTP responses). Use monkeypatch to avoid network. |
|
||||||
- DB tests use temporary DuckDB files (pytest fixtures) to verify storing and searching embeddings. |
|
||||||
- query_dal tests use ibis.duckdb.connect against a temporary DB file and parse JSON fields. |
|
||||||
- Summarizer tests mock ai_provider to assert DB writes (summary and optional embedding). |
|
||||||
|
|
||||||
|
|
||||||
## Error handling |
|
||||||
|
|
||||||
- ai_provider: retry/backoff for transient errors; raise ProviderError for terminal failures. |
|
||||||
- Summarizer: non-fatal on AI failures — write fallback/empty summary, log, and surface message in UI when interactive. |
|
||||||
- DB functions: keep try/except patterns and ensure connections closed on error. |
|
||||||
|
|
||||||
|
|
||||||
## Risks & mitigations |
|
||||||
|
|
||||||
- ai_provider changes: medium risk — mitigate with retries, clear ProviderError, and thorough unit tests. |
|
||||||
- Embedding search: medium (naive scan performance) — mitigate by keeping implementation simple and planning for ANN/FAISS later. |
|
||||||
- ibis usage: medium — mitigate with tests and keep query_dal narrow. |
|
||||||
|
|
||||||
|
|
||||||
## Next actions (what I'll do now) |
|
||||||
|
|
||||||
- I wrote this implementation plan to thoughts/shared/plans/2026-03-19-stemwijzer-plan.md (draft). |
|
||||||
- I will NOT start applying code changes automatically. If you want, I can: |
|
||||||
- (A) Create the first PR patch (tests/conftest.py + migration) and open a draft for review, or |
|
||||||
- (B) Start implementing Task 1.1 (ai_provider) next. |
|
||||||
|
|
||||||
Interrupt if you want changes to the plan or a different PR ordering. Otherwise tell me which task to start and I'll create the first patch. |
|
||||||
Loading…
Reference in new issue