Cleanup performed by assistant: removed generated caches and stale files: __pycache__, *.pyc, .pytest_cache, .ruff_cache, dummy/, test.py, read.py, reset.py, fix_database.py, thoughts/thoughts/, .github/workflows/mindmodel-validate.yml. No push performed.main
parent
867fcd1989
commit
a20bd834fc
@ -1,39 +0,0 @@ |
||||
name: mindmodel validate |
||||
|
||||
on: |
||||
push: |
||||
branches: [ main ] |
||||
pull_request: |
||||
branches: [ main ] |
||||
schedule: |
||||
- cron: '0 4 * * 0' # weekly |
||||
|
||||
jobs: |
||||
validate: |
||||
runs-on: ubuntu-latest |
||||
steps: |
||||
- name: Checkout |
||||
uses: actions/checkout@v4 |
||||
|
||||
- name: Set up Python |
||||
uses: actions/setup-python@v4 |
||||
with: |
||||
python-version: '3.11' |
||||
|
||||
- name: Install dependencies |
||||
run: | |
||||
python -m pip install --upgrade pip |
||||
pip install -r requirements.txt || true |
||||
|
||||
- name: Run tests |
||||
run: | |
||||
python -m pytest -q |
||||
|
||||
- name: Run mindmodel validator if manifest exists |
||||
if: ${{ always() }} |
||||
run: | |
||||
if [ -f .mindmodel/manifest.yaml ]; then |
||||
python -m scripts.mindmodel.cli || true |
||||
else |
||||
echo "No .mindmodel/manifest.yaml present — skipping validator" |
||||
fi |
||||
@ -1,90 +0,0 @@ |
||||
# Tweede Kamer Parliamentary Embedding Analysis |
||||
|
||||
## Goal |
||||
|
||||
Track how MPs shift politically over time and map motions onto a meaningful ideological axis, by embedding both MPs and motions into a shared vector space. |
||||
|
||||
## Data |
||||
|
||||
|Source|Content| |
||||
|------|-------| |
||||
|MP × motion vote matrix|yes / no / abstain per MP per motion| |
||||
|Motion text|Dutch-language motion descriptions| |
||||
|MP metadata|name, party, entry/exit dates| |
||||
|Timestamps|date of each vote| |
||||
|
||||
## Approach: Late Fusion |
||||
|
||||
Two independent embedding signals, combined per motion. |
||||
|
||||
### 1. Vote embeddings (SVD) |
||||
|
||||
- Build a sparse MP × motion matrix per time window |
||||
- Apply SVD to get latent vectors for both MPs and motions |
||||
- Encodes political alignment from actual voting behavior |
||||
|
||||
### 2. Text embeddings (Qwen3-0.6B) |
||||
|
||||
- Embed each motion's text using Qwen3-0.6B (multilingual, Dutch supported) |
||||
- Encodes semantic/policy topic of the motion |
||||
- Use a task instruction in English, e.g. `"Retrieve semantically similar Dutch parliamentary motions"` |
||||
|
||||
### 3. Fusion |
||||
|
||||
Concatenate (or weighted sum) the SVD motion vector and text vector into a single motion embedding. MPs retain their SVD vectors only. |
||||
|
||||
## Temporal Tracking |
||||
|
||||
### Time windows |
||||
|
||||
- Default: **quarterly** (flexible — can be per half-year or per N votes) |
||||
- Adaptive option: fixed number of votes per window (e.g. 200) for stable SVD regardless of parliamentary rhythm |
||||
|
||||
### Procrustes alignment |
||||
|
||||
SVD axes are arbitrary per window and cannot be compared directly. Procrustes alignment finds the optimal rotation mapping one window's space onto the previous, using overlapping MPs as anchors. |
||||
|
||||
``` |
||||
R = argmin || W1[common] - W2[common] @ R || |
||||
W2_aligned = W2 @ R # applied to all MPs, including newcomers |
||||
``` |
||||
|
||||
- Only overlapping MPs are needed to estimate R |
||||
- New MPs are placed into the aligned space via their voting pattern |
||||
- High Procrustes disparity score = structural political shift, not just individual drift |
||||
|
||||
### Election transitions |
||||
|
||||
At term boundaries (~60% MP overlap), alignment is noisier. Mitigation: chain alignments via the last quarter of the old term and first quarter of the new term, using only returning MPs. |
||||
|
||||
## Analysis |
||||
|
||||
|Question|Method| |
||||
|--------|------| |
||||
|MP drift over time|trajectory of MP vector across aligned windows| |
||||
|Political axis|first SVD component, or defined by anchor parties (e.g. VVD vs SP)| |
||||
|Swing voters|MPs closest to the boundary between party clusters| |
||||
|Thematic clustering|UMAP on fused motion embeddings| |
||||
|Cross-party coalitions|motions where party cluster boundaries blur| |
||||
|Party cohesion|variance of MP vectors within a party per window| |
||||
|
||||
## Stack |
||||
|
||||
|Component|Tool| |
||||
|---------|----| |
||||
|Matrix factorization| |
||||
````scipy.sparse.linalg.svds |
||||
````| |
||||
|
||||
|Procrustes alignment| |
||||
````scipy.spatial.procrustes |
||||
````| |
||||
|
||||
|Text embeddings|Qwen3-0.6B via |
||||
````sentence-transformers |
||||
```` |
||||
|
||||
or vLLM| |
||||
|Dimensionality reduction|UMAP| |
||||
|Visualization|Plotly (interactive trajectories)| |
||||
|Data handling|ibis / pandas| |
||||
@ -1,67 +0,0 @@ |
||||
# fix_database.py (updated version) |
||||
import os |
||||
import duckdb |
||||
from config import config |
||||
|
||||
def fix_database(): |
||||
"""Completely reset the database with correct schema""" |
||||
|
||||
# Remove the existing database file completely |
||||
if os.path.exists(config.DATABASE_PATH): |
||||
os.remove(config.DATABASE_PATH) |
||||
print("Removed existing database file") |
||||
|
||||
# Create directory if it doesn't exist |
||||
os.makedirs(os.path.dirname(config.DATABASE_PATH), exist_ok=True) |
||||
|
||||
# Initialize with correct schema |
||||
conn = duckdb.connect(config.DATABASE_PATH) |
||||
|
||||
# Create sequence for auto-incrementing IDs |
||||
conn.execute("CREATE SEQUENCE motions_id_seq START 1") |
||||
|
||||
# Create motions table with sequence-based auto-increment |
||||
conn.execute(""" |
||||
CREATE TABLE motions ( |
||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||
title TEXT NOT NULL, |
||||
description TEXT, |
||||
date DATE, |
||||
policy_area TEXT, |
||||
voting_results JSON, |
||||
winning_margin FLOAT, |
||||
controversy_score FLOAT, |
||||
layman_explanation TEXT, |
||||
url TEXT UNIQUE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
|
||||
conn.execute(""" |
||||
CREATE TABLE user_sessions ( |
||||
session_id TEXT PRIMARY KEY, |
||||
user_votes JSON, |
||||
completed_motions INTEGER DEFAULT 0, |
||||
total_motions INTEGER DEFAULT 10, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP |
||||
) |
||||
""") |
||||
|
||||
conn.execute(""" |
||||
CREATE TABLE party_results ( |
||||
session_id TEXT, |
||||
party_name TEXT, |
||||
agreement_percentage FLOAT, |
||||
agreed_motions JSON, |
||||
disagreed_motions JSON, |
||||
PRIMARY KEY (session_id, party_name) |
||||
) |
||||
""") |
||||
|
||||
conn.close() |
||||
print("Database recreated with correct schema using sequences") |
||||
|
||||
if __name__ == "__main__": |
||||
fix_database() |
||||
@ -1,9 +0,0 @@ |
||||
import ibis |
||||
|
||||
con = ibis.duckdb.connect('data/motions.db') |
||||
|
||||
print(con.tables) |
||||
|
||||
for t in con.tables: |
||||
print(con.table(t).head().execute().to_string()) |
||||
|
||||
@ -1,3 +0,0 @@ |
||||
# Run this to reset your database |
||||
from database import db |
||||
db.reset_database() |
||||
@ -1,16 +0,0 @@ |
||||
# test_single_insert.py |
||||
from database import db |
||||
|
||||
test_motion = { |
||||
'title': 'Test Motion', |
||||
'description': 'This is a test motion', |
||||
'date': '2024-01-01', |
||||
'policy_area': 'Test', |
||||
'voting_results': {'VVD': 'voor', 'PvdA': 'tegen'}, |
||||
'winning_margin': 0.5, |
||||
'url': 'https://test.com/motion1' |
||||
} |
||||
|
||||
success = db.insert_motion(test_motion) |
||||
print(f"Insert successful: {success}") |
||||
|
||||
@ -1,106 +0,0 @@ |
||||
--- |
||||
date: 2026-03-19 |
||||
topic: "Stemwijzer AI & DB implementation plan" |
||||
status: draft |
||||
--- |
||||
|
||||
## Summary |
||||
|
||||
Implementation plan derived from thoughts/shared/designs/2026-03-19-stemwijzer-design.md. |
||||
Goal: add a provider abstraction for AI calls, minimal embeddings stored in DuckDB (JSON), and an ibis-based read DAL. Keep changes small, additive and well-tested. |
||||
|
||||
|
||||
## High-level approach (chosen) |
||||
|
||||
- Add **ai_provider**: adapter exposing get_embedding(text) and chat_completion(messages) with retries and ProviderError. |
||||
- Add **embeddings** table (DuckDB) and store/search helpers in database.py (naive Python cosine scan). |
||||
- Add **query_dal**: ibis-based read helpers for Streamlit (get_filtered_motions, calculate_party_matches). |
||||
- Refactor summarizer to call ai_provider and optionally store embeddings. |
||||
- Minimal housekeeping fixes: reset.py and SCRAPING_DELAY in scraper.py. |
||||
|
||||
|
||||
## Micro-tasks (11 tasks) |
||||
|
||||
All tasks are intentionally small (file-level changes + tests). Estimates assume one developer full-time; see Risk and Calendar section below. |
||||
|
||||
Batch 1 (foundation, parallelizable) |
||||
|
||||
1. Add tests fixtures for temporary DuckDB (tests/conftest.py) — 2h — low risk |
||||
2. Add migration SQL to create embeddings table (migrations/2026-03-19-add-embeddings.sql) — 1h — low risk |
||||
3. Add ai_provider adapter (src/ai_provider.py) + tests (tests/test_ai_provider.py) — 6h — medium risk |
||||
4. Add scraper SCRAPING_DELAY default (src/scraper.py) + tests — 1h — low risk |
||||
5. Fix reset script to run migrations (src/reset.py) + tests — 2h — low risk |
||||
|
||||
Batch 2 (core modules) |
||||
|
||||
6. Add store_embedding and search_similar to src/database.py + tests (tests/test_database_embeddings.py) — 8h — medium risk |
||||
7. Add query_dal (src/query_dal.py) with ibis reads + tests (tests/test_query_dal.py) — 6h — medium risk |
||||
8. Refactor summarizer to use ai_provider and optionally store embeddings (src/summarizer.py) + tests (tests/test_summarizer.py) — 6h — medium risk |
||||
|
||||
Batch 3 (integration) |
||||
|
||||
9. Add CLI semantic search helper (src/cli_search.py) + tests — 4h — low-medium risk |
||||
10. Update app read paths to use query_dal (src/app.py) + tests — 3h — low risk |
||||
|
||||
Batch 4 (docs/config) |
||||
|
||||
11. Add .env.example entries for new env vars — 1h — low risk |
||||
|
||||
|
||||
## PR order (recommended, small focused PRs) |
||||
|
||||
1. PR A — tests/conftest (fixtures) |
||||
2. PR B — migration SQL (embeddings table) |
||||
3. PR C — ai_provider + tests |
||||
4. PR D — database store/search helpers + tests |
||||
5. PR E — query_dal + tests |
||||
6. PR F — summarizer refactor + tests |
||||
7. PR G — cli_search + tests |
||||
8. PR H — app read changes + tests |
||||
9. PR I — scraper/reset small fixes + tests |
||||
10. PR J — .env.example |
||||
|
||||
|
||||
## Estimates & schedule (one dev, full-time ~8h/day) |
||||
|
||||
- Total estimated effort: ~50 hours (~6.25 days) + buffer → ~7 calendar days. |
||||
- Conservative schedule: Batch 1 (2 days), Batch 2 (3 days), Batch 3 (1 day), Buffer/Review (1 day). |
||||
|
||||
|
||||
## DB migration steps |
||||
|
||||
- Add migrations/2026-03-19-add-embeddings.sql (additive). |
||||
- Apply on staging first; backup DB, run migration, verify `SELECT count(*) FROM embeddings`. |
||||
- No changes to motions table in first iteration. |
||||
|
||||
|
||||
## Testing strategy |
||||
|
||||
- Unit tests for ai_provider (mock HTTP responses). Use monkeypatch to avoid network. |
||||
- DB tests use temporary DuckDB files (pytest fixtures) to verify storing and searching embeddings. |
||||
- query_dal tests use ibis.duckdb.connect against a temporary DB file and parse JSON fields. |
||||
- Summarizer tests mock ai_provider to assert DB writes (summary and optional embedding). |
||||
|
||||
|
||||
## Error handling |
||||
|
||||
- ai_provider: retry/backoff for transient errors; raise ProviderError for terminal failures. |
||||
- Summarizer: non-fatal on AI failures — write fallback/empty summary, log, and surface message in UI when interactive. |
||||
- DB functions: keep try/except patterns and ensure connections closed on error. |
||||
|
||||
|
||||
## Risks & mitigations |
||||
|
||||
- ai_provider changes: medium risk — mitigate with retries, clear ProviderError, and thorough unit tests. |
||||
- Embedding search: medium (naive scan performance) — mitigate by keeping implementation simple and planning for ANN/FAISS later. |
||||
- ibis usage: medium — mitigate with tests and keep query_dal narrow. |
||||
|
||||
|
||||
## Next actions (what I'll do now) |
||||
|
||||
- I wrote this implementation plan to thoughts/shared/plans/2026-03-19-stemwijzer-plan.md (draft). |
||||
- I will NOT start applying code changes automatically. If you want, I can: |
||||
- (A) Create the first PR patch (tests/conftest.py + migration) and open a draft for review, or |
||||
- (B) Start implementing Task 1.1 (ai_provider) next. |
||||
|
||||
Interrupt if you want changes to the plan or a different PR ordering. Otherwise tell me which task to start and I'll create the first patch. |
||||
Loading…
Reference in new issue