feat(pipeline): implement parliamentary embedding pipeline MVP

- Add 4 migration files: mp_votes, mp_metadata, svd_vectors, fused_embeddings - Extend database.py with 5 new helper methods and table init - Add pipeline/ package: extract_mp_votes, fetch_mp_metadata, text_pipeline, svd_pipeline (with Procrustes alignment), fusion - Add full test suite (17 tests) covering all pipeline modules and migrations - Fix Procrustes alignment bug: scipy scale is a norm value, not a multiplier - Fix DuckDB date type handling in test assertions (datetime.date vs string) - Remove duckdb.py shim; tests now run against real duckdb + scipy via uv Ref: thoughts/shared/plans/2026-03-21-parliamentary-embedding-pipeline-plan.md
1 month ago · a36e6cba4e
parent c498c3467e
commit a36e6cba4e
68 changed files with 6822 additions and 0 deletions
--- a/.drone.yml
+++ b/.drone.yml
@ -0,0 +1,38 @@
 kind: pipeline
 type: docker
 name: default
 steps:
  - name: build
    image: docker:24.0.2
    environment:
      DOCKER_BUILDKIT: "1"
    commands:
      - docker build -t ${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA} .
      - docker tag ${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA} ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest
  - name: push
    image: docker:24.0.2
    commands:
      - echo "Logging into registry"
      - docker login -u ${DOCKER_USERNAME} -p ${DOCKER_PASSWORD} ${DOCKER_REGISTRY}
      - docker push ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA}
      - docker push ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest
  - name: deploy
    image: appleboy/drone-ssh
    settings:
      host: ${DEPLOY_HOST}
      port: ${DEPLOY_SSH_PORT}
      username: ${DEPLOY_USER}
      password: ${DEPLOY_PASSWORD}
      script: |
        set -e
        cd /srv/stemwijzer
        docker pull ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest
        docker-compose pull
        docker-compose up -d
 trigger:
  branch:
    - main
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,10 @@
 # Python-generated files
 __pycache__/
 *.py[oc]
 build/
 dist/
 wheels/
 *.egg-info
 # Virtual environments
 .venv
--- a/.python-version
+++ b/.python-version
@ -0,0 +1 @@
 3.13
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -0,0 +1,126 @@
 ARCHITECTURE
 ============
 Overview
 --------
 - Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It
  ingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short human
  summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
 Tech stack
 ----------
 - Language: Python (single-project repository)
 - Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py)
 - Web / UI: Streamlit (app.py)
 - HTTP: requests
 - HTML parsing: BeautifulSoup (scraper.py)
 - Scheduling: schedule (scheduler.py)
 - LLM: OpenAI-compatible client (summarizer.py uses openai.OpenAI configured via config)
 - Packaging: pyproject.toml present
 Top-level layout (annotated)
 ----------------------------
 ./
 - app.py               — Streamlit UI, main UI flow and session handling (entrypoint for web)
 - main.py              — minimal CLI entry / small script
 - database.py          — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations
 - api_client.py        — TweedeKamerAPI: fetch OData voting records and group into motions
 - scraper.py           — MotionScraper: HTML fallback scraper for motion pages
 - summarizer.py        — MotionSummarizer: LLM integration to generate layman_explanation
 - scheduler.py         — DataUpdateScheduler: initial historical loads + periodic scheduled updates
 - config.py            — Config dataclass: central configuration (DATABASE_PATH, API/AI settings, constants)
 - read.py              — small ibis + duckdb demonstration/utility
 - fix_database.py      — script to recreate/reset DuckDB schema
 - reset.py / verify.py — small maintenance scripts that call into database module
 - test.py              — ad-hoc test script (manual insert/verification)
 - data/                — data/motions.db (DuckDB file)
 - pyproject.toml       — project metadata / dependencies
 - .env                 — environment variables (not printed here)
 Core components
 ---------------
 - Streamlit UI (app.py)
  - Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes
  - Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),
    database.calculate_party_matches(), summarizer.update_motion_summaries()
 - Storage (database.py)
  - MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions
  - Exposes a module-level instance `db = MotionDatabase()` used across the codebase
  - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,
    calculate_party_matches
 - Ingestion (api_client.py + scraper.py)
  - api_client.py fetches votes via Tweede Kamer OData API and groups records into motions
  - scraper.py is an HTML fallback that scrapes motion pages and extracts vote info
  - Both provide structured motion dicts consumed by database.insert_motion()
 - Summarization (summarizer.py)
  - Wraps an OpenAI-compatible client to produce short layman explanations and persists them to DB
  - Reads motions without layman_explanation and updates rows
 - Orchestration (scheduler.py)
  - Runs initial historical ingestion and schedules periodic updates (using schedule)
  - Calls API client and summarizer and writes to the database
 Data flow (high level)
 ----------------------
 1. Ingestion
   - scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job()
   - Each produced motion dict is passed to MotionDatabase.insert_motion()
   - insert_motion writes to DuckDB (data/motions.db)
 2. Enrichment
   - summarizer.update_motion_summaries() reads motions lacking layman_explanation,
     calls the LLM client (openai.OpenAI) and writes summary text back to the DB
 3. Presentation / Interaction
   - app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
   - Users vote; app.py writes votes into the database via db.update_user_vote()
   - app.py calls db.calculate_party_matches() to compute match percentages for parties
 External integrations & dependencies
 -----------------------------------
 - Tweede Kamer OData API (api_client.py)
 - HTTP (requests)
 - HTML parsing (BeautifulSoup) used by scraper.py
 - DuckDB (database file at data/motions.db)
 - ibis (read.py demonstrates an ibis.duckdb connection)
 - Streamlit for UI
 - OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py
 Configuration
 -------------
 - config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include:
  - config.DATABASE_PATH (default "data/motions.db")
  - OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py
  - QWEN_MODEL (or other model identifier) referenced in summarizer.py
  - API timeout / batch size constants
 - .env file present at repo root (do not commit secrets). See .env.example if present (none observed).
 - Packaging metadata: pyproject.toml
 Build, run & development notes
 ------------------------------
 - Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CI
  workflows detected in the repository.
 - Streamlit app: run `streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
 - Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion.
 Tests
 -----
 - There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification.
 Notes / caveats
 ----------------
 - Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons
  (e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
 - Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,
  scraper.py). Logging is not centralized (print statements used).
 Where to look first (for contributors)
 -------------------------------------
 - app.py            — follow the UI flow and see how votes & sessions are used
 - database.py       — core data model and calculations
 - api_client.py     — OData ingestion logic
 - summarizer.py     — LLM usage and environment variables
 - scheduler.py      — how ingestion is orchestrated over time
--- a/CODE_STYLE.md
+++ b/CODE_STYLE.md
@ -0,0 +1,118 @@
 CODE STYLE
 ==========
 Purpose
 -------
 This document records the conventions already in use in the codebase so new contributors and AI
 agents can produce code that fits the repository's existing style.
 General
 -------
 - Language: Python (3.x)
 - Project uses one file-per-module with descriptive snake_case filenames (e.g., api_client.py, database.py)
 - Top-level module singletons are exposed when a single shared instance is desired (e.g. `db = MotionDatabase()`)
 - Keep code synchronous unless you introduce async consistently across modules (none currently use async/await)
 Naming
 ------
 - Files / modules: snake_case.py (e.g., motion_scraper -> scraper.py, api_client.py)
 - Classes: PascalCase (e.g., MotionDatabase, MotionSummarizer, TweedeKamerAPI)
 - Functions and methods: snake_case (including private helpers with a single leading underscore)
 - Constants / config fields: UPPER_SNAKE_CASE (placed in config.py and referenced via `from config import config`)
 File organization
 -----------------
 - Keep top-level domain modules in the repository root (this repo uses a flat layout)
 - Each module should contain one primary responsibility (e.g., database.py for DB logic)
 - Module-level singletons: create at module bottom and import from other modules (pattern used widely)
 Imports
 -------
 - Group imports in this order with a blank line between groups:
  1. Standard library (datetime, json, typing)
  2. Third-party libraries (requests, duckdb, ibis, streamlit)
  3. Local imports (from config import config, from database import db)
 - Use absolute imports (module name) rather than relative imports
 Typing
 ------
 - Add type hints to public function signatures where helpful (project uses typing in several places).
 - Use typing.Dict, typing.List, typing.Optional for simple container annotations.
 Error handling & logging
 ------------------------
 - Current pattern: functions catch broad Exception and print error messages, then return a safe default
  (False, [], None). Examples in database.py and api_client.py.
 - When updating code, prefer to:
  - Keep the existing behavior (return safe fallback) to avoid breaking call sites
  - Consider adding structured logging (use logging module) rather than print, but maintain similar
    high-level error flows unless refactoring intentionally.
 LLM / external API calls
 ------------------------
 - OpenAI-compatible client usage is in summarizer.py. Environment variables are read from config.py.
 - Do NOT commit API keys or secrets. Use environment variables (OPENROUTER_API_KEY, etc.) and
  reference them by name.
 - Network calls are synchronous using requests. Keep request timeouts and error handling consistent with
  existing patterns (catch requests.exceptions.RequestException and return safe fallback values).
 Database patterns
 -----------------
 - Database is DuckDB stored at data/motions.db. The MotionDatabase class opens short-lived duckdb
  connections inside methods (conn = duckdb.connect(self.db_path)). This pattern is used widely.
 - Queries and schema initialization happen inside MotionDatabase._init_database(). Keep DDL grouped there.
 - When writing methods that modify DB, follow the try/except + conn.close() pattern to guarantee cleanup.
 Testing
 -------
 - Currently the project uses ad-hoc test scripts (test.py). If adding tests, follow pytest conventions:
  - Place tests in tests/ directory
  - Use filenames test_*.py and functions test_* with assertions
  - Mock external APIs (requests, LLM client) via monkeypatch or unittest.mock
 Patterns observed (use these when adding new code)
 -----------------------------------------------
 - Singletons: expose module-level instance (e.g. `db = MotionDatabase()`), import it elsewhere
 - Private helpers: name with a single leading underscore (e.g., _get_voting_records)
 - Config: centralize in config.py and reference via `from config import config` (don't hardcode paths)
 Do's and Don'ts
 ---------------
 Do:
 - Follow existing naming: snake_case for files/functions
 - Add simple type hints for clarity
 - Return the same safe fallback values used in existing functions on error
 - Use module-level singletons for shared services if helpful
 Don't:
 - Don't add async/await in a single module without broader coordination
 - Don't print secret values or commit .env files
 - Don't create circular imports (be careful when modules instantiate singletons at import time)
 Example snippets
 ----------------
 Conformant class and method:
 class ExampleService:
    def __init__(self, param: str = config.DATABASE_PATH):
        self.param = param
    def do_work(self, items: typing.List[dict]) -> bool:
        try:
            # short-lived DB/HTTP usage
            conn = duckdb.connect(config.DATABASE_PATH)
            # ... perform work
            conn.close()
            return True
        except Exception as e:
            print(f"Error in do_work: {e}")
            if 'conn' in locals():
                conn.close()
            return False
 Adding a new module
 -------------------
 1. Create snake_case file (e.g., new_service.py)
 2. Add a PascalCase class implementing the behavior and small helper functions prefixed with _
 3. If you need a shared instance, create `service = NewService()` at the module bottom
 4. Import via `from new_service import service` in other modules
--- a/36
+++ b/36
@ -0,0 +1,36 @@
 FROM python:3.13-slim
 # Install minimal system deps
 RUN apt-get update \
    && apt-get install -y --no-install-recommends build-essential curl ca-certificates \
    && rm -rf /var/lib/apt/lists/*
 # Create non-root user for running the app
 RUN useradd -m -s /bin/bash app
 WORKDIR /home/app/app
 # Copy project files
 COPY . /home/app/app
 # Upgrade pip and install either pinned requirements or runtime defaults
 RUN python -m pip install --upgrade pip
 RUN if [ -f requirements.txt ]; then \
      pip install -r requirements.txt; \
    else \
      pip install uv streamlit duckdb; \
    fi
 # Fix permissions
 RUN chown -R app:app /home/app
 USER app
 ENV PYTHONPATH=/home/app/app
 EXPOSE 8501
 # Simple healthcheck that queries the Streamlit root
 HEALTHCHECK --interval=30s --timeout=3s --start-period=10s CMD curl -f http://localhost:8501/ || exit 1
 # Run the Streamlit app via uv as preferred in this project
 CMD ["uv", "run", "streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
--- a/EMBEDDING_ANALYSIS.md
+++ b/EMBEDDING_ANALYSIS.md
@ -0,0 +1,90 @@
 # Tweede Kamer Parliamentary Embedding Analysis
 ## Goal
 Track how MPs shift politically over time and map motions onto a meaningful ideological axis, by embedding both MPs and motions into a shared vector space.
 ## Data
 |Source|Content|
 |------|-------|
 |MP × motion vote matrix|yes / no / abstain per MP per motion|
 |Motion text|Dutch-language motion descriptions|
 |MP metadata|name, party, entry/exit dates|
 |Timestamps|date of each vote|
 ## Approach: Late Fusion
 Two independent embedding signals, combined per motion.
 ### 1. Vote embeddings (SVD)
 - Build a sparse MP × motion matrix per time window
 - Apply SVD to get latent vectors for both MPs and motions
 - Encodes political alignment from actual voting behavior
 ### 2. Text embeddings (Qwen3-0.6B)
 - Embed each motion's text using Qwen3-0.6B (multilingual, Dutch supported)
 - Encodes semantic/policy topic of the motion
 - Use a task instruction in English, e.g. `"Retrieve semantically similar Dutch parliamentary motions"`
 ### 3. Fusion
 Concatenate (or weighted sum) the SVD motion vector and text vector into a single motion embedding. MPs retain their SVD vectors only.
 ## Temporal Tracking
 ### Time windows
 - Default: **quarterly** (flexible — can be per half-year or per N votes)
 - Adaptive option: fixed number of votes per window (e.g. 200) for stable SVD regardless of parliamentary rhythm
 ### Procrustes alignment
 SVD axes are arbitrary per window and cannot be compared directly. Procrustes alignment finds the optimal rotation mapping one window's space onto the previous, using overlapping MPs as anchors.
 ```
 R = argmin || W1[common] - W2[common] @ R ||
 W2_aligned = W2 @ R  # applied to all MPs, including newcomers
 ```
 - Only overlapping MPs are needed to estimate R
 - New MPs are placed into the aligned space via their voting pattern
 - High Procrustes disparity score = structural political shift, not just individual drift
 ### Election transitions
 At term boundaries (~60% MP overlap), alignment is noisier. Mitigation: chain alignments via the last quarter of the old term and first quarter of the new term, using only returning MPs.
 ## Analysis
 |Question|Method|
 |--------|------|
 |MP drift over time|trajectory of MP vector across aligned windows|
 |Political axis|first SVD component, or defined by anchor parties (e.g. VVD vs SP)|
 |Swing voters|MPs closest to the boundary between party clusters|
 |Thematic clustering|UMAP on fused motion embeddings|
 |Cross-party coalitions|motions where party cluster boundaries blur|
 |Party cohesion|variance of MP vectors within a party per window|
 ## Stack
 |Component|Tool|
 |---------|----|
 |Matrix factorization|
 ````scipy.sparse.linalg.svds
 ````|
 |Procrustes alignment|
 ````scipy.spatial.procrustes
 ````|
 |Text embeddings|Qwen3-0.6B via 
 ````sentence-transformers
 ````
 or vLLM|
 |Dimensionality reduction|UMAP|
 |Visualization|Plotly (interactive trajectories)|
 |Data handling|ibis / pandas|
--- a/README.md
+++ b/README.md
--- a/ai_provider.py
+++ b/ai_provider.py
@ -0,0 +1,188 @@
 """Thin AI provider adapter for OpenRouter-compatible backends.
 Provides simple helpers for embeddings and chat completions using requests.
 This module is intentionally small and dependency-light to make testing easy.
 """
 from __future__ import annotations
 import os
 import time
 import random
 from typing import Any
 import requests
 class ProviderError(Exception):
    """Terminal provider error (non-retryable or configuration issues)."""
 def _get_base_url() -> str:
    # Support multiple env var names and fall back to OpenRouter default
    return os.environ.get(
        "OPENROUTER_URL",
        os.environ.get("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1"),
    )
 def _get_api_key() -> str:
    # Accept several common env var names for convenience
    for name in ("OPENROUTER_API_KEY", "OPENROUTER_KEY", "OPENAI_API_KEY", "API_KEY"):
        key = os.environ.get(name)
        if key:
            return key
    raise ProviderError(
        "OPENROUTER_API_KEY (or OPENAI_API_KEY) environment variable is required"
    )
 def _post_with_retries(
    path: str, json: dict[str, Any], retries: int = 3
 ) -> requests.Response:
    """POST to the provider with a small retry/backoff for transient errors.
    Retries on network errors (requests.ConnectionError) and 5xx responses.
    """
    url = _get_base_url().rstrip("/") + path
    headers = {
        "Authorization": f"Bearer {_get_api_key()}",
        "Content-Type": "application/json",
    }
    backoff = 0.5
    for attempt in range(1, retries + 1):
        try:
            resp = requests.post(url, json=json, headers=headers, timeout=10)
        except requests.ConnectionError as exc:
            if attempt == retries:
                raise ProviderError(
                    f"Connection error when calling provider: {exc}"
                ) from exc
            sleep = backoff * (2 ** (attempt - 1))
            sleep = sleep + random.uniform(0, sleep * 0.1)
            time.sleep(sleep)
            continue
        # Treat 5xx as transient
        if 500 <= getattr(resp, "status_code", 0) < 600:
            if attempt == retries:
                raise ProviderError(f"Provider returned HTTP {resp.status_code}")
            sleep = backoff * (2 ** (attempt - 1))
            sleep = sleep + random.uniform(0, sleep * 0.1)
            time.sleep(sleep)
            continue
        return resp
    # Should not reach here
    raise ProviderError("Failed to call provider after retries")
 def get_embedding(text: str, model: str | None = None) -> list[float]:
    """Return an embedding vector for `text` using the configured provider.
    Raises ProviderError for configuration or provider-side failures.
    """
    if not isinstance(text, str):
        raise ProviderError("text must be a string")
    # Resolve model: prefer explicit arg, then env vars, then sensible Qwen default
    if model is None:
        model = (
            os.environ.get("EMBEDDING_MODEL")
            or os.environ.get("QWEN_EMBEDDING_MODEL")
            or "qwen/qwen3-embedding-4b"
        )
    resp = _post_with_retries("/embeddings", json={"model": model, "input": text})
    try:
        data = resp.json()
    except Exception as exc:
        raise ProviderError(f"Invalid JSON response from provider: {exc}") from exc
    # Expecting {"data": [{"embedding": [...]}, ...]}
    try:
        embedding = data["data"][0]["embedding"]
    except Exception as exc:
        # If provider returns an error JSON, allow a local fallback when explicitly enabled
        fallback = os.environ.get("ALLOW_LOCAL_EMBED_FALLBACK", "false").lower() in (
            "1",
            "true",
            "yes",
        )
        if fallback:
            # choose fallback dim via env or default
            dim = int(os.environ.get("LOCAL_EMBED_DIM", "64"))
            return _local_embedding(text, dim=dim)
        raise ProviderError(f"Unexpected embedding response shape: {data}") from exc
    if not isinstance(embedding, list):
        raise ProviderError("Embedding is not a list")
    return [float(x) for x in embedding]
 def _local_embedding(text: str, dim: int = 64) -> list[float]:
    """Deterministic local fallback embedding based on SHA256.
    Returns a list of `dim` floats in range [-1, 1]. Not semantically rich but useful
    for local testing when provider embeddings are unavailable.
    """
    import hashlib
    h = hashlib.sha256(text.encode("utf8")).digest()
    values = []
    i = 0
    # Expand digest if needed
    while len(values) < dim:
        # take 8 bytes -> 64-bit int
        chunk = h[i % len(h) : (i % len(h)) + 8]
        if len(chunk) < 8:
            chunk = chunk.ljust(8, b"\0")
        val = int.from_bytes(chunk, "big", signed=False)
        # normalize to [-1,1]
        valscale = (val / (2**64 - 1)) * 2.0 - 1.0
        values.append(valscale)
        i += 1
        # re-hash occasionally to get more entropy
        if i % (len(h) // 2 + 1) == 0:
            h = hashlib.sha256(h + chunk).digest()
    return values[:dim]
 def chat_completion(messages: list[dict], model: str | None = None) -> str:
    """Return the assistant's content string for a chat completion request.
    messages should be a list of dicts like {"role": "user", "content": "..."}.
    """
    if not isinstance(messages, list):
        raise ProviderError("messages must be a list of dicts")
    # Resolve chat model: prefer explicit arg, then env var QWEN_MODEL, then a sensible default
    if model is None:
        model = (
            os.environ.get("QWEN_MODEL")
            or os.environ.get("CHAT_MODEL")
            or "qwen/qwen-3.2"
        )
    resp = _post_with_retries(
        "/chat/completions", json={"model": model, "messages": messages}
    )
    try:
        data = resp.json()
    except Exception as exc:
        raise ProviderError(f"Invalid JSON response from provider: {exc}") from exc
    # Expecting {"choices": [{"message": {"content": "..."}}]}
    try:
        content = data["choices"][0]["message"]["content"]
    except Exception as exc:
        raise ProviderError(
            f"Unexpected chat completion response shape: {data}"
        ) from exc
    return str(content)
--- a/api_client.py
+++ b/api_client.py
@ -0,0 +1,389 @@
 # api_client.py (complete updated version)
 import requests
 import json
 import re
 from datetime import datetime, timedelta
 from typing import Dict, List, Optional
 from config import config
 import time
 from collections import defaultdict
 class TweedeKamerAPI:
    def __init__(self):
        self.odata_base_url = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0"
        self.session = requests.Session()
        self.session.headers.update(
            {
                "Accept": "application/json",
                "User-Agent": "Dutch-Political-Compass-Tool/1.0",
            }
        )
    def get_motions(
        self, start_date: datetime = None, end_date: datetime = None, limit: int = 500
    ) -> List[Dict]:
        """Get motions with voting results using OData API"""
        if not start_date:
            start_date = datetime.now() - timedelta(days=730)  # 2 years ago
        try:
            # Get voting records
            voting_records = self._get_voting_records(start_date, end_date, limit)
            print(f"Fetched {len(voting_records)} voting records from API")
            # Group by Besluit_Id (decision/motion) and get motion details
            motions = self._process_voting_records(voting_records)
            print(f"Processed into {len(motions)} unique motions")
            return motions
        except Exception as e:
            print(f"Error fetching motions from API: {e}")
            return []
    def _get_voting_records(
        self, start_date: datetime, end_date: datetime = None, limit: int = 500
    ) -> List[Dict]:
        """Get individual voting records from the API"""
        # Format date properly for OData
        start_date_str = start_date.strftime("%Y-%m-%d")
        filter_query = f"GewijzigdOp ge {start_date_str}T00:00:00Z"
        if end_date:
            end_date_str = end_date.strftime("%Y-%m-%d")
            filter_query += f" and GewijzigdOp le {end_date_str}T23:59:59Z"
        # Add filter to exclude deleted records
        filter_query += " and Verwijderd eq false"
        url = f"{self.odata_base_url}/Stemming"
        params = {
            "$filter": filter_query,
            "$top": limit,
            "$orderby": "GewijzigdOp desc",
        }
        try:
            response = self.session.get(url, params=params, timeout=config.API_TIMEOUT)
            response.raise_for_status()
            data = response.json()
            voting_records = data.get("value", [])
            # If we got the maximum, there might be more data
            if len(voting_records) == limit:
                print(
                    f"Retrieved maximum {limit} records, there might be more data available"
                )
            return voting_records
        except requests.exceptions.RequestException as e:
            print(f"API request failed: {e}")
            if hasattr(e, "response") and e.response is not None:
                print(f"Response status: {e.response.status_code}")
                print(f"Response text: {e.response.text[:500]}")
            return []
    def _process_voting_records(self, records: List[Dict]) -> List[Dict]:
        """Process individual voting records into grouped motions"""
        # Group records by Besluit_Id (decision/motion)
        motion_groups = defaultdict(
            lambda: {"votes": {}, "besluit_id": None, "latest_date": None}
        )
        for record in records:
            besluit_id = record.get("Besluit_Id")
            if not besluit_id:
                continue
            # Extract party and vote information
            party_name = record.get("ActorNaam")
            vote_type = record.get("Soort", "").lower()
            record_date = record.get("GewijzigdOp", "")
            if not party_name:
                continue
            # Map vote types to our format
            if vote_type == "voor":
                vote = "voor"
            elif vote_type == "tegen":
                vote = "tegen"
            else:
                vote = "afwezig"
            # Store the vote
            motion_groups[besluit_id]["votes"][party_name] = vote
            motion_groups[besluit_id]["besluit_id"] = besluit_id
            # Track the latest date for this motion
            if (
                not motion_groups[besluit_id]["latest_date"]
                or record_date > motion_groups[besluit_id]["latest_date"]
            ):
                motion_groups[besluit_id]["latest_date"] = record_date
        # Now get motion details for each unique Besluit_Id
        motions = []
        for besluit_id, motion_data in motion_groups.items():
            if len(motion_data["votes"]) < 3:  # Skip motions with too few votes
                continue
            # Get motion details
            motion_details = self._get_motion_details(besluit_id)
            if not motion_details:
                # Create basic motion data if we can't get details
                motion_details = {
                    "title": f"Motion {besluit_id[:8]}",
                    "description": "No description available",
                    "date": motion_data["latest_date"].split("T")[0]
                    if motion_data["latest_date"]
                    else datetime.now().strftime("%Y-%m-%d"),
                }
            # Calculate winning margin
            voting_results = motion_data["votes"]
            total_votes = sum(
                1 for vote in voting_results.values() if vote in ["voor", "tegen"]
            )
            if total_votes == 0:
                continue
            votes_for = sum(1 for vote in voting_results.values() if vote == "voor")
            winning_margin = abs(votes_for - (total_votes - votes_for)) / total_votes
            motion = {
                "title": motion_details["title"],
                "description": motion_details["description"],
                "date": motion_details["date"],
                "policy_area": self._determine_policy_area(
                    motion_details["title"], motion_details["description"]
                ),
                "voting_results": voting_results,
                "winning_margin": winning_margin,
                "url": f"https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit_id}",
                "externe_identifier": motion_details.get("externe_identifier"),
                "body_text": motion_details.get("body_text"),
            }
            motions.append(motion)
        return motions
    def _get_motion_details(self, besluit_id: str) -> Optional[Dict]:
        """Get motion details from Besluit endpoint.
        Fetches Zaak.Onderwerp for the human-readable title, then follows the
        Zaak → Document → DocumentVersie chain to get the ExterneIdentifier,
        which is used to scrape the full motion body text from
        zoek.officielebekendmakingen.nl.
        """
        try:
            # Step 1: Besluit → Zaak (title) + Zaak.Id for document lookup
            url = f"{self.odata_base_url}/Besluit({besluit_id})"
            params = {"$expand": "Zaak($select=Id,Onderwerp)"}
            response = self.session.get(url, params=params, timeout=config.API_TIMEOUT)
            response.raise_for_status()
            record = response.json()
            zaak_list = record.get("Zaak", [])
            onderwerp = None
            zaak_id = None
            if zaak_list:
                onderwerp = zaak_list[0].get("Onderwerp")
                zaak_id = zaak_list[0].get("Id")
            besluit_tekst = record.get("BesluitTekst") or ""
            date_str = record.get("GewijzigdOp", "")
            date = (
                date_str.split("T")[0]
                if date_str
                else datetime.now().strftime("%Y-%m-%d")
            )
            title = onderwerp or f"Motion {besluit_id[:8]}"
            description = onderwerp or besluit_tekst or "Geen beschrijving beschikbaar"
            # Step 2: Fetch ExterneIdentifier via Zaak → Document → DocumentVersie
            externe_identifier = None
            body_text = None
            if zaak_id:
                externe_identifier = self._get_externe_identifier(zaak_id)
                if externe_identifier:
                    body_text = self._fetch_body_text(externe_identifier)
            return {
                "title": title,
                "description": body_text or description,
                "date": date,
                "externe_identifier": externe_identifier,
                "body_text": body_text,
            }
        except Exception as e:
            print(f"Error getting motion details for {besluit_id}: {e}")
        return None
    def _get_externe_identifier(self, zaak_id: str) -> Optional[str]:
        """Fetch the ExterneIdentifier for the first non-deleted DocumentVersie of a Zaak."""
        try:
            url = f"{self.odata_base_url}/Zaak({zaak_id})"
            params = {
                "$expand": "Document($expand=DocumentVersie($select=Id,ExterneIdentifier,Extensie,Verwijderd))"
            }
            response = self.session.get(url, params=params, timeout=config.API_TIMEOUT)
            response.raise_for_status()
            data = response.json()
            for doc in data.get("Document", []):
                for versie in doc.get("DocumentVersie", []):
                    if versie.get("Verwijderd"):
                        continue
                    ext_id = versie.get("ExterneIdentifier")
                    if ext_id:
                        return ext_id
        except Exception as e:
            print(f"Error fetching ExterneIdentifier for zaak {zaak_id}: {e}")
        return None
    def _fetch_body_text(self, externe_identifier: str) -> Optional[str]:
        """Scrape full motion body text from zoek.officielebekendmakingen.nl."""
        try:
            url = f"https://zoek.officielebekendmakingen.nl/{externe_identifier}.html"
            response = self.session.get(url, timeout=config.API_TIMEOUT)
            response.raise_for_status()
            html = response.text
            # Strip tags
            text = re.sub(r"<[^>]+>", " ", html)
            text = re.sub(r"&[a-z]+;", " ", text)
            text = re.sub(r"\s+", " ", text).strip()
            # Find the motion body starting at the first relevant keyword
            start_keywords = [
                "constaterende",
                "overwegende",
                "verzoekt",
                "spreekt uit",
                "roept op",
                "de kamer,",
            ]
            start_pos = len(text)
            for kw in start_keywords:
                pos = text.lower().find(kw)
                if pos != -1 and pos < start_pos:
                    start_pos = pos
            if start_pos == len(text):
                return None  # No motion body found
            body = text[start_pos:]
            # Trim at end markers
            end_markers = [
                "gaat over tot de orde van de dag",
                "naar boven",
                "deze motie is",
                "nr.",
            ]
            for marker in end_markers:
                pos = body.lower().find(marker)
                if pos != -1:
                    body = body[:pos]
            body = body.strip()
            return body if len(body) > 50 else None
        except Exception as e:
            print(f"Error fetching body text for {externe_identifier}: {e}")
        return None
    def _determine_policy_area(self, title: str, description: str) -> str:
        """Determine policy area from motion title and description"""
        text = (title + " " + description).lower()
        # Policy area keyword mapping
        policy_mapping = {
            "Economie": [
                "economie",
                "belasting",
                "budget",
                "financiën",
                "werkgelegenheid",
                "bedrijven",
                "economisch",
            ],
            "Klimaat": [
                "klimaat",
                "co2",
                "duurzaam",
                "energie",
                "milieu",
                "uitstoot",
                "klimaatverandering",
            ],
            "Immigratie": [
                "migratie",
                "asiel",
                "vreemdeling",
                "integratie",
                "naturalisatie",
                "immigratie",
            ],
            "Zorg": [
                "zorg",
                "gezondheid",
                "ziekenhuis",
                "medicijn",
                "arts",
                "patiënt",
                "gezondheidszorg",
            ],
            "Onderwijs": [
                "onderwijs",
                "school",
                "universiteit",
                "student",
                "leraar",
                "educatie",
            ],
            "Defensie": [
                "defensie",
                "militair",
                "veiligheid",
                "oorlog",
                "leger",
                "veiligheidsdienst",
            ],
        }
        for area, keywords in policy_mapping.items():
            if any(keyword in text for keyword in keywords):
                return area
        return "Algemeen"
    def test_api_connection(self) -> bool:
        """Test if API is accessible"""
        try:
            url = f"{self.odata_base_url}/Stemming"
            params = {"$top": 1}
            response = self.session.get(url, params=params, timeout=10)
            response.raise_for_status()
            data = response.json()
            return len(data.get("value", [])) > 0
        except Exception as e:
            print(f"API connection test failed: {e}")
            return False
--- a/app.py
+++ b/app.py
@ -0,0 +1,310 @@
 # app.py
 import streamlit as st
 import pandas as pd
 from datetime import datetime
 from database import db
 from summarizer import summarizer
 from config import config
 import json
 # Page config
 st.set_page_config(
    page_title="Nederlandse Politieke Kompas", page_icon="🇳🇱", layout="wide"
 )
 def main():
    st.title("🇳🇱 Nederlandse Politieke Kompas")
    st.markdown(
        "Ontdek welke politieke partij het beste bij jouw idealen past door te stemmen op echte Tweede Kamer moties."
    )
    # Initialize session state
    if "session_id" not in st.session_state:
        st.session_state.session_id = None
    if "current_motion_index" not in st.session_state:
        st.session_state.current_motion_index = 0
    if "motions" not in st.session_state:
        st.session_state.motions = []
    if "show_results" not in st.session_state:
        st.session_state.show_results = False
    # Sidebar configuration
    with st.sidebar:
        st.header("Instellingen")
        motion_count = st.slider(
            "Aantal moties",
            min_value=5,
            max_value=25,
            value=config.DEFAULT_MOTION_COUNT,
        )
        policy_area = st.selectbox("Beleidsgebied", config.POLICY_AREAS)
        margin_range = st.slider(
            "Controversiële moties (%)",
            min_value=0,
            max_value=100,
            value=(
                config.DEFAULT_WINNING_MARGIN_MIN,
                config.DEFAULT_WINNING_MARGIN_MAX,
            ),
        )
        if st.button("Start Nieuwe Sessie"):
            start_new_session(motion_count, policy_area, margin_range)
        if st.button("Genereer AI Samenvattingen"):
            with st.spinner("Genereren van samenvattingen..."):
                summarizer.update_motion_summaries()
            st.success("Samenvattingen bijgewerkt!")
    # Main content
    if not st.session_state.session_id:
        show_welcome_screen(motion_count, policy_area, margin_range)
    elif st.session_state.show_results:
        show_results()
    else:
        show_motion_interface()
 def start_new_session(motion_count, policy_area, margin_range):
    """Start a new voting session"""
    # Get filtered motions
    motions = db.get_filtered_motions(
        policy_area=policy_area,
        min_margin=margin_range[0] / 100,
        max_margin=margin_range[1] / 100,
        limit=motion_count,
    )
    if len(motions) < motion_count:
        st.warning(
            f"Slechts {len(motions)} moties gevonden met de geselecteerde criteria."
        )
    # Create session
    session_id = db.create_session(motion_count)
    # Update session state
    st.session_state.session_id = session_id
    st.session_state.motions = motions[:motion_count]
    st.session_state.current_motion_index = 0
    st.session_state.show_results = False
    st.rerun()
 def show_welcome_screen(motion_count, policy_area, margin_range):
    """Show welcome screen with start button"""
    col1, col2, col3 = st.columns([1, 2, 1])
    with col2:
        st.markdown("### Welkom bij de Nederlandse Politieke Kompas!")
        st.markdown(f"""
        **Jouw instellingen:**
        - 📊 **{motion_count} moties** uit het beleidsgebied **{policy_area}**
        - 🎯 **Controversiële moties** tussen {margin_range[0]}% en {margin_range[1]}% marge
        Klik op "Start Nieuwe Sessie" in de zijbalk om te beginnen met stemmen.
        """)
        st.info(
            "💡 **Tip**: Kies 'Alle' als beleidsgebied voor een breed overzicht van verschillende onderwerpen."
        )
 def show_motion_interface():
    """Show motion voting interface"""
    if not st.session_state.motions:
        st.error("Geen moties gevonden. Start een nieuwe sessie.")
        return
    current_index = st.session_state.current_motion_index
    total_motions = len(st.session_state.motions)
    # Progress bar
    progress = (current_index) / total_motions
    st.progress(progress, text=f"Motie {current_index + 1} van {total_motions}")
    if current_index >= total_motions:
        st.session_state.show_results = True
        st.rerun()
        return
    motion = st.session_state.motions[current_index]
    # Motion display
    st.header(f"Motie {current_index + 1}: {motion['title']}")
    # Policy area tag
    st.markdown(f"**Beleidsgebied:** {motion['policy_area']}")
    # Layman explanation (prominent)
    if motion.get("layman_explanation"):
        st.markdown("### 📝 Uitleg in begrijpelijke taal:")
        st.markdown(f"*{motion['layman_explanation']}*")
    # Original description (collapsible)
    motion_text = motion.get("body_text") or motion.get("description", "")
    if motion_text:
        label = (
            "📋 Volledige motietekst"
            if motion.get("body_text")
            else "📋 Originele motiebeschrijving"
        )
        with st.expander(label):
            st.write(motion_text)
    # Voting buttons
    st.markdown("### 🗳️ Hoe zou jij stemmen?")
    col1, col2, col3 = st.columns(3)
    with col1:
        if st.button("✅ Voor", use_container_width=True, type="primary"):
            cast_vote("Voor")
    with col2:
        if st.button("❌ Tegen", use_container_width=True):
            cast_vote("Tegen")
    with col3:
        if st.button("🚫 Geen stem", use_container_width=True):
            cast_vote("Geen stem")
 def cast_vote(vote_choice):
    """Record user vote and move to next motion"""
    current_motion = st.session_state.motions[st.session_state.current_motion_index]
    # Save vote to database
    db.update_user_vote(st.session_state.session_id, current_motion["id"], vote_choice)
    # Move to next motion
    st.session_state.current_motion_index += 1
    st.rerun()
 def show_results():
    """Show voting results and party matches"""
    st.header("🎯 Jouw Resultaten")
    # Calculate party matches
    party_matches = db.calculate_party_matches(st.session_state.session_id)
    if not party_matches:
        st.error("Geen resultaten beschikbaar.")
        return
    # Party ranking table
    st.subheader("📊 Partij Overeenkomsten (van hoog naar laag)")
    df = pd.DataFrame(party_matches)
    df.columns = ["Partij", "Overeenkomst %", "Eens", "Totaal"]
    # Style the dataframe
    def color_agreement(val):
        if val >= 80:
            return "background-color: #d4edda"
        elif val >= 60:
            return "background-color: #fff3cd"
        else:
            return "background-color: #f8d7da"
    styled_df = df.style.applymap(color_agreement, subset=["Overeenkomst %"])
    st.dataframe(styled_df, use_container_width=True, hide_index=True)
    # Top match highlight
    top_match = party_matches[0]
    st.success(
        f"🏆 **Beste match:** {top_match['party']} ({top_match['agreement_percentage']}% overeenkomst)"
    )
    # Detailed motion overview
    st.subheader("📋 Gedetailleerd Overzicht per Motie")
    show_detailed_motion_results()
    # New session button
    if st.button("🔄 Start Nieuwe Sessie"):
        # Clear session state
        for key in ["session_id", "motions", "current_motion_index", "show_results"]:
            if key in st.session_state:
                del st.session_state[key]
        st.rerun()
 def show_detailed_motion_results():
    """Show detailed voting results for each motion"""
    import duckdb
    conn = duckdb.connect(config.DATABASE_PATH)
    # Get user votes
    user_data = conn.execute(
        """
        SELECT user_votes FROM user_sessions WHERE session_id = ?
    """,
        (st.session_state.session_id,),
    ).fetchone()
    if not user_data:
        return
    user_votes = json.loads(user_data[0])
    # Get motion details
    motion_ids = list(user_votes.keys())
    if motion_ids:
        placeholders = ",".join(["?" for _ in motion_ids])
        motions = conn.execute(
            f"""
            SELECT id, title, layman_explanation, body_text, description, voting_results FROM motions 
            WHERE id IN ({placeholders})
        """,
            motion_ids,
        ).fetchall()
        for (
            motion_id,
            title,
            layman_explanation,
            body_text,
            description,
            voting_results_json,
        ) in motions:
            voting_results = json.loads(voting_results_json)
            user_vote = user_votes[str(motion_id)]
            with st.expander(f"**{title}** (Jouw stem: {user_vote})"):
                # Show layman explanation prominently
                if layman_explanation:
                    st.markdown("**📝 Uitleg:**")
                    st.markdown(f"*{layman_explanation}*")
                # Show full motion body text if available, otherwise description
                motion_text = body_text or description
                if motion_text:
                    st.markdown("**📋 Motiebeschrijving:**")
                    st.write(motion_text)
                # Create voting overview
                parties_voor = [p for p, v in voting_results.items() if v == "voor"]
                parties_tegen = [p for p, v in voting_results.items() if v == "tegen"]
                col1, col2 = st.columns(2)
                with col1:
                    st.markdown("**Voor:**")
                    st.write(", ".join(parties_voor) if parties_voor else "Geen")
                with col2:
                    st.markdown("**Tegen:**")
                    st.write(", ".join(parties_tegen) if parties_tegen else "Geen")
    conn.close()
 if __name__ == "__main__":
    main()
--- a/config.py
+++ b/config.py
@ -0,0 +1,51 @@
 # config.py (complete updated version)
 import os
 from dataclasses import dataclass
 from typing import List
@dataclass
 class Config:
    # Database settings
    DATABASE_PATH = "data/motions.db"
    # API settings (updated)
    TWEEDE_KAMER_ODATA_API = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0"
    API_TIMEOUT = 30
    API_BATCH_SIZE = 250  # Increased based on API capabilities
    API_MAX_LIMIT = 250
    # AI settings
    OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
    OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1"
    QWEN_MODEL = "qwen/qwen-2.5-72b-instruct"
    # App settings
    DEFAULT_MOTION_COUNT = 10
    DEFAULT_WINNING_MARGIN_MIN = (
        0  # % - include all, filter by layman_explanation instead
    )
    DEFAULT_WINNING_MARGIN_MAX = 100  # %
    SESSION_TIMEOUT_DAYS = 30
    # Policy areas
    POLICY_AREAS = [
        "Alle",
        "Economie",
        "Klimaat",
        "Immigratie",
        "Zorg",
        "Onderwijs",
        "Defensie",
        "Sociale Zaken",
        "Algemeen",
    ]
    # Scraper defaults (previously missing)
    BASE_URL = (
        "https://www.tweedekamer.nl/zoeken/zoekresultaten"  # base for scraping motions
    )
    SCRAPING_DELAY = int(os.getenv("SCRAPING_DELAY", "5"))
 config = Config()
--- a/database.py
+++ b/database.py
@ -0,0 +1,582 @@
 # database.py (final working version)
 import duckdb
 import json
 import uuid
 from datetime import datetime, timedelta
 from typing import Dict, List, Optional, Tuple
 from config import config
 import logging
 _logger = logging.getLogger(__name__)
 class MotionDatabase:
    def __init__(self, db_path: str = config.DATABASE_PATH):
        self.db_path = db_path
        self._init_database()
    def _init_database(self):
        """Initialize database with required tables"""
        # Create directory if it doesn't exist
        import os
        os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
        conn = duckdb.connect(self.db_path)
        # Create sequence for auto-incrementing IDs
        try:
            conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1")
        except:
            pass
        # Create tables with proper ID handling
        conn.execute("""
            CREATE TABLE IF NOT EXISTS motions (
                id INTEGER DEFAULT nextval('motions_id_seq'),
                title TEXT NOT NULL,
                description TEXT,
                date DATE,
                policy_area TEXT,
                voting_results JSON,
                winning_margin FLOAT,
                controversy_score FLOAT,
                layman_explanation TEXT,
                url TEXT UNIQUE,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                PRIMARY KEY (id)
            )
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS user_sessions (
                session_id TEXT PRIMARY KEY,
                user_votes JSON,
                completed_motions INTEGER DEFAULT 0,
                total_motions INTEGER DEFAULT 10,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS party_results (
                session_id TEXT,
                party_name TEXT,
                agreement_percentage FLOAT,
                agreed_motions JSON,
                disagreed_motions JSON,
                PRIMARY KEY (session_id, party_name)
            )
        """)
        # New pipeline tables
        conn.execute("""
            CREATE SEQUENCE IF NOT EXISTS mp_votes_id_seq START 1
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS mp_votes (
                id INTEGER DEFAULT nextval('mp_votes_id_seq'),
                motion_id INTEGER NOT NULL,
                mp_name TEXT NOT NULL,
                party TEXT,
                vote TEXT NOT NULL,
                date DATE,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                PRIMARY KEY (id)
            )
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS mp_metadata (
                mp_name TEXT PRIMARY KEY,
                party TEXT,
                van DATE,
                tot_en_met DATE,
                persoon_id TEXT
            )
        """)
        conn.execute("""
            CREATE SEQUENCE IF NOT EXISTS svd_vectors_id_seq START 1
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS svd_vectors (
                id INTEGER DEFAULT nextval('svd_vectors_id_seq'),
                window_id TEXT NOT NULL,
                entity_type TEXT NOT NULL,
                entity_id TEXT NOT NULL,
                vector JSON NOT NULL,
                model TEXT,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                PRIMARY KEY (id)
            )
        """)
        conn.execute("""
            CREATE SEQUENCE IF NOT EXISTS fused_embeddings_id_seq START 1
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS fused_embeddings (
                id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
                motion_id INTEGER NOT NULL,
                window_id TEXT NOT NULL,
                vector JSON NOT NULL,
                svd_dims INTEGER NOT NULL,
                text_dims INTEGER NOT NULL,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                PRIMARY KEY (id)
            )
        """)
        conn.close()
    def reset_database(self):
        """Development helper: drop known tables and re-run initialization.
        WARNING: intended for dev/test only. This will remove tables and recreate schema.
        """
        conn = duckdb.connect(self.db_path)
        try:
            # Drop known tables if they exist
            for t in ("party_results", "user_sessions", "motions"):
                try:
                    conn.execute(f"DROP TABLE IF EXISTS {t}")
                except Exception:
                    pass
            # Recreate schema
            conn.close()
            self._init_database()
        finally:
            try:
                conn.close()
            except Exception:
                pass
    def insert_motion(self, motion_data: Dict) -> bool:
        """Insert a new motion into database"""
        try:
            conn = duckdb.connect(self.db_path)
            # Check if motion already exists by URL to avoid duplicates
            existing = conn.execute(
                """
                SELECT COUNT(*) FROM motions WHERE url = ?
            """,
                (motion_data["url"],),
            ).fetchone()
            if existing and existing[0] > 0:
                conn.close()
                return False  # Motion already exists
            # Insert motion - id will be auto-generated by sequence
            conn.execute(
                """
                INSERT INTO motions 
                (title, description, date, policy_area, voting_results, 
                 winning_margin, controversy_score, url, externe_identifier, body_text, created_at)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
            """,
                (
                    motion_data["title"],
                    motion_data["description"] or "",
                    motion_data["date"],
                    motion_data["policy_area"],
                    json.dumps(motion_data["voting_results"]),
                    motion_data["winning_margin"],
                    1 - motion_data["winning_margin"],  # controversy score
                    motion_data["url"],
                    motion_data.get("externe_identifier"),
                    motion_data.get("body_text"),
                ),
            )
            conn.close()
            return True
        except Exception as e:
            print(f"Error inserting motion: {e}")
            if "conn" in locals():
                conn.close()
            return False
    def get_filtered_motions(
        self,
        policy_area: str = "Alle",
        min_margin: float = 0.2,
        max_margin: float = 0.8,
        limit: int = 100,
    ) -> List[Dict]:
        """Get motions filtered by criteria"""
        conn = duckdb.connect(self.db_path)
        query = """
            SELECT * FROM motions
            WHERE winning_margin BETWEEN ? AND ?
              AND layman_explanation IS NOT NULL
              AND layman_explanation != ''
        """
        params = [min_margin, max_margin]
        if policy_area != "Alle":
            query += " AND policy_area = ?"
            params.append(policy_area)
        query += " ORDER BY controversy_score DESC LIMIT ?"
        params.append(limit)
        try:
            result = conn.execute(query, params).fetchall()
            columns = [desc[0] for desc in conn.description]
            conn.close()
            return [dict(zip(columns, row)) for row in result]
        except Exception as e:
            print(f"Error querying motions: {e}")
            conn.close()
            return []
    def create_session(self, total_motions: int = 10) -> str:
        """Create new user session"""
        session_id = str(uuid.uuid4())
        conn = duckdb.connect(self.db_path)
        conn.execute(
            """
            INSERT INTO user_sessions (session_id, user_votes, total_motions)
            VALUES (?, '{}', ?)
        """,
            (session_id, total_motions),
        )
        conn.close()
        return session_id
    def update_user_vote(self, session_id: str, motion_id: int, vote: str):
        """Update user vote for a motion"""
        conn = duckdb.connect(self.db_path)
        # Get current votes
        current_votes = conn.execute(
            """
            SELECT user_votes FROM user_sessions WHERE session_id = ?
        """,
            (session_id,),
        ).fetchone()
        if current_votes:
            votes_dict = json.loads(current_votes[0])
            votes_dict[str(motion_id)] = vote
            conn.execute(
                """
                UPDATE user_sessions 
                SET user_votes = ?, 
                    completed_motions = ?,
                    last_updated = CURRENT_TIMESTAMP
                WHERE session_id = ?
            """,
                (json.dumps(votes_dict), len(votes_dict), session_id),
            )
        conn.close()
    def calculate_party_matches(self, session_id: str) -> List[Dict]:
        """Calculate party agreement percentages"""
        conn = duckdb.connect(self.db_path)
        # Get user votes and motion data
        user_data = conn.execute(
            """
            SELECT user_votes FROM user_sessions WHERE session_id = ?
        """,
            (session_id,),
        ).fetchone()
        if not user_data:
            return []
        user_votes = json.loads(user_data[0])
        motion_ids = list(user_votes.keys())
        if not motion_ids:
            return []
        # Get motion voting results
        placeholders = ",".join(["?" for _ in motion_ids])
        motions = conn.execute(
            f"""
            SELECT id, voting_results FROM motions 
            WHERE id IN ({placeholders})
        """,
            motion_ids,
        ).fetchall()
        conn.close()
        # Calculate agreements
        party_scores = {}
        for motion_id, voting_results_json in motions:
            voting_results = json.loads(voting_results_json)
            user_vote = user_votes[str(motion_id)]
            if user_vote == "Geen stem":  # Skip abstentions
                continue
            for party, party_vote in voting_results.items():
                # Skip individual MP names (contain comma, e.g. "Yesilgöz-Zegerius, D.")
                # Party/fractie names never contain a comma.
                if "," in party:
                    continue
                if party not in party_scores:
                    party_scores[party] = {"agreed": 0, "total": 0}
                party_scores[party]["total"] += 1
                # Check agreement
                if (user_vote == "Voor" and party_vote == "voor") or (
                    user_vote == "Tegen" and party_vote == "tegen"
                ):
                    party_scores[party]["agreed"] += 1
        # Convert to percentages and sort
        results = []
        for party, scores in party_scores.items():
            if scores["total"] > 0:
                agreement_pct = (scores["agreed"] / scores["total"]) * 100
                results.append(
                    {
                        "party": party,
                        "agreement_percentage": round(agreement_pct, 1),
                        "agreed_motions": scores["agreed"],
                        "total_motions": scores["total"],
                    }
                )
        return sorted(results, key=lambda x: x["agreement_percentage"], reverse=True)
    def store_embedding(self, motion_id: int, model: str, vector: List[float]) -> int:
        """Store an embedding for a motion. Returns inserted row id or -1 on failure."""
        try:
            conn = duckdb.connect(self.db_path)
            # store vector as JSON
            conn.execute(
                "INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, CURRENT_TIMESTAMP)",
                (motion_id, model, json.dumps(vector)),
            )
            row = conn.execute("SELECT max(id) FROM embeddings").fetchone()
            conn.close()
            if row and row[0] is not None:
                return int(row[0])
            return -1
        except Exception as e:
            print(f"Error storing embedding: {e}")
            try:
                conn.close()
            except Exception:
                pass
            return -1
    def search_similar(
        self, query_vector: List[float], top_k: int = 5, model: Optional[str] = None
    ) -> List[Dict]:
        """Naive in-Python cosine similarity search over stored embeddings.
        Returns list of dicts with keys: id, motion_id, model, score, created_at
        """
        try:
            conn = duckdb.connect(self.db_path)
            if model:
                rows = conn.execute(
                    "SELECT id, motion_id, model, vector, created_at FROM embeddings WHERE model = ?",
                    (model,),
                ).fetchall()
            else:
                rows = conn.execute(
                    "SELECT id, motion_id, model, vector, created_at FROM embeddings"
                ).fetchall()
            conn.close()
            results = []
            import math
            for r in rows:
                id_, motion_id, mdl, vector_json, created_at = r
                try:
                    vec = json.loads(vector_json)
                except Exception:
                    continue
                # cosine similarity
                try:
                    dot = sum(float(a) * float(b) for a, b in zip(query_vector, vec))
                    na = math.sqrt(sum(float(a) * float(a) for a in query_vector))
                    nb = math.sqrt(sum(float(b) * float(b) for b in vec))
                    score = dot / (na * nb) if na and nb else 0.0
                except Exception:
                    score = 0.0
                results.append(
                    {
                        "id": id_,
                        "motion_id": motion_id,
                        "model": mdl,
                        "score": score,
                        "created_at": created_at,
                    }
                )
            results.sort(key=lambda x: x["score"], reverse=True)
            return results[:top_k]
        except Exception as e:
            print(f"Error searching embeddings: {e}")
            try:
                conn.close()
            except Exception:
                pass
            return []
    def mp_votes_exists_for_motion(self, motion_id: int) -> bool:
        try:
            conn = duckdb.connect(self.db_path)
            row = conn.execute(
                "SELECT COUNT(*) FROM mp_votes WHERE motion_id = ?",
                (motion_id,),
            ).fetchone()
            conn.close()
            return bool(row and row[0] > 0)
        except Exception as e:
            _logger.error(f"Error checking mp_votes existence: {e}")
            try:
                conn.close()
            except Exception:
                pass
            return False
    def insert_mp_vote(
        self,
        motion_id: int,
        mp_name: str,
        vote: str,
        date: Optional[str] = None,
        party: Optional[str] = None,
    ) -> int:
        try:
            conn = duckdb.connect(self.db_path)
            conn.execute(
                """
                INSERT INTO mp_votes (motion_id, mp_name, party, vote, date, created_at)
                VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
            """,
                (motion_id, mp_name, party, vote, date),
            )
            row = conn.execute("SELECT max(id) FROM mp_votes").fetchone()
            conn.close()
            if row and row[0] is not None:
                return int(row[0])
            return -1
        except Exception as e:
            _logger.error(f"Error inserting mp_vote: {e}")
            try:
                conn.close()
            except Exception:
                pass
            return -1
    def upsert_mp_metadata(
        self,
        mp_name: str,
        party: Optional[str],
        van: Optional[str],
        tot_en_met: Optional[str],
        persoon_id: Optional[str],
    ) -> None:
        try:
            conn = duckdb.connect(self.db_path)
            exists = conn.execute(
                "SELECT COUNT(*) FROM mp_metadata WHERE mp_name = ?", (mp_name,)
            ).fetchone()
            if exists and exists[0] > 0:
                conn.execute(
                    """
                    UPDATE mp_metadata SET party = ?, van = ?, tot_en_met = ?, persoon_id = ?
                    WHERE mp_name = ?
                """,
                    (party, van, tot_en_met, persoon_id, mp_name),
                )
            else:
                conn.execute(
                    """
                    INSERT INTO mp_metadata (mp_name, party, van, tot_en_met, persoon_id)
                    VALUES (?, ?, ?, ?, ?)
                """,
                    (mp_name, party, van, tot_en_met, persoon_id),
                )
            conn.close()
        except Exception as e:
            _logger.error(f"Error upserting mp_metadata: {e}")
            try:
                conn.close()
            except Exception:
                pass
    def store_svd_vector(
        self,
        window_id: str,
        entity_type: str,
        entity_id: str,
        vector: List[float],
        model: Optional[str] = None,
    ) -> int:
        try:
            conn = duckdb.connect(self.db_path)
            conn.execute(
                """
                INSERT INTO svd_vectors (window_id, entity_type, entity_id, vector, model, created_at)
                VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
            """,
                (window_id, entity_type, entity_id, json.dumps(vector), model),
            )
            row = conn.execute("SELECT max(id) FROM svd_vectors").fetchone()
            conn.close()
            if row and row[0] is not None:
                return int(row[0])
            return -1
        except Exception as e:
            _logger.error(f"Error storing svd_vector: {e}")
            try:
                conn.close()
            except Exception:
                pass
            return -1
    def store_fused_embedding(
        self,
        motion_id: int,
        window_id: str,
        vector: List[float],
        svd_dims: int,
        text_dims: int,
    ) -> int:
        try:
            conn = duckdb.connect(self.db_path)
            conn.execute(
                """
                INSERT INTO fused_embeddings (motion_id, window_id, vector, svd_dims, text_dims, created_at)
                VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
            """,
                (motion_id, window_id, json.dumps(vector), svd_dims, text_dims),
            )
            row = conn.execute("SELECT max(id) FROM fused_embeddings").fetchone()
            conn.close()
            if row and row[0] is not None:
                return int(row[0])
            return -1
        except Exception as e:
            _logger.error(f"Error storing fused_embedding: {e}")
            try:
                conn.close()
            except Exception:
                pass
            return -1
 db = MotionDatabase()
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -0,0 +1,20 @@
 version: '3.8'
 services:
  stemwijzer:
    build: .
    image: stemwijzer:latest
    container_name: stemwijzer_app
    restart: unless-stopped
    ports:
      - "8501:8501"
    volumes:
      - ./data:/home/app/app/data:rw
    environment:
      - PYTHONPATH=/home/app/app
      - OPENROUTER_API_KEY
      - OTHER_SECRET
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8501/"]
      interval: 30s
      timeout: 3s
      retries: 3
--- a/docs/admin/recompute_similarity.md
+++ b/docs/admin/recompute_similarity.md
@ -0,0 +1,72 @@
 # Recomputing Similarity (Admin)
 This document explains the admin CLI and developer workflows for recomputing similarity scores and running clustering jobs locally.
 ## What this does
 - Recompute similarity vectors/scores for existing records in the database.
 - (Optionally) run the clusterer job that groups similar items based on recomputed vectors.
 These operations are typically run as admin/maintenance tasks after changing the embedding/similarity logic or restoring a database snapshot.
 ## Migration filenames
 When adding or running migrations related to similarity or clustering, follow the project's migration filename pattern. Migration files touching similarity will typically include keywords like `recompute_similarity` or `clusterer` in the filename, for example:
 - `20260101_001_recompute_similarity.py`
 - `20260215_002_clusterer_migration.py`
 Check your migrations folder for the exact filenames used in your environment.
 ## Environment variables
 When running the CLI locally you may need to set the following environment variables.
 - `TEST_DB_URL` — connection string for a test/development database (used by local runs when you don't want to touch production data).
 - `AI_PROVIDER_MOCK` — when set to a truthy value (`1`, `true`, `yes`) the AI/embedding provider is mocked so you don't make real API calls during development. Treat any non-empty value of `AI_PROVIDER_MOCK` as truthy.
 - `SIMILARITY_TOP_N` — default number of top similar items to compute/keep for each record. The CLI `--top-n` flag overrides this value for the duration of the run.
 Examples:
 - Export in a shell (persistent for your session):
  export TEST_DB_URL="postgresql://user:pass@localhost:5432/devdb"
  export AI_PROVIDER_MOCK="true"
  export SIMILARITY_TOP_N="50"
 - Inline for a single command (non-persistent):
  TEST_DB_URL="postgresql://user:pass@localhost/devdb" AI_PROVIDER_MOCK=1 python -m src.cli.recompute_similarity --batch-size 100
 Notes:
 - `--top-n` CLI flag takes precedence over `SIMILARITY_TOP_N` when both are provided.
 - `AI_PROVIDER_MOCK` should be set to a truthy value (e.g. `1`, `true`, `yes`) to avoid real external AI calls during local runs.
 ## Running locally (development)
 The CLI lives under src/cli. Use the module runner to execute the recompute script. Example commands:
 Run a dry-run that doesn't persist changes:
 ```
 python -m src.cli.recompute_similarity --top-n 10 --batch-size 100 --dry-run
 ```
 Run for real (writes results to the DB):
 ```
 python -m src.cli.recompute_similarity --top-n 50 --batch-size 500
 ```
 Common flags
 - `--top-n` — override SIMILARITY_TOP_N for this run.
 - `--batch-size` — number of records to process per batch.
 - `--dry-run` — inspect what would be changed without writing to the DB.
 Notes
 - Always point `TEST_DB_URL` at a non-production database when experimenting.
 - Use `AI_PROVIDER_MOCK=true` to skip external calls and speed up local dev.
 - If you change the embedding or similarity algorithm, re-run the recompute job and re-index/cluster as needed.
 If you need help or encounter mismatches between migration files and the CLI, check the migrations folder and speak with the team member that authored the change.
--- a/fix_database.py
+++ b/fix_database.py
@ -0,0 +1,67 @@
 # fix_database.py (updated version)
 import os
 import duckdb
 from config import config
 def fix_database():
    """Completely reset the database with correct schema"""
    # Remove the existing database file completely
    if os.path.exists(config.DATABASE_PATH):
        os.remove(config.DATABASE_PATH)
        print("Removed existing database file")
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(config.DATABASE_PATH), exist_ok=True)
    # Initialize with correct schema
    conn = duckdb.connect(config.DATABASE_PATH)
    # Create sequence for auto-incrementing IDs
    conn.execute("CREATE SEQUENCE motions_id_seq START 1")
    # Create motions table with sequence-based auto-increment
    conn.execute("""
        CREATE TABLE motions (
            id INTEGER DEFAULT nextval('motions_id_seq'),
            title TEXT NOT NULL,
            description TEXT,
            date DATE,
            policy_area TEXT,
            voting_results JSON,
            winning_margin FLOAT,
            controversy_score FLOAT,
            layman_explanation TEXT,
            url TEXT UNIQUE,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            PRIMARY KEY (id)
        )
    """)
    conn.execute("""
        CREATE TABLE user_sessions (
            session_id TEXT PRIMARY KEY,
            user_votes JSON,
            completed_motions INTEGER DEFAULT 0,
            total_motions INTEGER DEFAULT 10,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.execute("""
        CREATE TABLE party_results (
            session_id TEXT,
            party_name TEXT,
            agreement_percentage FLOAT,
            agreed_motions JSON,
            disagreed_motions JSON,
            PRIMARY KEY (session_id, party_name)
        )
    """)
    conn.close()
    print("Database recreated with correct schema using sequences")
 if __name__ == "__main__":
    fix_database()
--- a/main.py
+++ b/main.py
@ -0,0 +1,6 @@
 def main():
    print("Hello from stemwijzer!")
 if __name__ == "__main__":
    main()
--- a/migrations/2026-03-19-add-embeddings.sql
+++ b/migrations/2026-03-19-add-embeddings.sql
@ -0,0 +1,11 @@
 -- Add a separate embeddings table for semantic search and storage of vectors (DuckDB-compatible)
 CREATE TABLE IF NOT EXISTS embeddings (
    id INTEGER,
    motion_id INTEGER NOT NULL,
    model TEXT NOT NULL,
    vector JSON NOT NULL,
    created_at TIMESTAMP DEFAULT current_timestamp
 );
 -- DuckDB does not support AUTOINCREMENT; emulate id via a sequence if needed elsewhere
 CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1;
 -- Populate id via trigger-like insert pattern is handled by application code (select nextval when inserting)
--- a/migrations/2026-03-20-add-body-text.sql
+++ b/migrations/2026-03-20-add-body-text.sql
@ -0,0 +1,6 @@
 -- Migration: add externe_identifier and body_text columns to motions
 -- externe_identifier: e.g. "kst-36600-VII-28" from DocumentVersie.ExterneIdentifier
 -- body_text: full plain-text motion body scraped from officielebekendmakingen.nl
 ALTER TABLE motions ADD COLUMN IF NOT EXISTS externe_identifier VARCHAR;
 ALTER TABLE motions ADD COLUMN IF NOT EXISTS body_text VARCHAR;
--- a/migrations/2026-03-22-add-audit-events.sql
+++ b/migrations/2026-03-22-add-audit-events.sql
@ -0,0 +1,24 @@
 -- Migration: create audit_events table
 -- Date: 2026-03-22
 -- Description: Placeholder migration to add an audit_events table to record audit logs.
 --
 -- Decision: The actual SQL is intentionally left commented out to avoid making
 -- database changes during test runs. When ready to apply, uncomment and
 -- adapt the SQL for your database engine.
 /*
 CREATE TABLE audit_events (
    id UUID PRIMARY KEY,
    actor_id UUID NOT NULL,
    action TEXT NOT NULL,
    target_type TEXT,
    target_id UUID,
    metadata JSONB,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT now()
 );
 -- Add indexes as needed, e.g.:
 -- CREATE INDEX ON audit_events (actor_id);
 */
 -- End of migration placeholder
--- a/migrations/2026-03-22-add-similarity-cache.sql
+++ b/migrations/2026-03-22-add-similarity-cache.sql
@ -0,0 +1,15 @@
 -- 2026-03-22-add-similarity-cache.sql
 -- Placeholder migration for adding a similarity_cache table
 -- Decision: Keep SQL commented out so CI does not accidentally modify databases.
 /*
 -- Example (commented out):
 CREATE TABLE similarity_cache (
    id SERIAL PRIMARY KEY,
    key TEXT NOT NULL,
    vector FLOAT8[] NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT now()
 );
 */
 -- No executable SQL in this file. Intentionally left as a safe no-op.
--- a/migrations/2026_03_21__create_fused_embeddings.sql
+++ b/migrations/2026_03_21__create_fused_embeddings.sql
@ -0,0 +1,13 @@
 ----SQL
 CREATE SEQUENCE IF NOT EXISTS fused_embeddings_id_seq START 1;
 CREATE TABLE IF NOT EXISTS fused_embeddings (
    id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
    motion_id INTEGER NOT NULL,
    window_id TEXT NOT NULL,
    vector JSON NOT NULL,
    svd_dims INTEGER NOT NULL,
    text_dims INTEGER NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
 );
 ----END
--- a/migrations/2026_03_21__create_mp_metadata.sql
+++ b/migrations/2026_03_21__create_mp_metadata.sql
@ -0,0 +1,9 @@
 ----SQL
 CREATE TABLE IF NOT EXISTS mp_metadata (
    mp_name TEXT PRIMARY KEY,
    party TEXT,
    van DATE,
    tot_en_met DATE,
    persoon_id TEXT
 );
 ----END
--- a/migrations/2026_03_21__create_mp_votes.sql
+++ b/migrations/2026_03_21__create_mp_votes.sql
@ -0,0 +1,13 @@
 ----SQL
 CREATE SEQUENCE IF NOT EXISTS mp_votes_id_seq START 1;
 CREATE TABLE IF NOT EXISTS mp_votes (
    id INTEGER DEFAULT nextval('mp_votes_id_seq'),
    motion_id INTEGER NOT NULL,
    mp_name TEXT NOT NULL,
    party TEXT,
    vote TEXT NOT NULL,
    date DATE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
 );
 ----END
--- a/migrations/2026_03_21__create_svd_vectors.sql
+++ b/migrations/2026_03_21__create_svd_vectors.sql
@ -0,0 +1,13 @@
 ----SQL
 CREATE SEQUENCE IF NOT EXISTS svd_vectors_id_seq START 1;
 CREATE TABLE IF NOT EXISTS svd_vectors (
    id INTEGER DEFAULT nextval('svd_vectors_id_seq'),
    window_id TEXT NOT NULL,
    entity_type TEXT NOT NULL,
    entity_id TEXT NOT NULL,
    vector JSON NOT NULL,
    model TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
 );
 ----END
--- a/pipeline/init.py
+++ b/pipeline/init.py
--- a/pipeline/extract_mp_votes.py
+++ b/pipeline/extract_mp_votes.py
@ -0,0 +1,75 @@
 import json
 import logging
 from typing import Optional
 import duckdb
 from database import MotionDatabase
 _logger = logging.getLogger(__name__)
 def extract_mp_votes(db_path: Optional[str] = None, limit: Optional[int] = None):
    """Extract individual MP votes from motions.voting_results and store them
    in the mp_votes table.
    Returns a dict with summary counts:
      - motions_scanned: number of motions inspected
      - mp_rows_inserted: number of mp_votes rows inserted
      - motions_skipped: number of motions skipped because mp_votes already existed
    """
    db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase()
    conn = duckdb.connect(db.db_path)
    try:
        # support optional limit to only scan a subset of motions
        if limit is not None:
            rows = conn.execute(
                "SELECT id, voting_results, date FROM motions LIMIT ?", (limit,)
            ).fetchall()
        else:
            rows = conn.execute(
                "SELECT id, voting_results, date FROM motions"
            ).fetchall()
    finally:
        conn.close()
    mp_rows_inserted = 0
    motions_skipped = 0
    motions_scanned = 0
    for motion_id, voting_results_json, date in rows:
        motions_scanned += 1
        try:
            if db.mp_votes_exists_for_motion(motion_id):
                _logger.debug(
                    "Skipping motion %s because mp_votes already exist", motion_id
                )
                motions_skipped += 1
                continue
            # voting_results may be stored as JSON text or as native JSON; ensure it's a dict
            if isinstance(voting_results_json, str):
                voting_results = json.loads(voting_results_json)
            else:
                voting_results = voting_results_json
            for actor, vote in (voting_results or {}).items():
                # Individual MP names contain a comma (e.g. "Last, F.")
                if "," not in actor:
                    continue
                inserted_id = db.insert_mp_vote(
                    motion_id=motion_id, mp_name=actor, vote=vote, date=date, party=None
                )
                if inserted_id and inserted_id > 0:
                    mp_rows_inserted += 1
        except Exception as e:
            _logger.error("Error processing motion %s: %s", motion_id, e)
    return {
        "motions_scanned": motions_scanned,
        "mp_rows_inserted": mp_rows_inserted,
        "motions_skipped": motions_skipped,
    }
--- a/pipeline/fetch_mp_metadata.py
+++ b/pipeline/fetch_mp_metadata.py
@ -0,0 +1,94 @@
 import logging
 from typing import Optional
 import requests
 from database import MotionDatabase
 logger = logging.getLogger(__name__)
 def normalize_mp_name(
    achternaam: str, initialen: Optional[str], tussenvoegsel: Optional[str]
 ) -> str:
    """Reconstruct ActorNaam format used in voting_results keys.
    Format: "{Tussenvoegsel} {Achternaam}, {Initialen}" with sensible stripping when
    tussenvoegsel is missing.
    """
    parts = []
    if tussenvoegsel:
        parts.append(tussenvoegsel)
    parts.append(achternaam)
    name = " ".join(parts).strip()
    # Ensure the displayed name starts with an uppercase letter so
    # ORDER BY mp_name behaves predictably across databases that may
    # sort uppercase before lowercase. Only change the first character
    # to upper-case to avoid lowercasing other letters (e.g. hyphenated
    # or already capitalized parts).
    if name and name[0].islower():
        name = name[0].upper() + name[1:]
    if initialen:
        name = f"{name}, {initialen}"
    return name
 def fetch_mp_metadata(
    db_path: str, odata_url: str = "https://odata.example/FractieZetelPersoon"
 ) -> int:
    """Fetch MP party membership and tenure from OData and upsert into DB.
    Returns the number of records processed (inserted or updated).
    """
    session = requests.Session()
    try:
        resp = session.get(odata_url)
        resp.raise_for_status()
        data = resp.json()
    except Exception as e:
        logger.error("Failed to fetch MP metadata: %s", e)
        raise
    values = data.get("value") if isinstance(data, dict) else None
    if values is None:
        logger.error("Unexpected OData payload; missing 'value' list")
        return 0
    db = MotionDatabase(db_path)
    processed = 0
    for item in values:
        try:
            persoon = item.get("Persoon") or {}
            fractiezetel = item.get("FractieZetel") or {}
            fractie = fractiezetel.get("Fractie") or {}
            achternaam = persoon.get("Achternaam")
            initialen = persoon.get("Initialen")
            tussenvoegsel = persoon.get("Tussenvoegsel")
            persoon_id = persoon.get("Id")
            party = fractie.get("NaamNL")
            van = item.get("Van")
            tot_en_met = item.get("TotEnMet")
            if not achternaam:
                logger.debug("Skipping record without achternaam: %s", item)
                continue
            mp_name = normalize_mp_name(achternaam, initialen, tussenvoegsel)
            db.upsert_mp_metadata(
                mp_name=mp_name,
                party=party,
                van=van,
                tot_en_met=tot_en_met,
                persoon_id=persoon_id,
            )
            processed += 1
        except Exception:
            logger.exception("Error processing OData item: %s", item)
    logger.info("Processed %d MP metadata records", processed)
    return processed
--- a/pipeline/fusion.py
+++ b/pipeline/fusion.py
@ -0,0 +1,116 @@
 import json
 import logging
 from typing import Dict
 import duckdb
 from database import MotionDatabase
 _logger = logging.getLogger(__name__)
 def fuse_for_window(
    window_id: str, db_path: str = None, model: str = None
 ) -> Dict[str, int]:
    """Fuse SVD vectors with text embeddings for motions in a window.
    Parameters:
    - window_id: id of the window to process
    - db_path: optional path to duckdb database (if None MotionDatabase default is used)
    - model: optional model name to filter text embeddings
    Returns a dict with counts: inserted, skipped_missing_text, skipped_missing_svd, errors
    """
    # Create MotionDatabase using provided path if given, otherwise use default
    if db_path:
        db = MotionDatabase(db_path=db_path)
        conn = duckdb.connect(db_path)
    else:
        db = MotionDatabase()
        # MotionDatabase always exposes the path it uses
        conn = duckdb.connect(db.db_path)
    # Fetch svd vectors for the window and entity_type=motion
    rows = conn.execute(
        "SELECT entity_id, vector FROM svd_vectors WHERE window_id = ? AND entity_type = ?",
        (window_id, "motion"),
    ).fetchall()
    # debug
    _logger.debug("Found %d svd rows for window %s", len(rows), window_id)
    inserted = 0
    skipped_missing_text = 0
    skipped_missing_svd = 0
    errors = 0
    for entity_id, svd_json in rows:
        try:
            svd_vec = json.loads(svd_json)
        except Exception:
            _logger.exception("Invalid SVD vector JSON for entity %s", entity_id)
            skipped_missing_svd += 1
            continue
        # Look up text embedding for this motion (most recent). If model is provided
        # filter by model as well.
        if model:
            emb_row = conn.execute(
                "SELECT vector FROM embeddings WHERE motion_id = ? AND model = ? ORDER BY created_at DESC LIMIT 1",
                (int(entity_id), model),
            ).fetchone()
        else:
            emb_row = conn.execute(
                "SELECT vector FROM embeddings WHERE motion_id = ? ORDER BY created_at DESC LIMIT 1",
                (int(entity_id),),
            ).fetchone()
        if not emb_row:
            skipped_missing_text += 1
            continue
        try:
            text_vec = json.loads(emb_row[0])
        except Exception:
            _logger.exception("Invalid text embedding JSON for motion %s", entity_id)
            skipped_missing_text += 1
            continue
        try:
            fused = list(svd_vec) + list(text_vec)
        except Exception:
            _logger.exception("Error concatenating vectors for motion %s", entity_id)
            errors += 1
            continue
        # store fused embedding and check result
        try:
            res = db.store_fused_embedding(
                int(entity_id),
                window_id,
                fused,
                svd_dims=len(svd_vec),
                text_dims=len(text_vec),
            )
            if res and res > 0:
                inserted += 1
            else:
                errors += 1
                _logger.error(
                    "Failed to store fused embedding for motion %s (db returned %s)",
                    entity_id,
                    res,
                )
        except Exception:
            _logger.exception(
                "Exception while storing fused embedding for motion %s", entity_id
            )
            errors += 1
    conn.close()
    return {
        "inserted": inserted,
        "skipped_missing_text": skipped_missing_text,
        "skipped_missing_svd": skipped_missing_svd,
        "errors": errors,
    }
--- a/pipeline/svd_pipeline.py
+++ b/pipeline/svd_pipeline.py
@ -0,0 +1,206 @@
 import json
 import logging
 from typing import Optional, Dict, List, Tuple
 import numpy as np
 try:
    from scipy.sparse import csr_matrix
    from scipy.sparse.linalg import svds
    from scipy.linalg import orthogonal_procrustes
    _HAS_SCIPY = True
 except Exception:
    # Provide lightweight fallbacks for environments without scipy
    csr_matrix = lambda x: x
    def svds(a, k=1):
        # fallback to numpy.linalg.svd on dense arrays
        U, s, Vt = np.linalg.svd(np.array(a), full_matrices=False)
        # return last k components to mimic scipy.svds behaviour
        return U[:, -k:], s[-k:], Vt[-k:, :]
    def orthogonal_procrustes(A, B):
        # simple orthogonal Procrustes via SVD: find R minimizing ||A R - B||
        U, _, Vt = np.linalg.svd(A.T.dot(B))
        R = U.dot(Vt)
        scale = 1.0
        return R, scale
    _HAS_SCIPY = False
 import duckdb
 from database import MotionDatabase
 _logger = logging.getLogger(__name__)
 # Map textual votes to numeric values for SVD
 VOTE_MAP = {
    "Voor": 1.0,
    "voor": 1.0,
    "Tegen": -1.0,
    "tegen": -1.0,
    "Geen stem": 0.0,
    "Onbekend": 0.0,
    "Onbekend stem": 0.0,
    "Blanco": 0.0,
 }
 def _safe_k(mat: np.ndarray, k: int) -> int:
    """Return a safe k for svds: must be < min(mat.shape)."""
    if mat is None:
        return 0
    m, n = mat.shape
    min_dim = min(m, n)
    # svds requires k < min_dim
    if min_dim <= 1:
        return 0
    return min(k, min_dim - 1)
 def _build_vote_matrix(
    db: MotionDatabase, start_date: str, end_date: str
 ) -> Tuple[np.ndarray, List[str], List[int]]:
    """Build dense vote matrix (mp x motion) for votes between start_date and end_date.
    Returns (matrix, mp_names, motion_ids)
    """
    conn = duckdb.connect(db.db_path)
    rows = conn.execute(
        "SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?",
        (start_date, end_date),
    ).fetchall()
    conn.close()
    if not rows:
        return np.zeros((0, 0)), [], []
    motion_ids = sorted({int(r[0]) for r in rows})
    mp_names = sorted({r[1] for r in rows})
    m = len(mp_names)
    n = len(motion_ids)
    mat = np.zeros((m, n), dtype=float)
    mp_index = {name: i for i, name in enumerate(mp_names)}
    motion_index = {mid: j for j, mid in enumerate(motion_ids)}
    for motion_id, mp_name, vote in rows:
        i = mp_index[mp_name]
        j = motion_index[int(motion_id)]
        val = VOTE_MAP.get(
            vote, VOTE_MAP.get(vote.strip() if isinstance(vote, str) else vote, 0.0)
        )
        try:
            mat[i, j] = float(val)
        except Exception:
            mat[i, j] = 0.0
    return mat, mp_names, motion_ids
 def _procrustes_align(
    reference_anchor: np.ndarray,
    current_anchor: np.ndarray,
    min_overlap: int = 3,
 ) -> np.ndarray:
    """Align current_anchor to reference_anchor using orthogonal Procrustes.
    This function will only attempt alignment when there is a reasonable number of
    overlapping rows (default: min_overlap). If the overlap is too small or if any
    input is invalid, the original current_anchor is returned unchanged.
    Returns transformed_current_anchor
    """
    # basic validation
    if reference_anchor is None or current_anchor is None:
        return current_anchor
    if not isinstance(reference_anchor, np.ndarray) or not isinstance(
        current_anchor, np.ndarray
    ):
        return current_anchor
    # Determine overlap by number of available rows. If too small, skip alignment.
    n_ref = reference_anchor.shape[0]
    n_cur = current_anchor.shape[0]
    overlap = min(n_ref, n_cur)
    if overlap < min_overlap:
        _logger.debug(
            "Procrustes alignment skipped: overlap %s < min_overlap %s",
            overlap,
            min_overlap,
        )
        return current_anchor
    # Use only the overlapping rows to compute the orthogonal transform.
    ref_sub = reference_anchor[:overlap, :]
    cur_sub = current_anchor[:overlap, :]
    try:
        # orthogonal_procrustes(A, B) returns R, scale such that A @ R = B * scale
        # We want to transform current_anchor to align with reference_anchor so
        # call orthogonal_procrustes(cur_sub, ref_sub) and apply resulting R/scale
        R, _scale = orthogonal_procrustes(cur_sub, ref_sub)
        transformed = current_anchor.dot(R)
        return transformed
    except Exception:
        _logger.exception("Procrustes alignment failed")
        return current_anchor
 def run_svd_for_window(
    db: MotionDatabase,
    window_id: str,
    start_date: str,
    end_date: str,
    k: int = 50,
 ) -> Dict:
    """Run SVD on votes in given date window and store vectors in DB.
    Returns metadata dict with keys: k_used, stored_mp, stored_motion
    """
    mat, mp_names, motion_ids = _build_vote_matrix(db, start_date, end_date)
    if mat.size == 0 or mat.shape[0] == 0 or mat.shape[1] == 0:
        return {"k_used": 0, "stored_mp": 0, "stored_motion": 0}
    k_used = _safe_k(mat, k)
    if k_used <= 0:
        return {"k_used": 0, "stored_mp": 0, "stored_motion": 0}
    # use sparse svds for efficiency
    try:
        A = csr_matrix(mat)
        U, s, Vt = svds(A, k=k_used)
        # svds does not guarantee ordering of singular values; sort descending
        idx = np.argsort(s)[::-1]
        s = s[idx]
        U = U[:, idx]
        Vt = Vt[idx, :]
        # weight by singular values
        mp_vecs = (U * s.reshape(1, -1)).tolist()  # m x k
        motion_vecs = (Vt.T * s.reshape(1, -1)).tolist()  # n x k
        stored_mp = 0
        stored_motion = 0
        for i, mp_name in enumerate(mp_names):
            db.store_svd_vector(window_id, "mp", mp_name, mp_vecs[i])
            stored_mp += 1
        for j, motion_id in enumerate(motion_ids):
            db.store_svd_vector(window_id, "motion", str(motion_id), motion_vecs[j])
            stored_motion += 1
        return {
            "k_used": k_used,
            "stored_mp": stored_mp,
            "stored_motion": stored_motion,
        }
    except Exception:
        _logger.exception("SVD failed for window")
        return {"k_used": 0, "stored_mp": 0, "stored_motion": 0}
--- a/pipeline/text_pipeline.py
+++ b/pipeline/text_pipeline.py
@ -0,0 +1,122 @@
 import logging
 import json
 from typing import Optional, List, Tuple
 import duckdb
 from database import MotionDatabase, db as default_db
 import ai_provider
 _logger = logging.getLogger(__name__)
 DEFAULT_MODEL = "qwen/qwen3-embedding-4b"
 def _select_text(
    db: MotionDatabase, model: str, limit: Optional[int] = None
 ) -> List[Tuple[int, Optional[str]]]:
    """Select motions that do not yet have an embedding for `model`.
    Returns list of (motion_id, text).
    """
    conn = duckdb.connect(db.db_path)
    params = [model]
    # prefer layman_explanation > description > title (keep compatibility with existing tests)
    sql = (
        "SELECT m.id, COALESCE(m.layman_explanation, m.description, m.title) AS text"
        " FROM motions m"
        " LEFT JOIN embeddings e ON e.motion_id = m.id AND e.model = ?"
        " WHERE e.id IS NULL"
    )
    if limit:
        sql += " LIMIT ?"
        params.append(limit)
    try:
        rows = conn.execute(sql, params).fetchall()
        conn.close()
        results: List[Tuple[int, Optional[str]]] = []
        for r in rows:
            text_val = r[1]
            # treat empty strings as no text
            if text_val is None:
                text = None
            else:
                text = str(text_val).strip() or None
            results.append((int(r[0]), text))
        return results
    except Exception as exc:
        _logger.error("Error selecting motions for embeddings: %s", exc)
        try:
            conn.close()
        except Exception:
            pass
        return []
 def ensure_text_embeddings(
    db_path: Optional[str] = None, model: Optional[str] = None
 ) -> Tuple[int, int, int, int]:
    """Ensure all motions have text embeddings for `model`.
    Returns tuple (stored_count, skipped_existing, skipped_no_text, errors).
    """
    model = model or DEFAULT_MODEL
    db = MotionDatabase(db_path) if db_path else default_db
    # motions to process
    to_process = _select_text(db, model)
    # how many already exist
    conn = duckdb.connect(db.db_path)
    try:
        total_motions = conn.execute("SELECT COUNT(*) FROM motions").fetchone()[0]
    except Exception:
        total_motions = 0
    try:
        existing = conn.execute(
            "SELECT COUNT(DISTINCT motion_id) FROM embeddings WHERE model = ?", (model,)
        ).fetchone()[0]
    except Exception:
        existing = 0
    conn.close()
    stored = 0
    skipped_no_text = 0
    errors = 0
    for motion_id, text in to_process:
        if not text:
            _logger.info("Skipping motion %s: no text available", motion_id)
            skipped_no_text += 1
            continue
        try:
            vec = ai_provider.get_embedding(text, model=model)
            if not isinstance(vec, list):
                _logger.warning(
                    "Embedding provider returned non-list for motion %s", motion_id
                )
                errors += 1
                continue
            res = db.store_embedding(motion_id, model, vec)
            if res and res > 0:
                stored += 1
            else:
                _logger.error(
                    "Failed to store embedding for motion %s (store returned %s)",
                    motion_id,
                    res,
                )
                errors += 1
        except Exception as exc:
            _logger.error(
                "Error computing/storing embedding for motion %s: %s", motion_id, exc
            )
            errors += 1
    skipped_existing = int(existing)
    return stored, skipped_existing, skipped_no_text, errors
--- a/pyproject.toml
+++ b/pyproject.toml
@ -0,0 +1,18 @@
 [project]
 name = "stemwijzer"
 version = "0.1.0"
 description = "Add your description here"
 readme = "README.md"
 requires-python = ">=3.13"
 dependencies = [
    "duckdb>=1.3.2",
    "ibis-framework[duckdb]>=10.8.0",
    "openai>=1.99.7",
    "scipy>=1.11",
    "umap-learn>=0.5",
    "plotly>=5.0",
    "pytest>=9.0.2",
    "requests>=2.32.4",
    "schedule>=1.2.2",
    "streamlit>=1.48.0",
 ]
--- a/read.py
+++ b/read.py
@ -0,0 +1,9 @@
 import ibis
 con = ibis.duckdb.connect('data/motions.db')
 print(con.tables)
 for t in con.tables:
    print(con.table(t).head().execute().to_string())
--- a/reset.py
+++ b/reset.py
@ -0,0 +1,3 @@
 # Run this to reset your database
 from database import db
 db.reset_database()
--- a/scheduler.py
+++ b/scheduler.py
@ -0,0 +1,264 @@
 # scheduler.py (fixed infinite loop issue)
 import schedule
 import time
 import duckdb
 from datetime import datetime, timedelta
 from api_client import TweedeKamerAPI
 from summarizer import summarizer
 from database import db
 from config import config
 class DataUpdateScheduler:
    def __init__(self):
        self.api_client = TweedeKamerAPI()
    def test_api_connection(self) -> bool:
        """Test API connection before proceeding"""
        print("Testing API connection...")
        if self.api_client.test_api_connection():
            print("✅ API connection successful")
            return True
        else:
            print("❌ API connection failed")
            return False
    def check_database_has_data(self) -> bool:
        """Check if database has any motion data"""
        try:
            conn = duckdb.connect(config.DATABASE_PATH)
            result = conn.execute("SELECT COUNT(*) FROM motions").fetchone()
            conn.close()
            return result[0] > 0 if result else False
        except Exception as e:
            print(f"Error checking database: {e}")
            return False
    def update_motions_data(self, days_back: int = 30, max_records: int = 1000):
        """Fetch new motions from API and update database"""
        print(f"Starting motion data update at {datetime.now()}")
        if not self.test_api_connection():
            return False
        try:
            # Fetch recent motions from API (respecting API limits)
            start_date = datetime.now() - timedelta(days=days_back)
            motions = self.api_client.get_motions(
                start_date=start_date, 
                limit=max_records
            )
            print(f"Fetched {len(motions)} motions from API")
            if not motions:
                print("No motions received from API")
                return False
            # Insert new motions into database
            successful_inserts = 0
            duplicate_count = 0
            for motion in motions:
                if db.insert_motion(motion):
                    successful_inserts += 1
                else:
                    duplicate_count += 1
            print(f"Successfully inserted {successful_inserts} new motions")
            if duplicate_count > 0:
                print(f"Skipped {duplicate_count} duplicate motions")
            # Generate AI summaries for new motions (only if we have new data)
            if successful_inserts > 0:
                print("Generating AI summaries for new motions...")
                summarizer.update_motion_summaries()
            print("Motion data update completed successfully")
            return True
        except Exception as e:
            print(f"Error during motion data update: {e}")
            return False
    def initial_data_load(self):
        """Perform initial data load with comprehensive data"""
        print("Performing initial comprehensive data load...")
        if not self.test_api_connection():
            return False
        try:
            # Start from 2 years ago but make sure we don't go into the future
            start_date = datetime.now() - timedelta(days=730)
            end_date = datetime.now()
            print(f"Loading data from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
            # Use a single request for recent data first, then expand if needed
            chunk_days = 90  # 3-month chunks
            current_date = start_date
            all_motions = []
            chunks_processed = 0
            max_chunks = 10  # Safety limit to prevent infinite loops
            while current_date < end_date and chunks_processed < max_chunks:
                chunk_end_date = min(current_date + timedelta(days=chunk_days), end_date)
                print(f"Fetching chunk {chunks_processed + 1}/{max_chunks}: {current_date.strftime('%Y-%m-%d')} to {chunk_end_date.strftime('%Y-%m-%d')}")
                try:
                    # Fetch data for this time chunk
                    chunk_motions = self.api_client.get_motions(
                        start_date=current_date,
                        end_date=chunk_end_date,
                        limit=250  # Reasonable limit per chunk
                    )
                    if chunk_motions:
                        all_motions.extend(chunk_motions)
                        print(f"✅ Found {len(chunk_motions)} motions in this chunk (Total: {len(all_motions)})")
                    else:
                        print(f"⚠️  No motions found in chunk {current_date.strftime('%Y-%m-%d')} to {chunk_end_date.strftime('%Y-%m-%d')}")
                except Exception as e:
                    print(f"❌ Error fetching chunk {current_date.strftime('%Y-%m-%d')} to {chunk_end_date.strftime('%Y-%m-%d')}: {e}")
                # IMPORTANT: Always increment the date to avoid infinite loop
                current_date = chunk_end_date
                chunks_processed += 1
                # Add delay between chunks
                if chunks_processed < max_chunks and current_date < end_date:
                    time.sleep(2)
            print(f"Data collection completed. Total motions fetched: {len(all_motions)}")
            if not all_motions:
                print("❌ No motions retrieved from API. This might be normal if the API doesn't have recent data.")
                print("💡 Try adjusting the date range or check if the API has data for the selected period.")
                # Try a broader date range as fallback
                print("🔄 Trying broader date range (last 30 days)...")
                fallback_start = datetime.now() - timedelta(days=30)
                fallback_motions = self.api_client.get_motions(
                    start_date=fallback_start,
                    limit=250
                )
                if fallback_motions:
                    all_motions = fallback_motions
                    print(f"✅ Fallback successful: Found {len(fallback_motions)} motions")
                else:
                    print("❌ No data found even with broader date range")
                    return False
            # Insert all motions with progress tracking
            successful_inserts = 0
            duplicate_count = 0
            print(f"Inserting {len(all_motions)} motions into database...")
            for i, motion in enumerate(all_motions):
                if i % 25 == 0:  # Progress indicator every 25 motions
                    print(f"Processing motion {i+1}/{len(all_motions)} ({((i+1)/len(all_motions)*100):.1f}%)")
                if db.insert_motion(motion):
                    successful_inserts += 1
                else:
                    duplicate_count += 1
            print(f"✅ Successfully inserted {successful_inserts} motions")
            if duplicate_count > 0:
                print(f"ℹ️  Skipped {duplicate_count} duplicate motions")
            # Generate summaries if we have data
            if successful_inserts > 0:
                print("🤖 Generating AI summaries...")
                summarizer.update_motion_summaries()
            print("🎉 Initial data load completed!")
            return successful_inserts > 0
        except Exception as e:
            print(f"❌ Error during initial data load: {e}")
            return False
    def weekly_update_job(self):
        """Weekly job to update with new motions"""
        print(f"Starting weekly update job at {datetime.now()}")
        # Use smaller limits for regular updates
        self.update_motions_data(days_back=14, max_records=250)
        print("Weekly update job completed")
    def run_scheduler(self):
        """Main scheduler function"""
        print("=" * 50)
        print("Dutch Political Compass Data Scheduler")
        print("=" * 50)
        # Check if database has data
        has_data = self.check_database_has_data()
        print(f"Database has existing data: {has_data}")
        if not has_data:
            print("\n🔄 No data found in database. Running initial data load...")
            success = self.initial_data_load()
            if success:
                print("✅ Initial data load completed successfully!")
            else:
                print("❌ Initial data load failed or no data available.")
                print("💡 You may need to check the API or adjust the date range.")
                return
        else:
            print("✅ Database already contains motion data.")
            # Ask if user wants to update anyway
            try:
                response = input("\nDo you want to fetch recent motions anyway? (y/n): ").lower().strip()
                if response in ['y', 'yes']:
                    print("🔄 Updating with recent motions...")
                    self.update_motions_data(days_back=7, max_records=250)
            except KeyboardInterrupt:
                print("\nSkipping manual update.")
        # Schedule regular updates
        print("\n📅 Scheduling regular updates...")
        schedule.every().monday.at("02:00").do(self.weekly_update_job)
        schedule.every().thursday.at("14:00").do(lambda: self.update_motions_data(days_back=7, max_records=250))
        print("Jobs scheduled:")
        print("- Weekly motion update: Every Monday at 02:00")
        print("- Mid-week update: Every Thursday at 14:00")
        print(f"- API limit per request: {config.API_MAX_LIMIT} records")
        print("\n🔄 Scheduler is now running. Press Ctrl+C to stop.")
        try:
            while True:
                schedule.run_pending()
                time.sleep(3600)  # Check every hour
        except KeyboardInterrupt:
            print("\n👋 Scheduler stopped by user.")
 def run_once():
    """Run data update once and exit"""
    scheduler = DataUpdateScheduler()
    print("Running one-time data update...")
    has_data = scheduler.check_database_has_data()
    if not has_data:
        print("No existing data found. Running initial data load...")
        scheduler.initial_data_load()
    else:
        print("Updating existing data with recent motions...")
        scheduler.update_motions_data(days_back=14, max_records=250)
    print("One-time update completed!")
 if __name__ == "__main__":
    import sys
    if len(sys.argv) > 1 and sys.argv[1] == "--once":
        run_once()
    else:
        scheduler = DataUpdateScheduler()
        scheduler.run_scheduler()
--- a/scraper.py
+++ b/scraper.py
@ -0,0 +1,183 @@
 # scraper.py
 import requests
 from bs4 import BeautifulSoup
 import time
 import re
 from datetime import datetime, timedelta
 from typing import Dict, List, Optional
 from database import db
 from config import config
 class MotionScraper:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        })
    def scrape_motion_list(self, start_date: datetime = None, end_date: datetime = None) -> List[str]:
        """Scrape motion URLs from the main page"""
        if not start_date:
            start_date = datetime.now() - timedelta(days=730)  # 2 years ago
        if not end_date:
            end_date = datetime.now()
        motion_urls = []
        page = 1
        while True:
            try:
                url = f"{config.BASE_URL}?page={page}"
                response = self.session.get(url, timeout=30)
                response.raise_for_status()
                soup = BeautifulSoup(response.content, 'html.parser')
                # Find motion links (adjust selectors based on actual HTML structure)
                motion_links = soup.find_all('a', href=re.compile(r'/stemmingsuitslagen/'))
                if not motion_links:
                    break
                for link in motion_links:
                    href = link.get('href')
                    if href and href not in motion_urls:
                        motion_urls.append(href)
                page += 1
                time.sleep(config.SCRAPING_DELAY)
            except Exception as e:
                print(f"Error scraping page {page}: {e}")
                break
        return motion_urls
    def parse_motion_detail(self, motion_url: str) -> Optional[Dict]:
        """Parse individual motion details"""
        try:
            full_url = f"https://www.tweedekamer.nl{motion_url}" if motion_url.startswith('/') else motion_url
            response = self.session.get(full_url, timeout=30)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            # Extract motion data (adjust selectors based on actual HTML structure)
            title = self._extract_title(soup)
            description = self._extract_description(soup)
            date = self._extract_date(soup)
            policy_area = self._extract_policy_area(soup)
            voting_results = self._extract_voting_results(soup)
            if not all([title, voting_results]):
                return None
            # Calculate winning margin
            total_votes = sum(1 for vote in voting_results.values() if vote in ['voor', 'tegen'])
            if total_votes == 0:
                return None
            votes_for = sum(1 for vote in voting_results.values() if vote == 'voor')
            winning_margin = abs(votes_for - (total_votes - votes_for)) / total_votes
            return {
                'title': title,
                'description': description or '',
                'date': date,
                'policy_area': policy_area or 'Onbekend',
                'voting_results': voting_results,
                'winning_margin': winning_margin,
                'url': full_url
            }
        except Exception as e:
            print(f"Error parsing motion {motion_url}: {e}")
            return None
    def _extract_title(self, soup: BeautifulSoup) -> Optional[str]:
        """Extract motion title"""
        # Look for common title selectors
        selectors = ['h1', '.motion-title', '.title', 'h2']
        for selector in selectors:
            element = soup.select_one(selector)
            if element:
                return element.get_text(strip=True)
        return None
    def _extract_description(self, soup: BeautifulSoup) -> Optional[str]:
        """Extract motion description"""
        # Look for description elements
        selectors = ['.motion-description', '.description', '.content', 'p']
        for selector in selectors:
            elements = soup.select(selector)
            if elements:
                return ' '.join(el.get_text(strip=True) for el in elements[:3])
        return None
    def _extract_date(self, soup: BeautifulSoup) -> Optional[str]:
        """Extract motion date"""
        # Look for date patterns
        date_pattern = re.compile(r'\d{1,2}-\d{1,2}-\d{4}|\d{4}-\d{1,2}-\d{1,2}')
        text = soup.get_text()
        match = date_pattern.search(text)
        if match:
            return match.group()
        return datetime.now().strftime('%Y-%m-%d')
    def _extract_policy_area(self, soup: BeautifulSoup) -> Optional[str]:
        """Extract policy area/category"""
        # Look for category indicators
        text = soup.get_text().lower()
        for area in config.POLICY_AREAS[1:]:  # Skip "Alle"
            if area.lower() in text:
                return area
        return "Algemeen"
    def _extract_voting_results(self, soup: BeautifulSoup) -> Dict[str, str]:
        """Extract party voting results"""
        # This is a simplified extraction - you'll need to adjust based on actual HTML
        voting_results = {}
        # Look for voting tables or lists
        tables = soup.find_all('table')
        for table in tables:
            rows = table.find_all('tr')
            for row in rows:
                cells = row.find_all(['td', 'th'])
                if len(cells) >= 2:
                    party = cells[0].get_text(strip=True)
                    vote = cells[1].get_text(strip=True).lower()
                    if vote in ['voor', 'tegen', 'afwezig']:
                        voting_results[party] = vote
        # Fallback: simulate some voting data for testing
        if not voting_results:
            parties = ['VVD', 'PVV', 'CDA', 'D66', 'GL', 'SP', 'PvdA', 'CU', 'PvdD', 'FVD', '50PLUS', 'SGP']
            import random
            for party in parties:
                voting_results[party] = random.choice(['voor', 'tegen', 'afwezig'])
        return voting_results
    def run_scraping_job(self):
        """Main scraping job"""
        print("Starting motion scraping...")
        motion_urls = self.scrape_motion_list()
        print(f"Found {len(motion_urls)} motion URLs")
        successful_scrapes = 0
        for i, url in enumerate(motion_urls):
            print(f"Processing motion {i+1}/{len(motion_urls)}: {url}")
            motion_data = self.parse_motion_detail(url)
            if motion_data:
                if db.insert_motion(motion_data):
                    successful_scrapes += 1
            time.sleep(config.SCRAPING_DELAY)
        print(f"Scraping completed. Successfully scraped {successful_scrapes} motions.")
 scraper = MotionScraper()
--- a/scripts/compute_test_batch.py
+++ b/scripts/compute_test_batch.py
@ -0,0 +1,128 @@
 """Compute summaries and embeddings for a small test batch of motions.
 Usage:
  # dry-run (no network calls)
  python scripts/compute_test_batch.py --limit 20 --dry-run
  # run (will call AI provider; requires OPENROUTER_API_KEY)
  python scripts/compute_test_batch.py --limit 20
 This script is intentionally simple and intended for manual invocation.
 It will update motions.layman_explanation and store embeddings via db.store_embedding if available.
 """
 from __future__ import annotations
 import argparse
 import logging
 import sys
 from typing import List
 import duckdb
 from config import config
 import ai_provider
 from database import db
 from summarizer import MotionSummarizer
 logger = logging.getLogger("compute_test_batch")
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
 def fetch_motion_candidates(limit: int) -> List[dict]:
    conn = duckdb.connect(config.DATABASE_PATH)
    try:
        # Prefer motions that still lack a layman_explanation so we don't re-process recent ones
        rows = conn.execute(
            "SELECT id, title, description FROM motions WHERE layman_explanation IS NULL OR layman_explanation = '' ORDER BY created_at DESC LIMIT ?",
            (limit,),
        ).fetchall()
        return [{"id": r[0], "title": r[1], "description": r[2] or ""} for r in rows]
    finally:
        conn.close()
 def process_batch(limit: int = 20, dry_run: bool = False):
    summarizer = MotionSummarizer()
    motions = fetch_motion_candidates(limit)
    logger.info("Found %d motions to process", len(motions))
    conn = duckdb.connect(config.DATABASE_PATH)
    try:
        for i, m in enumerate(motions, start=1):
            mid = m["id"]
            title = m["title"]
            desc = m["description"]
            logger.info(
                "[%d/%d] Processing motion id=%s title=%s", i, len(motions), mid, title
            )
            if dry_run:
                logger.info(
                    "Dry run: would generate summary and embedding for motion %s", mid
                )
                continue
            # Generate summary
            summary = summarizer.generate_layman_explanation(title, desc)
            # Update DB
            try:
                conn.execute(
                    "UPDATE motions SET layman_explanation = ? WHERE id = ?",
                    (summary, mid),
                )
            except Exception as e:
                logger.exception("Failed to update motion %s: %s", mid, e)
            # Compute embedding and store
            try:
                emb = ai_provider.get_embedding(summary)
                store_fn = getattr(db, "store_embedding", None)
                if callable(store_fn):
                    store_fn(mid, "text-embedding-3-small", emb)
                    logger.info("Stored embedding for motion %s", mid)
                else:
                    logger.warning(
                        "No store_embedding available on db; skipping storage"
                    )
            except ai_provider.ProviderError as e:
                logger.exception(
                    "Failed to compute/store embedding for motion %s: %s", mid, e
                )
    finally:
        conn.close()
 def main(argv=None):
    p = argparse.ArgumentParser()
    p.add_argument("--limit", type=int, default=20, help="Number of motions to process")
    p.add_argument(
        "--dry-run",
        action="store_true",
        help="Do not call external APIs; just show what would run",
    )
    args = p.parse_args(argv)
    if args.dry_run:
        logger.info("Running in dry-run mode; no network calls will be made")
    # Safety: confirm when not dry-run
    if not args.dry_run:
        confirm = (
            input(
                f"This will call the AI provider for {args.limit} motions and may incur cost. Continue? (y/N): "
            )
            .strip()
            .lower()
        )
        if confirm not in ("y", "yes"):
            logger.info("Aborting per user choice")
            sys.exit(0)
    process_batch(limit=args.limit, dry_run=args.dry_run)
 if __name__ == "__main__":
    main()
--- a/src/types/motion_types.py
+++ b/src/types/motion_types.py
@ -0,0 +1,35 @@
 """Motion-related simple types and JSON helpers.
 Decision: MotionId is an alias for str for simplicity.
 """
 from dataclasses import dataclass, asdict
 from typing import List
 import json
 MotionId = str
 Embedding = List[float]
@dataclass
 class SimilarityNeighbor:
    motion_id: MotionId
    score: float
 def to_json(neighbors: List[SimilarityNeighbor]) -> str:
    """Serialize a list of SimilarityNeighbor to a JSON string.
    The format is a JSON list of objects with keys 'motion_id' and 'score'.
    """
    list_of_dicts = [asdict(n) for n in neighbors]
    return json.dumps(list_of_dicts)
 def from_json(json_str: str) -> List[SimilarityNeighbor]:
    """Deserialize a JSON string (list of dicts) into SimilarityNeighbor list."""
    parsed = json.loads(json_str)
    return [
        SimilarityNeighbor(motion_id=item["motion_id"], score=float(item["score"]))
        for item in parsed
    ]
--- a/summarizer.py
+++ b/summarizer.py
@ -0,0 +1,101 @@
 # summarizer.py (refactored to use ai_provider)
 from typing import Optional
 import logging
 import duckdb
 from config import config
 import ai_provider
 from database import db
 logger = logging.getLogger(__name__)
 class MotionSummarizer:
    def __init__(self):
        # Stateless; use ai_provider functions directly
        pass
    def _build_prompt_messages(self, title: str, body_text: str) -> list[dict]:
        prompt = f"""
 Leg deze Nederlandse parlementaire motie uit in eenvoudige, toegankelijke taal:
 Titel: {title}
 Tekst: {body_text}
 Geef een uitleg van 2-3 zinnen die:
 - Gebruik maakt van alledaagse taal
 - De praktische impact op burgers uitlegt
 - Politiek jargon vermijdt
 - Neutraal en feitelijk blijft
 Antwoord alleen met de uitleg, geen introductie of extra tekst.
 """
        return [
            {
                "role": "system",
                "content": "Je bent een expert in het uitleggen van politieke onderwerpen in eenvoudige taal voor Nederlandse burgers.",
            },
            {"role": "user", "content": prompt},
        ]
    def generate_layman_explanation(self, title: str, body_text: str) -> str:
        """Generate a layman-friendly explanation via ai_provider.
        Returns an empty string on failure (non-fatal).
        """
        messages = self._build_prompt_messages(title, body_text or "")
        try:
            return ai_provider.chat_completion(messages, model=config.QWEN_MODEL)
        except ai_provider.ProviderError:
            logger.exception("AI provider failed to generate summary")
            return ""
    def update_motion_summaries(
        self,
        compute_embeddings: bool = True,
        embedding_model: str = "qwen/qwen3-embedding-4b",
    ):
        """Find motions missing layman_explanation and generate summaries.
        Uses body_text when available, falls back to description, then title only.
        If compute_embeddings is True and database provides store_embedding, compute and store embeddings.
        """
        conn = duckdb.connect(config.DATABASE_PATH)
        try:
            rows = conn.execute(
                "SELECT id, title, description, body_text FROM motions WHERE layman_explanation IS NULL OR layman_explanation = '' LIMIT 50"
            ).fetchall()
            for motion_id, title, description, body_text in rows:
                input_text = body_text or description or ""
                summary = self.generate_layman_explanation(title, input_text)
                if summary is None:
                    summary = ""
                conn.execute(
                    "UPDATE motions SET layman_explanation = ? WHERE id = ?",
                    (summary, motion_id),
                )
                logger.info("Updated summary for motion %s", motion_id)
                if compute_embeddings and summary:
                    logger.info(
                        "Computing embedding for motion %s using model %s",
                        motion_id,
                        embedding_model,
                    )
                    # compute embedding and try to store via database helper if available
                    try:
                        emb = ai_provider.get_embedding(summary, model=embedding_model)
                        store_fn = getattr(db, "store_embedding", None)
                        if callable(store_fn):
                            store_fn(motion_id, embedding_model, emb)
                    except ai_provider.ProviderError:
                        logger.exception(
                            "Failed to compute/store embedding for motion %s", motion_id
                        )
        finally:
            conn.close()
 summarizer = MotionSummarizer()
--- a/test.py
+++ b/test.py
@ -0,0 +1,16 @@
 # test_single_insert.py
 from database import db
 test_motion = {
    'title': 'Test Motion',
    'description': 'This is a test motion',
    'date': '2024-01-01',
    'policy_area': 'Test',
    'voting_results': {'VVD': 'voor', 'PvdA': 'tegen'},
    'winning_margin': 0.5,
    'url': 'https://test.com/motion1'
 }
 success = db.insert_motion(test_motion)
 print(f"Insert successful: {success}")
--- a/tests/init.py
+++ b/tests/init.py
@ -0,0 +1 @@
 """Make the tests directory a package so test helpers can be imported."""
--- a/tests/conftest.py
+++ b/tests/conftest.py
@ -0,0 +1,63 @@
 import tempfile
 import pytest
 # Load test fixtures from the utils package so pytest can discover them.
 pytest_plugins = ["tests.utils.migration_fixtures"]
@pytest.fixture
 def tmp_duckdb_path(tmp_path):
    p = tmp_path / "test.db"
    return str(p)
@pytest.fixture
 def tmp_duckdb_conn(tmp_duckdb_path):
    # Import duckdb lazily so running pytest doesn't fail on machines
    # where duckdb is not installed (CI / contributor machines that don't
    # need the duckdb-based fixtures). If duckdb is missing, skip this
    # fixture at runtime when it's requested.
    try:
        import duckdb
    except Exception:
        pytest.skip("duckdb not installed, skipping duckdb fixtures")
    conn = duckdb.connect(database=tmp_duckdb_path)
    yield conn
    try:
        conn.close()
    except Exception:
        pass
@pytest.fixture
 def monkeypatch_ai_provider(monkeypatch):
    """Patch ai_provider.get_embedding to return deterministic 16-dim vector."""
    import ai_provider
    fake = [0.01] * 16
    monkeypatch.setattr(ai_provider, "get_embedding", lambda text, model=None: fake)
    return fake
@pytest.fixture
 def mock_odata_client(monkeypatch):
    """
    Patch requests.Session.get for OData calls.
    Returns a configurable mock — set mock_odata_client.response to override.
    """
    import requests
    from unittest.mock import MagicMock
    mock_response = MagicMock()
    mock_response.raise_for_status.return_value = None
    mock_response.json.return_value = {"value": []}
    class MockSession:
        response = mock_response
        def get(self, *args, **kwargs):
            return self.response
    monkeypatch.setattr(requests, "Session", MockSession)
    return mock_response
--- a/tests/fixtures/init.py
+++ b/tests/fixtures/init.py
@ -0,0 +1 @@
 """Fixtures package for tests."""
--- a/tests/fixtures/sample_voting_results.json
+++ b/tests/fixtures/sample_voting_results.json
@ -0,0 +1,40 @@
 [
  {
    "motion_id": 1,
    "date": "2024-01-15",
    "voting_results": {
      "VVD": "voor",
      "PvdA": "tegen",
      "CDA": "voor",
      "D66": "voor",
      "Wilders, G.": "voor",
      "Yesilgöz-Zegerius, D.": "voor",
      "Jetten, R.A.A.": "voor"
    }
  },
  {
    "motion_id": 2,
    "date": "2024-02-10",
    "voting_results": {
      "VVD": "tegen",
      "PvdA": "voor",
      "CDA": "afwezig",
      "D66": "voor",
      "Wilders, G.": "tegen",
      "Yesilgöz-Zegerius, D.": "tegen",
      "Ploumen, L.J.": "voor"
    }
  },
  {
    "motion_id": 3,
    "date": "2024-03-05",
    "voting_results": {
      "VVD": "voor",
      "SP": "tegen",
      "GroenLinks": "voor",
      "PVV": "voor",
      "Van der Plas, C.": "voor",
      "Klever, N.C.": "voor"
    }
  }
 ]
--- a/tests/integration/init.py
+++ b/tests/integration/init.py
--- a/tests/integration/test_pipeline_end_to_end.py
+++ b/tests/integration/test_pipeline_end_to_end.py
@ -0,0 +1,87 @@
 import json
 import os
 import numpy as np
 import pytest
 # duckdb is an optional dependency in some environments; skip test if not available
 duckdb = pytest.importorskip("duckdb")
 def test_pipeline_end_to_end(tmp_path, monkeypatch):
    # ensure determinism for any random embedding generation
    np.random.seed(0)
    # prepare temp db
    db_path = str(tmp_path / "motions.db")
    # create the minimal MotionDatabase schema using existing code where possible
    from database import MotionDatabase
    db = MotionDatabase(db_path)
    # create embeddings table (migration would normally do this)
    conn = duckdb.connect(db.db_path)
    conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1")
    conn.execute(
        "CREATE TABLE IF NOT EXISTS embeddings (id INTEGER PRIMARY KEY DEFAULT nextval('embeddings_id_seq'), motion_id INTEGER, model TEXT, vector JSON, created_at TIMESTAMP)"
    )
    # insert three motions
    conn.execute(
        "INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
        ("t1", "d1", "u1", "ex1"),
    )
    conn.execute(
        "INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
        ("t2", "d2", "u2", "ex2"),
    )
    conn.execute(
        "INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
        ("t3", "d3", "u3", "ex3"),
    )
    # fetch ids
    rows = conn.execute("SELECT id FROM motions ORDER BY id").fetchall()
    ids = [r[0] for r in rows]
    # insert existing embedding for first motion
    vec = json.dumps([0.1] * 16)
    conn.execute(
        "INSERT INTO embeddings (motion_id, model, vector) VALUES (?, ?, ?)",
        (ids[0], "test-model", vec),
    )
    conn.close()
    # monkeypatch ai_provider.get_embedding to deterministic vector
    import ai_provider
    def fake_get_embedding(text, model=None):
        # produce a deterministic vector based on seeded numpy
        return list(np.random.rand(16))
    monkeypatch.setattr("ai_provider.get_embedding", fake_get_embedding)
    # run ensure_text_embeddings
    from pipeline.text_pipeline import ensure_text_embeddings
    stored, skipped_existing, skipped_no_text, errors = ensure_text_embeddings(
        db_path=db_path, model="test-model"
    )
    assert stored == 2
    assert skipped_existing == 1
    assert skipped_no_text == 0
    assert errors == 0
    # verify stored vectors length
    conn = duckdb.connect(db.db_path)
    rows = conn.execute(
        "SELECT vector FROM embeddings WHERE model = ? ORDER BY motion_id",
        ("test-model",),
    ).fetchall()
    conn.close()
    assert len(rows) == 3
    for r in rows:
        v = json.loads(r[0])
        assert len(v) == 16
--- a/tests/migrations/test_2026_03_22_add_audit_events.py
+++ b/tests/migrations/test_2026_03_22_add_audit_events.py
@ -0,0 +1,58 @@
 import os
 import pathlib
 import sqlite3
 import re
 import pytest
 def test_migration_file_exists_and_name():
    migrations_dir = pathlib.Path("migrations")
    expected_name = "2026-03-22-add-audit-events.sql"
    migration_path = migrations_dir / expected_name
    # File must exist
    assert migration_path.exists(), f"Migration file {migration_path} does not exist"
    # Name sanity check
    assert migration_path.name == expected_name
 def _strip_sql_comments(sql_text: str) -> str:
    # Remove SQL single-line comments -- ... and C-style /* ... */
    # Use multiline-aware single-line removal for safety.
    no_single = re.sub(r"--.*?$", "", sql_text, flags=re.MULTILINE)
    no_block = re.sub(r"/\*.*?\*/", "", no_single, flags=re.DOTALL)
    return no_block.strip()
 def test_optional_apply_sql_if_db_available():
    """
    If TEST_DB_URL is provided, attempt to apply the SQL.
    For safety this test will skip applying when the SQL is empty or commented out.
    Only sqlite URLs (sqlite:///path/to/db) are attempted here to avoid adding
    extra dependencies; other URL schemes will cause the test to be skipped.
    """
    db_url = os.environ.get("TEST_DB_URL")
    if not db_url:
        pytest.skip("TEST_DB_URL not set - skipping DB application")
    migration_path = pathlib.Path("migrations") / "2026-03-22-add-audit-events.sql"
    sql = migration_path.read_text(encoding="utf8")
    stripped = _strip_sql_comments(sql)
    if not stripped:
        pytest.skip("Migration SQL is empty or commented out - skipping application")
    # Only handle sqlite URLs here
    if db_url.startswith("sqlite:///"):
        db_path = db_url.replace("sqlite:///", "", 1)
        try:
            conn = sqlite3.connect(db_path)
            try:
                conn.executescript(sql)
            finally:
                conn.close()
        except Exception as e:
            pytest.skip(f"Could not apply SQL to sqlite DB: {e}")
    else:
        pytest.skip(f"TEST_DB_URL set but scheme not supported by this test: {db_url}")
--- a/tests/migrations/test_2026_03_22_add_similarity_cache.py
+++ b/tests/migrations/test_2026_03_22_add_similarity_cache.py
@ -0,0 +1,85 @@
 import os
 import re
 import pathlib
 import pytest
 # small migration filename/header tests; keep imports minimal
 MIGRATION_FILENAME = "2026-03-22-add-similarity-cache.sql"
 MIGRATION_PATH = pathlib.Path("migrations") / MIGRATION_FILENAME
 def _strip_sql_comments(sql: str) -> str:
    """Remove SQL single-line (-- ...) and C-style (/* ... */) comments.
    This is a best-effort stripper sufficient for the test's purpose.
    """
    # remove block comments
    sql = re.sub(r"/\*.*?\*/", "", sql, flags=re.S)
    # remove line comments
    sql = re.sub(r"--.*?$", "", sql, flags=re.M)
    return sql.strip()
 def test_migration_file_exists_and_header():
    # file must exist
    assert MIGRATION_PATH.exists(), f"Migration file {MIGRATION_PATH} not found"
    text = MIGRATION_PATH.read_text(encoding="utf8")
    # header should reference the filename and purpose
    assert MIGRATION_FILENAME in text.splitlines()[0], (
        "First line should include the filename"
    )
    assert "similarity" in text.lower(), "Header should mention similarity"
 def test_optional_apply_migration_safe():
    # If TEST_DB_URL is set, try to apply the SQL only if it contains non-comment statements.
    db_url = os.environ.get("TEST_DB_URL")
    sql = MIGRATION_PATH.read_text(encoding="utf8")
    stripped = _strip_sql_comments(sql)
    # If there is no DB url, consider this a filename/header validation test only.
    if not db_url:
        pytest.skip("TEST_DB_URL not set; skipping DB apply step")
    # If the SQL is empty (only comments), nothing to apply — test passes.
    if not stripped:
        pytest.skip("Migration contains no executable SQL; nothing to apply")
    # Otherwise attempt to execute the SQL. Be conservative: if drivers are missing or
    # connection fails, skip the test rather than failing CI. Only unexpected errors
    # during execution should fail the test.
    try:
        if db_url.startswith("sqlite:"):
            import sqlite3
            # sqlite URL might be sqlite:///path or sqlite:///:memory:
            path = db_url.split("sqlite:", 1)[1]
            # normalize prefixes like ///
            path = path.lstrip("/") or ":memory:"
            conn = sqlite3.connect(path)
            try:
                conn.executescript(sql)
            finally:
                conn.close()
        elif db_url.startswith("postgresql:") or db_url.startswith("postgres:"):
            try:
                import psycopg2
            except Exception as e:  # pragma: no cover - driver may be absent in CI
                pytest.skip(f"psycopg2 not available: {e}")
            # psycopg2 accepts a DSN; rely on that here.
            conn = psycopg2.connect(db_url)
            try:
                cur = conn.cursor()
                cur.execute(sql)
                conn.commit()
            finally:
                conn.close()
        else:
            pytest.skip(f"DB URL scheme not supported by this test: {db_url}")
    except Exception as exc:
        # Unexpected error while applying SQL should fail the test.
        raise
--- a/tests/migrations/test_migration_fixtures_smoke.py
+++ b/tests/migrations/test_migration_fixtures_smoke.py
@ -0,0 +1,29 @@
 """Smoke test for the migration test_db fixture.
 This test imports the `test_db` fixture and asserts expected behavior in two
 cases:
 - If the environment variable TEST_DB_URL is not set, the fixture should yield
  None.
 - If TEST_DB_URL is set, the fixture should yield a connection-like object
  (we check for an object with a `cursor` attribute or the sqlite3 connection
  type).
 """
 import os
 import types
 import pytest
 def test_migration_fixture_smoke(test_db):
    """Smoke test ensuring the test_db fixture yields expected values."""
    url = os.environ.get("TEST_DB_URL")
    if not url:
        assert test_db is None
    else:
        # For sqlite we expect a sqlite3.Connection which has a 'cursor'
        # method. Be permissive and accept any object with a 'cursor'
        # attribute or callable.
        assert test_db is not None
        assert hasattr(test_db, "cursor") or hasattr(test_db, "execute")
--- a/tests/test_ai_provider.py
+++ b/tests/test_ai_provider.py
@ -0,0 +1,49 @@
 import os
 import types
 import pytest
 import ai_provider
 class DummyResponse:
    def __init__(self, status_code=200, json_data=None):
        self.status_code = status_code
        self._json = json_data or {}
    def json(self):
        return self._json
 def test_get_embedding_success(monkeypatch):
    fake = DummyResponse(json_data={"data": [{"embedding": [0.1, 0.2, 0.3]}]})
    def fake_post(url, json, headers, timeout):
        return fake
    monkeypatch.setenv("OPENROUTER_API_KEY", "sk-test")
    monkeypatch.setattr("requests.post", fake_post)
    emb = ai_provider.get_embedding("hello world")
    assert emb == [0.1, 0.2, 0.3]
 def test_chat_completion_success(monkeypatch):
    fake = DummyResponse(json_data={"choices": [{"message": {"content": "summary"}}]})
    def fake_post(url, json, headers, timeout):
        return fake
    monkeypatch.setenv("OPENROUTER_API_KEY", "sk-test")
    monkeypatch.setattr("requests.post", fake_post)
    out = ai_provider.chat_completion([{"role": "user", "content": "hi"}])
    assert out == "summary"
 def test_missing_api_key_raises(monkeypatch):
    # Ensure env var is not set
    monkeypatch.delenv("OPENROUTER_API_KEY", raising=False)
    with pytest.raises(ai_provider.ProviderError):
        ai_provider.get_embedding("x")
--- a/tests/test_extract_mp_votes.py
+++ b/tests/test_extract_mp_votes.py
@ -0,0 +1,74 @@
 import json
 import duckdb
 import logging
 from pipeline.extract_mp_votes import extract_mp_votes
 from database import MotionDatabase
 def test_extract_mp_votes(tmp_path):
    db_file = tmp_path / "test.db"
    # Initialize database
    mdb = MotionDatabase(db_path=str(db_file))
    # Load fixture
    fixture_path = "tests/fixtures/sample_voting_results.json"
    with open(fixture_path, "r") as fh:
        fixtures = json.load(fh)
    # Insert motions into motions table
    conn = duckdb.connect(str(db_file))
    try:
        for item in fixtures:
            motion_id = item.get("motion_id")
            date = item.get("date")
            voting_results = item.get("voting_results")
            conn.execute(
                """
                INSERT INTO motions (id, title, description, date, policy_area, voting_results, winning_margin, url)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?)
                """,
                (
                    motion_id,
                    f"Test Motion {motion_id}",
                    "",
                    date,
                    "Test",
                    json.dumps(voting_results),
                    0.5,
                    f"http://example/{motion_id}",
                ),
            )
    finally:
        conn.close()
    # Run extraction
    res = extract_mp_votes(db_path=str(db_file))
    # Expected MP rows: count keys that contain a comma in fixtures
    expected_mp_count = 0
    for item in fixtures:
        for k in item.get("voting_results", {}).keys():
            if "," in k:
                expected_mp_count += 1
    assert res["mp_rows_inserted"] == expected_mp_count
    assert res["motions_skipped"] == 0
    # Verify mp_votes table contains only rows with comma in mp_name and count matches
    conn = duckdb.connect(str(db_file))
    try:
        rows = conn.execute("SELECT mp_name FROM mp_votes").fetchall()
    finally:
        conn.close()
    assert len(rows) == expected_mp_count
    for (mp_name,) in rows:
        assert "," in mp_name
    # Running again should be idempotent: no new mp rows, motions_skipped > 0
    res2 = extract_mp_votes(db_path=str(db_file))
    assert res2["mp_rows_inserted"] == 0
    assert res2["motions_skipped"] > 0
--- a/tests/test_fetch_mp_metadata.py
+++ b/tests/test_fetch_mp_metadata.py
@ -0,0 +1,103 @@
 import json
 import requests
 import types
 import pytest
 try:
    import duckdb
 except Exception:
    pytest.skip(
        "duckdb not installed, skipping fetch_mp_metadata tests",
        allow_module_level=True,
    )
 from pipeline.fetch_mp_metadata import fetch_mp_metadata, normalize_mp_name
 class MockResponse:
    def __init__(self, data, status_code=200):
        self._data = data
        self.status_code = status_code
    def raise_for_status(self):
        if not (200 <= self.status_code < 300):
            raise requests.HTTPError(f"status {self.status_code}")
    def json(self):
        return self._data
 class MockSession:
    def __init__(self, response):
        self._response = response
    def get(self, url):
        return self._response
 def test_fetch_mp_metadata_idempotent(tmp_path, monkeypatch):
    # Prepare canned OData response with two FractieZetelPersoon records
    data = {
        "value": [
            {
                "Persoon": {
                    "Achternaam": "Yesilgöz-Zegerius",
                    "Initialen": "D.",
                    "Tussenvoegsel": None,
                    "Id": "guid-1",
                },
                "FractieZetel": {"Fractie": {"NaamNL": "VVD"}},
                "Van": "2023-01-01",
                "TotEnMet": None,
            },
            {
                "Persoon": {
                    "Achternaam": "Plas",
                    "Initialen": "C.",
                    "Tussenvoegsel": "van der",
                    "Id": "guid-2",
                },
                "FractieZetel": {"Fractie": {"NaamNL": "BBB"}},
                "Van": "2023-06-01",
                "TotEnMet": "2024-01-01",
            },
        ]
    }
    mock_resp = MockResponse(data)
    mock_session = MockSession(mock_resp)
    # Patch requests.Session to return our mock session
    monkeypatch.setattr(requests, "Session", lambda: mock_session)
    db_path = str(tmp_path / "test.db")
    # First run
    count = fetch_mp_metadata(db_path=db_path, odata_url="http://example/odata")
    assert count == 2
    # Verify DB contents
    conn = duckdb.connect(db_path)
    rows = conn.execute(
        "SELECT mp_name, party, van, tot_en_met, persoon_id FROM mp_metadata ORDER BY mp_name"
    ).fetchall()
    conn.close()
    assert len(rows) == 2
    # Check normalized names
    assert rows[0][0] == normalize_mp_name("Plas", "C.", "van der")
    assert rows[0][1] == "BBB"
    assert str(rows[0][2]) == "2023-06-01"
    assert str(rows[0][3]) == "2024-01-01"
    assert rows[0][4] == "guid-2"
    assert rows[1][0] == normalize_mp_name("Yesilgöz-Zegerius", "D.", None)
    assert rows[1][1] == "VVD"
    assert str(rows[1][2]) == "2023-01-01"
    assert rows[1][3] == None
    assert rows[1][4] == "guid-1"
    # Run again to assert idempotence (no exception and same count processed)
    count2 = fetch_mp_metadata(db_path=db_path, odata_url="http://example/odata")
    assert count2 == 2
--- a/tests/test_fusion.py
+++ b/tests/test_fusion.py
@ -0,0 +1,79 @@
 import json
 import duckdb
 import pytest
 from database import MotionDatabase
 def test_fuse_for_window(tmp_path):
    db_path = str(tmp_path / "motions.db")
    # Create MotionDatabase (this will initialize schema except embeddings)
    db = MotionDatabase(db_path=db_path)
    # Create embeddings table (migration not run by MotionDatabase)
    conn = duckdb.connect(db_path)
    conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1")
    conn.execute(
        """
        CREATE TABLE IF NOT EXISTS embeddings (
            id INTEGER DEFAULT nextval('embeddings_id_seq'),
            motion_id INTEGER NOT NULL,
            model TEXT NOT NULL,
            vector JSON NOT NULL,
            created_at TIMESTAMP DEFAULT current_timestamp,
            PRIMARY KEY (id)
        )
        """
    )
    conn.close()
    # Insert 3 synthetic SVD vectors (k=4)
    svd1 = [0.1, 0.2, 0.3, 0.4]
    svd2 = [0.2, 0.1, 0.0, -0.1]
    svd3 = [0.9, 0.8, 0.7, 0.6]
    db.store_svd_vector("2024-Q1", "motion", "1", svd1)
    db.store_svd_vector("2024-Q1", "motion", "2", svd2)
    db.store_svd_vector("2024-Q1", "motion", "3", svd3)
    # Insert text embeddings for motions 1 and 2 (16 dims)
    text1 = [float(i) / 100.0 for i in range(16)]
    text2 = [float(i) / 50.0 for i in range(16)]
    conn = duckdb.connect(db_path)
    conn.execute(
        "INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, current_timestamp)",
        (1, "text-model-1", json.dumps(text1)),
    )
    conn.execute(
        "INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, current_timestamp)",
        (2, "text-model-1", json.dumps(text2)),
    )
    conn.close()
    # Import fuse function here to ensure module available
    from pipeline.fusion import fuse_for_window
    result = fuse_for_window("2024-Q1", db_path=db_path)
    assert result["inserted"] == 2
    assert result["skipped_missing_text"] == 1
    # Verify fused embeddings stored
    conn = duckdb.connect(db_path)
    rows = conn.execute(
        "SELECT motion_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE window_id = ?",
        ("2024-Q1",),
    ).fetchall()
    conn.close()
    # Expect two rows for motions 1 and 2
    assert len(rows) == 2
    for motion_id, vector_json, svd_dims, text_dims in rows:
        vec = json.loads(vector_json)
        assert svd_dims == 4
        assert text_dims == 16
        assert len(vec) == 20
--- a/tests/test_migration_embeddings.py
+++ b/tests/test_migration_embeddings.py
@ -0,0 +1,31 @@
 import os
 import pytest
 def test_embeddings_migration_creates_table(tmp_path):
    try:
        import duckdb
    except ImportError:
        pytest.skip("duckdb is not installed")
    db_file = str(tmp_path / "migrations_test.db")
    conn = duckdb.connect(database=db_file)
    try:
        sql = open("migrations/2026-03-19-add-embeddings.sql", "r").read()
        conn.execute(sql)
        # Use sequence to set id if present, otherwise provide explicit id
        try:
            next_id = conn.execute("SELECT nextval('embeddings_id_seq')").fetchone()[0]
        except Exception:
            next_id = 1
        conn.execute(
            "INSERT INTO embeddings (id, motion_id, model, vector) VALUES (?, ?, ?, ?)",
            (next_id, 1, "m1", "[0.1, 0.2]"),
        )
        res = conn.execute(
            "SELECT motion_id, model FROM embeddings WHERE motion_id = 1"
        ).fetchall()
        assert len(res) == 1
        assert res[0][1] == "m1"
    finally:
        conn.close()
--- a/tests/test_migration_pipeline_tables.py
+++ b/tests/test_migration_pipeline_tables.py
@ -0,0 +1,219 @@
 from pathlib import Path
 try:
    import duckdb
    DB_BACKEND = "duckdb"
 except Exception:
    import sqlite3
    DB_BACKEND = "sqlite3"
 MIGRATIONS = [
    (
        "migrations/2026_03_21__create_mp_votes.sql",
        "mp_votes",
        [
            "id",
            "motion_id",
            "mp_name",
            "party",
            "vote",
            "date",
            "created_at",
        ],
    ),
    (
        "migrations/2026_03_21__create_mp_metadata.sql",
        "mp_metadata",
        [
            "mp_name",
            "party",
            "van",
            "tot_en_met",
            "persoon_id",
        ],
    ),
    (
        "migrations/2026_03_21__create_svd_vectors.sql",
        "svd_vectors",
        [
            "id",
            "window_id",
            "entity_type",
            "entity_id",
            "vector",
            "model",
            "created_at",
        ],
    ),
    (
        "migrations/2026_03_21__create_fused_embeddings.sql",
        "fused_embeddings",
        [
            "id",
            "motion_id",
            "window_id",
            "vector",
            "svd_dims",
            "text_dims",
            "created_at",
        ],
    ),
 ]
 def test_run_migrations_and_tables(tmp_path):
    db_path = tmp_path / "test.db"
    if DB_BACKEND == "duckdb":
        conn = duckdb.connect(str(db_path))
    else:
        conn = sqlite3.connect(str(db_path))
    for sql_path, table_name, expected_cols in MIGRATIONS:
        p = Path(sql_path)
        assert p.exists(), f"Migration file {sql_path} must exist"
        sql = p.read_text()
        # If using sqlite3, transform SQL to be sqlite compatible
        if DB_BACKEND == "sqlite3":
            # remove CREATE SEQUENCE lines
            lines = [
                l
                for l in sql.splitlines()
                if not l.strip().upper().startswith("CREATE SEQUENCE")
            ]
            sql2 = "\n".join(lines)
            # remove DEFAULT nextval(...) occurrences
            import re
            sql2 = re.sub(
                r"DEFAULT\s+nextval\('[^']+'\)", "", sql2, flags=re.IGNORECASE
            )
            # replace JSON type with TEXT
            sql2 = re.sub(r"\bJSON\b", "TEXT", sql2, flags=re.IGNORECASE)
            # execute as script (multiple statements)
            conn.executescript(sql2)
        else:
            # execute migration SQL
            conn.execute(sql)
        # check columns via pragma
        if DB_BACKEND == "duckdb":
            rows = conn.execute(f"PRAGMA table_info('{table_name}')").fetchall()
            col_names = [r[1] for r in rows]
        else:
            cur = conn.execute(f"PRAGMA table_info('{table_name}')")
            rows = cur.fetchall()
            col_names = [r[1] for r in rows]
        for col in expected_cols:
            assert col in col_names, (
                f"Column {col} missing in table {table_name}, got {col_names}"
            )
        # perform a simple insert + select to validate basic round-trip
        if table_name == "mp_votes":
            if DB_BACKEND == "duckdb":
                conn.execute(
                    "INSERT INTO mp_votes (motion_id, mp_name, party, vote, date) VALUES (1, 'Jane Doe', 'PartyX', 'Yea', '2026-03-21')"
                )
                res = conn.execute(
                    "SELECT motion_id, mp_name, party, vote, date FROM mp_votes WHERE motion_id=1"
                ).fetchone()
                # DuckDB returns datetime.date for DATE columns; normalise to string
                assert (
                    res[:4] == (1, "Jane Doe", "PartyX", "Yea")
                    and str(res[4]) == "2026-03-21"
                )
            else:
                # sqlite: id has no default after transformation, provide id explicitly
                conn.execute(
                    "INSERT INTO mp_votes (id, motion_id, mp_name, party, vote, date) VALUES (1, 1, 'Jane Doe', 'PartyX', 'Yea', '2026-03-21')"
                )
                res = conn.execute(
                    "SELECT motion_id, mp_name, party, vote, date FROM mp_votes WHERE id=1"
                ).fetchone()
                assert res == (1, "Jane Doe", "PartyX", "Yea", "2026-03-21")
        elif table_name == "mp_metadata":
            conn.execute(
                "INSERT INTO mp_metadata (mp_name, party, van, tot_en_met, persoon_id) VALUES ('Jane Doe', 'PartyX', '2020-01-01', '2024-12-31', 'pid-123')"
            )
            res = conn.execute(
                "SELECT mp_name, party, van, tot_en_met, persoon_id FROM mp_metadata WHERE mp_name='Jane Doe'"
            ).fetchone()
            # DuckDB returns datetime.date for DATE columns; normalise to string
            assert (
                res[0] == "Jane Doe"
                and res[1] == "PartyX"
                and str(res[2]) == "2020-01-01"
                and str(res[3]) == "2024-12-31"
                and res[4] == "pid-123"
            )
        elif table_name == "svd_vectors":
            # JSON value as text
            if DB_BACKEND == "duckdb":
                conn.execute(
                    "INSERT INTO svd_vectors (window_id, entity_type, entity_id, vector, model) VALUES ('w1', 'typeA', 'e1', '[1,2,3]', 'm1')"
                )
                res = conn.execute(
                    "SELECT window_id, entity_type, entity_id, vector, model FROM svd_vectors WHERE window_id='w1'"
                ).fetchone()
                # Note: DuckDB may return the JSON column as string; compare string form
                assert (
                    res[0] == "w1"
                    and res[1] == "typeA"
                    and res[2] == "e1"
                    and (str(res[3]) == "[1,2,3]" or res[3] == "[1,2,3]")
                    and res[4] == "m1"
                )
            else:
                # sqlite: provide id explicitly
                conn.execute(
                    "INSERT INTO svd_vectors (id, window_id, entity_type, entity_id, vector, model) VALUES (1, 'w1', 'typeA', 'e1', '[1,2,3]', 'm1')"
                )
                res = conn.execute(
                    "SELECT window_id, entity_type, entity_id, vector, model FROM svd_vectors WHERE id=1"
                ).fetchone()
                assert (
                    res[0] == "w1"
                    and res[1] == "typeA"
                    and res[2] == "e1"
                    and str(res[3]) == "[1,2,3]"
                    and res[4] == "m1"
                )
        elif table_name == "fused_embeddings":
            if DB_BACKEND == "duckdb":
                conn.execute(
                    "INSERT INTO fused_embeddings (motion_id, window_id, vector, svd_dims, text_dims) VALUES (2, 'w2', '[0.1,0.2]', 16, 128)"
                )
                res = conn.execute(
                    "SELECT motion_id, window_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE motion_id=2"
                ).fetchone()
                assert (
                    res[0] == 2
                    and res[1] == "w2"
                    and (str(res[2]) == "[0.1,0.2]" or res[2] == "[0.1,0.2]")
                    and res[3] == 16
                    and res[4] == 128
                )
            else:
                conn.execute(
                    "INSERT INTO fused_embeddings (id, motion_id, window_id, vector, svd_dims, text_dims) VALUES (1, 2, 'w2', '[0.1,0.2]', 16, 128)"
                )
                res = conn.execute(
                    "SELECT motion_id, window_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE id=1"
                ).fetchone()
                assert (
                    res[0] == 2
                    and res[1] == "w2"
                    and str(res[2]) == "[0.1,0.2]"
                    and res[3] == 16
                    and res[4] == 128
                )
    conn.close()
--- a/tests/test_pyproject_deps.py
+++ b/tests/test_pyproject_deps.py
@ -0,0 +1,5 @@
 def test_scientific_deps_present():
    content = open("pyproject.toml").read()
    assert "scipy" in content
    assert "umap-learn" in content
    assert "plotly" in content
--- a/tests/test_svd_pipeline.py
+++ b/tests/test_svd_pipeline.py
@ -0,0 +1,63 @@
 import json
 import numpy as np
 import pytest
 from database import db as motion_db
 from pipeline.svd_pipeline import (
    _safe_k,
    _build_vote_matrix,
    _procrustes_align,
    run_svd_for_window,
 )
 def test_safe_k_and_build_and_run(tmp_path):
    np.random.seed(0)
    # reset DB file for test
    db_path = tmp_path / "test.db"
    # point the MotionDatabase to this test DB
    motion_db.db_path = str(db_path)
    motion_db._init_database()
    # Create synthetic dataset: 5 MPs x 6 motions
    mps = [f"MP_{i}" for i in range(5)]
    motions = list(range(100, 106))
    dates = ["2020-01-0" + str(i + 1) for i in range(6)]
    votes = ["Voor", "Tegen", "Geen stem"]
    # insert votes: fill full matrix using MotionDatabase helper
    for j, motion_id in enumerate(motions):
        for i, mp in enumerate(mps):
            vote = votes[(i + j) % len(votes)]
            motion_db.insert_mp_vote(motion_id, mp, vote, date=dates[j])
    mat, mp_names, motion_ids = _build_vote_matrix(
        motion_db, "2020-01-01", "2020-01-10"
    )
    assert mat.shape == (5, 6)
    # _safe_k: with k=10 -> min_dim=5 -> returns 4
    assert _safe_k(mat, 10) == 4
    assert _safe_k(mat, 3) == 3
    # run_svd_for_window with k=10 -> should use k_used=4
    res = run_svd_for_window(motion_db, "w1", "2020-01-01", "2020-01-10", k=10)
    assert res["k_used"] == 4
    assert res["stored_mp"] == 5
    assert res["stored_motion"] == 6
 def test_procrustes_align():
    np.random.seed(0)
    # create reference anchors and current anchors rotated + noise
    ref = np.random.randn(10, 3)
    # create orthogonal rotation
    Q, _ = np.linalg.qr(np.random.randn(3, 3))
    cur = ref.dot(Q) + 0.1 * np.random.randn(10, 3)
    before = np.linalg.norm(cur - ref)
    transformed = _procrustes_align(ref, cur)
    after = np.linalg.norm(transformed - ref)
    assert after < before
--- a/tests/test_text_pipeline.py
+++ b/tests/test_text_pipeline.py
@ -0,0 +1,80 @@
 import json
 import pytest
 # duckdb is an optional dependency in some environments; skip test if not available
 duckdb = pytest.importorskip("duckdb")
 from database import MotionDatabase
 def test_ensure_text_embeddings_monkeypatch(tmp_path, monkeypatch):
    # prepare temp db
    db_path = str(tmp_path / "motions.db")
    db = MotionDatabase(db_path)
    # create embeddings table (migration would normally do this)
    conn = duckdb.connect(db.db_path)
    # create embeddings table with autoincrement id for sqlite
    conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1")
    conn.execute(
        "CREATE TABLE IF NOT EXISTS embeddings (id INTEGER PRIMARY KEY DEFAULT nextval('embeddings_id_seq'), motion_id INTEGER, model TEXT, vector JSON, created_at TIMESTAMP)"
    )
    # insert three motions
    conn.execute(
        "INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
        ("t1", "d1", "u1", "ex1"),
    )
    conn.execute(
        "INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
        ("t2", "d2", "u2", "ex2"),
    )
    conn.execute(
        "INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
        ("t3", "d3", "u3", "ex3"),
    )
    # fetch ids
    rows = conn.execute("SELECT id FROM motions ORDER BY id").fetchall()
    ids = [r[0] for r in rows]
    # insert existing embedding for first motion
    import json as _json
    vec = _json.dumps([0.1] * 16)
    conn.execute(
        "INSERT INTO embeddings (motion_id, model, vector) VALUES (?, ?, ?)",
        (ids[0], "test-model", vec),
    )
    conn.close()
    # monkeypatch ai_provider.get_embedding
    def fake_get_embedding(text, model=None):
        return [0.1] * 16
    monkeypatch.setattr("ai_provider.get_embedding", fake_get_embedding)
    # run ensure_text_embeddings
    from pipeline.text_pipeline import ensure_text_embeddings
    stored, skipped_existing, skipped_no_text, errors = ensure_text_embeddings(
        db_path=db_path, model="test-model"
    )
    assert stored == 2
    assert skipped_existing == 1
    assert skipped_no_text == 0
    assert errors == 0
    # verify stored vectors length
    conn = duckdb.connect(db.db_path)
    rows = conn.execute(
        "SELECT vector FROM embeddings WHERE model = ? ORDER BY motion_id",
        ("test-model",),
    ).fetchall()
    conn.close()
    assert len(rows) == 3
    for r in rows:
        v = _json.loads(r[0])
        assert len(v) == 16
--- a/tests/types/test_motion_types.py
+++ b/tests/types/test_motion_types.py
@ -0,0 +1,22 @@
 import json
 from src.types.motion_types import SimilarityNeighbor, to_json, from_json
 def test_similarity_neighbor_json_roundtrip():
    neighbors = [
        SimilarityNeighbor(motion_id="m1", score=0.9),
        SimilarityNeighbor(motion_id="m2", score=0.75),
    ]
    # Serialize to JSON string
    json_str = to_json(neighbors)
    assert isinstance(json_str, str)
    # Ensure it's valid JSON
    parsed = json.loads(json_str)
    assert isinstance(parsed, list)
    # Deserialize back to objects
    recovered = from_json(json_str)
    assert recovered == neighbors
--- a/tests/utils/migration_fixtures.py
+++ b/tests/utils/migration_fixtures.py
@ -0,0 +1,66 @@
 """
 Test helper fixtures for database migrations.
 Provides a pytest fixture `test_db` that inspects the environment variable
 `TEST_DB_URL` to decide what to yield:
 - If `TEST_DB_URL` is not set, the fixture yields None. This allows tests to
  be skipped or operate in a no-database mode in CI or local runs where a
  test database is not available.
 - If `TEST_DB_URL` is set and starts with "sqlite", an sqlite3 connection is
  created via `sqlite3.connect` and yielded. The connection is closed after
  the test completes.
 Decision: keep this fixture lightweight and focused on sqlite for local
 smoke-testing. If other database backends are needed later, expand this
 fixture accordingly.
 """
 from typing import Optional
 import os
 import sqlite3
 import pytest
@pytest.fixture
 def test_db():
    """Yield a test database connection or None.
    Behavior:
    - If TEST_DB_URL is not set in the environment, yield None.
    - If TEST_DB_URL is set and begins with 'sqlite', open an sqlite3
      connection and yield it. The connection will be closed when the test
      finishes.
    """
    url = os.environ.get("TEST_DB_URL")
    if not url:
        yield None
        return
    # Only support sqlite URLs in this lightweight fixture.
    if url.startswith("sqlite"):
        # For sqlite URLs, accept either a bare file path or a file:// style
        # URL. sqlite3.connect handles file paths; if a file:// prefix is
        # present, strip it.
        path = url
        if path.startswith("sqlite:///"):
            # sqlite:///path => /path
            path = path[len("sqlite:///") :]
        elif path.startswith("sqlite://"):
            path = path[len("sqlite://") :]
        conn = sqlite3.connect(path)
        try:
            yield conn
        finally:
            try:
                conn.close()
            except Exception:
                # Best-effort close; tests shouldn't fail on close errors.
                pass
        return
    # Unknown or unsupported TEST_DB_URL scheme — yield None to keep tests
    # tolerant in environments where the fixture can't create a connection.
    yield None
--- a/thoughts/ledgers/CONTINUITY_stemwijzer.md
+++ b/thoughts/ledgers/CONTINUITY_stemwijzer.md
@ -0,0 +1,50 @@
 # Session: stemwijzer
 Updated: 2026-03-20T00:23:33Z
 ## Goal
 Preserve the minimal session state required to resume work on the stemwijzer project after context clears (success = ledger exists and is kept up-to-date).
 ## Constraints
 - Keep the ledger CONCISE — only essential information
 - Focus on WHAT and WHY, not HOW
 - Mark uncertain information as UNCONFIRMED
 - Include git branch and key file paths
 ## Progress
 ### Done
 - [x] Create initial continuity ledger file
 ### In Progress
 - [ ] Capture ongoing session context and update ledger after each meaningful change
 ### Blocked
 - None currently
 ## Key Decisions
 - **Session name = "stemwijzer"**: Chosen from repository context (UNCONFIRMED if a different canonical session name is preferred).
 - **Do not auto-commit ledger changes**: Commits will only be made when the user explicitly requests it (follows Git Safety Protocol).
 ## Next Steps
 1. Continue updating this ledger when tasks, files, or decisions change
 2. Add entries for new branches or major feature work (mark as UNCONFIRMED when unsure)
 3. Ask user before creating any git commits that include this ledger
 ## File Operations
 ### Read
 - `README.md`
 - `pyproject.toml`
 - `thoughts/shared/plans/2026-03-19-stemwijzer-plan.md`
 - `thoughts/shared/designs/2026-03-19-stemwijzer-design.md`
 ### Modified
 - `thoughts/ledgers/CONTINUITY_stemwijzer.md` (new)
 ## Critical Context
 - Repository branch observed: `main`
 - Found project metadata in `pyproject.toml` indicating Python tooling preference
 - Existing notes/plans located under `thoughts/shared/` (plans and designs from 2026-03-19)
 - No existing continuity ledger was found prior to this creation
 ## Working Set
 - Branch: `main`
 - Key files: `README.md`, `pyproject.toml`, `thoughts/shared/plans/2026-03-19-stemwijzer-plan.md`, `thoughts/shared/designs/2026-03-19-stemwijzer-design.md`, `thoughts/ledgers/CONTINUITY_stemwijzer.md`
--- a/thoughts/shared/designs/2026-03-19-stemwijzer-design.md
+++ b/thoughts/shared/designs/2026-03-19-stemwijzer-design.md
@ -0,0 +1,98 @@
 ---
 date: 2026-03-19
 topic: "Stemwijzer AI & DB design"
 status: draft
 ---
 ## Problem Statement
 We need a clear, low-risk design to improve AI usage and query ergonomics in this repository. The codebase currently ingests motions, stores them in DuckDB, and generates AI-driven layman summaries via an OpenRouter/OpenAI client. There are a few maintenance issues (e.g., missing config keys, a broken reset script) and no embedding/search infrastructure.
 **Goal:**
 - Centralize AI/LLM usage behind a provider abstraction so we can swap or prefer providers later.
 - Introduce minimal embeddings storage and search so we can add semantic features without heavy infra.
 - Prefer ibis for read/query paths where that improves clarity and maintainability (the repo already imports ibis in read.py).
 ## Constraints
 - Work must be incremental and non-disruptive: keep existing DuckDB schema and write paths where possible.
 - Do not add external services (vector DB) in the first iteration — store embeddings in DuckDB as JSON for now.
 - Secrets must remain environment-driven (no checked-in secrets). Add env var defaults only.
 - Keep changes small and well-tested; make it easy to roll back.
 ## Approach (chosen)
 I'll introduce two small layers:
 - **ai_provider**: a thin adapter that exposes get_embedding(text) and chat_completion(messages). It will use the existing OpenRouter/OpenAI path by default and can be extended to prefer other providers if/when desired.
 - **query_dal**: read-focused utilities implemented with ibis to replace direct SQL reads in the app and other read-heavy paths. Writes (insert_motion, update_user_vote) stay in database.py initially.
 This gives the benefits of abstraction and pythonic query composition while keeping risk low.
 ## Architecture
 High level components (repo root):
 - api_client.py — fetches motion data from Tweede Kamer OData (unchanged)
 - scraper.py — optional HTML scraping fallback (unchanged)
 - database.py — current writes, schema initialization (add small embeddings table)
 - summarizer.py — generate layman summaries (refactor to use ai_provider)
 - app.py — Streamlit UI (switch read paths to query_dal)
 - scheduler.py — orchestrates ingestion and triggers summarization (unchanged)
 Additions:
 - ai_provider.py — single place for LLM/embedding calls and retries
 - query_dal.py — ibis-based read helpers (get_filtered_motions, calculate_party_matches)
 - minimal embeddings table in DuckDB (motion_id, model, vector JSON, created_at)
 ## Components and responsibilities
 - **ai_provider**: choose provider, handle retries/backoff, return plain Python objects (list[float] embeddings, str completions). Keep error classes small and testable.
 - **database (existing)**: add store_embedding and search_similar helpers (naive in-Python cosine scan). Keep insert_motion/update_user_vote unchanged to minimize risk.
 - **query_dal**: use ibis for read queries used by Streamlit paths (get_filtered_motions, session lookups). Return parsed JSON fields.
 - **summarizer**: call ai_provider.chat_completion to get summary; update motions.layman_explanation; optionally compute embedding via ai_provider.get_embedding and store via database.store_embedding.
 - **app.py**: replace direct duckdb selects with query_dal functions.
 ## Data Flow
 1. Ingest: scheduler / scraper / api_client fetch motions and call database.insert_motion(motion).
 2. Summarize: summarizer calls ai_provider.chat_completion(summary prompt) → writes layman_explanation to motions table. Optionally computes embedding and writes to embeddings table.
 3. Query: Streamlit app calls query_dal.get_filtered_motions (ibis) to load motions for sessions and query_dal.calculate_party_matches for results.
 4. Semantic search (future): query_dal or app can call database.search_similar by providing an embedding computed with ai_provider.get_embedding.
 ## Error Handling
 - ai_provider: retries with exponential backoff for transient errors; raises a ProviderError for terminal failures so callers can decide retry semantics.
 - Summarizer: non-fatal on AI failures — store an empty/fallback summary and log the failure; surface a user-facing message in Streamlit if generating summaries fails interactively.
 - DB functions: existing try/except patterns retained; ensure connections are closed on error.
 ## Testing Strategy
 - Unit tests for ai_provider using mocks for HTTP/openai responses.
 - DB tests using temporary DuckDB files to verify store_embedding and search_similar behavior.
 - query_dal tests using ibis against a temporary DB file; ensure JSON fields parse correctly.
 - Summarizer tests mock ai_provider to assert DB writes happen.
 ## Open Questions
 - Store embeddings inside motions table vs separate embeddings table? Recommendation: separate embeddings table for clarity and easier upserts.
 - Do we want to prefer other providers (Copilot) automatically? This repo currently references OPENROUTER. If user wants Copilot preference, we can add env vars and selection logic later.
 ## Next steps (short)
 1. Add ai_provider.py (adapter) and tests.
 2. Add embeddings table and store/search helpers in database.py and tests.
 3. Add query_dal.py with ibis reads and tests.
 4. Refactor summarizer.py to use ai_provider and optionally store embeddings.
 5. Update Streamlit app read paths to use query_dal.
 6. Fix housekeeping bugs: reset.py references reset_database(), scraper uses undefined SCRAPING_DELAY — address these small fixes in a separate patch.
 I'm proceeding to save this design to thoughts/shared/designs/2026-03-19-stemwijzer-design.md and will spawn the planner to create a detailed implementation plan. Interrupt if you want changes to the design text above.
--- a/thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md
+++ b/thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md
@ -0,0 +1,116 @@
 ---
 date: 2026-03-21
 topic: "Reuse motions as a guided policy explorer"
 status: draft
 ---
 ## Problem Statement
 We want to repurpose existing "motions" data so it becomes a lightweight, discovery-driven way for users to explore policy positions and discover related content. This is not a full proposal system; it's a guided exploration and bookmarking flow that leverages our existing ingestion, summarization, embeddings, and session voting work.
 **Why now:** We already ingest motions, generate layman explanations, compute embeddings, and store per-session votes. Reusing those building blocks gives high user value with modest effort.
 ## Constraints
 **Non-negotiables and technical limits:**
 - Use the existing database schema where possible (motions table, embeddings table, user_sessions). Do not require a new external vector DB for MVP.  
 - Keep the Streamlit UI model (app.py) and session-based votes intact for the initial rollout.  
 - Avoid breaking migrations: rely on existing migrations and add new ones when necessary (no forced drops).  
 - Respect current error-handling posture: network calls can fail; system must degrade gracefully.
 ## Chosen Approach
 I'm choosing a "Guided Policy Explorer" approach because it reuses thehighest-value existing pieces (summaries, embeddings, session voting) and delivers a clear UX that fits the current codebase. This gives immediate product value with low risk.
 **Core idea:** present curated short sessions and motion detail pages that combine the existing layman explanation, party-match results, and semantic "related motions" powered by stored embeddings.
 Alternatives considered:
 - "Motion-as-Proposal platform": full lifecycle (draft → comment → vote). Rejected for MVP due to high complexity and data model changes.  
 - "Motion Digest / Research Assistant": read-only pages and newsletters. Lower effort, but less interactive and reuses fewer of our current session features.
 ## Architecture
 High-level view (existing pieces in bold):
 - Ingest: **api_client.py** + **scraper.py** gather motions and create motion records in the DB.  
 - Persist: **database.py** stores motions, embeddings, and user_sessions.  
 - Enrichment: **summarizer.py** + **ai_provider.py** generate layman explanations and embeddings.  
 - Background jobs: **scheduler.py** runs ingest, summarization, and periodic clustering.  
 - UI: **app.py** current Streamlit session flow — extend with "Explore" and "Motion detail" pages.  
 - New: small **clusterer / similarity API** to compute and cache related-motion lists per motion.
 ## Key Components & Responsibilities
 - Motion Ingest (existing): keep ingest as-is; add metadata flags (e.g., curated, candidate).  
 - Motion Store (existing): motions table + embeddings table; add an **events/audit** table for user actions and important state transitions.  
 - Summarizer / Embedding Worker (existing): scheduled job that ensures motions have layman_explanation and embeddings; add retry/backoff and logging.  
 - Similarity service (new): computes nearest neighbors using stored vectors in-process for MVP and caches results in a small table. Swap to a vector index later if needed.  
 - Session & Voting (existing): continue using user_sessions JSON blob for individual sessions; add optional event log entries for each vote.  
 - UI (update): add "Explore" landing, motion detail view with layman text, party-match snapshot, related motions, and bookmark/flag actions. Reuse Streamlit components.  
 - Admin tooling (new): migration scripts, a CLI to recompute embeddings/similarity, and an audit query helper.
 ## Data Flow
 1. Ingest job (api_client/scraper) produces motion records and calls db.insert_motion.  
 2. Summarizer worker picks up motions without layman_explanation or embeddings, calls ai_provider, and writes layman_explanation + embeddings.  
 3. Clusterer/similarity job computes related-motion lists using stored embeddings and writes them to a cache table.  
 4. UI "Explore" shows curated motion lists; "Motion detail" reads motion, layman_explanation, party-match snapshot, and cached related motions.  
 5. User vote actions update user_sessions and also append an event to the audit table for traceability.  
 6. Background analytics (optional) reuses user_events and embeddings for offline insights.
 ## Error Handling Strategy
 - External calls: add retries with exponential backoff for AI provider and external APIs. Failures set a marker (e.g., summary_missing) and the system continues.  
 - Missing embeddings: UI gracefully disables "related motions" and offers "compute on demand".  
 - Idempotency: make insert_motion idempotent by URL/external id check at DB layer; use optimistic handling for duplicates.  
 - Concurrency: avoid read-modify-write races by writing user events (append-only) and deriving session state from events when race-prone updates are detected.  
 - Observability: replace prints with structured logging (module-level logger) and add basic metrics for worker errors, API failures, and queue lags.
 ## Testing Strategy
 - Unit tests: DB helpers (insert_motion, store_embedding, similarity cache), summarizer functions (mock ai_provider), and session vote logic.  
 - Migration tests: follow the existing pattern of applying migration SQL in a temp DB and asserting schema.  
 - Integration tests: end-to-end ingest → summarize → embedding → similarity → UI-read path in CI (use monkeypatch for AI calls).  
 - Load tests: simulate a few thousand embeddings search calls against the in-process search to validate performance assumptions for MVP.  
 - Acceptance: confirm UX flows: Explore session, Motion detail, Vote -> party match, Related motions populated.
 ## High-level Plan & Estimates
 Assumptions: one full-stack engineer (Python + Streamlit) and one part-time reviewer. All estimates are rough.  
 Milestone 0 — Validate & quick discovery (1 day)
 - Locate user's added markdown plan and extract exact requirements. (I'm assuming the file exists in thoughts/shared; if not, we validated by searching.)  
 Milestone 1 — MVP (8–12 engineer days)
 - Add similarity cache table and migration.  
 - Summarizer: make embedding generation robust with retries and store vectors.  
 - Clusterer job: compute and cache related motions.  
 - UI: Explore landing, Motion detail page, related motion UI, bookmark/flag button.  
 - Add event/audit table and write events on user votes and bookmarks.  
 Milestone 2 — Hardening & instrumentation (3–5 engineer days)
 - Replace prints with structured logging across touched modules.  
 - Add migration tests and CI integration tests (mock AI).  
 - Add health metrics & basic alerting for worker failures.  
 Milestone 3 — Polish & UX feedback (3–5 engineer days)
 - UX tweaks, performance tuning, compute on-demand fallback for embeddings, documentation, admin CLI.  
 Total MVP + polish: ~2–3 weeks of focused work.
 ## Risks & Mitigations
 - Risk: Naive in-process embedding search will not scale. Mitigation: cache nearest neighbors per motion and plan a migration path to a vector index.  
 - Risk: AI provider flakiness. Mitigation: retries, timeouts, and clear UI fallback. Tests must mock provider in CI.  
 - Risk: Race conditions on session votes. Mitigation: append-only event log and derive authoritative session view from events when needed.  
 - Risk: Schema drift and missing migrations. Mitigation: add migration tests and document required migrations in repo.
 ## Open Questions
 - Which exact user journeys do we want first (single-session discover vs. persistent account/bookmarking)?  
 - Do we want bookmarks persisted globally or per-session only? (Privacy implications.)  
 - What's acceptable latency for "related motions" — precomputed nightly vs. near-real-time?  
 - Any policy/legal ban on storing full body_text or on long-term retention of user votes?
 ---
 I'm proceeding to create the design doc file at thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md and will spawn the implementation planner next. Interrupt if you want changes to the approach or scope now.
--- a/thoughts/shared/plans/2026-03-21-motions-guided-explorer-plan.md
+++ b/thoughts/shared/plans/2026-03-21-motions-guided-explorer-plan.md
@ -0,0 +1,335 @@
 # Guided Policy Explorer — Implementation Plan
 **Goal:** Implement the Guided Policy Explorer MVP that reuses existing motions, layman summaries, embeddings and session votes to provide an Explore landing, Motion detail view, cached related motions (similarity cache), and accompanying background jobs and admin tooling.
 Design: thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md
 ---
 ## Dependency Graph
 ```
 Batch 1 (parallel): 1.1, 1.2, 1.3, 1.4, 1.5 [foundation - migrations, types, migration-tests]
 Batch 2 (parallel): 2.1, 2.2, 2.3, 2.4 [core - similarity service, cache repo, audit repo, embeddings worker]
 Batch 3 (parallel): 3.1, 3.2, 3.3, 3.4 [components - clusterer worker, CLI, API, Streamlit page]
 Batch 4 (parallel): 4.1 [integration tests & docs - depends on 2.x & 3.x]
 ```
 ---
 ## Notes on planning choices
 - Design requires a similarity cache and a small in-process nearest-neighbor search for MVP. I'm implementing this as: store precomputed top-N neighbor lists (IDs + scores) in a small SQL table and compute neighbors by scanning embeddings in-memory per batch job. Reason: avoids external vector DB and keeps implementation simple and testable.
 - Design requires robust embedding generation. I'll implement exponential-backoff retry logic with a configurable retry count and timeouts in embeddings_worker; tests will monkeypatch the ai_provider to simulate failures.
 - Migration tests: design asks to have migration tests, but migration SQL content is omitted per instructions. Tests will assert that migration files are present and follow naming conventions and will be marked to skip applying SQL unless a TEST_DB_URL env var is provided. This keeps CI safe while satisfying test coverage and developer verification.
 ---
 ## Batch 1: Foundation (parallel - 5 implementers)
 All tasks in this batch have NO dependencies and run simultaneously.
 ### Task 1.1: Add similarity cache migration (placeholder)
 **Title:** Migration: add similarity_cache table
 **Description:** Add a migration file to create a similarity cache table that stores precomputed related-motion lists per motion (motion_id, neighbors_json, computed_at). SQL content intentionally left out per instructions; file is a placeholder that CI/tests will detect.
 **Files:**
 - migrations/2026-03-22-add-similarity-cache.sql
 **Tests:**
 - tests/migrations/test_2026_03_22_add_similarity_cache.py
 **Estimated:** 1.0h
 **Priority:** high
 **Depends:** none
 **Acceptance criteria:**
 - Migration file exists at migrations/2026-03-22-add-similarity-cache.sql
 - test_migration file runs and passes in default mode (it will only check filename & header). If TEST_DB_URL is set in env, test will attempt to run the SQL and must not error (SQL may be empty; test expects a no-op or valid SQL). Test is marked to skip DB application when TEST_DB_URL is unset.
 ---
 ### Task 1.2: Add audit/events migration (placeholder)
 **Title:** Migration: add audit_events table
 **Description:** Add a migration placeholder to create an audit/events table for append-only user events (vote, bookmark, flag). Actual SQL omitted.
 **Files:**
 - migrations/2026-03-22-add-audit-events.sql
 **Tests:**
 - tests/migrations/test_2026_03_22_add_audit_events.py
 **Estimated:** 1.0h
 **Priority:** high
 **Depends:** none
 **Acceptance criteria:**
 - migrations/2026-03-22-add-audit-events.sql exists
 - migration test verifies filename and is safe to run in CI (skips DB apply unless TEST_DB_URL provided).
 ---
 ### Task 1.3: Shared types for motions & similarity entries
 **Title:** Types: motion and similarity types
 **Description:** Add a small types module that centralizes typed dataclasses/interfaces used by similarity and cache modules (MotionId, Embedding vector typed alias, SimilarityNeighbor). This reduces coupling and makes tests easier to write.
 **Files:**
 - src/types/motion_types.py
 **Tests:**
 - tests/types/test_motion_types.py
 **Estimated:** 1.5h
 **Priority:** medium
 **Depends:** none
 **Acceptance criteria:**
 - src/types/motion_types.py defines MotionId, Embedding, SimilarityNeighbor types and basic helpers (e.g., serialize/deserialize neighbors). Tests validate JSON round-trip of neighbors.
 ---
 ### Task 1.4: CI migration test helper
 **Title:** Test helper: migration test utils
 **Description:** Add a small test helper that other migration tests can use. It provides a pytest fixture that reads TEST_DB_URL and yields a DB connection or None and marks tests appropriately.
 **Files:**
 - tests/utils/migration_fixtures.py
 **Tests:**
 - tests/migrations/test_migration_fixtures_smoke.py
 **Estimated:** 1.0h
 **Priority:** medium
 **Depends:** none
 **Acceptance criteria:**
 - migration_fixtures.py provides `test_db` fixture. The smoke test asserts fixture yields None when TEST_DB_URL unset and yields a connection-like object when set.
 ---
 ### Task 1.5: Add README admin docs for recomputing
 **Title:** Docs: admin CLI usage and migration notes
 **Description:** Add a short markdown doc describing the admin CLI, migration filenames, and how to run recompute/clusterer jobs locally for dev.
 **Files:**
 - docs/admin/recompute_similarity.md
 **Tests:** none (doc only)
 **Estimated:** 0.5h
 **Priority:** low
 **Depends:** none
 **Acceptance criteria:**
 - docs/admin/recompute_similarity.md exists and documents commands and env vars: TEST_DB_URL, AI_PROVIDER_MOCK, SIMILARITY_TOP_N.
 ---
 ## Batch 2: Core Modules (parallel - 4 implementers)
 Depends: Batch 1
 ### Task 2.1: Similarity service (in-process search + utility)
 **Title:** Similarity service implementation
 **Description:** New service that, given motion embeddings, computes cosine similarity and returns top-N neighbors. Also exposes a convenience function to compute neighbors for one motion and return a list of (motion_id, score). This is pure Python and testable in-memory.
 **Files:**
 - src/services/similarity_service.py
 **Tests:**
 - tests/services/test_similarity_service.py
 **Estimated:** 5.0h
 **Priority:** high
 **Depends:** 1.3
 **Acceptance criteria:**
 - similarity_service.py exposes compute_neighbors(embedding: list[float], all_embeddings: Dict[motion_id, embedding], top_n: int) -> List[SimilarityNeighbor]
 - Unit tests cover exact small matrices and edge cases (empty, identical embeddings). All tests pass with `pytest tests/services/test_similarity_service.py`.
 ---
 ### Task 2.2: DB repo for similarity cache
 **Title:** Repo: similarity_cache read/write
 **Description:** Provide a small repository abstraction that reads and writes cached neighbor lists to the DB (serialize neighbors as JSON). Keep DB interactions minimal and testable using sqlite in-memory.
 **Files:**
 - src/db/similarity_cache_repo.py
 **Tests:**
 - tests/db/test_similarity_cache_repo.py
 **Estimated:** 4.0h
 **Priority:** high
 **Depends:** 1.1, 1.3
 **Acceptance criteria:**
 - similarity_cache_repo provides functions: get_cached_neighbors(motion_id) -> Optional[List[SimilarityNeighbor]] and upsert_cached_neighbors(motion_id, neighbors, computed_at)
 - Unit tests run against sqlite in-memory and assert correct serialization/deserialization.
 ---
 ### Task 2.3: Audit/events repository
 **Title:** Repo: audit_events append-only writer
 **Description:** Small repo to append audit events (user_id, session_id, motion_id, event_type, payload JSON, created_at). Provides an append_event function used by UI and session logic.
 **Files:**
 - src/db/audit_repo.py
 **Tests:**
 - tests/db/test_audit_repo.py
 **Estimated:** 3.0h
 **Priority:** medium
 **Depends:** 1.2
 **Acceptance criteria:**
 - append_event writes a row to sqlite in-memory in test and read-back verifies fields and created_at presence. Functions are well typed and handle JSON payloads.
 ---
 ### Task 2.4: Embeddings worker helper (retries/backoff)
 **Title:** Worker: robust embedding generator
 **Description:** Add a worker helper that ensures embeddings exist for a motion. It calls ai_provider.get_embedding with retry/backoff and writes embedding via an abstracted DB function (the put function will be dependency-injected in tests). This module contains no long-running loop — it's a single-run helper function used by the scheduler.
 **Files:**
 - src/ai/embeddings_worker.py
 **Tests:**
 - tests/ai/test_embeddings_worker.py
 **Estimated:** 4.0h
 **Priority:** high
 **Depends:** 1.3
 **Acceptance criteria:**
 - embeddings_worker.explain_and_embed(motion_id, text, put_embedding_fn) calls ai_provider and retries on simulated transient errors. Tests monkeypatch ai_provider to simulate 2 failing attempts then success and verify put_embedding_fn called exactly once with a vector-like object.
 ---
 ## Batch 3: Components (parallel - 4 implementers)
 Depends: Batch 2
 ### Task 3.1: Clusterer scheduled job
 **Title:** Worker: clusterer job that computes & writes caches
 **Description:** Background job module that loads all embeddings, computes top-N neighbors for each motion using similarity_service, and writes cache rows via similarity_cache_repo. Designed to be runnable from CLI. It should respect a MAX runtime parameter (process batch size) for safe operation in dev.
 **Files:**
 - src/workers/clusterer.py
 **Tests:**
 - tests/workers/test_clusterer.py
 **Estimated:** 6.0h
 **Priority:** high
 **Depends:** 2.1, 2.2, 2.4
 **Acceptance criteria:**
 - clusterer.run_batch(batch_size, top_n, load_embeddings_fn, upsert_cache_fn) exists and can be unit-tested by injecting small in-memory embeddings and verifying upsert_cache_fn called with expected neighbor lists.
 ---
 ### Task 3.2: Admin CLI: recompute-similarity
 **Title:** CLI: recompute similarity & options
 **Description:** Small CLI script (click or argparse) to trigger the clusterer job (full-run or limited). CLI accepts --top-n, --batch-size, --dry-run flags. Tests will monkeypatch clusterer.run_batch.
 **Files:**
 - src/cli/recompute_similarity.py
 **Tests:**
 - tests/cli/test_recompute_similarity.py
 **Estimated:** 2.5h
 **Priority:** medium
 **Depends:** 3.1
 **Acceptance criteria:**
 - CLI parses flags and calls clusterer.run_batch with parsed args. tests assert proper arguments passed and dry-run does not call run_batch.
 ---
 ### Task 3.3: HTTP API endpoint for compute-on-demand / cached
 **Title:** API: similarity endpoint
 **Description:** Small Flask/FastAPI/WSGI handler module that returns cached related motions for a motion_id; if cache missing and a query param compute=true, it calls the similarity service to compute neighbors on demand (without persisting) and returns them. Keep the handler framework-agnostic so it can be wired into existing web framework; tests will call the handler function directly.
 **Files:**
 - src/api/similarity_api.py
 **Tests:**
 - tests/api/test_similarity_api.py
 **Estimated:** 3.5h
 **Priority:** medium
 **Depends:** 2.1, 2.2
 **Acceptance criteria:**
 - Handler get_related(motion_id, compute=False, load_embedding_fn, load_all_embeddings_fn, cache_repo) returns cached neighbors when present and computes on demand when compute=True. Tests cover both code paths.
 ---
 ### Task 3.4: Streamlit UI: Explore landing & Motion detail module
 **Title:** UI: explore page and motion detail component
 **Description:** Add a Streamlit helper module providing functions to render the Explore landing and Motion detail sections. Avoid modifying existing app.py in this MVP; instead provide a module that app.py can import. The module will expose pure functions where possible to ease testing; tests will verify behavior by calling functions and mocking DB/AI calls.
 **Files:**
 - src/ui/explore_page.py
 **Tests:**
 - tests/ui/test_explore_page.py
 **Estimated:** 5.0h
 **Priority:** medium
 **Depends:** 2.2, 2.3, 2.4
 **Acceptance criteria:**
 - explore_page.render_explore(session, load_curated_fn, load_cached_neighbors_fn) returns a data structure (not direct Streamlit calls) that app.py can choose to render. Tests assert correct payload for a sample session and that missing embeddings gracefully remove related motions.
 ---
 ## Batch 4: Integration & Docs (parallel - 2 implementers)
 Depends: Batch 2 & 3
 ### Task 4.1: Integration test: ingest → summarize → embed → cluster → UI read
 **Title:** Integration test for the end-to-end path (mvp)
 **Description:** Add an integration pytest that simulates: create 3 synthetic motions, call embeddings_worker (monkeypatched AI provider), run clusterer on the in-memory dataset, and assert similarity cache rows exist and explore_page returns related motions. Use sqlite in-memory and monkeypatch ai_provider to return deterministic vectors.
 **Files:**
 - tests/integration/test_end_to_end_explore_flow.py
 **Tests:**
 - (this is the test file)
 **Estimated:** 8.0h
 **Priority:** high
 **Depends:** 1.3, 2.1, 2.2, 2.4, 3.1, 3.4
 **Acceptance criteria:**
 - Running `pytest tests/integration/test_end_to_end_explore_flow.py` passes locally with no external network calls when AI provider is monkeypatched via monkeypatch fixture. The test asserts that at least one neighbor exists for a motion and the explore_page data includes it.
 ---
 ## CI / Test instructions
 - Run unit tests: pytest tests/unit (or full suite: pytest)
 - Run a single module test: pytest tests/services/test_similarity_service.py::test_compute_neighbors_basic
 - Integration tests: pytest tests/integration/test_end_to_end_explore_flow.py
 Monkeypatching AI provider in CI/local tests:
 - Use the `monkeypatch` pytest fixture to patch `src.ai.ai_provider.get_embedding` and `src.ai.ai_provider.summarize` (if used). Example in tests: monkeypatch.setattr('src.ai.ai_provider.get_embedding', fake_get_embedding)
 - CI should set env var AI_PROVIDER_MOCK=1 for additional safety; tests will check this var and use mocks if present.
 Temp DB setup for tests:
 - Unit tests should use sqlite in-memory ("sqlite:///:memory:") via a `test_db` fixture in tests/utils/migration_fixtures.py.
 - Migration tests: If TEST_DB_URL env var is set, the migration tests will attempt to apply SQL to that DB; otherwise they will run in dry-run / skip-apply mode and only validate filename and header.
 Example pytest commands:
 - pytest -q
 - pytest -q tests/services/test_similarity_service.py -k compute_neighbors
 Notes for CI pipeline:
 - Ensure Python dependencies include pytest, pytest-mock and any DB driver required (sqlite built-in is fine). No external AI keys required — tests must mock AI provider.
 ---
 ## 3-Sprint Schedule (2-week sprints)
 Sprint 1 (Weeks 1–2) — Milestone 1: MVP foundation + core similarity
 - Goals: Add migrations, types, similarity service, similarity cache repo, audit repo, embeddings worker helper
 - Tasks: 1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 2.3, 2.4
 Sprint 2 (Weeks 3–4) — Milestone 1 continued: background job, CLI, API, UI
 - Goals: Implement clusterer job, CLI, similarity API, explore_page UI module; initial integration smoke tests
 - Tasks: 3.1, 3.2, 3.3, 3.4, initial lightweight integration test scaffolding
 Sprint 3 (Weeks 5–6) — Milestone 2 & 3: hardening, integration tests, docs
 - Goals: Full integration tests, migration tests, docs, logging hardening, small UX polish
 - Tasks: 4.1, docs improvements from 1.5, logging conversion across modules (follow-up small PRs as needed)
 Notes:
 - Estimates assume 1 full-stack engineer + 1 reviewer. Sprint 1 is AMA-heavy; reviewer will focus on migrations and core algorithms. Sprint 2 focuses on wiring and UI; reviewer focuses on integration and UX. Sprint 3 finishes tests and polish.
 ---
 ## Assumptions
 - The repository uses Python 3.10+ and pytest for tests. If different, adjust test fixtures accordingly.
 - Existing DB access helpers exist (a simple execute/connection helper). If not, tests use sqlite3 directly and repository code will accept a DB connection/cursor via dependency injection.
 - The project already has an ai_provider abstraction at src/ai/ai_provider.py with functions `get_embedding(text) -> list[float]` and `summarize(text) -> str` — tests will monkeypatch these. If the names differ, adapt imports when implementing.
 - Streamlit app remains `app.py` and can import src/ui/explore_page.py — I deliberately do not modify app.py in this plan to keep the change set minimal.
 - We will store embeddings as arrays in an embeddings table; similarity modules will load them via an injected loader function to keep unit tests pure.
 ---
 ## Open Questions / Implementation Clarifications
 1. Bookmarks persistence: design left bookmarks as open (session vs. persistent). For MVP we will record bookmark events in the audit_events table (append-only) and treat them as per-session by default. If persistent bookmarks required later, a new table/migration will be added.
 2. Which web framework to wire the similarity_api into? The plan keeps handler framework-agnostic; we need guidance whether app uses Flask/FastAPI/Starlette to add the route. Implementer should wire into existing HTTP routing pattern.
 3. Embedding storage format: assume float arrays stored as JSON or array type in DB. If project uses a binary blob, adjust serialization in similarity_cache_repo and tests accordingly.
 4. Acceptable top-N neighbor size for caches. Default SIMILARITY_TOP_N = 10; CLI and worker accept override. If product wants 50, increase later.
 ---
 ## How a single implementer should proceed (step-by-step)
 1. Start with Batch 1 tasks 1.1–1.4. Create migrations placeholders and types module. Run migration filename tests.
 2. Implement similarity_service (2.1) and its unit tests. This is the critical algorithm that must be rock-solid.
 3. Implement similarity_cache_repo (2.2) and audit_repo (2.3) using sqlite in-memory for tests. Run unit tests.
 4. Implement embeddings_worker helper (2.4) and add tests that mock ai_provider. Ensure CI will not call real AI.
 5. Implement clusterer (3.1) and test with in-memory data by injecting loader/upsert functions.
 6. Add admin CLI (3.2) to run clusterer; add small doc (1.5) describing how to run it locally.
 7. Implement API handler (3.3) and UI helper (3.4). Tests should mock DB and AI as needed.
 8. Finish with integration test (4.1) to stitch the pieces together. Iterate on bug fixes and reviewer feedback.
 ---
 ## Acceptance criteria for the feature (MVP)
 - Explore landing exists and can present curated motions (using existing curated flag). Data payload returned by explore_page includes motion metadata and layman_explanation.
 - Motion detail returns layman_explanation, party-match snapshot (existing), and related motions computed from cached neighbor lists when available.
 - Background clusterer job can recompute cached neighbor lists and the CLI can trigger it.
 - Tests cover core algorithm (similarity computation), cache repo serialization, embedders (mocked), and at least one end-to-end smoke integration test.
 ---
 If anything in this plan should be narrowed further (for a smaller initial PR) I recommend focusing Sprint 1 + clusterer CLI (Tasks 1.x + 2.x + 3.1 + 3.2) and deferring UI wiring until clusterer and cache are validated.
--- a/thoughts/thoughts/shared/plans/2026-03-19-stemwijzer-plan.md
+++ b/thoughts/thoughts/shared/plans/2026-03-19-stemwijzer-plan.md
@ -0,0 +1,106 @@
 ---
 date: 2026-03-19
 topic: "Stemwijzer AI & DB implementation plan"
 status: draft
 ---
 ## Summary
 Implementation plan derived from thoughts/shared/designs/2026-03-19-stemwijzer-design.md.
 Goal: add a provider abstraction for AI calls, minimal embeddings stored in DuckDB (JSON), and an ibis-based read DAL. Keep changes small, additive and well-tested.
 ## High-level approach (chosen)
 - Add **ai_provider**: adapter exposing get_embedding(text) and chat_completion(messages) with retries and ProviderError.
 - Add **embeddings** table (DuckDB) and store/search helpers in database.py (naive Python cosine scan).
 - Add **query_dal**: ibis-based read helpers for Streamlit (get_filtered_motions, calculate_party_matches).
 - Refactor summarizer to call ai_provider and optionally store embeddings.
 - Minimal housekeeping fixes: reset.py and SCRAPING_DELAY in scraper.py.
 ## Micro-tasks (11 tasks)
 All tasks are intentionally small (file-level changes + tests). Estimates assume one developer full-time; see Risk and Calendar section below.
 Batch 1 (foundation, parallelizable)
 1. Add tests fixtures for temporary DuckDB (tests/conftest.py) — 2h — low risk
 2. Add migration SQL to create embeddings table (migrations/2026-03-19-add-embeddings.sql) — 1h — low risk
 3. Add ai_provider adapter (src/ai_provider.py) + tests (tests/test_ai_provider.py) — 6h — medium risk
 4. Add scraper SCRAPING_DELAY default (src/scraper.py) + tests — 1h — low risk
 5. Fix reset script to run migrations (src/reset.py) + tests — 2h — low risk
 Batch 2 (core modules)
 6. Add store_embedding and search_similar to src/database.py + tests (tests/test_database_embeddings.py) — 8h — medium risk
 7. Add query_dal (src/query_dal.py) with ibis reads + tests (tests/test_query_dal.py) — 6h — medium risk
 8. Refactor summarizer to use ai_provider and optionally store embeddings (src/summarizer.py) + tests (tests/test_summarizer.py) — 6h — medium risk
 Batch 3 (integration)
 9. Add CLI semantic search helper (src/cli_search.py) + tests — 4h — low-medium risk
 10. Update app read paths to use query_dal (src/app.py) + tests — 3h — low risk
 Batch 4 (docs/config)
 11. Add .env.example entries for new env vars — 1h — low risk
 ## PR order (recommended, small focused PRs)
 1. PR A — tests/conftest (fixtures)
 2. PR B — migration SQL (embeddings table)
 3. PR C — ai_provider + tests
 4. PR D — database store/search helpers + tests
 5. PR E — query_dal + tests
 6. PR F — summarizer refactor + tests
 7. PR G — cli_search + tests
 8. PR H — app read changes + tests
 9. PR I — scraper/reset small fixes + tests
 10. PR J — .env.example
 ## Estimates & schedule (one dev, full-time ~8h/day)
 - Total estimated effort: ~50 hours (~6.25 days) + buffer → ~7 calendar days.
 - Conservative schedule: Batch 1 (2 days), Batch 2 (3 days), Batch 3 (1 day), Buffer/Review (1 day).
 ## DB migration steps
 - Add migrations/2026-03-19-add-embeddings.sql (additive).
 - Apply on staging first; backup DB, run migration, verify `SELECT count(*) FROM embeddings`.
 - No changes to motions table in first iteration.
 ## Testing strategy
 - Unit tests for ai_provider (mock HTTP responses). Use monkeypatch to avoid network.
 - DB tests use temporary DuckDB files (pytest fixtures) to verify storing and searching embeddings.
 - query_dal tests use ibis.duckdb.connect against a temporary DB file and parse JSON fields.
 - Summarizer tests mock ai_provider to assert DB writes (summary and optional embedding).
 ## Error handling
 - ai_provider: retry/backoff for transient errors; raise ProviderError for terminal failures.
 - Summarizer: non-fatal on AI failures — write fallback/empty summary, log, and surface message in UI when interactive.
 - DB functions: keep try/except patterns and ensure connections closed on error.
 ## Risks & mitigations
 - ai_provider changes: medium risk — mitigate with retries, clear ProviderError, and thorough unit tests.
 - Embedding search: medium (naive scan performance) — mitigate by keeping implementation simple and planning for ANN/FAISS later.
 - ibis usage: medium — mitigate with tests and keep query_dal narrow.
 ## Next actions (what I'll do now)
 - I wrote this implementation plan to thoughts/shared/plans/2026-03-19-stemwijzer-plan.md (draft).
 - I will NOT start applying code changes automatically. If you want, I can:
  - (A) Create the first PR patch (tests/conftest.py + migration) and open a draft for review, or
  - (B) Start implementing Task 1.1 (ai_provider) next.
 Interrupt if you want changes to the plan or a different PR ordering. Otherwise tell me which task to start and I'll create the first patch.
--- a/tools/query_tk_api.py
+++ b/tools/query_tk_api.py
@ -0,0 +1,129 @@
 #!/usr/bin/env python3
 """Query Tweede Kamer OData endpoints to locate motion body text.
 This script performs the API calls described in the task and prints
 structured information about responses (status code, keys, candidate
 fields that may contain text or content URLs).
 File: tools/query_tk_api.py
 """
 import json
 import sys
 from urllib.parse import quote
 try:
    import requests
 except Exception:
    print("missing requests library", file=sys.stderr)
    raise
 BASE = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0"
 ZAAK_ID = "e6fd62f1-29be-4955-9811-03d46da2fc3a"
 def try_get(path):
    url = BASE.rstrip("/") + "/" + path.lstrip("/")
    print("\nGET", url)
    r = requests.get(url, headers={"Accept": "application/json"})
    print("->", r.status_code, r.headers.get("Content-Type"))
    # try to print JSON keys or text length
    ct = r.headers.get("Content-Type", "")
    if "application/json" in ct or r.text.strip().startswith("{"):
        try:
            j = r.json()
            print("JSON keys:", list(j.keys()))
            # pretty-print limited
            print("JSON preview:", json.dumps(j, indent=2)[:4000])
            return j
        except Exception as e:
            print("failed to parse json:", e)
    else:
        print("text length:", len(r.content))
        print("headers:", dict(r.headers))
        print("first 800 bytes:\n", r.content[:800])
    return None
 def main():
    # 1. Zaak expand Document
    tried = []
    patterns = [
        f"Zaak({ZAAK_ID})?$expand=Document",
        f"Zaak(guid'{ZAAK_ID}')?$expand=Document",
        f"Zaak('{ZAAK_ID}')?$expand=Document",
    ]
    zaak_json = None
    for p in patterns:
        tried.append(p)
        zaak_json = try_get(p)
        if zaak_json and "Document" in (zaak_json.get("value") or zaak_json):
            break
    # If top-level 'value' exists (collection), try to find first
    if zaak_json and "value" in zaak_json:
        # If API returned a collection, pick first
        val = zaak_json["value"]
        if isinstance(val, list) and val:
            zaak = val[0]
        else:
            zaak = None
    else:
        zaak = zaak_json
    print("\n--- Zaak object (extracted) ---")
    print(json.dumps(zaak, indent=2)[:4000])
    docs = []
    if zaak:
        # Document may be navigation property 'Document' or 'Documents'
        for key in ("Document", "Documents"):
            if key in zaak:
                val = zaak[key]
                if isinstance(val, list):
                    docs.extend(val)
                elif isinstance(val, dict):
                    docs.append(val)
    print("\nFound", len(docs), "Document entries")
    for i, d in enumerate(docs):
        print("\n--- Document", i, "---")
        print(json.dumps(d, indent=2)[:4000])
    # 2. Try DocumentVersie endpoint
    # We'll attempt: DocumentVersie?$filter=DocumentId eq guid'...'
    for d in docs:
        doc_id = d.get("Id") or d.get("DocumentId") or d.get("IdDocument")
        if not doc_id:
            # maybe OData provided @odata.id
            if "@odata.id" in d:
                # extract id from URI - last segment
                seg = d["@odata.id"].rstrip("/").split("/")[-1]
                doc_id = seg
        if not doc_id:
            continue
        print("\nQuerying DocumentVersie for Document id:", doc_id)
        q1 = f"DocumentVersie?$filter=DocumentId%20eq%20guid'{doc_id}'"
        j = try_get(q1)
        # also try expanding from Document
        q2 = f"Document({quote(doc_id)})?$expand=DocumentVersie"
        j2 = try_get(q2)
        # try direct DocumentVersie by key
        q3 = f"DocumentVersie(guid'{doc_id}')"
        j3 = try_get(q3)
        # 3. Try content stream patterns
        candidates = [
            f"Document({quote(doc_id)})/Content",
            f"Document({quote(doc_id)})/$value",
            f"Document({quote(doc_id)})/Inhoud",
            f"Resource('{doc_id}')",
            f"Resource({quote(doc_id)})",
        ]
        for c in candidates:
            try_get(c)
 if __name__ == "__main__":
    main()
--- a/uv.lock
+++ b/uv.lock
--- a/verify.py
+++ b/verify.py
@ -0,0 +1,9 @@
 import duckdb
 from config import config
 conn = duckdb.connect(config.DATABASE_PATH)
 result = conn.execute("PRAGMA table_info('motions')").fetchall()
 for row in result:
    print(row)
 conn.close()
		`@ -0,0 +1 @@`
							`"""Make the tests directory a package so test helpers can be imported."""`