feat(mindmodel): add manifest loader and tests

3 months ago · 2efd7ba3a0
parent 9c82962d47
commit 2efd7ba3a0
17 changed files with 807 additions and 123 deletions
--- a/.mindmodel/constraints/01-naming.yaml
+++ b/.mindmodel/constraints/01-naming.yaml
@ -0,0 +1,34 @@
 # Naming & Style Conventions
 ## Rules
 - Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py
 - Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py)
 - Classes: PascalCase. Evidence: MotionDatabase (database.py)
 - Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred)
 - Imports order: stdlib, third-party, local; prefer absolute imports and grouped.
 - Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections).
 ## Examples
 ### Function example (from pipeline/run_pipeline.py)
 ```python
 def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]:
    """Return list of (window_id, start_str, end_str) tuples."""
 ```
 ### Class example (from database.py)
 ```python
 class MotionDatabase:
    def __init__(self, db_path: str = config.DATABASE_PATH):
        ...
 ```
 ## Anti-patterns
 - Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files.
 ## Remediations
 - Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step.
 ## Evidence pointers
 - pipeline/run_pipeline.py: function _generate_windows (lines ~1-120)
 - database.py: MotionDatabase class and methods (file database.py lines 1-400+)
--- a/.mindmodel/constraints/10-db-schema.yaml
+++ b/.mindmodel/constraints/10-db-schema.yaml
@ -0,0 +1,74 @@
 # Database Schema (DuckDB) — extracted DDL
 ## Rules
 - Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). 
 - Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py).
 ## Examples (DDL snippets extracted from database.py)
 ### motions table
 ```sql
 CREATE TABLE IF NOT EXISTS motions (
    id INTEGER DEFAULT nextval('motions_id_seq'),
    title TEXT NOT NULL,
    description TEXT,
    date DATE,
    policy_area TEXT,
    voting_results JSON,
    winning_margin FLOAT,
    controversy_score FLOAT,
    layman_explanation TEXT,
    externe_identifier TEXT,
    body_text TEXT,
    url TEXT UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
 )
 ```
 ### mp_votes table
 ```sql
 CREATE TABLE IF NOT EXISTS mp_votes (
    id INTEGER DEFAULT nextval('mp_votes_id_seq'),
    motion_id INTEGER NOT NULL,
    mp_name TEXT NOT NULL,
    party TEXT,
    vote TEXT NOT NULL,
    date DATE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
 )
 ```
 ### embeddings / fused_embeddings
 ```sql
 CREATE TABLE IF NOT EXISTS embeddings (
    id INTEGER DEFAULT nextval('embeddings_id_seq'),
    motion_id INTEGER NOT NULL,
    model TEXT,
    vector JSON NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
 )
 CREATE TABLE IF NOT EXISTS fused_embeddings (
    id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
    motion_id INTEGER NOT NULL,
    window_id TEXT NOT NULL,
    vector JSON NOT NULL,
    svd_dims INTEGER NOT NULL,
    text_dims INTEGER NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
 )
 ```
 ## Anti-patterns
 - Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior.
 ## Remediations
 - Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically.
 - Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80).
 ## Evidence pointers
 - database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings.
--- a/.mindmodel/constraints/20-domain-glossary.yaml
+++ b/.mindmodel/constraints/20-domain-glossary.yaml
@ -0,0 +1,22 @@
 # Domain Glossary
 ## Rules
 - Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id.
 ## Terms
 - Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110)
 - MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes
 - Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`.
 - SVD vector: reduced-dimensional vectors stored in `svd_vectors` table.
 - Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows
 - Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score
 ## Examples / Usage
 - pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120
 ## Evidence pointers
 - database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py)
 - pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py)
 ## Anti-patterns
 - Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations.
--- a/.mindmodel/constraints/30-clusters.yaml
+++ b/.mindmodel/constraints/30-clusters.yaml
@ -0,0 +1,30 @@
 # Code Clusters / Organization
 ## Rules
 - The repository organizes code into the following clusters (observed):
  - UI / Streamlit: Home.py, pages/, app.py, explorer.py
  - Database & persistence: database.py, config.py
  - ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion)
  - AI provider & summarization: ai_provider.py, pipeline/..., analysis/
  - Similarity & caching: similarity/*, similarity_cache table in DB
  - API client & scraping: api_client.py, pipeline/fetch_mp_metadata
  - Analysis & visualization: analysis/visualize.py, explorer.py
  - CLI & scheduler: scheduler.py, pipeline/run_pipeline.py
  - Tests & migrations: tests/ (pytest) and database reset helpers
 ## Examples
 ### Pipeline orchestrator (cluster: CLI & pipeline)
 ```python
 from database import MotionDatabase
 db = MotionDatabase(db_path)
 # then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window
 ```
 ## Remediations
 - Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests.
 ## Evidence pointers
 - pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py)
 - ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py)
 - analysis/visualize.py: visualization cluster (file: analysis/visualize.py)
--- a/.mindmodel/constraints/40-patterns.yaml
+++ b/.mindmodel/constraints/40-patterns.yaml
@ -0,0 +1,46 @@
 # Design Patterns & Code Patterns
 ## Rules
 - Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management.
 - AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback.
 - Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes).
 ## Examples
 ### Repository pattern (database.py MotionDatabase)
 ```python
 class MotionDatabase:
    def __init__(self, db_path: str = config.DATABASE_PATH):
        self.db_path = db_path
        self._init_database()
    def insert_motion(self, motion_data: Dict) -> bool:
        """Insert a new motion into database"""
        # uses duckdb.connect and parameterized queries
 ```
 ### Provider adapter with retries (ai_provider.py)
 ```python
 def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response:
    # Implements retries/backoff, handles 429 with Retry-After and 5xx responses
 ```
 ### Pipeline parallelism pattern (run_pipeline)
 ```python
 with ThreadPoolExecutor(max_workers=max_workers) as pool:
    for window_id, w_start, w_end in windows:
        fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k)
        futures[fut] = window_id
 # wait then write sequentially to DuckDB
 ```
 ## Anti-patterns
 - Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors.
 ## Remediations
 - Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md.
 ## Evidence pointers
 - ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300)
 - pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260)
 - database.py: MotionDatabase methods (file: database.py)
--- a/.mindmodel/constraints/50-anti-patterns.yaml
+++ b/.mindmodel/constraints/50-anti-patterns.yaml
@ -0,0 +1,24 @@
 # Anti-patterns, Issues and Recommended Fixes
 ## Rules
 - Flagged issues discovered in Phase 1 must be remediated with concrete actions.
 ## Issues
 - pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml
 - openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports.
 - Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility.
 - Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps.
 - Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging.
 ## Remediations / Recommended fixes
 - Move pytest from runtime dependencies to dev-dependencies in pyproject.toml.
  - Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain.
 - Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var.
 - Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges.
 - Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage.
 - Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks.
 ## Evidence pointers
 - pyproject.toml: dependencies list (file: pyproject.toml lines 1-40)
 - database.py: multiple broad except blocks (file: database.py top and methods)
 - ai_provider.py: uses requests + env keys (file: ai_provider.py)
--- a/.mindmodel/constraints/60-examples.yaml
+++ b/.mindmodel/constraints/60-examples.yaml
@ -0,0 +1,117 @@
 # Example Extractions
 ## Rules
 - Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions.
 ## (a) Function signatures with docstrings (5 examples)
 1) pipeline/run_pipeline.py::_generate_windows
 ```python
 def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]:
    """Return list of (window_id, start_str, end_str) tuples.
    window_id format:
      quarterly → "2024-Q1", "2024-Q2", …
      annual    → "2024"
    """
 ```
 2) database.py::append_audit_event
 ```python
 def append_audit_event(
    self,
    actor_id: Optional[str],
    action: str,
    target_type: Optional[str] = None,
    target_id: Optional[str] = None,
    metadata: Optional[Dict] = None,
 ) -> bool:
    """Record an audit event. Tries DB then falls back to ledger file."""
 ```
 3) ai_provider.py::get_embedding
 ```python
 def get_embedding(text: str, model: str | None = None) -> list[float]:
    """Return an embedding vector for `text` using the configured provider.
    Raises ProviderError for configuration or provider-side failures.
    """
 ```
 4) ai_provider.py::get_embeddings_batch
 ```python
 def get_embeddings_batch(
    texts: list[str], model: str | None = None, batch_size: int = 50
 ) -> list[list[float]]:
    """Return embedding vectors for multiple texts using batched API calls."""
 ```
 5) analysis/visualize.py::plot_umap_scatter
 ```python
 def plot_umap_scatter(
    motion_ids: List[int],
    coords: List[List[float]],
    labels: Optional[List[int]] = None,
    window_id: Optional[str] = None,
    output_path: str = "analysis_umap.html",
 ) -> str:
    """Produce a 2D scatter plot of UMAP-reduced fused embeddings."""
 ```
 ## (b) SQL / DDL snippets (3 examples inferred from database.py)
 1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110)
 2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes
 3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings
 ## (c) Pytest stubs (4 sample tests matching conventions)
 Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add.
 1) tests/test_database_basic.py
 ```python
 def test_init_database_creates_tables(tmp_path):
    db_path = str(tmp_path / "motions.db")
    from database import MotionDatabase
    db = MotionDatabase(db_path=db_path)
    # If duckdb not available, JSON fallback should create .embeddings.json
    assert db is not None
 ```
 2) tests/test_ai_provider.py
 ```python
 def test_local_embedding_fallback():
    from ai_provider import _local_embedding
    v = _local_embedding("hello world", dim=16)
    assert isinstance(v, list) and len(v) == 16
 ```
 3) tests/test_pipeline_windows.py
 ```python
 from pipeline.run_pipeline import _generate_windows
 def test_generate_quarterly_windows():
    from datetime import date
    start = date(2024, 1, 1)
    end = date(2024, 3, 31)
    windows = _generate_windows(start, end, "quarterly")
    assert any(w[0].endswith("Q1") for w in windows)
 ```
 4) tests/test_visualize_plot.py
 ```python
 def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path):
    # If plotly missing, function should raise ImportError with guidance
    import analysis.visualize as vis
    try:
        vis._require_plotly()
    except ImportError:
        assert True
 ```
 ## Evidence pointers
 - Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py
 - DDL: database.py create table blocks
--- a/.mindmodel/constraints/99-stack.yaml
+++ b/.mindmodel/constraints/99-stack.yaml
@ -0,0 +1,43 @@
 # Stack and Dependencies
 ## Rules
 - Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13")
 - Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile
 - Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py
 - ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/
 ## Examples
 ### pyproject dependencies (evidence: pyproject.toml)
 ```toml
 dependencies = [
    "duckdb>=1.3.2",
    "ibis-framework[duckdb]>=10.8.0",
    "openai>=1.99.7",
    "scipy>=1.11",
    "umap-learn>=0.5",
    "plotly>=5.0",
    "pytest>=9.0.2",
    "requests>=2.32.4",
    "schedule>=1.2.2",
    "streamlit>=1.48.0",
    "scikit-learn>=1.8.0",
    "beautifulsoup4>=4.14.3",
    "lxml>=6.0.2",
 ]
 ```
 ## Anti-patterns / Notes
 - pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml
 - Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility.
 - openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai).
 ## Remediations
 - Move test-only libs (pytest) to dev-dependencies in pyproject.toml.
 - Add lockfile and CI step to check for pinned dependencies.
 - Audit declared but unused packages (openai) and remove or confirm dynamic usage.
 ## Evidence pointers
 - pyproject.toml: full dependency list (lines 1-40)
 - Home.py: streamlit usage and app entry (file: Home.py) 
 - database.py: duckdb table creation and connection (file: database.py lines ~1-350)
--- a/.mindmodel/constraints/README.md
+++ b/.mindmodel/constraints/README.md
@ -0,0 +1,5 @@
 # Mindmodel constraints README
 Files in .mindmodel/constraints/ are YAML-like constraint documents describing
 conventions, patterns and remediation steps. Use these to guide PR reviews and
 CI automation.
--- a/.mindmodel/manifest.yaml
+++ b/.mindmodel/manifest.yaml
@ -1,60 +1,36 @@
 name: stemwijzer
 version: 2
 summary: >-
  Mindmodel constraints for the Stemwijzer repository (Python + Streamlit +
  DuckDB). Captures tech stack, conventions, DB schema, clusters, patterns,
  anti-patterns and example extractions. Generated from Phase 1 analysis.
 main_patterns:
  - Repository DB wrapper (MotionDatabase)
  - AI provider adapter with retry/backoff and local fallback
  - SVD + embedding fusion pipeline with windowed processing
 total_files: 11
 categories:
-  - path: stack.yaml
+  - path: .mindmodel/constraints/99-stack.yaml
-    description: Project technology stack (languages, frameworks, runtime)
+    description: Runtime tech stack and primary dependencies (Python, Streamlit, DuckDB, Ibis)
    group: stack
-  - path: dependencies.yaml
+  - path: .mindmodel/constraints/01-naming.yaml
-    description: Declared and recommended dependencies grouped by purpose
+    description: Naming, import and style conventions
-    group: stack
+    group: conventions
-  - path: system.md
+  - path: .mindmodel/constraints/10-db-schema.yaml
-    description: System overview and architecture high-level notes
+    description: DuckDB schema DDL extracted from database.py
-    group: architecture
+    group: database
-  - path: architecture.yaml
+  - path: .mindmodel/constraints/20-domain-glossary.yaml
-    description: Architectural layers, organization and confidence levels
+    description: Domain glossary and terminology (motions, MP, embeddings, windows)
    group: architecture
  - path: conventions.yaml
    description: Coding conventions cheat-sheet (naming, imports, types)
    group: style
  - path: domain-glossary.yaml
    description: Business domain glossary for the project
    group: domain
-  - path: patterns/duckdb_access.yaml
+  - path: .mindmodel/constraints/30-clusters.yaml
-    description: DuckDB access patterns, examples, and anti-patterns
+    description: Code clusters and module organization
-    group: patterns
+    group: architecture
-  - path: patterns/requests_http.yaml
+  - path: .mindmodel/constraints/40-patterns.yaml
-    description: Requests/HTTP client usage and retry best-practices
+    description: Design patterns and coding patterns observed with examples
-    group: patterns
+    group: patterns
-  - path: patterns/embeddings_similarity.yaml
+  - path: .mindmodel/constraints/50-anti-patterns.yaml
-    description: Embedding, SVD, fusion and similarity pipeline patterns
+    description: Anti-patterns, issues and recommended remediations
-    group: patterns
+    group: ops
-  - path: patterns/error_handling.yaml
+  - path: .mindmodel/constraints/60-examples.yaml
-    description: Error handling patterns and rules
+    description: Example extractions: function signatures, SQL DDL snippets, pytest stubs
-    group: patterns
+    group: examples
  - path: patterns/validation.yaml
    description: Input/domain validation patterns and examples
    group: patterns
  - path: patterns/module_singletons.yaml
    description: Module-level singletons and lifecycle patterns
    group: patterns
  - path: anti-patterns.yaml
    description: Known anti-patterns and remediation steps
    group: patterns
  - path: examples/pattern-examples.md
    description: Consolidated extracted code examples across patterns
    group: patterns
  - path: constraints/naming.yaml
    description: Enforce naming rules (snake_case, PascalCase, constants)
    group: constraints
  - path: constraints/imports.yaml
    description: Enforce import grouping and ordering
    group: constraints
  - path: constraints/db_connection.yaml
    description: Rules for opening/closing DB connections and read-only usage
    group: constraints
  - path: constraints/error_handling.yaml
    description: Error handling style and allowed exception scopes
    group: constraints
  - path: constraints/testing.yaml
    description: Test conventions (pytest, test naming, fixtures)
    group: constraints
--- a/.mindmodel/system.md
+++ b/.mindmodel/system.md
@ -1,18 +1,14 @@
-# System overview
+# System Overview: Stemwijzer
-This project is a Streamlit-based UI and data-processing pipeline that computes embeddings,
+This mindmodel documents constraints, conventions and patterns for the Stemwijzer
-performs SVD over MP/motion voting matrices, fuses vector representations, and precomputes
+project (Python Streamlit app with DuckDB-backed pipeline for parliamentary
-a similarity cache for quick lookup in the UI.
+motions embedding analysis).
-Key subsystems:
+Key points:
- UI: Streamlit pages (Home.py, pages/*). Exposes interactive explorer and quizzes.
+- Language: Python >=3.13
- Data ingestion: scripts and scraper/api_client.py (Tweede Kamer OData).
+- UI: Streamlit multi-page app (Home.py, pages/)
- Processing pipelines: pipeline/* (text embeddings, SVD, fusion).
+- Storage: DuckDB with JSON fallback for tests/dev (database.py)
- Similarity layer: similarity/compute.py and similarity/lookup.py storing precomputed neighbors.
+- Pipeline: ETL and SVD/text fusion pipeline (pipeline/run_pipeline.py)
- Storage: DuckDB (primary), with a JSON-file fallback used in tests/environments without duckdb.
+- AI: ai_provider adapter uses HTTP-based OpenRouter/OpenAI-compatible API with retry/backoff and local fallback
 - AI/Embedding provider: ai_provider.py (HTTP wrapper around an OpenRouter/OpenAI-compatible API).
-Operational notes:
+Use the .mindmodel/ constraints files to guide code changes, CI, and onboarding.
 - Dockerfile exists; Streamlit default port 8501 exposed.
 - Tests use pytest. CI uses Drone (.drone.yml).
 - There is no lockfile present in the repository snapshot; add one (poetry.lock or requirements.txt) for reproducible installs.
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -1,14 +1,11 @@
-ARCHITECTURE
+# ARCHITECTURE
 ============
-Overview
+## Overview
--------
+
- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It
+- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). Itingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short humansummaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
-  ingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short human
+
-  summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
+## Tech stack
 Tech stack
 ----------
 - Language: Python (single-project repository)
 - Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py)
 - Web / UI: Streamlit (app.py)
@ -18,9 +15,10 @@ Tech stack
 - LLM: OpenAI-compatible client (summarizer.py uses openai.OpenAI configured via config)
 - Packaging: pyproject.toml present
-Top-level layout (annotated)
+## Top-level layout (annotated)
----------------------------
+
 ./
 - app.py               — Streamlit UI, main UI flow and session handling (entrypoint for web)
 - main.py              — minimal CLI entry / small script
 - database.py          — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations
@ -37,50 +35,41 @@ Top-level layout (annotated)
 - pyproject.toml       — project metadata / dependencies
 - .env                 — environment variables (not printed here)
-Core components
+## Core components
---------------
+
 - Streamlit UI (app.py)
  - Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes
-  - Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),
+  - Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),database.calculate_party_matches(), summarizer.update_motion_summaries()
    database.calculate_party_matches(), summarizer.update_motion_summaries()
 - Storage (database.py)
  - MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions
  - Exposes a module-level instance `db = MotionDatabase()` used across the codebase
-  - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,
+  - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,calculate_party_matches
    calculate_party_matches
 - Ingestion (api_client.py + scraper.py)
  - api_client.py fetches votes via Tweede Kamer OData API and groups records into motions
  - scraper.py is an HTML fallback that scrapes motion pages and extracts vote info
  - Both provide structured motion dicts consumed by database.insert_motion()
 - Summarization (summarizer.py)
  - Wraps an OpenAI-compatible client to produce short layman explanations and persists them to DB
  - Reads motions without layman_explanation and updates rows
 - Orchestration (scheduler.py)
  - Runs initial historical ingestion and schedules periodic updates (using schedule)
  - Calls API client and summarizer and writes to the database
-Data flow (high level)
+## Data flow (high level)
----------------------
+
 1.  Ingestion
    - scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job()
    - Each produced motion dict is passed to MotionDatabase.insert_motion()
    - insert_motion writes to DuckDB (data/motions.db)
 2.  Enrichment
-   - summarizer.update_motion_summaries() reads motions lacking layman_explanation,
+    - summarizer.update_motion_summaries() reads motions lacking layman_explanation,calls the LLM client (openai.OpenAI) and writes summary text back to the DB
     calls the LLM client (openai.OpenAI) and writes summary text back to the DB
 3.  Presentation / Interaction
    - app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
    - Users vote; app.py writes votes into the database via db.update_user_vote()
    - app.py calls db.calculate_party_matches() to compute match percentages for parties
-External integrations & dependencies
+## External integrations & dependencies
-----------------------------------
+
 - Tweede Kamer OData API (api_client.py)
 - HTTP (requests)
 - HTML parsing (BeautifulSoup) used by scraper.py
@ -89,8 +78,8 @@ External integrations & dependencies
 - Streamlit for UI
 - OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py
-Configuration
+## Configuration
-------------
+
 - config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include:
  - config.DATABASE_PATH (default "data/motions.db")
  - OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py
@ -99,26 +88,24 @@ Configuration
 - .env file present at repo root (do not commit secrets). See .env.example if present (none observed).
 - Packaging metadata: pyproject.toml
-Build, run & development notes
+## Build, run & development notes
------------------------------
+
- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CI
+- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CIworkflows detected in the repository.
-  workflows detected in the repository.
+- Use uv add and uv run to manage the dependencies in this directory and run scripts
- Streamlit app: run `streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
+- Streamlit app: run `uv run streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
 - Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion.
-Tests
+## Tests
-----
+
 - There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification.
-Notes / caveats
+## Notes / caveats
----------------
+
- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons
+- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons(e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
-  (e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
+- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,scraper.py). Logging is not centralized (print statements used).
- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,
+
-  scraper.py). Logging is not centralized (print statements used).
+## Where to look first (for contributors)
 Where to look first (for contributors)
 -------------------------------------
 - app.py            — follow the UI flow and see how votes & sessions are used
 - database.py       — core data model and calculations
 - api_client.py     — OData ingestion logic
--- a/scripts/mindmodel/loader.py
+++ b/scripts/mindmodel/loader.py
@ -0,0 +1,67 @@
 """Simple manifest loader for mindmodel manifests.
 Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`.
 Behavior:
 - If PyYAML is installed, uses yaml.safe_load to parse the file.
 - Otherwise falls back to the stdlib json parser.
 - If the top-level document is a list it will be normalized to {"constraints": <list>}.
 - Raises ManifestLoadError for missing file or parse errors.
 """
 from typing import Any, Dict
 import json
 from pathlib import Path
 class ManifestLoadError(Exception):
    """Raised when a manifest cannot be loaded or parsed."""
 try:
    import yaml  # type: ignore
 except Exception:  # YAML not available
    yaml = None  # type: ignore
 def _parse_with_yaml(text: str) -> Any:
    # yamlsafe_load may return any Python structure
    try:
        return yaml.safe_load(text)
    except Exception as exc:  # pragma: no cover - defensive
        raise ManifestLoadError(f"YAML parse error: {exc}") from exc
 def _parse_with_json(text: str) -> Any:
    try:
        return json.loads(text)
    except Exception as exc:
        raise ManifestLoadError(f"JSON parse error: {exc}") from exc
 def load_manifest(path: str) -> Dict[str, Any]:
    """Load a manifest from the given file path and normalize it to a dict.
    If the top-level document is a list, it will be returned as {"constraints": list}.
    Raises ManifestLoadError if the file does not exist or if parsing fails.
    """
    p = Path(path)
    if not p.exists():
        raise ManifestLoadError(f"Manifest file not found: {path}")
    text = p.read_text(encoding="utf-8")
    if yaml is not None:
        data = _parse_with_yaml(text)
    else:
        data = _parse_with_json(text)
    # Normalize
    if isinstance(data, list):
        return {"constraints": data}
    if isinstance(data, dict):
        return data
    # Unexpected top-level type, wrap it
    return {"manifest": data}
--- a/tests/scripts/mindmodel/test_loader.py
+++ b/tests/scripts/mindmodel/test_loader.py
@ -0,0 +1,21 @@
 import json
 import pytest
 from scripts.mindmodel import loader
 def test_load_json_manifest(tmp_path):
    data = [{"id": "c1", "description": "a constraint"}]
    p = tmp_path / "manifest.json"
    p.write_text(json.dumps(data), encoding="utf-8")
    loaded = loader.load_manifest(str(p))
    assert isinstance(loaded, dict)
    assert "constraints" in loaded
    assert any(c.get("id") == "c1" for c in loaded["constraints"])
 def test_missing_manifest_raises():
    with pytest.raises(loader.ManifestLoadError):
        loader.load_manifest("nonexistent-file-manifest.json")
--- a/thoughts/ledgers/audit_events.json
+++ b/thoughts/ledgers/audit_events.json
@ -545,5 +545,98 @@
    "target_id": null,
    "metadata": {},
    "created_at": "2026-03-23T22:52:47.836920Z"
  },
  {
    "id": "de3394a0-8c8e-4282-8369-f53aa957fd46",
    "actor_id": null,
    "action": "embedding_failed",
    "target_type": "motion",
    "target_id": "99",
    "metadata": {
      "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
    },
    "created_at": "2026-03-24T19:08:06.647810Z"
  },
  {
    "id": "8491ed90-9314-41a9-9d02-092a5d0bebd5",
    "actor_id": null,
    "action": "test_action",
    "target_type": "unit",
    "target_id": "u1",
    "metadata": {
      "k": 1
    },
    "created_at": "2026-03-24T19:08:08.085618Z"
  },
  {
    "id": "ae7c88e5-ba28-4012-8991-c58fea9c0778",
    "actor_id": null,
    "action": "another_action",
    "target_type": "motion",
    "target_id": null,
    "metadata": {},
    "created_at": "2026-03-24T19:08:08.131631Z"
  },
  {
    "id": "b73e6bf8-2b66-43bf-ad9c-e92d34ae38db",
    "actor_id": null,
    "action": "embedding_failed",
    "target_type": "motion",
    "target_id": "99",
    "metadata": {
      "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
    },
    "created_at": "2026-03-24T19:18:02.854710Z"
  },
  {
    "id": "3a6bf0e0-9f07-477d-9079-715d8c0f39c4",
    "actor_id": null,
    "action": "test_action",
    "target_type": "unit",
    "target_id": "u1",
    "metadata": {
      "k": 1
    },
    "created_at": "2026-03-24T19:18:05.512388Z"
  },
  {
    "id": "75d9e229-78e6-439e-8095-c01ba7830de9",
    "actor_id": null,
    "action": "another_action",
    "target_type": "motion",
    "target_id": null,
    "metadata": {},
    "created_at": "2026-03-24T19:18:05.557773Z"
  },
  {
    "id": "d45fc116-47be-4486-ba5c-ab2edd7f7e76",
    "actor_id": null,
    "action": "embedding_failed",
    "target_type": "motion",
    "target_id": "99",
    "metadata": {
      "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
    },
    "created_at": "2026-03-24T19:28:43.867346Z"
  },
  {
    "id": "b4ead1cd-58b1-4ff6-aa73-c77ab09ba063",
    "actor_id": null,
    "action": "test_action",
    "target_type": "unit",
    "target_id": "u1",
    "metadata": {
      "k": 1
    },
    "created_at": "2026-03-24T19:28:45.051895Z"
  },
  {
    "id": "463bfa1b-59fe-4fd3-a8dd-b39674948656",
    "actor_id": null,
    "action": "another_action",
    "target_type": "motion",
    "target_id": null,
    "metadata": {},
    "created_at": "2026-03-24T19:28:45.097703Z"
  }
 ]
--- a/thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md
+++ b/thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md
@ -0,0 +1,73 @@
 ---
 date: 2026-03-24
 topic: "mindmodel-generation"
 status: draft
 ---
 ## Problem Statement
 We generated a .mindmodel/ snapshot for this repository using an automated orchestrator. The output includes inferred constraints, patterns, schema snippets, and remediation recommendations. We need a short, validated design that explains what was produced, how to verify and integrate it safely, and a recommended next set of changes (low-risk remediation and CI additions).
 ## Constraints
 **Non-negotiables:**
 - Keep the generated .mindmodel/ files read-only until validated.
 - Do not make behavioral changes to production code in the same change as model metadata updates.
 - Avoid committing secrets or lockfiles without explicit review.
 **Limitations:**
 - The orchestrator used heuristic file reads; some evidence pointers may be truncated or approximate.
 - No poetry.lock / requirements.txt or CI workflows were found; dependency remediation must be conservative.
 ## Approach
 I'm choosing an **audit-first, incremental integration** approach because the generated artifacts are high-value policy documents but rely on evidence that needs verification. We will: (1) validate evidence pointers and missing files, (2) mark fixes for trivial issues (move pytest to dev-deps, add formatter configs) in a small non-invasive PR, (3) integrate the .mindmodel/ into the repo and add a CI lint step that validates the manifest, and (4) iterate on higher-risk changes after tests pass.
 Alternatives considered:
 - Accept-and-commit everything immediately (faster) — rejected because of truncated reads and potential wrong pointers.
 - Manual rewrite of constraints by hand (accurate) — rejected due to time cost; validation + targeted fixes gives best ROI.
 ## Architecture
 This is a documentation/metadata integration task, not a runtime service. Components:
 - **.mindmodel/**: constraint files and manifest produced by orchestrator. Source of truth for conventions and inferred patterns.
 - **Validator job (CI)**: lightweight script/CI step that verifies manifest consistency, required files exist, and key evidence pointers resolve.
 - **Small remediation PRs**: conservative code/config edits (pyproject tweaks, add black/ruff/isort configs, pre-commit) that enable future automation.
 ## Components
 - Constraint Validator: verifies every .mindmodel/ constraint references existing files; flags truncated evidence ranges; ensures no secrets.
 - Staging branch: holds small remediation commits; each commit is limited to one class of change (deps dev/prod move, linters, CI yaml).
 - CI pipeline changes: add a validation job and a docs check that ensures .mindmodel/ manifest is up to date.
 ## Data Flow
 1. Orchestrator output (.mindmodel/) exists in the working tree.
 2. Validator runs locally or in CI to check pointers and file existence.
 3. Developer reviews validator report and accepts/edits constraint files.
 4. Remediation PRs are opened for low-risk fixes.
 5. CI runs tests + validator; on green we merge and enable scheduled checks.
 ## Error Handling
 - Validator failures are non-blocking for mainline but must be resolved before we rely on constraints for automation.
 - If a constraint references a deleted or moved file, mark the constraint as "needs-review" in the manifest and leave file unchanged.
 - For ambiguous evidence (truncated reads), add an explicit comment in the constraint file pointing to the reviewer.
 ## Testing Strategy
 - Unit: small pytest tests that assert README/pyproject presence and that manifest YAML parses.
 - Integration: CI job that runs the Constraint Validator and fails on missing files or secrets.
 - Manual: reviewer inspects a sample of constraint files (3-5) for accuracy before merging.
 ## Open Questions
 - Do we want the validator to auto-fix trivial issues (reformatting YAML paths) or only report? I'm leaning toward report-only for safety.
 - Should .mindmodel/ be protected by branch policy or just reviewed by humans? Recommend human review + CI check, not protected branch yet.
 ## Next Steps (what I'll do now)
 1. Create this design doc (done).
 2. Commit the design doc to the repo (doing now).
 3. Spawn the planner to create a step-by-step implementation plan based on this design (spawning now).
--- a/thoughts/shared/plans/2026-03-24-mindmodel-generation.md
+++ b/thoughts/shared/plans/2026-03-24-mindmodel-generation.md
@ -0,0 +1,76 @@
 ---
 date: 2026-03-24
 topic: "mindmodel-generation"
 status: draft
 ---
 # Implementation Plan: mindmodel-generation
 Goal: Implement a lightweight, safe Constraint Validator for the generated .mindmodel/ snapshot plus small CI / config artifacts to validate and integrate the manifest incrementally and safely.
 Design reference: thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md
 ---
 ## Overview
 This plan breaks work into four batches: Foundation, Core, Components, Integration/Configs. Each micro-task is small and independently testable. Tests accompany core modules. The validator intentionally avoids reading repository secret files and only scans manifest text and evidence snippets.
 ## Batch 1: Foundation (parallel)
 - Task 1.1: Manifest loader
  - Path: scripts/mindmodel/loader.py
  - Test: tests/scripts/mindmodel/test_loader.py
  - Behavior: load YAML or JSON manifest, normalize to dict, raise ManifestLoadError on failure
 - Task 1.2: Low-level checks
  - Path: scripts/mindmodel/checks.py
  - Test: tests/scripts/mindmodel/test_checks.py
  - Behavior: file existence (without opening), truncated-snippet heuristics, manifest-text secret heuristics
 ## Batch 2: Core Modules (depends on Batch 1)
 - Task 2.1: Constraint Validator (core)
  - Path: scripts/mindmodel/validator.py
  - Test: tests/scripts/mindmodel/test_validator.py
  - Behavior: load manifest, scan for secrets, verify referenced files exist, detect truncated snippets, produce machine-readable report and exit codes: 0 ok, 1 warnings, 2 critical
 ## Batch 3: Components (depends on Batch 2)
 - Task 3.1: CLI wrapper for CI and local runs
  - Path: scripts/mindmodel/cli.py
  - Test: tests/scripts/mindmodel/test_cli.py
  - Behavior: simple wrapper delegating to validator; callable as python -m scripts.mindmodel.cli
 ## Batch 4: Integration / Configs / Docs (parallel)
 - Task 4.1: CI workflow to run validator on PRs and scheduled checks
  - Path: .github/workflows/mindmodel-validate.yml
  - Behavior: run tests, then run validator against .mindmodel/manifest.yaml if present
 - Task 4.2: .mindmodel/ README describing read-only policy
  - Path: .mindmodel/README.md
 - Task 4.3: Add a minimal pre-commit config (trailing whitespace, eof fixer, check-yaml)
  - Path: .pre-commit-config.yaml
 ## Verification
 - Each unit has a focused pytest test to validate behavior.
 - CI will run the validator and tests; the validator should skip if no manifest present.
 ## Implementation Checklist
 - [ ] Add scripts/mindmodel/loader.py + tests/scripts/mindmodel/test_loader.py
 - [ ] Add scripts/mindmodel/checks.py + tests/scripts/mindmodel/test_checks.py
 - [ ] Add scripts/mindmodel/validator.py + tests/scripts/mindmodel/test_validator.py
 - [ ] Add scripts/mindmodel/cli.py + tests/scripts/mindmodel/test_cli.py
 - [ ] Add .github/workflows/mindmodel-validate.yml
 - [ ] Add .mindmodel/README.md
 - [ ] Add .pre-commit-config.yaml
 ## Next steps
 1. Create the files above in small commits (one micro-task per commit).
 2. Run unit tests for each new module as added.
 3. Open a small PR with the validator + CI + docs; request reviewers to run the validator locally.