feat(mindmodel): add manifest loader and tests

1 month ago · 2efd7ba3a0
parent 9c82962d47
commit 2efd7ba3a0
17 changed files with 807 additions and 123 deletions
--- a/.mindmodel/constraints/01-naming.yaml
+++ b/.mindmodel/constraints/01-naming.yaml
@ -0,0 +1,34 @@
+# Naming & Style Conventions
+
+## Rules
+- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py
+- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py)
+- Classes: PascalCase. Evidence: MotionDatabase (database.py)
+- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred)
+- Imports order: stdlib, third-party, local; prefer absolute imports and grouped.
+- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections).
+
+## Examples
+
+### Function example (from pipeline/run_pipeline.py)
+```python
+def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]:
+    """Return list of (window_id, start_str, end_str) tuples."""
+```
+
+### Class example (from database.py)
+```python
+class MotionDatabase:
+    def __init__(self, db_path: str = config.DATABASE_PATH):
+        ...
+```
+
+## Anti-patterns
+- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files.
+
+## Remediations
+- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step.
+
+## Evidence pointers
+- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120)
+- database.py: MotionDatabase class and methods (file database.py lines 1-400+)
--- a/.mindmodel/constraints/10-db-schema.yaml
+++ b/.mindmodel/constraints/10-db-schema.yaml
@ -0,0 +1,74 @@
+# Database Schema (DuckDB) — extracted DDL
+
+## Rules
+- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). 
+- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py).
+
+## Examples (DDL snippets extracted from database.py)
+
+### motions table
+```sql
+CREATE TABLE IF NOT EXISTS motions (
+    id INTEGER DEFAULT nextval('motions_id_seq'),
+    title TEXT NOT NULL,
+    description TEXT,
+    date DATE,
+    policy_area TEXT,
+    voting_results JSON,
+    winning_margin FLOAT,
+    controversy_score FLOAT,
+    layman_explanation TEXT,
+    externe_identifier TEXT,
+    body_text TEXT,
+    url TEXT UNIQUE,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    PRIMARY KEY (id)
+)
+```
+
+### mp_votes table
+```sql
+CREATE TABLE IF NOT EXISTS mp_votes (
+    id INTEGER DEFAULT nextval('mp_votes_id_seq'),
+    motion_id INTEGER NOT NULL,
+    mp_name TEXT NOT NULL,
+    party TEXT,
+    vote TEXT NOT NULL,
+    date DATE,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    PRIMARY KEY (id)
+)
+```
+
+### embeddings / fused_embeddings
+```sql
+CREATE TABLE IF NOT EXISTS embeddings (
+    id INTEGER DEFAULT nextval('embeddings_id_seq'),
+    motion_id INTEGER NOT NULL,
+    model TEXT,
+    vector JSON NOT NULL,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    PRIMARY KEY (id)
+)
+
+CREATE TABLE IF NOT EXISTS fused_embeddings (
+    id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
+    motion_id INTEGER NOT NULL,
+    window_id TEXT NOT NULL,
+    vector JSON NOT NULL,
+    svd_dims INTEGER NOT NULL,
+    text_dims INTEGER NOT NULL,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    PRIMARY KEY (id)
+)
+```
+
+## Anti-patterns
+- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior.
+
+## Remediations
+- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically.
+- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80).
+
+## Evidence pointers
+- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings.
--- a/.mindmodel/constraints/20-domain-glossary.yaml
+++ b/.mindmodel/constraints/20-domain-glossary.yaml
@ -0,0 +1,22 @@
+# Domain Glossary
+
+## Rules
+- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id.
+
+## Terms
+- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110)
+- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes
+- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`.
+- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table.
+- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows
+- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score
+
+## Examples / Usage
+- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120
+
+## Evidence pointers
+- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py)
+- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py)
+
+## Anti-patterns
+- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations.
--- a/.mindmodel/constraints/30-clusters.yaml
+++ b/.mindmodel/constraints/30-clusters.yaml
@ -0,0 +1,30 @@
+# Code Clusters / Organization
+
+## Rules
+- The repository organizes code into the following clusters (observed):
+  - UI / Streamlit: Home.py, pages/, app.py, explorer.py
+  - Database & persistence: database.py, config.py
+  - ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion)
+  - AI provider & summarization: ai_provider.py, pipeline/..., analysis/
+  - Similarity & caching: similarity/*, similarity_cache table in DB
+  - API client & scraping: api_client.py, pipeline/fetch_mp_metadata
+  - Analysis & visualization: analysis/visualize.py, explorer.py
+  - CLI & scheduler: scheduler.py, pipeline/run_pipeline.py
+  - Tests & migrations: tests/ (pytest) and database reset helpers
+
+## Examples
+
+### Pipeline orchestrator (cluster: CLI & pipeline)
+```python
+from database import MotionDatabase
+db = MotionDatabase(db_path)
+# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window
+```
+
+## Remediations
+- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests.
+
+## Evidence pointers
+- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py)
+- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py)
+- analysis/visualize.py: visualization cluster (file: analysis/visualize.py)
--- a/.mindmodel/constraints/40-patterns.yaml
+++ b/.mindmodel/constraints/40-patterns.yaml
@ -0,0 +1,46 @@
+# Design Patterns & Code Patterns
+
+## Rules
+- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management.
+- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback.
+- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes).
+
+## Examples
+
+### Repository pattern (database.py MotionDatabase)
+```python
+class MotionDatabase:
+    def __init__(self, db_path: str = config.DATABASE_PATH):
+        self.db_path = db_path
+        self._init_database()
+
+    def insert_motion(self, motion_data: Dict) -> bool:
+        """Insert a new motion into database"""
+        # uses duckdb.connect and parameterized queries
+```
+
+### Provider adapter with retries (ai_provider.py)
+```python
+def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response:
+    # Implements retries/backoff, handles 429 with Retry-After and 5xx responses
+```
+
+### Pipeline parallelism pattern (run_pipeline)
+```python
+with ThreadPoolExecutor(max_workers=max_workers) as pool:
+    for window_id, w_start, w_end in windows:
+        fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k)
+        futures[fut] = window_id
+# wait then write sequentially to DuckDB
+```
+
+## Anti-patterns
+- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors.
+
+## Remediations
+- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md.
+
+## Evidence pointers
+- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300)
+- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260)
+- database.py: MotionDatabase methods (file: database.py)
--- a/.mindmodel/constraints/50-anti-patterns.yaml
+++ b/.mindmodel/constraints/50-anti-patterns.yaml
@ -0,0 +1,24 @@
+# Anti-patterns, Issues and Recommended Fixes
+
+## Rules
+- Flagged issues discovered in Phase 1 must be remediated with concrete actions.
+
+## Issues
+- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml
+- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports.
+- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility.
+- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps.
+- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging.
+
+## Remediations / Recommended fixes
+- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml.
+  - Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain.
+- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var.
+- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges.
+- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage.
+- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks.
+
+## Evidence pointers
+- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40)
+- database.py: multiple broad except blocks (file: database.py top and methods)
+- ai_provider.py: uses requests + env keys (file: ai_provider.py)
--- a/.mindmodel/constraints/60-examples.yaml
+++ b/.mindmodel/constraints/60-examples.yaml
@ -0,0 +1,117 @@
+# Example Extractions
+
+## Rules
+- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions.
+
+## (a) Function signatures with docstrings (5 examples)
+1) pipeline/run_pipeline.py::_generate_windows
+```python
+def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]:
+    """Return list of (window_id, start_str, end_str) tuples.
+
+    window_id format:
+      quarterly → "2024-Q1", "2024-Q2", …
+      annual    → "2024"
+    """
+```
+
+2) database.py::append_audit_event
+```python
+def append_audit_event(
+    self,
+    actor_id: Optional[str],
+    action: str,
+    target_type: Optional[str] = None,
+    target_id: Optional[str] = None,
+    metadata: Optional[Dict] = None,
+) -> bool:
+    """Record an audit event. Tries DB then falls back to ledger file."""
+```
+
+3) ai_provider.py::get_embedding
+```python
+def get_embedding(text: str, model: str | None = None) -> list[float]:
+    """Return an embedding vector for `text` using the configured provider.
+
+    Raises ProviderError for configuration or provider-side failures.
+    """
+```
+
+4) ai_provider.py::get_embeddings_batch
+```python
+def get_embeddings_batch(
+    texts: list[str], model: str | None = None, batch_size: int = 50
+) -> list[list[float]]:
+    """Return embedding vectors for multiple texts using batched API calls."""
+```
+
+5) analysis/visualize.py::plot_umap_scatter
+```python
+def plot_umap_scatter(
+    motion_ids: List[int],
+    coords: List[List[float]],
+    labels: Optional[List[int]] = None,
+    window_id: Optional[str] = None,
+    output_path: str = "analysis_umap.html",
+) -> str:
+    """Produce a 2D scatter plot of UMAP-reduced fused embeddings."""
+```
+
+## (b) SQL / DDL snippets (3 examples inferred from database.py)
+1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110)
+
+2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes
+
+3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings
+
+## (c) Pytest stubs (4 sample tests matching conventions)
+Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add.
+
+1) tests/test_database_basic.py
+```python
+def test_init_database_creates_tables(tmp_path):
+    db_path = str(tmp_path / "motions.db")
+    from database import MotionDatabase
+
+    db = MotionDatabase(db_path=db_path)
+    # If duckdb not available, JSON fallback should create .embeddings.json
+    assert db is not None
+```
+
+2) tests/test_ai_provider.py
+```python
+def test_local_embedding_fallback():
+    from ai_provider import _local_embedding
+
+    v = _local_embedding("hello world", dim=16)
+    assert isinstance(v, list) and len(v) == 16
+```
+
+3) tests/test_pipeline_windows.py
+```python
+from pipeline.run_pipeline import _generate_windows
+
+def test_generate_quarterly_windows():
+    from datetime import date
+
+    start = date(2024, 1, 1)
+    end = date(2024, 3, 31)
+    windows = _generate_windows(start, end, "quarterly")
+    assert any(w[0].endswith("Q1") for w in windows)
+```
+
+4) tests/test_visualize_plot.py
+```python
+def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path):
+    # If plotly missing, function should raise ImportError with guidance
+    import analysis.visualize as vis
+
+    try:
+        vis._require_plotly()
+    except ImportError:
+        assert True
+```
+
+## Evidence pointers
+- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py
+- DDL: database.py create table blocks
--- a/.mindmodel/constraints/99-stack.yaml
+++ b/.mindmodel/constraints/99-stack.yaml
@ -0,0 +1,43 @@
+# Stack and Dependencies
+
+## Rules
+- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13")
+- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile
+- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py
+- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/
+
+## Examples
+
+### pyproject dependencies (evidence: pyproject.toml)
+```toml
+dependencies = [
+    "duckdb>=1.3.2",
+    "ibis-framework[duckdb]>=10.8.0",
+    "openai>=1.99.7",
+    "scipy>=1.11",
+    "umap-learn>=0.5",
+    "plotly>=5.0",
+    "pytest>=9.0.2",
+    "requests>=2.32.4",
+    "schedule>=1.2.2",
+    "streamlit>=1.48.0",
+    "scikit-learn>=1.8.0",
+    "beautifulsoup4>=4.14.3",
+    "lxml>=6.0.2",
+]
+```
+
+## Anti-patterns / Notes
+- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml
+- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility.
+- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai).
+
+## Remediations
+- Move test-only libs (pytest) to dev-dependencies in pyproject.toml.
+- Add lockfile and CI step to check for pinned dependencies.
+- Audit declared but unused packages (openai) and remove or confirm dynamic usage.
+
+## Evidence pointers
+- pyproject.toml: full dependency list (lines 1-40)
+- Home.py: streamlit usage and app entry (file: Home.py) 
+- database.py: duckdb table creation and connection (file: database.py lines ~1-350)
--- a/.mindmodel/constraints/README.md
+++ b/.mindmodel/constraints/README.md
@ -0,0 +1,5 @@
+# Mindmodel constraints README
+
+Files in .mindmodel/constraints/ are YAML-like constraint documents describing
+conventions, patterns and remediation steps. Use these to guide PR reviews and
+CI automation.
--- a/.mindmodel/manifest.yaml
+++ b/.mindmodel/manifest.yaml
@ -1,60 +1,36 @@
 name: stemwijzer
 version: 2
+summary: >-
+  Mindmodel constraints for the Stemwijzer repository (Python + Streamlit +
+  DuckDB). Captures tech stack, conventions, DB schema, clusters, patterns,
+  anti-patterns and example extractions. Generated from Phase 1 analysis.
+main_patterns:
+  - Repository DB wrapper (MotionDatabase)
+  - AI provider adapter with retry/backoff and local fallback
+  - SVD + embedding fusion pipeline with windowed processing
+total_files: 11
 categories:
-  - path: stack.yaml
-    description: Project technology stack (languages, frameworks, runtime)
+  - path: .mindmodel/constraints/99-stack.yaml
+    description: Runtime tech stack and primary dependencies (Python, Streamlit, DuckDB, Ibis)
    group: stack
-  - path: dependencies.yaml
-    description: Declared and recommended dependencies grouped by purpose
-    group: stack
-  - path: system.md
-    description: System overview and architecture high-level notes
-    group: architecture
-  - path: architecture.yaml
-    description: Architectural layers, organization and confidence levels
-    group: architecture
-  - path: conventions.yaml
-    description: Coding conventions cheat-sheet (naming, imports, types)
-    group: style
-  - path: domain-glossary.yaml
-    description: Business domain glossary for the project
+  - path: .mindmodel/constraints/01-naming.yaml
+    description: Naming, import and style conventions
+    group: conventions
+  - path: .mindmodel/constraints/10-db-schema.yaml
+    description: DuckDB schema DDL extracted from database.py
+    group: database
+  - path: .mindmodel/constraints/20-domain-glossary.yaml
+    description: Domain glossary and terminology (motions, MP, embeddings, windows)
    group: domain
-  - path: patterns/duckdb_access.yaml
-    description: DuckDB access patterns, examples, and anti-patterns
-    group: patterns
-  - path: patterns/requests_http.yaml
-    description: Requests/HTTP client usage and retry best-practices
-    group: patterns
-  - path: patterns/embeddings_similarity.yaml
-    description: Embedding, SVD, fusion and similarity pipeline patterns
-    group: patterns
-  - path: patterns/error_handling.yaml
-    description: Error handling patterns and rules
-    group: patterns
-  - path: patterns/validation.yaml
-    description: Input/domain validation patterns and examples
-    group: patterns
-  - path: patterns/module_singletons.yaml
-    description: Module-level singletons and lifecycle patterns
-    group: patterns
-  - path: anti-patterns.yaml
-    description: Known anti-patterns and remediation steps
-    group: patterns
-  - path: examples/pattern-examples.md
-    description: Consolidated extracted code examples across patterns
-    group: patterns
-  - path: constraints/naming.yaml
-    description: Enforce naming rules (snake_case, PascalCase, constants)
-    group: constraints
-  - path: constraints/imports.yaml
-    description: Enforce import grouping and ordering
-    group: constraints
-  - path: constraints/db_connection.yaml
-    description: Rules for opening/closing DB connections and read-only usage
-    group: constraints
-  - path: constraints/error_handling.yaml
-    description: Error handling style and allowed exception scopes
-    group: constraints
-  - path: constraints/testing.yaml
-    description: Test conventions (pytest, test naming, fixtures)
-    group: constraints
+  - path: .mindmodel/constraints/30-clusters.yaml
+    description: Code clusters and module organization
+    group: architecture
+  - path: .mindmodel/constraints/40-patterns.yaml
+    description: Design patterns and coding patterns observed with examples
+    group: patterns
+  - path: .mindmodel/constraints/50-anti-patterns.yaml
+    description: Anti-patterns, issues and recommended remediations
+    group: ops
+  - path: .mindmodel/constraints/60-examples.yaml
+    description: Example extractions: function signatures, SQL DDL snippets, pytest stubs
+    group: examples
--- a/.mindmodel/system.md
+++ b/.mindmodel/system.md
@ -1,18 +1,14 @@
-# System overview
+# System Overview: Stemwijzer

-This project is a Streamlit-based UI and data-processing pipeline that computes embeddings,
-performs SVD over MP/motion voting matrices, fuses vector representations, and precomputes
-a similarity cache for quick lookup in the UI.
+This mindmodel documents constraints, conventions and patterns for the Stemwijzer
+project (Python Streamlit app with DuckDB-backed pipeline for parliamentary
+motions embedding analysis).

-Key subsystems:
- UI: Streamlit pages (Home.py, pages/*). Exposes interactive explorer and quizzes.
- Data ingestion: scripts and scraper/api_client.py (Tweede Kamer OData).
- Processing pipelines: pipeline/* (text embeddings, SVD, fusion).
- Similarity layer: similarity/compute.py and similarity/lookup.py storing precomputed neighbors.
- Storage: DuckDB (primary), with a JSON-file fallback used in tests/environments without duckdb.
- AI/Embedding provider: ai_provider.py (HTTP wrapper around an OpenRouter/OpenAI-compatible API).
+Key points:
+- Language: Python >=3.13
+- UI: Streamlit multi-page app (Home.py, pages/)
+- Storage: DuckDB with JSON fallback for tests/dev (database.py)
+- Pipeline: ETL and SVD/text fusion pipeline (pipeline/run_pipeline.py)
+- AI: ai_provider adapter uses HTTP-based OpenRouter/OpenAI-compatible API with retry/backoff and local fallback

-Operational notes:
- Dockerfile exists; Streamlit default port 8501 exposed.
- Tests use pytest. CI uses Drone (.drone.yml).
- There is no lockfile present in the repository snapshot; add one (poetry.lock or requirements.txt) for reproducible installs.
+Use the .mindmodel/ constraints files to guide code changes, CI, and onboarding.
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -1,14 +1,11 @@
-ARCHITECTURE
-============
+# ARCHITECTURE

-Overview
--------
- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It
-  ingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short human
-  summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
+## Overview
+
+- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). Itingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short humansummaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
+
+## Tech stack

-Tech stack
----------
 - Language: Python (single-project repository)
 - Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py)
 - Web / UI: Streamlit (app.py)
@ -18,9 +15,10 @@ Tech stack
 - LLM: OpenAI-compatible client (summarizer.py uses openai.OpenAI configured via config)
 - Packaging: pyproject.toml present

-Top-level layout (annotated)
----------------------------
+## Top-level layout (annotated)
+
 ./
+
 - app.py               — Streamlit UI, main UI flow and session handling (entrypoint for web)
 - main.py              — minimal CLI entry / small script
 - database.py          — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations
@ -37,50 +35,41 @@ Top-level layout (annotated)
 - pyproject.toml       — project metadata / dependencies
 - .env                 — environment variables (not printed here)

-Core components
---------------
+## Core components
+
 - Streamlit UI (app.py)
  - Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes
-  - Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),
-    database.calculate_party_matches(), summarizer.update_motion_summaries()
-
+  - Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),database.calculate_party_matches(), summarizer.update_motion_summaries()
 - Storage (database.py)
  - MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions
  - Exposes a module-level instance `db = MotionDatabase()` used across the codebase
-  - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,
-    calculate_party_matches
-
+  - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,calculate_party_matches
 - Ingestion (api_client.py + scraper.py)
  - api_client.py fetches votes via Tweede Kamer OData API and groups records into motions
  - scraper.py is an HTML fallback that scrapes motion pages and extracts vote info
  - Both provide structured motion dicts consumed by database.insert_motion()
-
 - Summarization (summarizer.py)
  - Wraps an OpenAI-compatible client to produce short layman explanations and persists them to DB
  - Reads motions without layman_explanation and updates rows
-
 - Orchestration (scheduler.py)
  - Runs initial historical ingestion and schedules periodic updates (using schedule)
  - Calls API client and summarizer and writes to the database

-Data flow (high level)
----------------------
-1. Ingestion
-   - scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job()
-   - Each produced motion dict is passed to MotionDatabase.insert_motion()
-   - insert_motion writes to DuckDB (data/motions.db)
+## Data flow (high level)

-2. Enrichment
-   - summarizer.update_motion_summaries() reads motions lacking layman_explanation,
-     calls the LLM client (openai.OpenAI) and writes summary text back to the DB
+1.  Ingestion
+    - scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job()
+    - Each produced motion dict is passed to MotionDatabase.insert_motion()
+    - insert_motion writes to DuckDB (data/motions.db)
+2.  Enrichment
+    - summarizer.update_motion_summaries() reads motions lacking layman_explanation,calls the LLM client (openai.OpenAI) and writes summary text back to the DB
+3.  Presentation / Interaction
+    - app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
+    - Users vote; app.py writes votes into the database via db.update_user_vote()
+    - app.py calls db.calculate_party_matches() to compute match percentages for parties

-3. Presentation / Interaction
-   - app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
-   - Users vote; app.py writes votes into the database via db.update_user_vote()
-   - app.py calls db.calculate_party_matches() to compute match percentages for parties
+## External integrations & dependencies

-External integrations & dependencies
-----------------------------------
 - Tweede Kamer OData API (api_client.py)
 - HTTP (requests)
 - HTML parsing (BeautifulSoup) used by scraper.py
@ -89,8 +78,8 @@ External integrations & dependencies
 - Streamlit for UI
 - OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py

-Configuration
-------------
+## Configuration
+
 - config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include:
  - config.DATABASE_PATH (default "data/motions.db")
  - OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py
@ -99,26 +88,24 @@ Configuration
 - .env file present at repo root (do not commit secrets). See .env.example if present (none observed).
 - Packaging metadata: pyproject.toml

-Build, run & development notes
------------------------------
- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CI
-  workflows detected in the repository.
- Streamlit app: run `streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
+## Build, run & development notes
+
+- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CIworkflows detected in the repository.
+- Use uv add and uv run to manage the dependencies in this directory and run scripts
+- Streamlit app: run `uv run streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
 - Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion.

-Tests
-----
+## Tests
+
 - There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification.

-Notes / caveats
----------------
- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons
-  (e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,
-  scraper.py). Logging is not centralized (print statements used).
+## Notes / caveats
+
+- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons(e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
+- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,scraper.py). Logging is not centralized (print statements used).
+
+## Where to look first (for contributors)

-Where to look first (for contributors)
-------------------------------------
 - app.py            — follow the UI flow and see how votes & sessions are used
 - database.py       — core data model and calculations
 - api_client.py     — OData ingestion logic
--- a/scripts/mindmodel/loader.py
+++ b/scripts/mindmodel/loader.py
@ -0,0 +1,67 @@
+"""Simple manifest loader for mindmodel manifests.
+
+Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`.
+
+Behavior:
+- If PyYAML is installed, uses yaml.safe_load to parse the file.
+- Otherwise falls back to the stdlib json parser.
+- If the top-level document is a list it will be normalized to {"constraints": <list>}.
+- Raises ManifestLoadError for missing file or parse errors.
+"""
+
+from typing import Any, Dict
+import json
+from pathlib import Path
+
+
+class ManifestLoadError(Exception):
+    """Raised when a manifest cannot be loaded or parsed."""
+
+
+try:
+    import yaml  # type: ignore
+except Exception:  # YAML not available
+    yaml = None  # type: ignore
+
+
+def _parse_with_yaml(text: str) -> Any:
+    # yamlsafe_load may return any Python structure
+    try:
+        return yaml.safe_load(text)
+    except Exception as exc:  # pragma: no cover - defensive
+        raise ManifestLoadError(f"YAML parse error: {exc}") from exc
+
+
+def _parse_with_json(text: str) -> Any:
+    try:
+        return json.loads(text)
+    except Exception as exc:
+        raise ManifestLoadError(f"JSON parse error: {exc}") from exc
+
+
+def load_manifest(path: str) -> Dict[str, Any]:
+    """Load a manifest from the given file path and normalize it to a dict.
+
+    If the top-level document is a list, it will be returned as {"constraints": list}.
+    Raises ManifestLoadError if the file does not exist or if parsing fails.
+    """
+    p = Path(path)
+    if not p.exists():
+        raise ManifestLoadError(f"Manifest file not found: {path}")
+
+    text = p.read_text(encoding="utf-8")
+
+    if yaml is not None:
+        data = _parse_with_yaml(text)
+    else:
+        data = _parse_with_json(text)
+
+    # Normalize
+    if isinstance(data, list):
+        return {"constraints": data}
+
+    if isinstance(data, dict):
+        return data
+
+    # Unexpected top-level type, wrap it
+    return {"manifest": data}
--- a/tests/scripts/mindmodel/test_loader.py
+++ b/tests/scripts/mindmodel/test_loader.py
@ -0,0 +1,21 @@
+import json
+import pytest
+
+from scripts.mindmodel import loader
+
+
+def test_load_json_manifest(tmp_path):
+    data = [{"id": "c1", "description": "a constraint"}]
+    p = tmp_path / "manifest.json"
+    p.write_text(json.dumps(data), encoding="utf-8")
+
+    loaded = loader.load_manifest(str(p))
+
+    assert isinstance(loaded, dict)
+    assert "constraints" in loaded
+    assert any(c.get("id") == "c1" for c in loaded["constraints"])
+
+
+def test_missing_manifest_raises():
+    with pytest.raises(loader.ManifestLoadError):
+        loader.load_manifest("nonexistent-file-manifest.json")
--- a/thoughts/ledgers/audit_events.json
+++ b/thoughts/ledgers/audit_events.json
@ -545,5 +545,98 @@
    "target_id": null,
    "metadata": {},
    "created_at": "2026-03-23T22:52:47.836920Z"
+  },
+  {
+    "id": "de3394a0-8c8e-4282-8369-f53aa957fd46",
+    "actor_id": null,
+    "action": "embedding_failed",
+    "target_type": "motion",
+    "target_id": "99",
+    "metadata": {
+      "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
+    },
+    "created_at": "2026-03-24T19:08:06.647810Z"
+  },
+  {
+    "id": "8491ed90-9314-41a9-9d02-092a5d0bebd5",
+    "actor_id": null,
+    "action": "test_action",
+    "target_type": "unit",
+    "target_id": "u1",
+    "metadata": {
+      "k": 1
+    },
+    "created_at": "2026-03-24T19:08:08.085618Z"
+  },
+  {
+    "id": "ae7c88e5-ba28-4012-8991-c58fea9c0778",
+    "actor_id": null,
+    "action": "another_action",
+    "target_type": "motion",
+    "target_id": null,
+    "metadata": {},
+    "created_at": "2026-03-24T19:08:08.131631Z"
+  },
+  {
+    "id": "b73e6bf8-2b66-43bf-ad9c-e92d34ae38db",
+    "actor_id": null,
+    "action": "embedding_failed",
+    "target_type": "motion",
+    "target_id": "99",
+    "metadata": {
+      "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
+    },
+    "created_at": "2026-03-24T19:18:02.854710Z"
+  },
+  {
+    "id": "3a6bf0e0-9f07-477d-9079-715d8c0f39c4",
+    "actor_id": null,
+    "action": "test_action",
+    "target_type": "unit",
+    "target_id": "u1",
+    "metadata": {
+      "k": 1
+    },
+    "created_at": "2026-03-24T19:18:05.512388Z"
+  },
+  {
+    "id": "75d9e229-78e6-439e-8095-c01ba7830de9",
+    "actor_id": null,
+    "action": "another_action",
+    "target_type": "motion",
+    "target_id": null,
+    "metadata": {},
+    "created_at": "2026-03-24T19:18:05.557773Z"
+  },
+  {
+    "id": "d45fc116-47be-4486-ba5c-ab2edd7f7e76",
+    "actor_id": null,
+    "action": "embedding_failed",
+    "target_type": "motion",
+    "target_id": "99",
+    "metadata": {
+      "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
+    },
+    "created_at": "2026-03-24T19:28:43.867346Z"
+  },
+  {
+    "id": "b4ead1cd-58b1-4ff6-aa73-c77ab09ba063",
+    "actor_id": null,
+    "action": "test_action",
+    "target_type": "unit",
+    "target_id": "u1",
+    "metadata": {
+      "k": 1
+    },
+    "created_at": "2026-03-24T19:28:45.051895Z"
+  },
+  {
+    "id": "463bfa1b-59fe-4fd3-a8dd-b39674948656",
+    "actor_id": null,
+    "action": "another_action",
+    "target_type": "motion",
+    "target_id": null,
+    "metadata": {},
+    "created_at": "2026-03-24T19:28:45.097703Z"
  }
 ]
--- a/thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md
+++ b/thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md
@ -0,0 +1,73 @@
+---
+date: 2026-03-24
+topic: "mindmodel-generation"
+status: draft
+---
+
+## Problem Statement
+
+We generated a .mindmodel/ snapshot for this repository using an automated orchestrator. The output includes inferred constraints, patterns, schema snippets, and remediation recommendations. We need a short, validated design that explains what was produced, how to verify and integrate it safely, and a recommended next set of changes (low-risk remediation and CI additions).
+
+## Constraints
+
+**Non-negotiables:**
+- Keep the generated .mindmodel/ files read-only until validated.
+- Do not make behavioral changes to production code in the same change as model metadata updates.
+- Avoid committing secrets or lockfiles without explicit review.
+
+**Limitations:**
+- The orchestrator used heuristic file reads; some evidence pointers may be truncated or approximate.
+- No poetry.lock / requirements.txt or CI workflows were found; dependency remediation must be conservative.
+
+## Approach
+
+I'm choosing an **audit-first, incremental integration** approach because the generated artifacts are high-value policy documents but rely on evidence that needs verification. We will: (1) validate evidence pointers and missing files, (2) mark fixes for trivial issues (move pytest to dev-deps, add formatter configs) in a small non-invasive PR, (3) integrate the .mindmodel/ into the repo and add a CI lint step that validates the manifest, and (4) iterate on higher-risk changes after tests pass.
+
+Alternatives considered:
+- Accept-and-commit everything immediately (faster) — rejected because of truncated reads and potential wrong pointers.
+- Manual rewrite of constraints by hand (accurate) — rejected due to time cost; validation + targeted fixes gives best ROI.
+
+## Architecture
+
+This is a documentation/metadata integration task, not a runtime service. Components:
+
+- **.mindmodel/**: constraint files and manifest produced by orchestrator. Source of truth for conventions and inferred patterns.
+- **Validator job (CI)**: lightweight script/CI step that verifies manifest consistency, required files exist, and key evidence pointers resolve.
+- **Small remediation PRs**: conservative code/config edits (pyproject tweaks, add black/ruff/isort configs, pre-commit) that enable future automation.
+
+## Components
+
+- Constraint Validator: verifies every .mindmodel/ constraint references existing files; flags truncated evidence ranges; ensures no secrets.
+- Staging branch: holds small remediation commits; each commit is limited to one class of change (deps dev/prod move, linters, CI yaml).
+- CI pipeline changes: add a validation job and a docs check that ensures .mindmodel/ manifest is up to date.
+
+## Data Flow
+
+1. Orchestrator output (.mindmodel/) exists in the working tree.
+2. Validator runs locally or in CI to check pointers and file existence.
+3. Developer reviews validator report and accepts/edits constraint files.
+4. Remediation PRs are opened for low-risk fixes.
+5. CI runs tests + validator; on green we merge and enable scheduled checks.
+
+## Error Handling
+
+- Validator failures are non-blocking for mainline but must be resolved before we rely on constraints for automation.
+- If a constraint references a deleted or moved file, mark the constraint as "needs-review" in the manifest and leave file unchanged.
+- For ambiguous evidence (truncated reads), add an explicit comment in the constraint file pointing to the reviewer.
+
+## Testing Strategy
+
+- Unit: small pytest tests that assert README/pyproject presence and that manifest YAML parses.
+- Integration: CI job that runs the Constraint Validator and fails on missing files or secrets.
+- Manual: reviewer inspects a sample of constraint files (3-5) for accuracy before merging.
+
+## Open Questions
+
+- Do we want the validator to auto-fix trivial issues (reformatting YAML paths) or only report? I'm leaning toward report-only for safety.
+- Should .mindmodel/ be protected by branch policy or just reviewed by humans? Recommend human review + CI check, not protected branch yet.
+
+## Next Steps (what I'll do now)
+
+1. Create this design doc (done).
+2. Commit the design doc to the repo (doing now).
+3. Spawn the planner to create a step-by-step implementation plan based on this design (spawning now).
--- a/thoughts/shared/plans/2026-03-24-mindmodel-generation.md
+++ b/thoughts/shared/plans/2026-03-24-mindmodel-generation.md
@ -0,0 +1,76 @@
+---
+date: 2026-03-24
+topic: "mindmodel-generation"
+status: draft
+---
+
+# Implementation Plan: mindmodel-generation
+
+Goal: Implement a lightweight, safe Constraint Validator for the generated .mindmodel/ snapshot plus small CI / config artifacts to validate and integrate the manifest incrementally and safely.
+
+Design reference: thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md
+
+---
+
+## Overview
+
+This plan breaks work into four batches: Foundation, Core, Components, Integration/Configs. Each micro-task is small and independently testable. Tests accompany core modules. The validator intentionally avoids reading repository secret files and only scans manifest text and evidence snippets.
+
+## Batch 1: Foundation (parallel)
+
+- Task 1.1: Manifest loader
+  - Path: scripts/mindmodel/loader.py
+  - Test: tests/scripts/mindmodel/test_loader.py
+  - Behavior: load YAML or JSON manifest, normalize to dict, raise ManifestLoadError on failure
+
+- Task 1.2: Low-level checks
+  - Path: scripts/mindmodel/checks.py
+  - Test: tests/scripts/mindmodel/test_checks.py
+  - Behavior: file existence (without opening), truncated-snippet heuristics, manifest-text secret heuristics
+
+## Batch 2: Core Modules (depends on Batch 1)
+
+- Task 2.1: Constraint Validator (core)
+  - Path: scripts/mindmodel/validator.py
+  - Test: tests/scripts/mindmodel/test_validator.py
+  - Behavior: load manifest, scan for secrets, verify referenced files exist, detect truncated snippets, produce machine-readable report and exit codes: 0 ok, 1 warnings, 2 critical
+
+## Batch 3: Components (depends on Batch 2)
+
+- Task 3.1: CLI wrapper for CI and local runs
+  - Path: scripts/mindmodel/cli.py
+  - Test: tests/scripts/mindmodel/test_cli.py
+  - Behavior: simple wrapper delegating to validator; callable as python -m scripts.mindmodel.cli
+
+## Batch 4: Integration / Configs / Docs (parallel)
+
+- Task 4.1: CI workflow to run validator on PRs and scheduled checks
+  - Path: .github/workflows/mindmodel-validate.yml
+  - Behavior: run tests, then run validator against .mindmodel/manifest.yaml if present
+
+- Task 4.2: .mindmodel/ README describing read-only policy
+  - Path: .mindmodel/README.md
+
+- Task 4.3: Add a minimal pre-commit config (trailing whitespace, eof fixer, check-yaml)
+  - Path: .pre-commit-config.yaml
+
+## Verification
+
+- Each unit has a focused pytest test to validate behavior.
+- CI will run the validator and tests; the validator should skip if no manifest present.
+
+## Implementation Checklist
+
+- [ ] Add scripts/mindmodel/loader.py + tests/scripts/mindmodel/test_loader.py
+- [ ] Add scripts/mindmodel/checks.py + tests/scripts/mindmodel/test_checks.py
+- [ ] Add scripts/mindmodel/validator.py + tests/scripts/mindmodel/test_validator.py
+- [ ] Add scripts/mindmodel/cli.py + tests/scripts/mindmodel/test_cli.py
+- [ ] Add .github/workflows/mindmodel-validate.yml
+- [ ] Add .mindmodel/README.md
+- [ ] Add .pre-commit-config.yaml
+
+## Next steps
+
+1. Create the files above in small commits (one micro-task per commit).
+2. Run unit tests for each new module as added.
+3. Open a small PR with the validator + CI + docs; request reviewers to run the validator locally.