feat(mindmodel): add manifest loader and tests

main
Sven Geboers 1 month ago
parent 9c82962d47
commit 2efd7ba3a0
  1. 34
      .mindmodel/constraints/01-naming.yaml
  2. 74
      .mindmodel/constraints/10-db-schema.yaml
  3. 22
      .mindmodel/constraints/20-domain-glossary.yaml
  4. 30
      .mindmodel/constraints/30-clusters.yaml
  5. 46
      .mindmodel/constraints/40-patterns.yaml
  6. 24
      .mindmodel/constraints/50-anti-patterns.yaml
  7. 117
      .mindmodel/constraints/60-examples.yaml
  8. 43
      .mindmodel/constraints/99-stack.yaml
  9. 5
      .mindmodel/constraints/README.md
  10. 86
      .mindmodel/manifest.yaml
  11. 26
      .mindmodel/system.md
  12. 93
      ARCHITECTURE.md
  13. 67
      scripts/mindmodel/loader.py
  14. 21
      tests/scripts/mindmodel/test_loader.py
  15. 93
      thoughts/ledgers/audit_events.json
  16. 73
      thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md
  17. 76
      thoughts/shared/plans/2026-03-24-mindmodel-generation.md

@ -0,0 +1,34 @@
# Naming & Style Conventions
## Rules
- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py
- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py)
- Classes: PascalCase. Evidence: MotionDatabase (database.py)
- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred)
- Imports order: stdlib, third-party, local; prefer absolute imports and grouped.
- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections).
## Examples
### Function example (from pipeline/run_pipeline.py)
```python
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]:
"""Return list of (window_id, start_str, end_str) tuples."""
```
### Class example (from database.py)
```python
class MotionDatabase:
def __init__(self, db_path: str = config.DATABASE_PATH):
...
```
## Anti-patterns
- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files.
## Remediations
- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step.
## Evidence pointers
- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120)
- database.py: MotionDatabase class and methods (file database.py lines 1-400+)

@ -0,0 +1,74 @@
# Database Schema (DuckDB) — extracted DDL
## Rules
- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py).
- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py).
## Examples (DDL snippets extracted from database.py)
### motions table
```sql
CREATE TABLE IF NOT EXISTS motions (
id INTEGER DEFAULT nextval('motions_id_seq'),
title TEXT NOT NULL,
description TEXT,
date DATE,
policy_area TEXT,
voting_results JSON,
winning_margin FLOAT,
controversy_score FLOAT,
layman_explanation TEXT,
externe_identifier TEXT,
body_text TEXT,
url TEXT UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
```
### mp_votes table
```sql
CREATE TABLE IF NOT EXISTS mp_votes (
id INTEGER DEFAULT nextval('mp_votes_id_seq'),
motion_id INTEGER NOT NULL,
mp_name TEXT NOT NULL,
party TEXT,
vote TEXT NOT NULL,
date DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
```
### embeddings / fused_embeddings
```sql
CREATE TABLE IF NOT EXISTS embeddings (
id INTEGER DEFAULT nextval('embeddings_id_seq'),
motion_id INTEGER NOT NULL,
model TEXT,
vector JSON NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
CREATE TABLE IF NOT EXISTS fused_embeddings (
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
motion_id INTEGER NOT NULL,
window_id TEXT NOT NULL,
vector JSON NOT NULL,
svd_dims INTEGER NOT NULL,
text_dims INTEGER NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
```
## Anti-patterns
- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior.
## Remediations
- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically.
- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80).
## Evidence pointers
- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings.

@ -0,0 +1,22 @@
# Domain Glossary
## Rules
- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id.
## Terms
- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110)
- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes
- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`.
- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table.
- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows
- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score
## Examples / Usage
- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120
## Evidence pointers
- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py)
- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py)
## Anti-patterns
- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations.

@ -0,0 +1,30 @@
# Code Clusters / Organization
## Rules
- The repository organizes code into the following clusters (observed):
- UI / Streamlit: Home.py, pages/, app.py, explorer.py
- Database & persistence: database.py, config.py
- ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion)
- AI provider & summarization: ai_provider.py, pipeline/..., analysis/
- Similarity & caching: similarity/*, similarity_cache table in DB
- API client & scraping: api_client.py, pipeline/fetch_mp_metadata
- Analysis & visualization: analysis/visualize.py, explorer.py
- CLI & scheduler: scheduler.py, pipeline/run_pipeline.py
- Tests & migrations: tests/ (pytest) and database reset helpers
## Examples
### Pipeline orchestrator (cluster: CLI & pipeline)
```python
from database import MotionDatabase
db = MotionDatabase(db_path)
# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window
```
## Remediations
- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests.
## Evidence pointers
- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py)
- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py)
- analysis/visualize.py: visualization cluster (file: analysis/visualize.py)

@ -0,0 +1,46 @@
# Design Patterns & Code Patterns
## Rules
- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management.
- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback.
- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes).
## Examples
### Repository pattern (database.py MotionDatabase)
```python
class MotionDatabase:
def __init__(self, db_path: str = config.DATABASE_PATH):
self.db_path = db_path
self._init_database()
def insert_motion(self, motion_data: Dict) -> bool:
"""Insert a new motion into database"""
# uses duckdb.connect and parameterized queries
```
### Provider adapter with retries (ai_provider.py)
```python
def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response:
# Implements retries/backoff, handles 429 with Retry-After and 5xx responses
```
### Pipeline parallelism pattern (run_pipeline)
```python
with ThreadPoolExecutor(max_workers=max_workers) as pool:
for window_id, w_start, w_end in windows:
fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k)
futures[fut] = window_id
# wait then write sequentially to DuckDB
```
## Anti-patterns
- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors.
## Remediations
- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md.
## Evidence pointers
- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300)
- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260)
- database.py: MotionDatabase methods (file: database.py)

@ -0,0 +1,24 @@
# Anti-patterns, Issues and Recommended Fixes
## Rules
- Flagged issues discovered in Phase 1 must be remediated with concrete actions.
## Issues
- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml
- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports.
- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility.
- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps.
- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging.
## Remediations / Recommended fixes
- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml.
- Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain.
- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var.
- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges.
- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage.
- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks.
## Evidence pointers
- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40)
- database.py: multiple broad except blocks (file: database.py top and methods)
- ai_provider.py: uses requests + env keys (file: ai_provider.py)

@ -0,0 +1,117 @@
# Example Extractions
## Rules
- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions.
## (a) Function signatures with docstrings (5 examples)
1) pipeline/run_pipeline.py::_generate_windows
```python
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]:
"""Return list of (window_id, start_str, end_str) tuples.
window_id format:
quarterly → "2024-Q1", "2024-Q2", …
annual → "2024"
"""
```
2) database.py::append_audit_event
```python
def append_audit_event(
self,
actor_id: Optional[str],
action: str,
target_type: Optional[str] = None,
target_id: Optional[str] = None,
metadata: Optional[Dict] = None,
) -> bool:
"""Record an audit event. Tries DB then falls back to ledger file."""
```
3) ai_provider.py::get_embedding
```python
def get_embedding(text: str, model: str | None = None) -> list[float]:
"""Return an embedding vector for `text` using the configured provider.
Raises ProviderError for configuration or provider-side failures.
"""
```
4) ai_provider.py::get_embeddings_batch
```python
def get_embeddings_batch(
texts: list[str], model: str | None = None, batch_size: int = 50
) -> list[list[float]]:
"""Return embedding vectors for multiple texts using batched API calls."""
```
5) analysis/visualize.py::plot_umap_scatter
```python
def plot_umap_scatter(
motion_ids: List[int],
coords: List[List[float]],
labels: Optional[List[int]] = None,
window_id: Optional[str] = None,
output_path: str = "analysis_umap.html",
) -> str:
"""Produce a 2D scatter plot of UMAP-reduced fused embeddings."""
```
## (b) SQL / DDL snippets (3 examples inferred from database.py)
1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110)
2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes
3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings
## (c) Pytest stubs (4 sample tests matching conventions)
Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add.
1) tests/test_database_basic.py
```python
def test_init_database_creates_tables(tmp_path):
db_path = str(tmp_path / "motions.db")
from database import MotionDatabase
db = MotionDatabase(db_path=db_path)
# If duckdb not available, JSON fallback should create .embeddings.json
assert db is not None
```
2) tests/test_ai_provider.py
```python
def test_local_embedding_fallback():
from ai_provider import _local_embedding
v = _local_embedding("hello world", dim=16)
assert isinstance(v, list) and len(v) == 16
```
3) tests/test_pipeline_windows.py
```python
from pipeline.run_pipeline import _generate_windows
def test_generate_quarterly_windows():
from datetime import date
start = date(2024, 1, 1)
end = date(2024, 3, 31)
windows = _generate_windows(start, end, "quarterly")
assert any(w[0].endswith("Q1") for w in windows)
```
4) tests/test_visualize_plot.py
```python
def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path):
# If plotly missing, function should raise ImportError with guidance
import analysis.visualize as vis
try:
vis._require_plotly()
except ImportError:
assert True
```
## Evidence pointers
- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py
- DDL: database.py create table blocks

@ -0,0 +1,43 @@
# Stack and Dependencies
## Rules
- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13")
- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile
- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py
- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/
## Examples
### pyproject dependencies (evidence: pyproject.toml)
```toml
dependencies = [
"duckdb>=1.3.2",
"ibis-framework[duckdb]>=10.8.0",
"openai>=1.99.7",
"scipy>=1.11",
"umap-learn>=0.5",
"plotly>=5.0",
"pytest>=9.0.2",
"requests>=2.32.4",
"schedule>=1.2.2",
"streamlit>=1.48.0",
"scikit-learn>=1.8.0",
"beautifulsoup4>=4.14.3",
"lxml>=6.0.2",
]
```
## Anti-patterns / Notes
- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml
- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility.
- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai).
## Remediations
- Move test-only libs (pytest) to dev-dependencies in pyproject.toml.
- Add lockfile and CI step to check for pinned dependencies.
- Audit declared but unused packages (openai) and remove or confirm dynamic usage.
## Evidence pointers
- pyproject.toml: full dependency list (lines 1-40)
- Home.py: streamlit usage and app entry (file: Home.py)
- database.py: duckdb table creation and connection (file: database.py lines ~1-350)

@ -0,0 +1,5 @@
# Mindmodel constraints README
Files in .mindmodel/constraints/ are YAML-like constraint documents describing
conventions, patterns and remediation steps. Use these to guide PR reviews and
CI automation.

@ -1,60 +1,36 @@
name: stemwijzer
version: 2
summary: >-
Mindmodel constraints for the Stemwijzer repository (Python + Streamlit +
DuckDB). Captures tech stack, conventions, DB schema, clusters, patterns,
anti-patterns and example extractions. Generated from Phase 1 analysis.
main_patterns:
- Repository DB wrapper (MotionDatabase)
- AI provider adapter with retry/backoff and local fallback
- SVD + embedding fusion pipeline with windowed processing
total_files: 11
categories:
- path: stack.yaml
description: Project technology stack (languages, frameworks, runtime)
- path: .mindmodel/constraints/99-stack.yaml
description: Runtime tech stack and primary dependencies (Python, Streamlit, DuckDB, Ibis)
group: stack
- path: dependencies.yaml
description: Declared and recommended dependencies grouped by purpose
group: stack
- path: system.md
description: System overview and architecture high-level notes
group: architecture
- path: architecture.yaml
description: Architectural layers, organization and confidence levels
group: architecture
- path: conventions.yaml
description: Coding conventions cheat-sheet (naming, imports, types)
group: style
- path: domain-glossary.yaml
description: Business domain glossary for the project
- path: .mindmodel/constraints/01-naming.yaml
description: Naming, import and style conventions
group: conventions
- path: .mindmodel/constraints/10-db-schema.yaml
description: DuckDB schema DDL extracted from database.py
group: database
- path: .mindmodel/constraints/20-domain-glossary.yaml
description: Domain glossary and terminology (motions, MP, embeddings, windows)
group: domain
- path: patterns/duckdb_access.yaml
description: DuckDB access patterns, examples, and anti-patterns
group: patterns
- path: patterns/requests_http.yaml
description: Requests/HTTP client usage and retry best-practices
group: patterns
- path: patterns/embeddings_similarity.yaml
description: Embedding, SVD, fusion and similarity pipeline patterns
group: patterns
- path: patterns/error_handling.yaml
description: Error handling patterns and rules
group: patterns
- path: patterns/validation.yaml
description: Input/domain validation patterns and examples
group: patterns
- path: patterns/module_singletons.yaml
description: Module-level singletons and lifecycle patterns
group: patterns
- path: anti-patterns.yaml
description: Known anti-patterns and remediation steps
group: patterns
- path: examples/pattern-examples.md
description: Consolidated extracted code examples across patterns
group: patterns
- path: constraints/naming.yaml
description: Enforce naming rules (snake_case, PascalCase, constants)
group: constraints
- path: constraints/imports.yaml
description: Enforce import grouping and ordering
group: constraints
- path: constraints/db_connection.yaml
description: Rules for opening/closing DB connections and read-only usage
group: constraints
- path: constraints/error_handling.yaml
description: Error handling style and allowed exception scopes
group: constraints
- path: constraints/testing.yaml
description: Test conventions (pytest, test naming, fixtures)
group: constraints
- path: .mindmodel/constraints/30-clusters.yaml
description: Code clusters and module organization
group: architecture
- path: .mindmodel/constraints/40-patterns.yaml
description: Design patterns and coding patterns observed with examples
group: patterns
- path: .mindmodel/constraints/50-anti-patterns.yaml
description: Anti-patterns, issues and recommended remediations
group: ops
- path: .mindmodel/constraints/60-examples.yaml
description: Example extractions: function signatures, SQL DDL snippets, pytest stubs
group: examples

@ -1,18 +1,14 @@
# System overview
# System Overview: Stemwijzer
This project is a Streamlit-based UI and data-processing pipeline that computes embeddings,
performs SVD over MP/motion voting matrices, fuses vector representations, and precomputes
a similarity cache for quick lookup in the UI.
This mindmodel documents constraints, conventions and patterns for the Stemwijzer
project (Python Streamlit app with DuckDB-backed pipeline for parliamentary
motions embedding analysis).
Key subsystems:
- UI: Streamlit pages (Home.py, pages/*). Exposes interactive explorer and quizzes.
- Data ingestion: scripts and scraper/api_client.py (Tweede Kamer OData).
- Processing pipelines: pipeline/* (text embeddings, SVD, fusion).
- Similarity layer: similarity/compute.py and similarity/lookup.py storing precomputed neighbors.
- Storage: DuckDB (primary), with a JSON-file fallback used in tests/environments without duckdb.
- AI/Embedding provider: ai_provider.py (HTTP wrapper around an OpenRouter/OpenAI-compatible API).
Key points:
- Language: Python >=3.13
- UI: Streamlit multi-page app (Home.py, pages/)
- Storage: DuckDB with JSON fallback for tests/dev (database.py)
- Pipeline: ETL and SVD/text fusion pipeline (pipeline/run_pipeline.py)
- AI: ai_provider adapter uses HTTP-based OpenRouter/OpenAI-compatible API with retry/backoff and local fallback
Operational notes:
- Dockerfile exists; Streamlit default port 8501 exposed.
- Tests use pytest. CI uses Drone (.drone.yml).
- There is no lockfile present in the repository snapshot; add one (poetry.lock or requirements.txt) for reproducible installs.
Use the .mindmodel/ constraints files to guide code changes, CI, and onboarding.

@ -1,14 +1,11 @@
ARCHITECTURE
============
# ARCHITECTURE
Overview
--------
- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It
ingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short human
summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
## Overview
- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). Itingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short humansummaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
## Tech stack
Tech stack
----------
- Language: Python (single-project repository)
- Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py)
- Web / UI: Streamlit (app.py)
@ -18,9 +15,10 @@ Tech stack
- LLM: OpenAI-compatible client (summarizer.py uses openai.OpenAI configured via config)
- Packaging: pyproject.toml present
Top-level layout (annotated)
----------------------------
## Top-level layout (annotated)
./
- app.py — Streamlit UI, main UI flow and session handling (entrypoint for web)
- main.py — minimal CLI entry / small script
- database.py — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations
@ -37,50 +35,41 @@ Top-level layout (annotated)
- pyproject.toml — project metadata / dependencies
- .env — environment variables (not printed here)
Core components
---------------
## Core components
- Streamlit UI (app.py)
- Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes
- Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),
database.calculate_party_matches(), summarizer.update_motion_summaries()
- Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),database.calculate_party_matches(), summarizer.update_motion_summaries()
- Storage (database.py)
- MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions
- Exposes a module-level instance `db = MotionDatabase()` used across the codebase
- Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,
calculate_party_matches
- Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,calculate_party_matches
- Ingestion (api_client.py + scraper.py)
- api_client.py fetches votes via Tweede Kamer OData API and groups records into motions
- scraper.py is an HTML fallback that scrapes motion pages and extracts vote info
- Both provide structured motion dicts consumed by database.insert_motion()
- Summarization (summarizer.py)
- Wraps an OpenAI-compatible client to produce short layman explanations and persists them to DB
- Reads motions without layman_explanation and updates rows
- Orchestration (scheduler.py)
- Runs initial historical ingestion and schedules periodic updates (using schedule)
- Calls API client and summarizer and writes to the database
Data flow (high level)
----------------------
1. Ingestion
- scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job()
- Each produced motion dict is passed to MotionDatabase.insert_motion()
- insert_motion writes to DuckDB (data/motions.db)
## Data flow (high level)
2. Enrichment
- summarizer.update_motion_summaries() reads motions lacking layman_explanation,
calls the LLM client (openai.OpenAI) and writes summary text back to the DB
1. Ingestion
- scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job()
- Each produced motion dict is passed to MotionDatabase.insert_motion()
- insert_motion writes to DuckDB (data/motions.db)
2. Enrichment
- summarizer.update_motion_summaries() reads motions lacking layman_explanation,calls the LLM client (openai.OpenAI) and writes summary text back to the DB
3. Presentation / Interaction
- app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
- Users vote; app.py writes votes into the database via db.update_user_vote()
- app.py calls db.calculate_party_matches() to compute match percentages for parties
3. Presentation / Interaction
- app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
- Users vote; app.py writes votes into the database via db.update_user_vote()
- app.py calls db.calculate_party_matches() to compute match percentages for parties
## External integrations & dependencies
External integrations & dependencies
-----------------------------------
- Tweede Kamer OData API (api_client.py)
- HTTP (requests)
- HTML parsing (BeautifulSoup) used by scraper.py
@ -89,8 +78,8 @@ External integrations & dependencies
- Streamlit for UI
- OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py
Configuration
-------------
## Configuration
- config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include:
- config.DATABASE_PATH (default "data/motions.db")
- OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py
@ -99,26 +88,24 @@ Configuration
- .env file present at repo root (do not commit secrets). See .env.example if present (none observed).
- Packaging metadata: pyproject.toml
Build, run & development notes
------------------------------
- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CI
workflows detected in the repository.
- Streamlit app: run `streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
## Build, run & development notes
- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CIworkflows detected in the repository.
- Use uv add and uv run to manage the dependencies in this directory and run scripts
- Streamlit app: run `uv run streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
- Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion.
Tests
-----
## Tests
- There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification.
Notes / caveats
----------------
- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons
(e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,
scraper.py). Logging is not centralized (print statements used).
## Notes / caveats
- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons(e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,scraper.py). Logging is not centralized (print statements used).
## Where to look first (for contributors)
Where to look first (for contributors)
-------------------------------------
- app.py — follow the UI flow and see how votes & sessions are used
- database.py — core data model and calculations
- api_client.py — OData ingestion logic

@ -0,0 +1,67 @@
"""Simple manifest loader for mindmodel manifests.
Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`.
Behavior:
- If PyYAML is installed, uses yaml.safe_load to parse the file.
- Otherwise falls back to the stdlib json parser.
- If the top-level document is a list it will be normalized to {"constraints": <list>}.
- Raises ManifestLoadError for missing file or parse errors.
"""
from typing import Any, Dict
import json
from pathlib import Path
class ManifestLoadError(Exception):
"""Raised when a manifest cannot be loaded or parsed."""
try:
import yaml # type: ignore
except Exception: # YAML not available
yaml = None # type: ignore
def _parse_with_yaml(text: str) -> Any:
# yamlsafe_load may return any Python structure
try:
return yaml.safe_load(text)
except Exception as exc: # pragma: no cover - defensive
raise ManifestLoadError(f"YAML parse error: {exc}") from exc
def _parse_with_json(text: str) -> Any:
try:
return json.loads(text)
except Exception as exc:
raise ManifestLoadError(f"JSON parse error: {exc}") from exc
def load_manifest(path: str) -> Dict[str, Any]:
"""Load a manifest from the given file path and normalize it to a dict.
If the top-level document is a list, it will be returned as {"constraints": list}.
Raises ManifestLoadError if the file does not exist or if parsing fails.
"""
p = Path(path)
if not p.exists():
raise ManifestLoadError(f"Manifest file not found: {path}")
text = p.read_text(encoding="utf-8")
if yaml is not None:
data = _parse_with_yaml(text)
else:
data = _parse_with_json(text)
# Normalize
if isinstance(data, list):
return {"constraints": data}
if isinstance(data, dict):
return data
# Unexpected top-level type, wrap it
return {"manifest": data}

@ -0,0 +1,21 @@
import json
import pytest
from scripts.mindmodel import loader
def test_load_json_manifest(tmp_path):
data = [{"id": "c1", "description": "a constraint"}]
p = tmp_path / "manifest.json"
p.write_text(json.dumps(data), encoding="utf-8")
loaded = loader.load_manifest(str(p))
assert isinstance(loaded, dict)
assert "constraints" in loaded
assert any(c.get("id") == "c1" for c in loaded["constraints"])
def test_missing_manifest_raises():
with pytest.raises(loader.ManifestLoadError):
loader.load_manifest("nonexistent-file-manifest.json")

@ -545,5 +545,98 @@
"target_id": null,
"metadata": {},
"created_at": "2026-03-23T22:52:47.836920Z"
},
{
"id": "de3394a0-8c8e-4282-8369-f53aa957fd46",
"actor_id": null,
"action": "embedding_failed",
"target_type": "motion",
"target_id": "99",
"metadata": {
"error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
},
"created_at": "2026-03-24T19:08:06.647810Z"
},
{
"id": "8491ed90-9314-41a9-9d02-092a5d0bebd5",
"actor_id": null,
"action": "test_action",
"target_type": "unit",
"target_id": "u1",
"metadata": {
"k": 1
},
"created_at": "2026-03-24T19:08:08.085618Z"
},
{
"id": "ae7c88e5-ba28-4012-8991-c58fea9c0778",
"actor_id": null,
"action": "another_action",
"target_type": "motion",
"target_id": null,
"metadata": {},
"created_at": "2026-03-24T19:08:08.131631Z"
},
{
"id": "b73e6bf8-2b66-43bf-ad9c-e92d34ae38db",
"actor_id": null,
"action": "embedding_failed",
"target_type": "motion",
"target_id": "99",
"metadata": {
"error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
},
"created_at": "2026-03-24T19:18:02.854710Z"
},
{
"id": "3a6bf0e0-9f07-477d-9079-715d8c0f39c4",
"actor_id": null,
"action": "test_action",
"target_type": "unit",
"target_id": "u1",
"metadata": {
"k": 1
},
"created_at": "2026-03-24T19:18:05.512388Z"
},
{
"id": "75d9e229-78e6-439e-8095-c01ba7830de9",
"actor_id": null,
"action": "another_action",
"target_type": "motion",
"target_id": null,
"metadata": {},
"created_at": "2026-03-24T19:18:05.557773Z"
},
{
"id": "d45fc116-47be-4486-ba5c-ab2edd7f7e76",
"actor_id": null,
"action": "embedding_failed",
"target_type": "motion",
"target_id": "99",
"metadata": {
"error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")"
},
"created_at": "2026-03-24T19:28:43.867346Z"
},
{
"id": "b4ead1cd-58b1-4ff6-aa73-c77ab09ba063",
"actor_id": null,
"action": "test_action",
"target_type": "unit",
"target_id": "u1",
"metadata": {
"k": 1
},
"created_at": "2026-03-24T19:28:45.051895Z"
},
{
"id": "463bfa1b-59fe-4fd3-a8dd-b39674948656",
"actor_id": null,
"action": "another_action",
"target_type": "motion",
"target_id": null,
"metadata": {},
"created_at": "2026-03-24T19:28:45.097703Z"
}
]

@ -0,0 +1,73 @@
---
date: 2026-03-24
topic: "mindmodel-generation"
status: draft
---
## Problem Statement
We generated a .mindmodel/ snapshot for this repository using an automated orchestrator. The output includes inferred constraints, patterns, schema snippets, and remediation recommendations. We need a short, validated design that explains what was produced, how to verify and integrate it safely, and a recommended next set of changes (low-risk remediation and CI additions).
## Constraints
**Non-negotiables:**
- Keep the generated .mindmodel/ files read-only until validated.
- Do not make behavioral changes to production code in the same change as model metadata updates.
- Avoid committing secrets or lockfiles without explicit review.
**Limitations:**
- The orchestrator used heuristic file reads; some evidence pointers may be truncated or approximate.
- No poetry.lock / requirements.txt or CI workflows were found; dependency remediation must be conservative.
## Approach
I'm choosing an **audit-first, incremental integration** approach because the generated artifacts are high-value policy documents but rely on evidence that needs verification. We will: (1) validate evidence pointers and missing files, (2) mark fixes for trivial issues (move pytest to dev-deps, add formatter configs) in a small non-invasive PR, (3) integrate the .mindmodel/ into the repo and add a CI lint step that validates the manifest, and (4) iterate on higher-risk changes after tests pass.
Alternatives considered:
- Accept-and-commit everything immediately (faster) — rejected because of truncated reads and potential wrong pointers.
- Manual rewrite of constraints by hand (accurate) — rejected due to time cost; validation + targeted fixes gives best ROI.
## Architecture
This is a documentation/metadata integration task, not a runtime service. Components:
- **.mindmodel/**: constraint files and manifest produced by orchestrator. Source of truth for conventions and inferred patterns.
- **Validator job (CI)**: lightweight script/CI step that verifies manifest consistency, required files exist, and key evidence pointers resolve.
- **Small remediation PRs**: conservative code/config edits (pyproject tweaks, add black/ruff/isort configs, pre-commit) that enable future automation.
## Components
- Constraint Validator: verifies every .mindmodel/ constraint references existing files; flags truncated evidence ranges; ensures no secrets.
- Staging branch: holds small remediation commits; each commit is limited to one class of change (deps dev/prod move, linters, CI yaml).
- CI pipeline changes: add a validation job and a docs check that ensures .mindmodel/ manifest is up to date.
## Data Flow
1. Orchestrator output (.mindmodel/) exists in the working tree.
2. Validator runs locally or in CI to check pointers and file existence.
3. Developer reviews validator report and accepts/edits constraint files.
4. Remediation PRs are opened for low-risk fixes.
5. CI runs tests + validator; on green we merge and enable scheduled checks.
## Error Handling
- Validator failures are non-blocking for mainline but must be resolved before we rely on constraints for automation.
- If a constraint references a deleted or moved file, mark the constraint as "needs-review" in the manifest and leave file unchanged.
- For ambiguous evidence (truncated reads), add an explicit comment in the constraint file pointing to the reviewer.
## Testing Strategy
- Unit: small pytest tests that assert README/pyproject presence and that manifest YAML parses.
- Integration: CI job that runs the Constraint Validator and fails on missing files or secrets.
- Manual: reviewer inspects a sample of constraint files (3-5) for accuracy before merging.
## Open Questions
- Do we want the validator to auto-fix trivial issues (reformatting YAML paths) or only report? I'm leaning toward report-only for safety.
- Should .mindmodel/ be protected by branch policy or just reviewed by humans? Recommend human review + CI check, not protected branch yet.
## Next Steps (what I'll do now)
1. Create this design doc (done).
2. Commit the design doc to the repo (doing now).
3. Spawn the planner to create a step-by-step implementation plan based on this design (spawning now).

@ -0,0 +1,76 @@
---
date: 2026-03-24
topic: "mindmodel-generation"
status: draft
---
# Implementation Plan: mindmodel-generation
Goal: Implement a lightweight, safe Constraint Validator for the generated .mindmodel/ snapshot plus small CI / config artifacts to validate and integrate the manifest incrementally and safely.
Design reference: thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md
---
## Overview
This plan breaks work into four batches: Foundation, Core, Components, Integration/Configs. Each micro-task is small and independently testable. Tests accompany core modules. The validator intentionally avoids reading repository secret files and only scans manifest text and evidence snippets.
## Batch 1: Foundation (parallel)
- Task 1.1: Manifest loader
- Path: scripts/mindmodel/loader.py
- Test: tests/scripts/mindmodel/test_loader.py
- Behavior: load YAML or JSON manifest, normalize to dict, raise ManifestLoadError on failure
- Task 1.2: Low-level checks
- Path: scripts/mindmodel/checks.py
- Test: tests/scripts/mindmodel/test_checks.py
- Behavior: file existence (without opening), truncated-snippet heuristics, manifest-text secret heuristics
## Batch 2: Core Modules (depends on Batch 1)
- Task 2.1: Constraint Validator (core)
- Path: scripts/mindmodel/validator.py
- Test: tests/scripts/mindmodel/test_validator.py
- Behavior: load manifest, scan for secrets, verify referenced files exist, detect truncated snippets, produce machine-readable report and exit codes: 0 ok, 1 warnings, 2 critical
## Batch 3: Components (depends on Batch 2)
- Task 3.1: CLI wrapper for CI and local runs
- Path: scripts/mindmodel/cli.py
- Test: tests/scripts/mindmodel/test_cli.py
- Behavior: simple wrapper delegating to validator; callable as python -m scripts.mindmodel.cli
## Batch 4: Integration / Configs / Docs (parallel)
- Task 4.1: CI workflow to run validator on PRs and scheduled checks
- Path: .github/workflows/mindmodel-validate.yml
- Behavior: run tests, then run validator against .mindmodel/manifest.yaml if present
- Task 4.2: .mindmodel/ README describing read-only policy
- Path: .mindmodel/README.md
- Task 4.3: Add a minimal pre-commit config (trailing whitespace, eof fixer, check-yaml)
- Path: .pre-commit-config.yaml
## Verification
- Each unit has a focused pytest test to validate behavior.
- CI will run the validator and tests; the validator should skip if no manifest present.
## Implementation Checklist
- [ ] Add scripts/mindmodel/loader.py + tests/scripts/mindmodel/test_loader.py
- [ ] Add scripts/mindmodel/checks.py + tests/scripts/mindmodel/test_checks.py
- [ ] Add scripts/mindmodel/validator.py + tests/scripts/mindmodel/test_validator.py
- [ ] Add scripts/mindmodel/cli.py + tests/scripts/mindmodel/test_cli.py
- [ ] Add .github/workflows/mindmodel-validate.yml
- [ ] Add .mindmodel/README.md
- [ ] Add .pre-commit-config.yaml
## Next steps
1. Create the files above in small commits (one micro-task per commit).
2. Run unit tests for each new module as added.
3. Open a small PR with the validator + CI + docs; request reviewers to run the validator locally.
Loading…
Cancel
Save