diff --git a/.mindmodel/constraints/01-naming.yaml b/.mindmodel/constraints/01-naming.yaml new file mode 100644 index 0000000..ffd301f --- /dev/null +++ b/.mindmodel/constraints/01-naming.yaml @@ -0,0 +1,34 @@ +# Naming & Style Conventions + +## Rules +- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py +- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py) +- Classes: PascalCase. Evidence: MotionDatabase (database.py) +- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred) +- Imports order: stdlib, third-party, local; prefer absolute imports and grouped. +- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections). + +## Examples + +### Function example (from pipeline/run_pipeline.py) +```python +def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: + """Return list of (window_id, start_str, end_str) tuples.""" +``` + +### Class example (from database.py) +```python +class MotionDatabase: + def __init__(self, db_path: str = config.DATABASE_PATH): + ... +``` + +## Anti-patterns +- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files. + +## Remediations +- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step. + +## Evidence pointers +- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120) +- database.py: MotionDatabase class and methods (file database.py lines 1-400+) diff --git a/.mindmodel/constraints/10-db-schema.yaml b/.mindmodel/constraints/10-db-schema.yaml new file mode 100644 index 0000000..535dd48 --- /dev/null +++ b/.mindmodel/constraints/10-db-schema.yaml @@ -0,0 +1,74 @@ +# Database Schema (DuckDB) — extracted DDL + +## Rules +- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). +- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py). + +## Examples (DDL snippets extracted from database.py) + +### motions table +```sql +CREATE TABLE IF NOT EXISTS motions ( + id INTEGER DEFAULT nextval('motions_id_seq'), + title TEXT NOT NULL, + description TEXT, + date DATE, + policy_area TEXT, + voting_results JSON, + winning_margin FLOAT, + controversy_score FLOAT, + layman_explanation TEXT, + externe_identifier TEXT, + body_text TEXT, + url TEXT UNIQUE, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY (id) +) +``` + +### mp_votes table +```sql +CREATE TABLE IF NOT EXISTS mp_votes ( + id INTEGER DEFAULT nextval('mp_votes_id_seq'), + motion_id INTEGER NOT NULL, + mp_name TEXT NOT NULL, + party TEXT, + vote TEXT NOT NULL, + date DATE, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY (id) +) +``` + +### embeddings / fused_embeddings +```sql +CREATE TABLE IF NOT EXISTS embeddings ( + id INTEGER DEFAULT nextval('embeddings_id_seq'), + motion_id INTEGER NOT NULL, + model TEXT, + vector JSON NOT NULL, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY (id) +) + +CREATE TABLE IF NOT EXISTS fused_embeddings ( + id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), + motion_id INTEGER NOT NULL, + window_id TEXT NOT NULL, + vector JSON NOT NULL, + svd_dims INTEGER NOT NULL, + text_dims INTEGER NOT NULL, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY (id) +) +``` + +## Anti-patterns +- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior. + +## Remediations +- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically. +- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80). + +## Evidence pointers +- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings. diff --git a/.mindmodel/constraints/20-domain-glossary.yaml b/.mindmodel/constraints/20-domain-glossary.yaml new file mode 100644 index 0000000..43bbc7e --- /dev/null +++ b/.mindmodel/constraints/20-domain-glossary.yaml @@ -0,0 +1,22 @@ +# Domain Glossary + +## Rules +- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id. + +## Terms +- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110) +- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes +- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`. +- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table. +- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows +- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score + +## Examples / Usage +- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120 + +## Evidence pointers +- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py) +- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py) + +## Anti-patterns +- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations. diff --git a/.mindmodel/constraints/30-clusters.yaml b/.mindmodel/constraints/30-clusters.yaml new file mode 100644 index 0000000..c12c29f --- /dev/null +++ b/.mindmodel/constraints/30-clusters.yaml @@ -0,0 +1,30 @@ +# Code Clusters / Organization + +## Rules +- The repository organizes code into the following clusters (observed): + - UI / Streamlit: Home.py, pages/, app.py, explorer.py + - Database & persistence: database.py, config.py + - ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion) + - AI provider & summarization: ai_provider.py, pipeline/..., analysis/ + - Similarity & caching: similarity/*, similarity_cache table in DB + - API client & scraping: api_client.py, pipeline/fetch_mp_metadata + - Analysis & visualization: analysis/visualize.py, explorer.py + - CLI & scheduler: scheduler.py, pipeline/run_pipeline.py + - Tests & migrations: tests/ (pytest) and database reset helpers + +## Examples + +### Pipeline orchestrator (cluster: CLI & pipeline) +```python +from database import MotionDatabase +db = MotionDatabase(db_path) +# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window +``` + +## Remediations +- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests. + +## Evidence pointers +- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py) +- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py) +- analysis/visualize.py: visualization cluster (file: analysis/visualize.py) diff --git a/.mindmodel/constraints/40-patterns.yaml b/.mindmodel/constraints/40-patterns.yaml new file mode 100644 index 0000000..eaeed6f --- /dev/null +++ b/.mindmodel/constraints/40-patterns.yaml @@ -0,0 +1,46 @@ +# Design Patterns & Code Patterns + +## Rules +- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management. +- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback. +- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes). + +## Examples + +### Repository pattern (database.py MotionDatabase) +```python +class MotionDatabase: + def __init__(self, db_path: str = config.DATABASE_PATH): + self.db_path = db_path + self._init_database() + + def insert_motion(self, motion_data: Dict) -> bool: + """Insert a new motion into database""" + # uses duckdb.connect and parameterized queries +``` + +### Provider adapter with retries (ai_provider.py) +```python +def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response: + # Implements retries/backoff, handles 429 with Retry-After and 5xx responses +``` + +### Pipeline parallelism pattern (run_pipeline) +```python +with ThreadPoolExecutor(max_workers=max_workers) as pool: + for window_id, w_start, w_end in windows: + fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k) + futures[fut] = window_id +# wait then write sequentially to DuckDB +``` + +## Anti-patterns +- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors. + +## Remediations +- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md. + +## Evidence pointers +- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300) +- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260) +- database.py: MotionDatabase methods (file: database.py) diff --git a/.mindmodel/constraints/50-anti-patterns.yaml b/.mindmodel/constraints/50-anti-patterns.yaml new file mode 100644 index 0000000..00b5182 --- /dev/null +++ b/.mindmodel/constraints/50-anti-patterns.yaml @@ -0,0 +1,24 @@ +# Anti-patterns, Issues and Recommended Fixes + +## Rules +- Flagged issues discovered in Phase 1 must be remediated with concrete actions. + +## Issues +- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml +- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports. +- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility. +- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps. +- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging. + +## Remediations / Recommended fixes +- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml. + - Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain. +- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var. +- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges. +- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage. +- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks. + +## Evidence pointers +- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40) +- database.py: multiple broad except blocks (file: database.py top and methods) +- ai_provider.py: uses requests + env keys (file: ai_provider.py) diff --git a/.mindmodel/constraints/60-examples.yaml b/.mindmodel/constraints/60-examples.yaml new file mode 100644 index 0000000..d1f7027 --- /dev/null +++ b/.mindmodel/constraints/60-examples.yaml @@ -0,0 +1,117 @@ +# Example Extractions + +## Rules +- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions. + +## (a) Function signatures with docstrings (5 examples) +1) pipeline/run_pipeline.py::_generate_windows +```python +def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: + """Return list of (window_id, start_str, end_str) tuples. + + window_id format: + quarterly → "2024-Q1", "2024-Q2", … + annual → "2024" + """ +``` + +2) database.py::append_audit_event +```python +def append_audit_event( + self, + actor_id: Optional[str], + action: str, + target_type: Optional[str] = None, + target_id: Optional[str] = None, + metadata: Optional[Dict] = None, +) -> bool: + """Record an audit event. Tries DB then falls back to ledger file.""" +``` + +3) ai_provider.py::get_embedding +```python +def get_embedding(text: str, model: str | None = None) -> list[float]: + """Return an embedding vector for `text` using the configured provider. + + Raises ProviderError for configuration or provider-side failures. + """ +``` + +4) ai_provider.py::get_embeddings_batch +```python +def get_embeddings_batch( + texts: list[str], model: str | None = None, batch_size: int = 50 +) -> list[list[float]]: + """Return embedding vectors for multiple texts using batched API calls.""" +``` + +5) analysis/visualize.py::plot_umap_scatter +```python +def plot_umap_scatter( + motion_ids: List[int], + coords: List[List[float]], + labels: Optional[List[int]] = None, + window_id: Optional[str] = None, + output_path: str = "analysis_umap.html", +) -> str: + """Produce a 2D scatter plot of UMAP-reduced fused embeddings.""" +``` + +## (b) SQL / DDL snippets (3 examples inferred from database.py) +1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110) + +2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes + +3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings + +## (c) Pytest stubs (4 sample tests matching conventions) +Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add. + +1) tests/test_database_basic.py +```python +def test_init_database_creates_tables(tmp_path): + db_path = str(tmp_path / "motions.db") + from database import MotionDatabase + + db = MotionDatabase(db_path=db_path) + # If duckdb not available, JSON fallback should create .embeddings.json + assert db is not None +``` + +2) tests/test_ai_provider.py +```python +def test_local_embedding_fallback(): + from ai_provider import _local_embedding + + v = _local_embedding("hello world", dim=16) + assert isinstance(v, list) and len(v) == 16 +``` + +3) tests/test_pipeline_windows.py +```python +from pipeline.run_pipeline import _generate_windows + +def test_generate_quarterly_windows(): + from datetime import date + + start = date(2024, 1, 1) + end = date(2024, 3, 31) + windows = _generate_windows(start, end, "quarterly") + assert any(w[0].endswith("Q1") for w in windows) +``` + +4) tests/test_visualize_plot.py +```python +def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path): + # If plotly missing, function should raise ImportError with guidance + import analysis.visualize as vis + + try: + vis._require_plotly() + except ImportError: + assert True +``` + +## Evidence pointers +- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py +- DDL: database.py create table blocks diff --git a/.mindmodel/constraints/99-stack.yaml b/.mindmodel/constraints/99-stack.yaml new file mode 100644 index 0000000..034f664 --- /dev/null +++ b/.mindmodel/constraints/99-stack.yaml @@ -0,0 +1,43 @@ +# Stack and Dependencies + +## Rules +- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13") +- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile +- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py +- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/ + +## Examples + +### pyproject dependencies (evidence: pyproject.toml) +```toml +dependencies = [ + "duckdb>=1.3.2", + "ibis-framework[duckdb]>=10.8.0", + "openai>=1.99.7", + "scipy>=1.11", + "umap-learn>=0.5", + "plotly>=5.0", + "pytest>=9.0.2", + "requests>=2.32.4", + "schedule>=1.2.2", + "streamlit>=1.48.0", + "scikit-learn>=1.8.0", + "beautifulsoup4>=4.14.3", + "lxml>=6.0.2", +] +``` + +## Anti-patterns / Notes +- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml +- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility. +- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai). + +## Remediations +- Move test-only libs (pytest) to dev-dependencies in pyproject.toml. +- Add lockfile and CI step to check for pinned dependencies. +- Audit declared but unused packages (openai) and remove or confirm dynamic usage. + +## Evidence pointers +- pyproject.toml: full dependency list (lines 1-40) +- Home.py: streamlit usage and app entry (file: Home.py) +- database.py: duckdb table creation and connection (file: database.py lines ~1-350) diff --git a/.mindmodel/constraints/README.md b/.mindmodel/constraints/README.md new file mode 100644 index 0000000..7d63103 --- /dev/null +++ b/.mindmodel/constraints/README.md @@ -0,0 +1,5 @@ +# Mindmodel constraints README + +Files in .mindmodel/constraints/ are YAML-like constraint documents describing +conventions, patterns and remediation steps. Use these to guide PR reviews and +CI automation. diff --git a/.mindmodel/manifest.yaml b/.mindmodel/manifest.yaml index 071febb..254591f 100644 --- a/.mindmodel/manifest.yaml +++ b/.mindmodel/manifest.yaml @@ -1,60 +1,36 @@ name: stemwijzer version: 2 +summary: >- + Mindmodel constraints for the Stemwijzer repository (Python + Streamlit + + DuckDB). Captures tech stack, conventions, DB schema, clusters, patterns, + anti-patterns and example extractions. Generated from Phase 1 analysis. +main_patterns: + - Repository DB wrapper (MotionDatabase) + - AI provider adapter with retry/backoff and local fallback + - SVD + embedding fusion pipeline with windowed processing +total_files: 11 categories: - - path: stack.yaml - description: Project technology stack (languages, frameworks, runtime) + - path: .mindmodel/constraints/99-stack.yaml + description: Runtime tech stack and primary dependencies (Python, Streamlit, DuckDB, Ibis) group: stack - - path: dependencies.yaml - description: Declared and recommended dependencies grouped by purpose - group: stack - - path: system.md - description: System overview and architecture high-level notes - group: architecture - - path: architecture.yaml - description: Architectural layers, organization and confidence levels - group: architecture - - path: conventions.yaml - description: Coding conventions cheat-sheet (naming, imports, types) - group: style - - path: domain-glossary.yaml - description: Business domain glossary for the project + - path: .mindmodel/constraints/01-naming.yaml + description: Naming, import and style conventions + group: conventions + - path: .mindmodel/constraints/10-db-schema.yaml + description: DuckDB schema DDL extracted from database.py + group: database + - path: .mindmodel/constraints/20-domain-glossary.yaml + description: Domain glossary and terminology (motions, MP, embeddings, windows) group: domain - - path: patterns/duckdb_access.yaml - description: DuckDB access patterns, examples, and anti-patterns - group: patterns - - path: patterns/requests_http.yaml - description: Requests/HTTP client usage and retry best-practices - group: patterns - - path: patterns/embeddings_similarity.yaml - description: Embedding, SVD, fusion and similarity pipeline patterns - group: patterns - - path: patterns/error_handling.yaml - description: Error handling patterns and rules - group: patterns - - path: patterns/validation.yaml - description: Input/domain validation patterns and examples - group: patterns - - path: patterns/module_singletons.yaml - description: Module-level singletons and lifecycle patterns - group: patterns - - path: anti-patterns.yaml - description: Known anti-patterns and remediation steps - group: patterns - - path: examples/pattern-examples.md - description: Consolidated extracted code examples across patterns - group: patterns - - path: constraints/naming.yaml - description: Enforce naming rules (snake_case, PascalCase, constants) - group: constraints - - path: constraints/imports.yaml - description: Enforce import grouping and ordering - group: constraints - - path: constraints/db_connection.yaml - description: Rules for opening/closing DB connections and read-only usage - group: constraints - - path: constraints/error_handling.yaml - description: Error handling style and allowed exception scopes - group: constraints - - path: constraints/testing.yaml - description: Test conventions (pytest, test naming, fixtures) - group: constraints + - path: .mindmodel/constraints/30-clusters.yaml + description: Code clusters and module organization + group: architecture + - path: .mindmodel/constraints/40-patterns.yaml + description: Design patterns and coding patterns observed with examples + group: patterns + - path: .mindmodel/constraints/50-anti-patterns.yaml + description: Anti-patterns, issues and recommended remediations + group: ops + - path: .mindmodel/constraints/60-examples.yaml + description: Example extractions: function signatures, SQL DDL snippets, pytest stubs + group: examples diff --git a/.mindmodel/system.md b/.mindmodel/system.md index c90657d..7e7fe3e 100644 --- a/.mindmodel/system.md +++ b/.mindmodel/system.md @@ -1,18 +1,14 @@ -# System overview +# System Overview: Stemwijzer -This project is a Streamlit-based UI and data-processing pipeline that computes embeddings, -performs SVD over MP/motion voting matrices, fuses vector representations, and precomputes -a similarity cache for quick lookup in the UI. +This mindmodel documents constraints, conventions and patterns for the Stemwijzer +project (Python Streamlit app with DuckDB-backed pipeline for parliamentary +motions embedding analysis). -Key subsystems: -- UI: Streamlit pages (Home.py, pages/*). Exposes interactive explorer and quizzes. -- Data ingestion: scripts and scraper/api_client.py (Tweede Kamer OData). -- Processing pipelines: pipeline/* (text embeddings, SVD, fusion). -- Similarity layer: similarity/compute.py and similarity/lookup.py storing precomputed neighbors. -- Storage: DuckDB (primary), with a JSON-file fallback used in tests/environments without duckdb. -- AI/Embedding provider: ai_provider.py (HTTP wrapper around an OpenRouter/OpenAI-compatible API). +Key points: +- Language: Python >=3.13 +- UI: Streamlit multi-page app (Home.py, pages/) +- Storage: DuckDB with JSON fallback for tests/dev (database.py) +- Pipeline: ETL and SVD/text fusion pipeline (pipeline/run_pipeline.py) +- AI: ai_provider adapter uses HTTP-based OpenRouter/OpenAI-compatible API with retry/backoff and local fallback -Operational notes: -- Dockerfile exists; Streamlit default port 8501 exposed. -- Tests use pytest. CI uses Drone (.drone.yml). -- There is no lockfile present in the repository snapshot; add one (poetry.lock or requirements.txt) for reproducible installs. +Use the .mindmodel/ constraints files to guide code changes, CI, and onboarding. diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 19f1926..c161a73 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -1,14 +1,11 @@ -ARCHITECTURE -============ +# ARCHITECTURE -Overview --------- -- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It - ingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short human - summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results. +## Overview + +- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). Itingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short humansummaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results. + +## Tech stack -Tech stack ----------- - Language: Python (single-project repository) - Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py) - Web / UI: Streamlit (app.py) @@ -18,9 +15,10 @@ Tech stack - LLM: OpenAI-compatible client (summarizer.py uses openai.OpenAI configured via config) - Packaging: pyproject.toml present -Top-level layout (annotated) ----------------------------- +## Top-level layout (annotated) + ./ + - app.py — Streamlit UI, main UI flow and session handling (entrypoint for web) - main.py — minimal CLI entry / small script - database.py — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations @@ -37,50 +35,41 @@ Top-level layout (annotated) - pyproject.toml — project metadata / dependencies - .env — environment variables (not printed here) -Core components ---------------- +## Core components + - Streamlit UI (app.py) - Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes - - Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(), - database.calculate_party_matches(), summarizer.update_motion_summaries() - + - Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),database.calculate_party_matches(), summarizer.update_motion_summaries() - Storage (database.py) - MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions - Exposes a module-level instance `db = MotionDatabase()` used across the codebase - - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote, - calculate_party_matches - + - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,calculate_party_matches - Ingestion (api_client.py + scraper.py) - api_client.py fetches votes via Tweede Kamer OData API and groups records into motions - scraper.py is an HTML fallback that scrapes motion pages and extracts vote info - Both provide structured motion dicts consumed by database.insert_motion() - - Summarization (summarizer.py) - Wraps an OpenAI-compatible client to produce short layman explanations and persists them to DB - Reads motions without layman_explanation and updates rows - - Orchestration (scheduler.py) - Runs initial historical ingestion and schedules periodic updates (using schedule) - Calls API client and summarizer and writes to the database -Data flow (high level) ----------------------- -1. Ingestion - - scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job() - - Each produced motion dict is passed to MotionDatabase.insert_motion() - - insert_motion writes to DuckDB (data/motions.db) +## Data flow (high level) -2. Enrichment - - summarizer.update_motion_summaries() reads motions lacking layman_explanation, - calls the LLM client (openai.OpenAI) and writes summary text back to the DB +1. Ingestion + - scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job() + - Each produced motion dict is passed to MotionDatabase.insert_motion() + - insert_motion writes to DuckDB (data/motions.db) +2. Enrichment + - summarizer.update_motion_summaries() reads motions lacking layman_explanation,calls the LLM client (openai.OpenAI) and writes summary text back to the DB +3. Presentation / Interaction + - app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them + - Users vote; app.py writes votes into the database via db.update_user_vote() + - app.py calls db.calculate_party_matches() to compute match percentages for parties -3. Presentation / Interaction - - app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them - - Users vote; app.py writes votes into the database via db.update_user_vote() - - app.py calls db.calculate_party_matches() to compute match percentages for parties +## External integrations & dependencies -External integrations & dependencies ------------------------------------ - Tweede Kamer OData API (api_client.py) - HTTP (requests) - HTML parsing (BeautifulSoup) used by scraper.py @@ -89,8 +78,8 @@ External integrations & dependencies - Streamlit for UI - OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py -Configuration -------------- +## Configuration + - config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include: - config.DATABASE_PATH (default "data/motions.db") - OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py @@ -99,26 +88,24 @@ Configuration - .env file present at repo root (do not commit secrets). See .env.example if present (none observed). - Packaging metadata: pyproject.toml -Build, run & development notes ------------------------------- -- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CI - workflows detected in the repository. -- Streamlit app: run `streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint). +## Build, run & development notes + +- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CIworkflows detected in the repository. +- Use uv add and uv run to manage the dependencies in this directory and run scripts +- Streamlit app: run `uv run streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint). - Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion. -Tests ------ +## Tests + - There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification. -Notes / caveats ----------------- -- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons - (e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`). -- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py, - scraper.py). Logging is not centralized (print statements used). +## Notes / caveats + +- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons(e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`). +- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,scraper.py). Logging is not centralized (print statements used). + +## Where to look first (for contributors) -Where to look first (for contributors) -------------------------------------- - app.py — follow the UI flow and see how votes & sessions are used - database.py — core data model and calculations - api_client.py — OData ingestion logic diff --git a/scripts/mindmodel/loader.py b/scripts/mindmodel/loader.py new file mode 100644 index 0000000..088a688 --- /dev/null +++ b/scripts/mindmodel/loader.py @@ -0,0 +1,67 @@ +"""Simple manifest loader for mindmodel manifests. + +Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`. + +Behavior: +- If PyYAML is installed, uses yaml.safe_load to parse the file. +- Otherwise falls back to the stdlib json parser. +- If the top-level document is a list it will be normalized to {"constraints": }. +- Raises ManifestLoadError for missing file or parse errors. +""" + +from typing import Any, Dict +import json +from pathlib import Path + + +class ManifestLoadError(Exception): + """Raised when a manifest cannot be loaded or parsed.""" + + +try: + import yaml # type: ignore +except Exception: # YAML not available + yaml = None # type: ignore + + +def _parse_with_yaml(text: str) -> Any: + # yamlsafe_load may return any Python structure + try: + return yaml.safe_load(text) + except Exception as exc: # pragma: no cover - defensive + raise ManifestLoadError(f"YAML parse error: {exc}") from exc + + +def _parse_with_json(text: str) -> Any: + try: + return json.loads(text) + except Exception as exc: + raise ManifestLoadError(f"JSON parse error: {exc}") from exc + + +def load_manifest(path: str) -> Dict[str, Any]: + """Load a manifest from the given file path and normalize it to a dict. + + If the top-level document is a list, it will be returned as {"constraints": list}. + Raises ManifestLoadError if the file does not exist or if parsing fails. + """ + p = Path(path) + if not p.exists(): + raise ManifestLoadError(f"Manifest file not found: {path}") + + text = p.read_text(encoding="utf-8") + + if yaml is not None: + data = _parse_with_yaml(text) + else: + data = _parse_with_json(text) + + # Normalize + if isinstance(data, list): + return {"constraints": data} + + if isinstance(data, dict): + return data + + # Unexpected top-level type, wrap it + return {"manifest": data} diff --git a/tests/scripts/mindmodel/test_loader.py b/tests/scripts/mindmodel/test_loader.py new file mode 100644 index 0000000..b4a3429 --- /dev/null +++ b/tests/scripts/mindmodel/test_loader.py @@ -0,0 +1,21 @@ +import json +import pytest + +from scripts.mindmodel import loader + + +def test_load_json_manifest(tmp_path): + data = [{"id": "c1", "description": "a constraint"}] + p = tmp_path / "manifest.json" + p.write_text(json.dumps(data), encoding="utf-8") + + loaded = loader.load_manifest(str(p)) + + assert isinstance(loaded, dict) + assert "constraints" in loaded + assert any(c.get("id") == "c1" for c in loaded["constraints"]) + + +def test_missing_manifest_raises(): + with pytest.raises(loader.ManifestLoadError): + loader.load_manifest("nonexistent-file-manifest.json") diff --git a/thoughts/ledgers/audit_events.json b/thoughts/ledgers/audit_events.json index fbc2561..da1f368 100644 --- a/thoughts/ledgers/audit_events.json +++ b/thoughts/ledgers/audit_events.json @@ -545,5 +545,98 @@ "target_id": null, "metadata": {}, "created_at": "2026-03-23T22:52:47.836920Z" + }, + { + "id": "de3394a0-8c8e-4282-8369-f53aa957fd46", + "actor_id": null, + "action": "embedding_failed", + "target_type": "motion", + "target_id": "99", + "metadata": { + "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")" + }, + "created_at": "2026-03-24T19:08:06.647810Z" + }, + { + "id": "8491ed90-9314-41a9-9d02-092a5d0bebd5", + "actor_id": null, + "action": "test_action", + "target_type": "unit", + "target_id": "u1", + "metadata": { + "k": 1 + }, + "created_at": "2026-03-24T19:08:08.085618Z" + }, + { + "id": "ae7c88e5-ba28-4012-8991-c58fea9c0778", + "actor_id": null, + "action": "another_action", + "target_type": "motion", + "target_id": null, + "metadata": {}, + "created_at": "2026-03-24T19:08:08.131631Z" + }, + { + "id": "b73e6bf8-2b66-43bf-ad9c-e92d34ae38db", + "actor_id": null, + "action": "embedding_failed", + "target_type": "motion", + "target_id": "99", + "metadata": { + "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")" + }, + "created_at": "2026-03-24T19:18:02.854710Z" + }, + { + "id": "3a6bf0e0-9f07-477d-9079-715d8c0f39c4", + "actor_id": null, + "action": "test_action", + "target_type": "unit", + "target_id": "u1", + "metadata": { + "k": 1 + }, + "created_at": "2026-03-24T19:18:05.512388Z" + }, + { + "id": "75d9e229-78e6-439e-8095-c01ba7830de9", + "actor_id": null, + "action": "another_action", + "target_type": "motion", + "target_id": null, + "metadata": {}, + "created_at": "2026-03-24T19:18:05.557773Z" + }, + { + "id": "d45fc116-47be-4486-ba5c-ab2edd7f7e76", + "actor_id": null, + "action": "embedding_failed", + "target_type": "motion", + "target_id": "99", + "metadata": { + "error": "RuntimeError(\"Simulated embedding failure for index 0: 'failing motion'\")" + }, + "created_at": "2026-03-24T19:28:43.867346Z" + }, + { + "id": "b4ead1cd-58b1-4ff6-aa73-c77ab09ba063", + "actor_id": null, + "action": "test_action", + "target_type": "unit", + "target_id": "u1", + "metadata": { + "k": 1 + }, + "created_at": "2026-03-24T19:28:45.051895Z" + }, + { + "id": "463bfa1b-59fe-4fd3-a8dd-b39674948656", + "actor_id": null, + "action": "another_action", + "target_type": "motion", + "target_id": null, + "metadata": {}, + "created_at": "2026-03-24T19:28:45.097703Z" } ] \ No newline at end of file diff --git a/thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md b/thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md new file mode 100644 index 0000000..6207b13 --- /dev/null +++ b/thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md @@ -0,0 +1,73 @@ +--- +date: 2026-03-24 +topic: "mindmodel-generation" +status: draft +--- + +## Problem Statement + +We generated a .mindmodel/ snapshot for this repository using an automated orchestrator. The output includes inferred constraints, patterns, schema snippets, and remediation recommendations. We need a short, validated design that explains what was produced, how to verify and integrate it safely, and a recommended next set of changes (low-risk remediation and CI additions). + +## Constraints + +**Non-negotiables:** +- Keep the generated .mindmodel/ files read-only until validated. +- Do not make behavioral changes to production code in the same change as model metadata updates. +- Avoid committing secrets or lockfiles without explicit review. + +**Limitations:** +- The orchestrator used heuristic file reads; some evidence pointers may be truncated or approximate. +- No poetry.lock / requirements.txt or CI workflows were found; dependency remediation must be conservative. + +## Approach + +I'm choosing an **audit-first, incremental integration** approach because the generated artifacts are high-value policy documents but rely on evidence that needs verification. We will: (1) validate evidence pointers and missing files, (2) mark fixes for trivial issues (move pytest to dev-deps, add formatter configs) in a small non-invasive PR, (3) integrate the .mindmodel/ into the repo and add a CI lint step that validates the manifest, and (4) iterate on higher-risk changes after tests pass. + +Alternatives considered: +- Accept-and-commit everything immediately (faster) — rejected because of truncated reads and potential wrong pointers. +- Manual rewrite of constraints by hand (accurate) — rejected due to time cost; validation + targeted fixes gives best ROI. + +## Architecture + +This is a documentation/metadata integration task, not a runtime service. Components: + +- **.mindmodel/**: constraint files and manifest produced by orchestrator. Source of truth for conventions and inferred patterns. +- **Validator job (CI)**: lightweight script/CI step that verifies manifest consistency, required files exist, and key evidence pointers resolve. +- **Small remediation PRs**: conservative code/config edits (pyproject tweaks, add black/ruff/isort configs, pre-commit) that enable future automation. + +## Components + +- Constraint Validator: verifies every .mindmodel/ constraint references existing files; flags truncated evidence ranges; ensures no secrets. +- Staging branch: holds small remediation commits; each commit is limited to one class of change (deps dev/prod move, linters, CI yaml). +- CI pipeline changes: add a validation job and a docs check that ensures .mindmodel/ manifest is up to date. + +## Data Flow + +1. Orchestrator output (.mindmodel/) exists in the working tree. +2. Validator runs locally or in CI to check pointers and file existence. +3. Developer reviews validator report and accepts/edits constraint files. +4. Remediation PRs are opened for low-risk fixes. +5. CI runs tests + validator; on green we merge and enable scheduled checks. + +## Error Handling + +- Validator failures are non-blocking for mainline but must be resolved before we rely on constraints for automation. +- If a constraint references a deleted or moved file, mark the constraint as "needs-review" in the manifest and leave file unchanged. +- For ambiguous evidence (truncated reads), add an explicit comment in the constraint file pointing to the reviewer. + +## Testing Strategy + +- Unit: small pytest tests that assert README/pyproject presence and that manifest YAML parses. +- Integration: CI job that runs the Constraint Validator and fails on missing files or secrets. +- Manual: reviewer inspects a sample of constraint files (3-5) for accuracy before merging. + +## Open Questions + +- Do we want the validator to auto-fix trivial issues (reformatting YAML paths) or only report? I'm leaning toward report-only for safety. +- Should .mindmodel/ be protected by branch policy or just reviewed by humans? Recommend human review + CI check, not protected branch yet. + +## Next Steps (what I'll do now) + +1. Create this design doc (done). +2. Commit the design doc to the repo (doing now). +3. Spawn the planner to create a step-by-step implementation plan based on this design (spawning now). diff --git a/thoughts/shared/plans/2026-03-24-mindmodel-generation.md b/thoughts/shared/plans/2026-03-24-mindmodel-generation.md new file mode 100644 index 0000000..f971ddf --- /dev/null +++ b/thoughts/shared/plans/2026-03-24-mindmodel-generation.md @@ -0,0 +1,76 @@ +--- +date: 2026-03-24 +topic: "mindmodel-generation" +status: draft +--- + +# Implementation Plan: mindmodel-generation + +Goal: Implement a lightweight, safe Constraint Validator for the generated .mindmodel/ snapshot plus small CI / config artifacts to validate and integrate the manifest incrementally and safely. + +Design reference: thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md + +--- + +## Overview + +This plan breaks work into four batches: Foundation, Core, Components, Integration/Configs. Each micro-task is small and independently testable. Tests accompany core modules. The validator intentionally avoids reading repository secret files and only scans manifest text and evidence snippets. + +## Batch 1: Foundation (parallel) + +- Task 1.1: Manifest loader + - Path: scripts/mindmodel/loader.py + - Test: tests/scripts/mindmodel/test_loader.py + - Behavior: load YAML or JSON manifest, normalize to dict, raise ManifestLoadError on failure + +- Task 1.2: Low-level checks + - Path: scripts/mindmodel/checks.py + - Test: tests/scripts/mindmodel/test_checks.py + - Behavior: file existence (without opening), truncated-snippet heuristics, manifest-text secret heuristics + +## Batch 2: Core Modules (depends on Batch 1) + +- Task 2.1: Constraint Validator (core) + - Path: scripts/mindmodel/validator.py + - Test: tests/scripts/mindmodel/test_validator.py + - Behavior: load manifest, scan for secrets, verify referenced files exist, detect truncated snippets, produce machine-readable report and exit codes: 0 ok, 1 warnings, 2 critical + +## Batch 3: Components (depends on Batch 2) + +- Task 3.1: CLI wrapper for CI and local runs + - Path: scripts/mindmodel/cli.py + - Test: tests/scripts/mindmodel/test_cli.py + - Behavior: simple wrapper delegating to validator; callable as python -m scripts.mindmodel.cli + +## Batch 4: Integration / Configs / Docs (parallel) + +- Task 4.1: CI workflow to run validator on PRs and scheduled checks + - Path: .github/workflows/mindmodel-validate.yml + - Behavior: run tests, then run validator against .mindmodel/manifest.yaml if present + +- Task 4.2: .mindmodel/ README describing read-only policy + - Path: .mindmodel/README.md + +- Task 4.3: Add a minimal pre-commit config (trailing whitespace, eof fixer, check-yaml) + - Path: .pre-commit-config.yaml + +## Verification + +- Each unit has a focused pytest test to validate behavior. +- CI will run the validator and tests; the validator should skip if no manifest present. + +## Implementation Checklist + +- [ ] Add scripts/mindmodel/loader.py + tests/scripts/mindmodel/test_loader.py +- [ ] Add scripts/mindmodel/checks.py + tests/scripts/mindmodel/test_checks.py +- [ ] Add scripts/mindmodel/validator.py + tests/scripts/mindmodel/test_validator.py +- [ ] Add scripts/mindmodel/cli.py + tests/scripts/mindmodel/test_cli.py +- [ ] Add .github/workflows/mindmodel-validate.yml +- [ ] Add .mindmodel/README.md +- [ ] Add .pre-commit-config.yaml + +## Next steps + +1. Create the files above in small commits (one micro-task per commit). +2. Run unit tests for each new module as added. +3. Open a small PR with the validator + CI + docs; request reviewers to run the validator locally.