diff --git a/.mindmodel/anti-patterns/anti-patterns.md b/.mindmodel/anti-patterns/anti-patterns.md new file mode 100644 index 0000000..65cb59e --- /dev/null +++ b/.mindmodel/anti-patterns/anti-patterns.md @@ -0,0 +1,127 @@ +--- +title: Anti-Patterns in Stemwijzer +category: anti-patterns +severity: critical +--- + +# Anti-Patterns + +> **NOTE**: Some anti-patterns below were investigated and found to be resolved or invalid. See individual entries for details. + +## CRITICAL: print() Instead of Logging + +**File**: `api_client.py` +**Evidence**: 11 instances of `print(f"...")` instead of `_logger.info(...)` + +**Broken code**: +```python +def get_motions(self, ...): + try: + # ... + print(f"Fetched {len(voting_records)} voting records from API") # BAD + print(f"Processed into {len(motions)} unique motions") # BAD + except Exception as e: + print(f"Error fetching motions from API: {e}") # BAD - no traceback +``` + +**Fix**: +```python +import logging + +_logger = logging.getLogger(__name__) + +def get_motions(self, ...): + try: + _logger.info("Fetched %d voting records from API", len(voting_records)) + _logger.info("Processed into %d unique motions", len(motions)) + except Exception as e: + _logger.exception("Error fetching motions from API: %s", e) + return [] +``` + +--- + +## CRITICAL: Global `_DummySt` Replacement + +**File**: `explorer.py` +**Evidence**: Lines ~50-70, module-level `st = _DummySt()` global replacement + +**Problem**: Creates a module-level variable `st` that shadows `streamlit` module, causing subtle bugs. + +**Fix**: Use conditional flags instead of global replacement: +```python +# GOOD: Use conditional logic +try: + import plotly.express as px + import plotly.graph_objects as go + HAS_PLOTLY = True +except ImportError: + HAS_PLOTLY = False + px = None + go = None + +def render_chart(data): + if not HAS_PLOTLY: + _logger.warning("Plotly not available") + return + # ... rest of chart logic +``` + +--- + +## WARNING: Logger Naming Inconsistency + +**Evidence**: 16 files use `logger`, 17 files use `_logger` + +**Files with `logger`** (without underscore): +- api_client.py, ai_provider.py, pipeline files, analysis files + +**Files with `_logger`** (with underscore): +- database.py, explorer.py, explorer_helpers.py + +**Recommendation**: Standardize on `_logger` for module-level loggers. + +--- + +## WARNING: Bare except with pass + +**File**: `database.py`, line 47 + +```python +# BAD - catches KeyboardInterrupt, SystemExit, MemoryError +try: + conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") +except: # bare except + pass +``` + +**Fix**: +```python +try: + conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") +except Exception as exc: + _logger.debug("Sequence creation skipped: %s", exc) +``` + +--- + +## INVESTIGATED: Entity-ID / Party-Name Mismatch + +**Status**: INVALID - investigated and resolved + +**Investigation Summary**: `svd_vectors.entity_id` only contains MP names (not party names). Party centroids are correctly computed via `mp_metadata` lookups. No production bug exists. + +--- + +## Pattern: Three Separate Party Alias Dictionaries + +**Problem**: Party name variations exist in 3+ places with no canonical alias mapping. + +**Fix**: Create one `PARTY_ALIASES` dict in `config.py`: +```python +PARTY_ALIASES = { + "GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"], + "PVV": ["Partij voor de Vrijheid"], + # ... +} +``` diff --git a/.mindmodel/anti-patterns/anti-patterns.yaml b/.mindmodel/anti-patterns/anti-patterns.yaml deleted file mode 100644 index eea2166..0000000 --- a/.mindmodel/anti-patterns/anti-patterns.yaml +++ /dev/null @@ -1,146 +0,0 @@ -# Anti-Patterns - -> ⚠️ **NOTE**: Section 1 below was **investigated and resolved** — it is NOT a bug (see §1 for details). - ---- - -## 1. ~~CRITICAL: Entity-ID / Party-Name Mismatch in `compute_party_coords`~~ → **INVALID — INVESTIGATED & RESOLVED** - -**Investigation Date**: 2026-03-31 - -**Investigation Summary**: After thorough analysis of the database schema and code, this anti-pattern is **INVALID**. The original concern was based on a false assumption about `svd_vectors.entity_id` containing party names. - -**Investigation Findings**: -1. **`svd_vectors` table has NO rows with `entity_type='party'`** — only `mp` and `motion` entity types exist in practice. -2. **`entity_ids in svd_vectors are always MP names** (e.g., `"Van Dijk, I."`), never party names. The party centroids are correctly computed via `mp_metadata` lookups. -3. **The trajectories plot WORKS correctly** — no production bug exists. The code path for party-level visualization does not rely on `svd_vectors.entity_id` containing party names. - -**Conclusion**: The original anti-pattern was a false positive caused by incorrect assumptions about data contents. The `party_map` reverse-lookup (`mp_name → party_name`) works correctly because `entity_id` values are always MP names, not party names. - ---- - -## 2. Bare `except: pass` - -**File**: `database.py`, line 47 - -**Problem**: Catches **all** exceptions including `KeyboardInterrupt`, `SystemExit`, `MemoryError`. -Silently swallows errors — no logging, no fallback. - -**Broken code**: -```python -try: - self.conn.execute(sql) -except: # ← bare except - pass -``` - -**Fix**: -```python -try: - self.conn.execute(sql) -except ibis.errors.IbisError as e: - st.warning(f"Query failed: {e}") - raise # or return a default -``` - ---- - -## 3. Nested Exception Handling - -**File**: `explorer.py`, lines 244–261 - -**Problem**: Try/except inside try/except creates opaque error paths. Inner exception silently swallows outer intent. - -**Broken code**: -```python -try: - result = compute_svd(motions) - # ... -except Exception: - try: - # Try fallback approach - result = fallback_compute(motions) - except Exception: - pass # ← both exceptions silently dropped -``` - -**Fix**: Flatten — handle each case explicitly, or use a decorator. - ---- - -## 4. Catch-All `Exception` Used Everywhere - -**Problem**: `except Exception:` catches 50+ exception types including `ValueError`, `TypeError`, `KeyError`. -Overly broad — masks real bugs. - -**Occurrence**: 850+ instances of bare/generic exception handlers across codebase. - -**Fix**: Catch specific exceptions. If you must catch multiple, chain them: -```python -except (KeyError, ValueError) as e: - logger.warning(f"Missing field: {e}") -``` - ---- - -## 5. No `entity_id` Format Validation - -**Problem**: `svd_vectors.entity_id` can be either: -- An MP name (e.g., `"Van Dijk, I."`) for individual-level SVD -- A party name (e.g., `"GroenLinks-PvdA"`) for party-level SVD - -No validation distinguishes which is which. Code must infer from context. (Note: In practice `svd_vectors.entity_id` only contains MP names — see §1 for investigation findings.) - -**Fix**: Add explicit format marker or separate columns: -```python -# Option A: separate columns -svd_vectors = pd.DataFrame({ - 'mp_name': [...], # nullable - 'party_name': [...], # nullable - 'window': [...], - 'vector_2d': [...] -}) - -# Option B: format prefix -# "mp:Van Dijk, I." or "party:GroenLinks-PvdA" -``` - ---- - -## 6. Silent Fallback When Party Centroids Fail - -**Problem**: If `party_map` lookup fails (entity is a party, not MP), the code silently produces -`party_map_count: 0` and empty `parties_with_centroid_counts`. No warning is raised. - -**Fix**: Add validation and warning: -```python -if party_map_count == 0: - st.warning(f"No party mappings found for {len(svd_df)} entities in window '{window}'") -``` - ---- - -## 7. Three Separate Party Alias Dictionaries (No Single Source of Truth) - -**Problem**: Party name variations exist in 3+ places: -- `PARTY_COLOURS` keys -- `party_map` values (from `mp_party_history`) -- Raw data column values - -No canonical alias mapping. Spelling mismatches cause silent failures. - -**Fix**: Create one `PARTY_ALIASES` dict in `config.py`: -```python -PARTY_ALIASES = { - "GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"], - "PVV": ["Partij voor de Vrijheid"], - ... -} - -def resolve_party(name: str) -> str: - """Normalize any party name variant to canonical form.""" - for canonical, aliases in PARTY_ALIASES.items(): - if name in aliases or name == canonical: - return canonical - return name # no alias found -``` diff --git a/.mindmodel/constraints/01-naming.yaml b/.mindmodel/constraints/01-naming.yaml deleted file mode 100644 index ffd301f..0000000 --- a/.mindmodel/constraints/01-naming.yaml +++ /dev/null @@ -1,34 +0,0 @@ -# Naming & Style Conventions - -## Rules -- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py -- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py) -- Classes: PascalCase. Evidence: MotionDatabase (database.py) -- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred) -- Imports order: stdlib, third-party, local; prefer absolute imports and grouped. -- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections). - -## Examples - -### Function example (from pipeline/run_pipeline.py) -```python -def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: - """Return list of (window_id, start_str, end_str) tuples.""" -``` - -### Class example (from database.py) -```python -class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - ... -``` - -## Anti-patterns -- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files. - -## Remediations -- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step. - -## Evidence pointers -- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120) -- database.py: MotionDatabase class and methods (file database.py lines 1-400+) diff --git a/.mindmodel/constraints/10-db-schema.yaml b/.mindmodel/constraints/10-db-schema.yaml deleted file mode 100644 index 535dd48..0000000 --- a/.mindmodel/constraints/10-db-schema.yaml +++ /dev/null @@ -1,74 +0,0 @@ -# Database Schema (DuckDB) — extracted DDL - -## Rules -- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). -- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py). - -## Examples (DDL snippets extracted from database.py) - -### motions table -```sql -CREATE TABLE IF NOT EXISTS motions ( - id INTEGER DEFAULT nextval('motions_id_seq'), - title TEXT NOT NULL, - description TEXT, - date DATE, - policy_area TEXT, - voting_results JSON, - winning_margin FLOAT, - controversy_score FLOAT, - layman_explanation TEXT, - externe_identifier TEXT, - body_text TEXT, - url TEXT UNIQUE, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) -) -``` - -### mp_votes table -```sql -CREATE TABLE IF NOT EXISTS mp_votes ( - id INTEGER DEFAULT nextval('mp_votes_id_seq'), - motion_id INTEGER NOT NULL, - mp_name TEXT NOT NULL, - party TEXT, - vote TEXT NOT NULL, - date DATE, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) -) -``` - -### embeddings / fused_embeddings -```sql -CREATE TABLE IF NOT EXISTS embeddings ( - id INTEGER DEFAULT nextval('embeddings_id_seq'), - motion_id INTEGER NOT NULL, - model TEXT, - vector JSON NOT NULL, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) -) - -CREATE TABLE IF NOT EXISTS fused_embeddings ( - id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), - motion_id INTEGER NOT NULL, - window_id TEXT NOT NULL, - vector JSON NOT NULL, - svd_dims INTEGER NOT NULL, - text_dims INTEGER NOT NULL, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) -) -``` - -## Anti-patterns -- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior. - -## Remediations -- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically. -- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80). - -## Evidence pointers -- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings. diff --git a/.mindmodel/constraints/20-domain-glossary.yaml b/.mindmodel/constraints/20-domain-glossary.yaml deleted file mode 100644 index 43bbc7e..0000000 --- a/.mindmodel/constraints/20-domain-glossary.yaml +++ /dev/null @@ -1,22 +0,0 @@ -# Domain Glossary - -## Rules -- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id. - -## Terms -- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110) -- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes -- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`. -- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table. -- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows -- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score - -## Examples / Usage -- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120 - -## Evidence pointers -- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py) -- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py) - -## Anti-patterns -- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations. diff --git a/.mindmodel/constraints/30-clusters.yaml b/.mindmodel/constraints/30-clusters.yaml deleted file mode 100644 index c12c29f..0000000 --- a/.mindmodel/constraints/30-clusters.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# Code Clusters / Organization - -## Rules -- The repository organizes code into the following clusters (observed): - - UI / Streamlit: Home.py, pages/, app.py, explorer.py - - Database & persistence: database.py, config.py - - ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion) - - AI provider & summarization: ai_provider.py, pipeline/..., analysis/ - - Similarity & caching: similarity/*, similarity_cache table in DB - - API client & scraping: api_client.py, pipeline/fetch_mp_metadata - - Analysis & visualization: analysis/visualize.py, explorer.py - - CLI & scheduler: scheduler.py, pipeline/run_pipeline.py - - Tests & migrations: tests/ (pytest) and database reset helpers - -## Examples - -### Pipeline orchestrator (cluster: CLI & pipeline) -```python -from database import MotionDatabase -db = MotionDatabase(db_path) -# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window -``` - -## Remediations -- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests. - -## Evidence pointers -- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py) -- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py) -- analysis/visualize.py: visualization cluster (file: analysis/visualize.py) diff --git a/.mindmodel/constraints/40-patterns.yaml b/.mindmodel/constraints/40-patterns.yaml deleted file mode 100644 index eaeed6f..0000000 --- a/.mindmodel/constraints/40-patterns.yaml +++ /dev/null @@ -1,46 +0,0 @@ -# Design Patterns & Code Patterns - -## Rules -- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management. -- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback. -- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes). - -## Examples - -### Repository pattern (database.py MotionDatabase) -```python -class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - self.db_path = db_path - self._init_database() - - def insert_motion(self, motion_data: Dict) -> bool: - """Insert a new motion into database""" - # uses duckdb.connect and parameterized queries -``` - -### Provider adapter with retries (ai_provider.py) -```python -def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response: - # Implements retries/backoff, handles 429 with Retry-After and 5xx responses -``` - -### Pipeline parallelism pattern (run_pipeline) -```python -with ThreadPoolExecutor(max_workers=max_workers) as pool: - for window_id, w_start, w_end in windows: - fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k) - futures[fut] = window_id -# wait then write sequentially to DuckDB -``` - -## Anti-patterns -- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors. - -## Remediations -- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md. - -## Evidence pointers -- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300) -- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260) -- database.py: MotionDatabase methods (file: database.py) diff --git a/.mindmodel/constraints/50-anti-patterns.yaml b/.mindmodel/constraints/50-anti-patterns.yaml deleted file mode 100644 index 00b5182..0000000 --- a/.mindmodel/constraints/50-anti-patterns.yaml +++ /dev/null @@ -1,24 +0,0 @@ -# Anti-patterns, Issues and Recommended Fixes - -## Rules -- Flagged issues discovered in Phase 1 must be remediated with concrete actions. - -## Issues -- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml -- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports. -- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility. -- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps. -- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging. - -## Remediations / Recommended fixes -- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml. - - Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain. -- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var. -- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges. -- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage. -- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks. - -## Evidence pointers -- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40) -- database.py: multiple broad except blocks (file: database.py top and methods) -- ai_provider.py: uses requests + env keys (file: ai_provider.py) diff --git a/.mindmodel/constraints/60-examples.yaml b/.mindmodel/constraints/60-examples.yaml deleted file mode 100644 index d1f7027..0000000 --- a/.mindmodel/constraints/60-examples.yaml +++ /dev/null @@ -1,117 +0,0 @@ -# Example Extractions - -## Rules -- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions. - -## (a) Function signatures with docstrings (5 examples) -1) pipeline/run_pipeline.py::_generate_windows -```python -def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: - """Return list of (window_id, start_str, end_str) tuples. - - window_id format: - quarterly → "2024-Q1", "2024-Q2", … - annual → "2024" - """ -``` - -2) database.py::append_audit_event -```python -def append_audit_event( - self, - actor_id: Optional[str], - action: str, - target_type: Optional[str] = None, - target_id: Optional[str] = None, - metadata: Optional[Dict] = None, -) -> bool: - """Record an audit event. Tries DB then falls back to ledger file.""" -``` - -3) ai_provider.py::get_embedding -```python -def get_embedding(text: str, model: str | None = None) -> list[float]: - """Return an embedding vector for `text` using the configured provider. - - Raises ProviderError for configuration or provider-side failures. - """ -``` - -4) ai_provider.py::get_embeddings_batch -```python -def get_embeddings_batch( - texts: list[str], model: str | None = None, batch_size: int = 50 -) -> list[list[float]]: - """Return embedding vectors for multiple texts using batched API calls.""" -``` - -5) analysis/visualize.py::plot_umap_scatter -```python -def plot_umap_scatter( - motion_ids: List[int], - coords: List[List[float]], - labels: Optional[List[int]] = None, - window_id: Optional[str] = None, - output_path: str = "analysis_umap.html", -) -> str: - """Produce a 2D scatter plot of UMAP-reduced fused embeddings.""" -``` - -## (b) SQL / DDL snippets (3 examples inferred from database.py) -1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110) - -2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes - -3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings - -## (c) Pytest stubs (4 sample tests matching conventions) -Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add. - -1) tests/test_database_basic.py -```python -def test_init_database_creates_tables(tmp_path): - db_path = str(tmp_path / "motions.db") - from database import MotionDatabase - - db = MotionDatabase(db_path=db_path) - # If duckdb not available, JSON fallback should create .embeddings.json - assert db is not None -``` - -2) tests/test_ai_provider.py -```python -def test_local_embedding_fallback(): - from ai_provider import _local_embedding - - v = _local_embedding("hello world", dim=16) - assert isinstance(v, list) and len(v) == 16 -``` - -3) tests/test_pipeline_windows.py -```python -from pipeline.run_pipeline import _generate_windows - -def test_generate_quarterly_windows(): - from datetime import date - - start = date(2024, 1, 1) - end = date(2024, 3, 31) - windows = _generate_windows(start, end, "quarterly") - assert any(w[0].endswith("Q1") for w in windows) -``` - -4) tests/test_visualize_plot.py -```python -def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path): - # If plotly missing, function should raise ImportError with guidance - import analysis.visualize as vis - - try: - vis._require_plotly() - except ImportError: - assert True -``` - -## Evidence pointers -- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py -- DDL: database.py create table blocks diff --git a/.mindmodel/constraints/99-stack.yaml b/.mindmodel/constraints/99-stack.yaml deleted file mode 100644 index 034f664..0000000 --- a/.mindmodel/constraints/99-stack.yaml +++ /dev/null @@ -1,43 +0,0 @@ -# Stack and Dependencies - -## Rules -- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13") -- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile -- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py -- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/ - -## Examples - -### pyproject dependencies (evidence: pyproject.toml) -```toml -dependencies = [ - "duckdb>=1.3.2", - "ibis-framework[duckdb]>=10.8.0", - "openai>=1.99.7", - "scipy>=1.11", - "umap-learn>=0.5", - "plotly>=5.0", - "pytest>=9.0.2", - "requests>=2.32.4", - "schedule>=1.2.2", - "streamlit>=1.48.0", - "scikit-learn>=1.8.0", - "beautifulsoup4>=4.14.3", - "lxml>=6.0.2", -] -``` - -## Anti-patterns / Notes -- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml -- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility. -- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai). - -## Remediations -- Move test-only libs (pytest) to dev-dependencies in pyproject.toml. -- Add lockfile and CI step to check for pinned dependencies. -- Audit declared but unused packages (openai) and remove or confirm dynamic usage. - -## Evidence pointers -- pyproject.toml: full dependency list (lines 1-40) -- Home.py: streamlit usage and app entry (file: Home.py) -- database.py: duckdb table creation and connection (file: database.py lines ~1-350) diff --git a/.mindmodel/constraints/db_connection.yaml b/.mindmodel/constraints/db_connection.yaml deleted file mode 100644 index 52ed6a5..0000000 --- a/.mindmodel/constraints/db_connection.yaml +++ /dev/null @@ -1,29 +0,0 @@ -# DB connection handling constraints - -rules: - - name: use_context_managers_for_connections - rule: "Prefer using 'with duckdb.connect(path, read_only=...) as conn' for scoped DB interactions where possible." - rationale: "Ensures proper resource cleanup and avoids connection leaks." - - - name: read_only_for_compute - rule: "Use read_only=True for compute steps that only read data (SVD, similarity compute)." - rationale: "Allows safe parallel workers and reduces write contention." - - - name: short_lived_writes - rule: "When performing database writes, open short-lived connections, commit quickly and close." - rationale: "Avoids long-lived transactions and reduces lock windows." - -examples: - - path: pipeline/svd_pipeline.py - snippet: | - conn = duckdb.connect(db_path, read_only=True) - try: - rows = conn.execute(...).fetchall() - finally: - conn.close() - -anti_patterns_and_remediations: - - bad: "Creating a global connection at import that performs migrations." - remediation: "Move migrations to an explicit init function that runs at deployment/upgrade time." - - bad: "Not closing connections on exceptions." - remediation: "Wrap connects in `with` or finally: conn.close() blocks." diff --git a/.mindmodel/constraints/error-handling.md b/.mindmodel/constraints/error-handling.md new file mode 100644 index 0000000..9d0c75d --- /dev/null +++ b/.mindmodel/constraints/error-handling.md @@ -0,0 +1,143 @@ +--- +title: Error Handling Patterns +category: constraints +severity: high +--- + +# Error Handling Patterns + +## Core Rules + +1. **Catch `Exception`, return safe fallbacks** (False/[]/None) +2. **Log exceptions with traceback** using `_logger.exception()` +3. **Never swallow exceptions silently** - always log or return sensible default +4. **Avoid nested try/except blocks** - flatten exception handling + +## Pattern: Try/Except Safe Fallback + +This is the dominant pattern in the codebase (219+ instances). + +```python +# Standard pattern from database.py, api_client.py, etc. +try: + result = risky_operation() + return process(result) +except Exception as exc: + _logger.warning("Operation failed: %s", exc) + return safe_fallback # False, [], None, {} +``` + +### Examples from Codebase + +**database.py** - DuckDB operations: +```python +def get_svd_vectors(self, window: str): + try: + conn = duckdb.connect(self.db_path, read_only=True) + try: + result = conn.execute(query, (window,)).fetchall() + return self._parse_vectors(result) + finally: + conn.close() + except Exception as exc: + _logger.warning("Failed to get SVD vectors: %s", exc) + return [] +``` + +**ai_provider.py** - HTTP retries: +```python +try: + resp = requests.post(url, json=json, headers=headers, timeout=10) + resp.raise_for_status() + return resp.json() +except requests.ConnectionError as exc: + if attempt == retries: + raise ProviderError(f"Connection error: {exc}") from exc + # ... retry logic +``` + +## Pattern: Optional Dependency Fallback + +Gracefully degrade when optional packages are unavailable. + +```python +# UMAP fallback in explorer_helpers.py +try: + import umap + HAS_UMAP = True +except ImportError: + HAS_UMAP = False + _logger.debug("UMAP not available, using SVD vectors directly") + +def project_to_2d(vectors): + if HAS_UMAP: + return umap.UMAP().fit_transform(vectors) + return vectors[:, :2] # Fallback: first 2 SVD dimensions +``` + +## Anti-Patterns + +### 1. Bare except with pass (CRITICAL) +**File**: `database.py`, line 47 + +```python +# BAD - catches KeyboardInterrupt, SystemExit, MemoryError +try: + conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") +except: # bare except + pass +``` + +**Fix**: Catch specific exception or log and continue: +```python +try: + conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") +except Exception as exc: + _logger.debug("Sequence creation skipped (may already exist): %s", exc) +``` + +### 2. Nested Exception Handling +**File**: `explorer.py`, lines 244-261 + +```python +# BAD - opaque error paths +try: + result = compute_svd(motions) +except Exception: + try: + result = fallback_compute(motions) + except Exception: + pass # Both exceptions silently dropped +``` + +**Fix**: Flatten and handle each case explicitly: +```python +# GOOD - explicit handling +try: + result = compute_svd(motions) +except Exception as exc: + _logger.warning("SVD failed, trying fallback: %s", exc) + try: + result = fallback_compute(motions) + except Exception as fallback_exc: + _logger.error("Both SVD approaches failed: %s, %s", exc, fallback_exc) + raise +``` + +## Rule Summary + +| Pattern | When to Use | Return Value | +|---------|-------------|--------------| +| Safe fallback | Best-effort operations | `[]`, `{}`, `False`, `None` | +| Re-raise | Critical operations that must succeed | raise | +| Log and continue | Optional steps in pipeline | (continue) | +| Graceful degradation | Optional dependencies | Default behavior | + +## When to Log vs Return + +| Scenario | Action | +|----------|--------| +| User action fails | Log warning, return safe default | +| Internal error (corrupt data) | Log error, return safe default | +| Transient failure (network) | Log warning, retry if appropriate | +| Configuration error | Log error, raise with clear message | diff --git a/.mindmodel/constraints/error-handling.yaml b/.mindmodel/constraints/error-handling.yaml deleted file mode 100644 index 747a3c9..0000000 --- a/.mindmodel/constraints/error-handling.yaml +++ /dev/null @@ -1,184 +0,0 @@ -# Error Handling Constraints - -## Core Rule - -**Catch `Exception`, return safe fallbacks (False/[]/None)** - -Never let exceptions propagate to user-facing code. Always provide a safe default. - -## Patterns - -### For Not-Found Operations - -Return `None` or falsy value when item not found: - -```python -# GOOD: Return None on not found -def get_motion_by_id(self, motion_id: int) -> Optional[Dict]: - try: - conn = duckdb.connect(self.db_path) - result = conn.execute( - "SELECT * FROM motions WHERE id = ?", (motion_id,) - ).fetchone() - conn.close() - return result - except Exception: - conn.close() - return None -``` - -### For Collection Operations - -Return empty list when no results: - -```python -# GOOD: Return empty list on failure -def get_filtered_motions(self, **kwargs) -> List[Dict]: - try: - conn = duckdb.connect(self.db_path) - rows = conn.execute(query, params).fetchall() - conn.close() - return rows - except Exception: - conn.close() - return [] -``` - -### For Boolean Operations - -Return `False` for failed boolean checks: - -```python -# GOOD: Return False on failure -def motion_exists(self, motion_id: int) -> bool: - try: - conn = duckdb.connect(self.db_path) - count = conn.execute( - "SELECT COUNT(*) FROM motions WHERE id = ?", (motion_id,) - ).fetchone()[0] - conn.close() - return count > 0 - except Exception: - return False -``` - -### For Creation Operations - -Return `False` or empty string on failure: - -```python -# GOOD: Return empty string on failure -def generate_summary(self, title: str, body: str) -> str: - try: - return ai_provider.chat_completion(messages) - except ai_provider.ProviderError: - logger.exception("AI provider failed") - return "" -``` - -## Anti-Patterns to Avoid - -### Don't Catch Specific Exceptions Only -```python -# BAD: Catches only FileNotFoundError, misses other issues -try: - with open(path) as f: - return json.load(f) -except FileNotFoundError: - return None -``` - -### Don't Re-raise Without Context -```python -# BAD: Loses information -try: - process(data) -except Exception: - raise # No context added -``` - -### Don't Swallow Exceptions Silently -```python -# BAD: No logging, no fallback -try: - return risky_operation() -except Exception: - pass # What happened? -``` - -## Nested Exception Handling - -When calling code that has its own error handling, wrap only if needed: - -```python -# Accept result from wrapped function (it handles errors) -def fetch_motions(self, start_date): - # ai_provider_wrapper handles retries internally - embeddings = get_embeddings_with_retry(texts) - - # Only wrap if wrapper doesn't handle errors - if all(e is None for e in embeddings): - logger.error("All embeddings failed") - return [] - - return process(embeddings) -``` - -## Context Managers - -Use `try/finally` for cleanup: - -```python -def process_with_temp_file(self): - temp = NamedTemporaryFile(delete=False) - try: - temp.write(data) - temp.close() - return process_file(temp.name) - finally: - os.unlink(temp.name) - temp.close() -``` - -## When to Log vs Return - -| Scenario | Action | -|----------|--------| -| User action fails | Log warning, return safe default | -| Internal error (corrupt data) | Log error, return safe default | -| Transient failure (network) | Log warning, retry if appropriate | -| Configuration error | Log error, raise with clear message | - -## Exception Propagation - -Only raise exceptions for: -1. Configuration/setup errors (missing required env vars) -2. Programming errors (invalid arguments) -3. Fatal system errors (database corruption) - -```python -# GOOD: Raise for configuration errors -def _get_api_key(self) -> str: - key = os.environ.get("OPENROUTER_API_KEY") - if not key: - raise ProviderError( - "OPENROUTER_API_KEY environment variable is required" - ) - return key -``` - -## Logging Errors - -Always include context: - -```python -# GOOD: Include relevant context -_logger.error( - "Failed to fetch motion %d: %s", - motion_id, - exc -) - -# BAD: No context -_logger.error("Failed to fetch") -``` diff --git a/.mindmodel/constraints/error_handling.yaml b/.mindmodel/constraints/error_handling.yaml deleted file mode 100644 index 2f95936..0000000 --- a/.mindmodel/constraints/error_handling.yaml +++ /dev/null @@ -1,36 +0,0 @@ -# Error handling style rules (YAML constraint example) - -rules: - - name: explicit_exceptions - rule: "Raise explicit exceptions (ValueError, ProviderError) for known error conditions rather than returning magic values." - examples: - - good: | - if not isinstance(text, str): - raise ProviderError('text must be a string') - - bad: | - if not isinstance(text, str): - return [] - - - name: avoid_broad_except - rule: "Avoid 'except Exception:' that swallows errors. If broad except is used for best-effort, log the exception with logger.exception and re-raise or convert." - examples: - - bad: | - try: - do_work() - except Exception: - return [] - - remediation: | - try: - do_work() - except SpecificError as exc: - logger.warning('Handled error: %s', exc) - raise - - - name: logging_over_print - rule: "Prefer logger.* over print() for messages and errors." - examples: - - bad: "print('Error fetching motions from API: %s' % e)" - - good: "logger.exception('Error fetching motions from API')" - -enforcement_examples: - - "Add a static code check to flag 'print(' in modules (except in simple scripts) and 'except Exception:' usages without logger.exception." diff --git a/.mindmodel/constraints/logging.yaml b/.mindmodel/constraints/logging.md similarity index 51% rename from .mindmodel/constraints/logging.yaml rename to .mindmodel/constraints/logging.md index f8c901c..adc7dd6 100644 --- a/.mindmodel/constraints/logging.yaml +++ b/.mindmodel/constraints/logging.md @@ -1,8 +1,47 @@ +--- +title: Logging Constraints +category: constraints +severity: critical +--- + # Logging Constraints ## Core Rule -**Use `logging.getLogger(__name__)` - never use `print()`** +Use `logging.getLogger(__name__)` - never use `print()` + +**CRITICAL ANTI-PATTERN**: `api_client.py` uses `print()` instead of logging (11 instances). + +## CRITICAL Anti-Pattern: print() Instead of Logging + +**File**: `api_client.py` +**Evidence**: Lines with `print(f"...")` instead of `_logger.info(...)` + +**Broken code**: +```python +def get_motions(self, ...): + try: + # ... + print(f"Fetched {len(voting_records)} voting records from API") # BAD + print(f"Processed into {len(motions)} unique motions") # BAD + except Exception as e: + print(f"Error fetching motions from API: {e}") # BAD - no traceback +``` + +**Fix**: +```python +import logging + +_logger = logging.getLogger(__name__) + +def get_motions(self, ...): + try: + _logger.info("Fetched %d voting records from API", len(voting_records)) + _logger.info("Processed into %d unique motions", len(motions)) + except Exception as e: + _logger.exception("Error fetching motions from API: %s", e) + return [] +``` ## Logger Initialization @@ -31,6 +70,10 @@ _logger = logging.getLogger(__name__) _logger = logging.getLogger(__name__) ``` +**INCONSISTENCY WARNING**: 16 files use `logger`, 17 files use `_logger`. Choose one convention. + +**Recommendation**: Use `_logger` (with underscore) for module-level loggers to distinguish from class-level loggers. + ## Log Levels | Level | When to Use | @@ -41,30 +84,6 @@ _logger = logging.getLogger(__name__) | ERROR | Operation failed, may need attention | | CRITICAL | Fatal error, program may crash | -## Examples - -### Good Logging Practice -```python -_logger.info("Pipeline run: %s → %s (%s windows)", start, end, count) -_logger.debug("Batch embedding attempt %d failed: %s", attempt, exc) -_logger.warning("Fallback used for motion %d: %s", motion_id, reason) -_logger.error("Query failed: %s", exc) -``` - -### Bad: Using print() -```python -# BAD - don't use print -print(f"Fetched {len(voting_records)} voting records from API") -print(f"Error fetching motions from API: {e}") -``` - -### Good: Using logger -```python -# GOOD - use logger -_logger.info("Fetched %d voting records from API", len(voting_records)) -_logger.error("Error fetching motions from API: %s", e) -``` - ## Exception Logging Use `_logger.exception()` for caught exceptions (includes traceback): @@ -77,30 +96,6 @@ except Exception as exc: return fallback_value ``` -Use `_logger.error()` with explicit exception for controlled errors: - -```python -try: - result = risky_operation() -except Exception as exc: - _logger.error("Operation failed: %s", exc) - return fallback_value -``` - -## Configuration - -Ensure logging is configured in entry points: - -```python -# pipeline/run_pipeline.py -def run(args): - logging.basicConfig( - level=logging.INFO, - format="%(asctime)s %(levelname)s %(name)s: %(message)s", - ) - # ... rest of pipeline -``` - ## Anti-Patterns ### Debug Prints in Production Code @@ -117,22 +112,6 @@ _logger.debug("Processing window %s", wid) # BAD - mixing _logger and logger _logger = logging.getLogger(__name__) logger = logging.getLogger("other") # Inconsistent - -# GOOD - use single consistent pattern -_logger = logging.getLogger(__name__) -``` - -### Missing Logger Initialization -```python -# BAD - no logger defined -def some_function(): - logging.getLogger(__name__).info("...") # Redundant calls - -# GOOD - define once at module level -_logger = logging.getLogger(__name__) - -def some_function(): - _logger.info("...") ``` ## Sensitive Data @@ -150,18 +129,3 @@ _logger.info("User %s voted %s", user_id, vote) # GOOD - log aggregates, not individual votes _logger.info("Vote recorded for session %s", session_id[:8]) ``` - -## Structured Logging - -For complex data, use structured logging: - -```python -_logger.info( - "Motion processed", - extra={ - "motion_id": motion_id, - "policy_area": policy_area, - "processing_time_ms": elapsed_ms, - } -) -``` diff --git a/.mindmodel/dependencies/dependencies.md b/.mindmodel/dependencies/dependencies.md new file mode 100644 index 0000000..49c7ba9 --- /dev/null +++ b/.mindmodel/dependencies/dependencies.md @@ -0,0 +1,92 @@ +--- +title: Dependencies and Library Usage +category: dependencies +--- + +# Dependencies and Library Usage + +## Core Dependencies + +### duckdb +- **Required**: Yes +- **Fallback**: None (core functionality) +- **Usage**: SQL database for motions, embeddings, SVD vectors +- **Files**: database.py, analysis/*.py, pipeline/*.py + +### streamlit +- **Required**: Yes +- **Fallback**: None +- **Usage**: Web UI framework +- **Files**: app.py, pages/*.py, explorer.py + +### requests +- **Required**: Yes +- **Fallback**: None +- **Usage**: HTTP client for API calls +- **Files**: api_client.py, ai_provider.py + +### plotly +- **Required**: Yes +- **Fallback**: None (raises ImportError) +- **Usage**: Interactive charts for explorer +- **Files**: explorer.py, explorer_helpers.py + +## Optional Dependencies + +### umap-learn +- **Required**: No +- **Fallback**: Use raw SVD vectors (first 2 dimensions) +- **Usage**: Dimensionality reduction for visualization +- **Files**: analysis/clustering.py + +### matplotlib +- **Required**: No +- **Fallback**: Plotly or raw output +- **Usage**: Static charting +- **Files**: Various analysis scripts + +## ML Dependencies + +### sklearn +- **Required**: Yes +- **Usage**: KMeans clustering, cosine_similarity, StandardScaler +- **Files**: analysis/clustering.py, similarity/compute.py + +### scipy +- **Required**: Yes +- **Usage**: SVD (scipy.linalg.svd), spatial.procrustes for alignment +- **Files**: analysis/trajectory.py, pipeline/svd_pipeline.py + +### numpy +- **Required**: Yes +- **Usage**: Array operations, linear algebra +- **Files**: Throughout codebase + +## Key Imports by File + +### explorer.py +- `import streamlit as st` +- `from database import db` +- `from explorer_helpers import *` + +### explorer_helpers.py +- `import pandas as pd` +- `import plotly.graph_objects as go` +- `from database import db` (optional, for type hints) + +### database.py +- `import ibis` +- `import duckdb` +- `from config import config, PARTY_COLOURS` + +### config.py +- `from dataclasses import dataclass, field` +- `import streamlit as st` (optional, for warnings) + +## Singleton Instances + +| Module | Instance | Type | +|--------|----------|------| +| `database.py` | `db` | `MotionDatabase` | +| `config.py` | `config` | `Config` (dataclass) | +| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | diff --git a/.mindmodel/dependencies/dependencies.yaml b/.mindmodel/dependencies/dependencies.yaml deleted file mode 100644 index ff0af26..0000000 --- a/.mindmodel/dependencies/dependencies.yaml +++ /dev/null @@ -1,78 +0,0 @@ -# Dependencies - -## Core Library Wiring - -### Database Layer -``` -ibis → DuckDB → MotionDatabase singleton (database.py) - ↑ - sqlglot (ibis dependency) -``` - -### Data Processing -``` -pandas → (used throughout for DataFrame operations) -numpy → (used by sklearn, scipy, umap) -scipy → spatial.procrustes for window alignment -``` - -### ML Pipeline -``` -sklearn.cluster → KMeans, Procrustes -sklearn.preprocessing → StandardScaler -umap → UMAP (optional, graceful fallback) -``` - -### Visualization -``` -plotly → explorer_helpers.py chart builders -st.plotly_chart → explorer.py rendering -``` - -### Streamlit -``` -streamlit → all pages, @st.cache_data decorators -``` - -## Optional Dependencies -| Package | Required | Fallback | -|---------|----------|----------| -| `umap` | No | Use raw SVD vectors (first 2 dims) | -| `plotly` | Yes | Raises ImportError | -| `duckdb` | Yes | — | -| `ibis` | Yes | — | -| `sklearn` | Yes | — | - -## Singleton Instances -| Module | Instance | Type | -|--------|----------|------| -| `database.py` | `db` | `MotionDatabase` | -| `config.py` | `config` | `Config` (dataclass) | -| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | - -## Key Imports by File -``` -explorer.py: - - import streamlit as st - - from database import db - - from explorer_helpers import * - -explorer_helpers.py: - - import pandas as pd - - import plotly.graph_objects as go - - from database import db (optional, for type hints) - -database.py: - - import ibis - - import duckdb - - from config import config, PARTY_COLOURS - -config.py: - - from dataclasses import dataclass, field - - import streamlit as st (optional, for warnings) -``` - -## Environment -- Python ≥3.13 -- Environment variables via `.env` (DB path, API keys) -- No `.env` values in constraint files (security) diff --git a/.mindmodel/domain/domain-glossary.md b/.mindmodel/domain/domain-glossary.md new file mode 100644 index 0000000..9da8f9b --- /dev/null +++ b/.mindmodel/domain/domain-glossary.md @@ -0,0 +1,146 @@ +--- +title: Domain Glossary +category: domain +--- + +# Domain Glossary - Dutch Political Terms + +## CRITICAL INVARIANTS + +> **Rule 1**: Centroid of right-wing parties on RIGHT side of ALL axes +> - PVV, FVD, JA21, SGP centroid must appear on the RIGHT +> - Individual right-wing parties may vary slightly from the centroid +> - This is non-negotiable for any compass/axis visualization + +> **Rule 2**: SVD labels are empirically derived from voting data +> - Labels represent WHAT THE DATA SHOWS, not party self-identification or public opinion +> - Labels are derived from outliers and 20 representative motions (10 positive, 10 negative) +> - See SVD Label Derivation section below + +--- + +## SVD Label Derivation + +### The Process + +SVD (Singular Value Decomposition) finds axes that maximize variance in the MP × Motion voting matrix. To label each axis: + +1. **Identify outliers**: Find the two MPs with most extreme positions on that axis +2. **Select representative motions**: Pick 20 motions where these outliers disagreed most sharply (10 they voted opposite on, 10 where both voted same direction but with other extremes) +3. **Interpret theme**: Read the motion titles to derive what the axis represents +4. **Assign label**: Label describes the empirical theme, could be: + - Left-Right + - Coalition-Opposition + - Progressive-Conservative + - EU-National sovereignty + - Populist-Establishment + - Or whatever the voting patterns show + +### Example + +| Step | Description | +|------|-------------| +| Outlier A | Wilders (PVV) - extreme positive on Dim 1 | +| Outlier B | Marijnissen (SP) - extreme negative on Dim 1 | +| 20 Motions | Immigration, integration, law & order themes dominate | +| Label | "Links-Rechts" (Left-Right) | + +### Labeling Rules + +- **Never use party names in labels** (e.g., not "PVV-SP axis") +- **Never use semantic/ideological labels** (e.g., not "progressive-conservative" unless that's what the motions show) +- **Use motion-derived themes** (e.g., "Immigration", "EU", "Economy") +- **Fallback**: If theme is unclear, use "Axis 1", "Axis 2" + +--- + +## Core Entities + +### Motion / Motie +- Parliamentary motion submitted by MPs +- Fields: `id`, `title`, `date`, `category` +- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** + +### MP / Kamerlid +- Member of Parliament (Tweede Kamerlid) +- Identified by full name (e.g., "Van Dijk, I.") +- Has voting record, party affiliation, SVD position vector + +### Party / Fractie +- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") +- Party centroids: average SVD position of all MPs in party + +### Vote / Stemming +- Individual MP's vote on a motion: +1, 0, -1 +- Aggregated to compute SVD vectors + +--- + +## Time & Analysis Concepts + +### Window / Tijdsvenster +- Time period for analysis (annual or quarterly) +- Values: "2023", "2023-Q1", "2024", etc. +- SVD vectors computed per window + +### Trajectory +- MP's position change across multiple windows +- Computed from `svd_vectors` + window ordering + +--- + +## Mathematical / Algorithmic Terms + +### SVD Vector +- 2D vector from Singular Value Decomposition of MP × Motion vote matrix +- Represents MP's position in political space + +### SVD Label +- Empirically derived axis label based on outlier MPs and representative motions +- Describes the theme of disagreement on that axis +- NOT based on party ideology or semantic labels + +### Political Compass +- 2D visualization with SVD axes mapped to compass quadrants +- X-axis: First SVD dimension (labeled from voting data) +- Y-axis: Second SVD dimension (labeled from voting data) + +### Procrustes Alignment +- Algorithm to align SVD vectors across time windows +- Ensures comparable positions across years/quarters + +### UMAP +- Uniform Manifold Approximation and Projection +- Dimensionality reduction for visualization +- Optional dependency with graceful SVD fallback + +--- + +## Database Table Reference + +| Table | Key Fields | +|-------|-----------| +| `motions` | id, title, date, category | +| `mp_votes` | mp_id, motion_id, vote | +| `svd_vectors` | entity_id, window, vector_2d (list[2]) | +| `mp_party_history` | mp_id, party, start_date, end_date | +| `windows` | window_id, start_date, end_date, period_type | +| `mp_trajectories` | mp_id, window, trajectory_vector | + +--- + +## Dutch Political Parties + +### Canonical Right-Wing (centroid on RIGHT of axes) +- PVV (Partij voor de Vrijheid) +- FVD (Forum voor Democratie) +- JA21 +- SGP (Staatkundig Gereformeerde Partij) + +### Other Major Parties +- VVD (Volkspartij voor Vrijheid en Democratie) +- GL-PvdA (GroenLinks-PvdA) +- NSC (Nieuw Sociaal Contract) +- BBB (BoerBurgerBeweging) +- SP (Socialistische Partij) +- D66 (Democraten 66) diff --git a/.mindmodel/domain/domain-glossary.yaml b/.mindmodel/domain/domain-glossary.yaml deleted file mode 100644 index dc17896..0000000 --- a/.mindmodel/domain/domain-glossary.yaml +++ /dev/null @@ -1,107 +0,0 @@ -# Domain Glossary - Dutch Political Terms - -## Core Entities - -### Motion / Motie -- Parliamentary motion submitted by MPs -- Fields: `id`, `title`, `date`, `category` -- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** - -### MP / Kamerlid -- Member of Parliament (Tweede Kamerlid) -- Identified by full name (e.g., "Van Dijk, I.") -- Has voting record, party affiliation, SVD position vector -- Historical: `mp_party_history` tracks party changes over time - -### Party / Fractie -- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") -- Party centroids: average SVD position of all MPs in party -- Aliases: multiple spelling variants exist (see anti-patterns.yaml) - -### Vote / Stemming -- Individual MP's vote on a motion: +1, 0, -1 -- Aggregated to compute SVD vectors - ---- - -## Time & Analysis Concepts - -### Window / Tijdsvenster -- Time period for analysis (annual or quarterly) -- Values: "2023", "2023-Q1", "2024", etc. -- SVD vectors computed per window -- Windows can be aligned across time using Procrustes - -### Trajectory -- MP's position change across multiple windows -- Computed from `svd_vectors` + window ordering -- Used for trend analysis in Evolution tab - ---- - -## Mathematical / Algorithmic Terms - -### SVD Vector -- 2D vector from Singular Value Decomposition of MP × Motion vote matrix -- Represents MP's position in political space -- `entity_id` in `svd_vectors`: either MP name (when individual MPs) or party name (when party-level) - -### Political Compass -- 2D visualization: X-axis = Left↔Right, Y-axis = Progressive↔Conservative -- SVD vectors mapped to compass quadrants -- UMAP used for projection - -### Procrustes Alignment -- Algorithm to align SVD vectors across time windows -- Ensures comparable positions across years/quarters -- Implemented via `scipy.spatial.procrustes` or scikit-learn - -### Centroid -- Geometric center of a set of points -- Party centroid = average SVD position of all MPs in that party -- Computed from `svd_vectors` filtered by party - -### UMAP -- Uniform Manifold Approximation and Projection -- Dimensionality reduction for visualization -- Optional dependency — graceful fallback if unavailable - ---- - -## Visualization - -### PARTY_COLOURS -- Dict mapping party names to hex color codes -- Used in all Plotly charts for consistent party coloring -- Source: `config.py` → `PARTY_COLOURS` constant -- **Issue**: 3 separate alias dictionaries exist (no single source of truth) - ---- - -## Application Pages - -### Home -- Landing page with app overview - -### Stemwijzer (Quiz) -- User answers questions → matched to parties -- Thin wrapper around quiz module - -### Explorer (4 tabs) -- **Motion tab**: SVD positions colored by vote on selected motion -- **MP tab**: Individual MP trajectories across windows -- **Party tab**: Party centroids with members as scatter -- **Evolution tab**: How positions change over time - ---- - -## Database Table Reference -| Table | Key Fields | -|-------|-----------| -| `motions` | id, title, date, category | -| `mp_votes` | mp_id, motion_id, vote | -| `svd_vectors` | entity_id, window, vector_2d (list[2]) | -| `party_centroids` | party, window, centroid_2d | -| `mp_party_history` | mp_id, party, start_date, end_date | -| `windows` | window_id, start_date, end_date, period_type | -| `mp_trajectories` | mp_id, window, trajectory_vector | diff --git a/.mindmodel/manifest.yaml b/.mindmodel/manifest.yaml index 72df5a4..eb061e9 100644 --- a/.mindmodel/manifest.yaml +++ b/.mindmodel/manifest.yaml @@ -1,3 +1,7 @@ +# stemwijzer Mind Model - Manifest +# Generated: 2026-04-12 +# Phase: 2 - Assembly from Phase 1 Analysis + name: stemwijzer version: 2 description: Dutch political voting compass (Stemwijzer) - Mind Model constraints @@ -7,39 +11,54 @@ categories: - path: system.md description: System overview and architecture summary group: docs - - path: tech-stack.yaml + - path: stack/stack.md description: Technology stack with versions and purposes - group: docs - - path: conventions.yaml - description: Coding conventions and style guide - group: docs - - path: domain.yaml - description: Domain entities, terms, and relationships - group: docs - + group: stack + - path: domain/domain-glossary.md + description: Domain entities, terms, relationships, and CRITICAL INVARIANTS + group: domain + # Design patterns - - path: patterns/architecture.yaml - description: Repository, Facade, Pipeline architectural patterns + - path: patterns/patterns.yaml + description: Code patterns (Singleton, Repository, Pipeline, etc.) group: patterns - - path: patterns/python.yaml - description: Python-specific patterns (Singleton, dataclass, context manager) + - path: patterns/streamlit.yaml + description: Streamlit-specific patterns (session state, cache) + group: patterns + - path: patterns/api.yaml + description: API client patterns with retry and pagination group: patterns - path: patterns/database.yaml - description: DuckDB connection patterns and ORM usage + description: DuckDB patterns and connection management group: patterns - - path: patterns/api.yaml - description: API client patterns with retry logic and pagination + - path: patterns/python.yaml + description: Python-specific patterns (dataclass, typing) group: patterns - - path: patterns/streamlit.yaml - description: Streamlit session state and page patterns + - path: patterns/duckdb-access.md + description: DuckDB connection patterns and best practices group: patterns - + - path: patterns/embeddings-similarity.md + description: Embeddings and similarity computation patterns + group: patterns + - path: patterns/error-handling.md + description: Error handling and exception patterns + group: patterns + - path: patterns/module-singletons.md + description: Module-level singleton patterns + group: patterns + - path: patterns/requests-http.md + description: HTTP client patterns with retry + group: patterns + - path: patterns/validation.md + description: Input validation patterns + group: patterns + # Coding constraints - - path: constraints/error-handling.yaml + - path: constraints/error-handling.md description: Error handling patterns with safe fallbacks group: constraints - - path: constraints/logging.yaml - description: Logging conventions and best practices + - path: constraints/logging.md + description: Logging conventions group: constraints - path: constraints/naming.yaml description: File, class, function naming rules @@ -50,25 +69,40 @@ categories: - path: constraints/types.yaml description: Type hint conventions group: constraints - + - path: constraints/testing.yaml + description: Testing conventions + group: constraints + + # Anti-patterns + - path: anti-patterns/anti-patterns.md + description: Known anti-patterns with evidence and fixes + group: anti-patterns + + # Dependencies + - path: dependencies/dependencies.md + description: Library usage and singleton instances + group: dependencies + # Code examples - path: examples/database-example.py - description: MotionDatabase usage example + description: MotionDatabase usage examples group: examples - path: examples/api-client-example.py - description: TweedeKamerAPI usage + description: TweedeKamerAPI usage examples group: examples - path: examples/pipeline-example.py - description: Pipeline phase example + description: Pipeline orchestration examples group: examples - path: examples/streamlit-page-example.py - description: Streamlit page pattern + description: Streamlit page patterns + group: examples + - path: examples/pattern-examples.md + description: Consolidated pattern examples group: examples - - # Anti-patterns and workflows - - path: anti-patterns.yaml - description: Known anti-patterns to avoid - group: meta - - path: workflows.yaml - description: Key workflows (VotingSession, DataIngestion, EmbeddingGeneration) - group: meta + +# Phase 1 findings summary: +# - Tech: Python 3.13+, Streamlit, DuckDB, scipy/sklearn/umap, OpenRouter (QWEN) +# - 10 patterns discovered: Module singletons, Repository, Service layer, Pipeline +# - 8 anti-patterns: print() instead of logging, _DummySt global, bare except +# - 6 code clusters: Database, Streamlit UI, API, Analysis/ML, Config, Singletons +# - 3 groups: stdlib, 3rd party, local imports diff --git a/.mindmodel/patterns/duckdb-access.md b/.mindmodel/patterns/duckdb-access.md new file mode 100644 index 0000000..ec00d89 --- /dev/null +++ b/.mindmodel/patterns/duckdb-access.md @@ -0,0 +1,79 @@ +--- +title: DuckDB Access Pattern +category: patterns +--- +# DuckDB Access Pattern + +## Rules + +- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. +- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. +- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. +- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). + +## Examples + +### database.py - Explicit connect/close for schema init + +```python +conn = duckdb.connect(self.db_path) +... +conn.execute(""" + CREATE TABLE IF NOT EXISTS fused_embeddings ( + id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), + motion_id INTEGER NOT NULL, + window_id TEXT NOT NULL, + vector JSON NOT NULL, + svd_dims INTEGER NOT NULL, + text_dims INTEGER NOT NULL, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY (id) + ) +""") +conn.close() +``` + +### pipeline/svd_pipeline.py - Read-only connection + +```python +conn = duckdb.connect(db_path, read_only=True) +try: + rows = conn.execute( + "SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", + (start_date, end_date), + ).fetchall() +finally: + conn.close() +``` + +### similarity/compute.py - Preferred 'with' context + +```python +try: + import duckdb +except Exception: + logger.exception("duckdb import failed; cannot load vectors") + return 0 + +with duckdb.connect(db.db_path) as conn: + rows = conn.execute(query, params).fetchall() +``` + +## Anti-Patterns + +### Bad: Connection without closure + +```python +# BAD: connection may leak if exception occurs before explicit close +conn = duckdb.connect(db_path) +rows = conn.execute("SELECT ...").fetchall() +# missing finally/close +``` + +**Remediation**: Use "with" context or ensure conn.close() in finally block. + +### Bad: Parallel write connections + +**Problem**: Opening write connections from many parallel workers without coordination. + +**Remediation**: Open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. diff --git a/.mindmodel/patterns/duckdb_access.yaml b/.mindmodel/patterns/duckdb_access.yaml deleted file mode 100644 index 63204a5..0000000 --- a/.mindmodel/patterns/duckdb_access.yaml +++ /dev/null @@ -1,70 +0,0 @@ -name: duckdb_access - -rules: - - Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. - - Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. - - If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. - - Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). - -examples: - - path: database.py - excerpt: | - ```python - conn = duckdb.connect(self.db_path) - ... - conn.execute(""" - CREATE TABLE IF NOT EXISTS fused_embeddings ( - id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), - motion_id INTEGER NOT NULL, - window_id TEXT NOT NULL, - vector JSON NOT NULL, - svd_dims INTEGER NOT NULL, - text_dims INTEGER NOT NULL, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) - ) - """) - conn.close() - ``` - note: explicit connect/close used when initializing schema - - - path: pipeline/svd_pipeline.py - excerpt: | - ```python - conn = duckdb.connect(db_path, read_only=True) - try: - rows = conn.execute( - "SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", - (start_date, end_date), - ).fetchall() - finally: - conn.close() - ``` - note: read_only connection used for compute-heavy worker - - - path: similarity/compute.py - excerpt: | - ```python - try: - import duckdb - except Exception: - logger.exception("duckdb import failed; cannot load vectors") - return 0 - - with duckdb.connect(db.db_path) as conn: - rows = conn.execute(query, params).fetchall() - ``` - note: preferred 'with' context for automatic close - -anti_patterns: - - Bad: creating a connection without closure in a long-running process - remediation: use "with" context or ensure conn.close() in finally block - example: | - ```python - # BAD: connection may leak if exception occurs before explicit close - conn = duckdb.connect(db_path) - rows = conn.execute("SELECT ...").fetchall() - # missing finally/close - ``` - - Bad: Opening write connections from many parallel workers without coordination - remediation: open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. diff --git a/.mindmodel/patterns/embeddings-similarity.md b/.mindmodel/patterns/embeddings-similarity.md new file mode 100644 index 0000000..5b41d32 --- /dev/null +++ b/.mindmodel/patterns/embeddings-similarity.md @@ -0,0 +1,74 @@ +--- +title: Embeddings Similarity Pipeline +category: patterns +--- +# Embeddings Similarity Pipeline + +## Rules + +- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. +- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. +- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. +- Use read_only DuckDB connections in compute workers to allow parallel runs. + +## Examples + +### pipeline/ai_provider_wrapper.py - Batched embed + fallback + +```python +for start in range(0, len(texts), batch_size): + chunk = texts[start : start + batch_size] + resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) +... +for j in range(i, end): + t = texts[j] + single, single_exc = _attempt_batch([t], j) + if single: + results[j] = single[0] +``` + +### pipeline/fusion.py - Concatenation and storage + +```python +try: + svd_vec = json.loads(svd_json) +except Exception: + _logger.exception("Invalid SVD vector JSON for entity %s", entity_id) + skipped_missing_svd += 1 + continue +... +fused = list(svd_vec) + list(text_vec) +res = db.store_fused_embedding( + int(entity_id), + window_id, + fused, + svd_dims=len(svd_vec), + text_dims=len(text_vec), +) +``` + +### similarity/compute.py - Normalized cosine similarity + +```python +# Normalize rows +norms = np.linalg.norm(matrix, axis=1, keepdims=True) +norms[norms == 0] = 1.0 +normalized = matrix / norms +sim = normalized @ normalized.T +... +# pick top-k neighbors and write to similarity_cache +``` + +## Anti-Patterns + +### Bad: Assuming consistent vector length + +**Problem**: Assuming consistent vector length without checks leads to shape errors. + +**Remediation**: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). + +### Bad: Inline heavy computation in UI + +**Problem**: Recomputing heavy pipelines inline in UI requests. + +**Remediation**: Schedule heavy work in scripts/subprocesses and read precomputed results in UI. diff --git a/.mindmodel/patterns/embeddings_similarity.yaml b/.mindmodel/patterns/embeddings_similarity.yaml deleted file mode 100644 index 40a3149..0000000 --- a/.mindmodel/patterns/embeddings_similarity.yaml +++ /dev/null @@ -1,63 +0,0 @@ -name: embeddings_similarity_pipeline - -rules: - - Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. - - Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. - - Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. - - Use read_only DuckDB connections in compute workers to allow parallel runs. - -examples: - - path: pipeline/ai_provider_wrapper.py - excerpt: | - ```python - for start in range(0, len(texts), batch_size): - chunk = texts[start : start + batch_size] - resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) - ... - for j in range(i, end): - t = texts[j] - single, single_exc = _attempt_batch([t], j) - if single: - results[j] = single[0] - ``` - note: batched embed + fallback per-item retry - - - path: pipeline/fusion.py - excerpt: | - ```python - try: - svd_vec = json.loads(svd_json) - except Exception: - _logger.exception("Invalid SVD vector JSON for entity %s", entity_id) - skipped_missing_svd += 1 - continue - ... - fused = list(svd_vec) + list(text_vec) - res = db.store_fused_embedding( - int(entity_id), - window_id, - fused, - svd_dims=len(svd_vec), - text_dims=len(text_vec), - ) - ``` - note: concatenation of vectors and storage via MotionDatabase - - - path: similarity/compute.py - excerpt: | - ```python - # Normalize rows - norms = np.linalg.norm(matrix, axis=1, keepdims=True) - norms[norms == 0] = 1.0 - normalized = matrix / norms - sim = normalized @ normalized.T - ... - # pick top-k neighbors and write to similarity_cache - ``` - note: numeric pipeline and padding to consistent dimensionality - -anti_patterns: - - Bad: Assuming consistent vector length without checks (leads to shape errors). - remediation: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). - - Bad: Recomputing heavy pipelines inline in UI requests. - remediation: schedule heavy work in scripts/subprocesses and read precomputed results in UI. diff --git a/.mindmodel/patterns/error-handling.md b/.mindmodel/patterns/error-handling.md new file mode 100644 index 0000000..f0e5881 --- /dev/null +++ b/.mindmodel/patterns/error-handling.md @@ -0,0 +1,63 @@ +--- +title: Error Handling Pattern +category: patterns +--- +# Error Handling Pattern + +## Rules + +- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). +- Prefer logging.exception when catching an exception where stack trace is useful. +- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. +- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. + +## Examples + +### ai_provider.py - Network error to ProviderError + +```python +except requests.ConnectionError as exc: + if attempt == retries: + raise ProviderError( + f"Connection error when calling provider: {exc}" + ) from exc + ... +``` + +### pipeline/ai_provider_wrapper.py - Best-effort with logging + +```python +except Exception: + _logger.exception("Failed to append audit event for embedding failure") +results[j] = None +``` + +### similarity/compute.py - Defensive import handling + +```python +try: + import duckdb +except Exception: + logger.exception("duckdb import failed; cannot load vectors") + return 0 +``` + +## Anti-Patterns + +### Bad: Silent exception swallowing + +```python +try: + do_work() +except Exception: + return [] +# BAD: hides the root cause and returns an ambiguous default +``` + +**Remediation**: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. + +### Bad: Mixing print() and logging + +**Problem**: Mixing print() and logging for errors. + +**Remediation**: Replace print() calls with logger.* calls; use structured logging configuration. diff --git a/.mindmodel/patterns/error_handling.yaml b/.mindmodel/patterns/error_handling.yaml deleted file mode 100644 index d6344cc..0000000 --- a/.mindmodel/patterns/error_handling.yaml +++ /dev/null @@ -1,54 +0,0 @@ -name: error_handling - -rules: - - Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). - - Prefer logging.exception when catching an exception where stack trace is useful. - - Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. - - For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. - -examples: - - path: ai_provider.py - excerpt: | - ```python - except requests.ConnectionError as exc: - if attempt == retries: - raise ProviderError( - f"Connection error when calling provider: {exc}" - ) from exc - ... - ``` - note: mapping network error to ProviderError with re-raise chaining - - - path: pipeline/ai_provider_wrapper.py - excerpt: | - ```python - except Exception: - _logger.exception("Failed to append audit event for embedding failure") - results[j] = None - ``` - note: logs and assigns None for failure; fallback behavior documented earlier in wrapper rule - - - path: similarity/compute.py - excerpt: | - ```python - try: - import duckdb - except Exception: - logger.exception("duckdb import failed; cannot load vectors") - return 0 - ``` - note: defensive import handling and early return on failure - -anti_patterns: - - Bad: Broad except without logging and without re-raising (silently hides bugs) - remediation: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. - example: | - ```python - try: - do_work() - except Exception: - return [] - # BAD: hides the root cause and returns an ambiguous default - ``` - - Bad: Mixing print() and logging for errors - remediation: Replace print() calls with logger.* calls; use structured logging configuration. diff --git a/.mindmodel/patterns/module-singletons.md b/.mindmodel/patterns/module-singletons.md new file mode 100644 index 0000000..f6c80be --- /dev/null +++ b/.mindmodel/patterns/module-singletons.md @@ -0,0 +1,41 @@ +--- +title: Module Singletons Pattern +category: patterns +--- +# Module Singletons Pattern + +## Rules + +- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: + - Avoid expensive initialization at import time. + - Provide a way to construct with a test DB path or to reinitialize in tests. +- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. + +## Examples + +### database.py - Safe class initialization + +```python +class MotionDatabase: + def __init__(self, db_path: str = config.DATABASE_PATH): + self.db_path = db_path + # If duckdb is not available, operate in lightweight file-backed mode + self._file_mode = duckdb is None + self._init_database() +``` + +### similarity/lookup.py - Local instances + +```python +db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() +if hasattr(db, "get_cached_similarities"): + rows = db.get_cached_similarities(...) +``` + +## Anti-Patterns + +### Bad: Heavy initialization at import time + +**Problem**: Creating connections and performing heavy schema migrations during import. + +**Remediation**: Move heavy init to an explicit initialize() method and keep import fast. diff --git a/.mindmodel/patterns/module_singletons.yaml b/.mindmodel/patterns/module_singletons.yaml deleted file mode 100644 index 7ce7d96..0000000 --- a/.mindmodel/patterns/module_singletons.yaml +++ /dev/null @@ -1,33 +0,0 @@ -name: module_singletons - -rules: - - Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: - - Avoid expensive initialization at import time. - - Provide a way to construct with a test DB path or to reinitialize in tests. - - If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. - -examples: - - path: database.py - excerpt: | - ```python - class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - self.db_path = db_path - # If duckdb is not available, operate in lightweight file-backed mode - self._file_mode = duckdb is None - self._init_database() - ``` - note: class is safe to instantiate and creates DB at init; consider lazy init if heavy - - - path: similarity/lookup.py - excerpt: | - ```python - db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() - if hasattr(db, "get_cached_similarities"): - rows = db.get_cached_similarities(...) - ``` - note: consumers create local MotionDatabase instances, not relying on a single global - -anti_patterns: - - Bad: Creating connections and performing heavy schema migrations during import - remediation: Move heavy init to an explicit initialize() method and keep import fast. diff --git a/.mindmodel/patterns/requests-http.md b/.mindmodel/patterns/requests-http.md new file mode 100644 index 0000000..0930fb6 --- /dev/null +++ b/.mindmodel/patterns/requests-http.md @@ -0,0 +1,77 @@ +--- +title: Requests HTTP Pattern +category: patterns +--- +# Requests HTTP Pattern + +## Rules + +- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. +- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. +- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). +- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. + +## Examples + +### ai_provider.py - 429 handling with Retry-After + +```python +resp = requests.post(url, json=json, headers=headers, timeout=10) +... +if getattr(resp, "status_code", 0) == 429: + if attempt == retries: + raise ProviderError(f"Provider returned HTTP {resp.status_code}") + retry_after = None + raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None + if raw: + try: + retry_after = int(raw) + except Exception: + ... + if retry_after is not None: + time.sleep(retry_after) + continue +``` + +### api_client.py - Session + raise_for_status + +```python +response = self.session.get( + base_url, params=params, timeout=config.API_TIMEOUT +) +response.raise_for_status() +data = response.json() +``` + +### pipeline/ai_provider_wrapper.py - Retry/backoff wrapper + +```python +def _attempt_batch(chunk_texts, start_index): + backoff = 0.5 + for attempt in range(1, retries + 1): + try: + emb_chunk = _embedder( + chunk_texts, model=model, batch_size=len(chunk_texts) + ) + return emb_chunk, None + except Exception as exc: + if attempt == retries: + break + sleep = backoff * (2 ** (attempt - 1)) + time.sleep(sleep) + continue +``` + +## Anti-Patterns + +### Bad: Silent exception swallowing + +**Problem**: Blindly catching all requests exceptions and returning empty response. + +**Remediation**: Map network exceptions to retryable vs terminal (ProviderError) and log details. + +### Bad: Using print() for errors + +**Problem**: Using print() for network errors instead of structured logging. + +**Remediation**: Use `_logger.exception()` instead (see api_client.py needs fixing). diff --git a/.mindmodel/patterns/requests_http.yaml b/.mindmodel/patterns/requests_http.yaml deleted file mode 100644 index 135287c..0000000 --- a/.mindmodel/patterns/requests_http.yaml +++ /dev/null @@ -1,65 +0,0 @@ -name: requests_http - -rules: - - Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. - - Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. - - Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). - - Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. - -examples: - - path: ai_provider.py - excerpt: | - ```python - resp = requests.post(url, json=json, headers=headers, timeout=10) - ... - if getattr(resp, "status_code", 0) == 429: - if attempt == retries: - raise ProviderError(f"Provider returned HTTP {resp.status_code}") - retry_after = None - raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None - if raw: - try: - retry_after = int(raw) - except Exception: - ... - if retry_after is not None: - time.sleep(retry_after) - continue - ``` - note: explicit handling of 429 and Retry-After - - - path: api_client.py - excerpt: | - ```python - response = self.session.get( - base_url, params=params, timeout=config.API_TIMEOUT - ) - response.raise_for_status() - data = response.json() - ``` - note: uses session + raise_for_status() to surface HTTP errors - - - path: pipeline/ai_provider_wrapper.py - excerpt: | - ```python - def _attempt_batch(chunk_texts, start_index): - backoff = 0.5 - for attempt in range(1, retries + 1): - try: - emb_chunk = _embedder( - chunk_texts, model=model, batch_size=len(chunk_texts) - ) - return emb_chunk, None - except Exception as exc: - if attempt == retries: - break - sleep = backoff * (2 ** (attempt - 1)) - time.sleep(sleep) - continue - ``` - note: wrapper adds retry/backoff and per-item fallback - -anti_patterns: - - Bad: Blindly catching all requests exceptions and returning empty response - remediation: map network exceptions to retryable vs terminal (ProviderError) and log details. - - Bad: Using print() for network errors instead of structured logging (see api_client.py where print() is used; prefer logging). diff --git a/.mindmodel/patterns/validation.md b/.mindmodel/patterns/validation.md new file mode 100644 index 0000000..a8fab16 --- /dev/null +++ b/.mindmodel/patterns/validation.md @@ -0,0 +1,37 @@ +--- +title: Validation Pattern +category: patterns +--- +# Validation Pattern + +## Rules + +- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. +- Tests should assert that invalid inputs raise the expected exceptions. +- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). + +## Examples + +### ai_provider.py - Type validation + +```python +if not isinstance(text, str): + raise ProviderError("text must be a string") +``` + +### pipeline/ai_provider_wrapper.py - Defensive empty handling + +```python +if not texts: + return [] +if motion_ids is None: + motion_ids = [None for _ in texts] +``` + +## Anti-Patterns + +### Bad: Invalid values into computation + +**Problem**: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). + +**Remediation**: Fail fast with a typed exception and add unit tests to cover validations. diff --git a/.mindmodel/patterns/validation.yaml b/.mindmodel/patterns/validation.yaml deleted file mode 100644 index 5b68808..0000000 --- a/.mindmodel/patterns/validation.yaml +++ /dev/null @@ -1,29 +0,0 @@ -name: validation - -rules: - - Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. - - Tests should assert that invalid inputs raise the expected exceptions. - - Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). - -examples: - - path: ai_provider.py - excerpt: | - ```python - if not isinstance(text, str): - raise ProviderError("text must be a string") - ``` - note: explicit type validation before network call - - - path: pipeline/ai_provider_wrapper.py - excerpt: | - ```python - if not texts: - return [] - if motion_ids is None: - motion_ids = [None for _ in texts] - ``` - note: defensive handling of empty inputs - -anti_patterns: - - Bad: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). - remediation: Fail fast with a typed exception and add unit tests to cover validations. diff --git a/.mindmodel/stack/stack.md b/.mindmodel/stack/stack.md new file mode 100644 index 0000000..a2ea27d --- /dev/null +++ b/.mindmodel/stack/stack.md @@ -0,0 +1,67 @@ +--- +title: Tech Stack +category: stack +--- + +# Tech Stack + +## Runtime & Language +- **Python >=3.13** + +## Web Framework +- **Streamlit** - Multi-page app with Home, Stemwijzer, Explorer pages + +## Data Layer +- **DuckDB** - Embedded OLAP database + - Tables: motions, mp_votes, svd_vectors, fused_embeddings, embeddings, user_sessions, party_results, mp_metadata +- **ibis** - ORM (referenced but DuckDB-native implementation used) + +## AI / LLM +- **OpenRouter** - API abstraction for AI providers +- **QWEN** - Primary model + - Embeddings: `qwen/qwen3-embedding-4b` + - Chat: `qwen/qwen-2.5-72b-instruct` +- **requests** - HTTP client (not raw openai) + +## ML / Analytics +- **scikit-learn** - KMeans clustering, cosine_similarity, StandardScaler +- **scipy** - SVD (scipy.linalg.svd), spatial.procrustes +- **umap-learn** - Dimensionality reduction (optional, graceful fallback to SVD) +- **numpy** - Numerical computing + +## Visualization +- **Plotly** - Interactive charts (go.Figure, _DummyTrace fallback) +- **matplotlib** - Static plotting (optional) + +## HTTP & Parsing +- **requests** - Session pooling, retry with backoff +- **beautifulsoup4** - HTML parsing +- **lxml** - XML/HTML processing + +## Key Source Files + +| File | Purpose | +|------|---------| +| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema | +| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) | +| `explorer_helpers.py` | Pure helper functions, Plotly chart builders | +| `analysis/` | SVD pipeline, UMAP projection, clustering | +| `pipeline/` | Data fetch, transform, store pipeline | +| `pages/1_Stemwijzer.py` | Quiz page | +| `pages/2_Explorer.py` | Explorer page | +| `config.py` | Dataclass Config pattern | +| `ai_provider.py` | OpenRouter API wrapper with retry | +| `api_client.py` | TweedeKamer OData API client | + +## Singleton Instances + +| Module | Instance | Type | +|--------|----------|------| +| `database.py` | `db` | `MotionDatabase` | +| `config.py` | `config` | `Config` (dataclass) | +| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | + +## Environment +- Python >=3.13 +- Environment variables via `.env` (DB path, API keys) +- No `.env` values in constraint files (security) diff --git a/.mindmodel/stack/stack.yaml b/.mindmodel/stack/stack.yaml deleted file mode 100644 index 7b09f1e..0000000 --- a/.mindmodel/stack/stack.yaml +++ /dev/null @@ -1,41 +0,0 @@ -# Tech Stack - -## Runtime & Language -- **Python ≥3.13** (type: runtime) -- Streamlit (type: web framework) - multi-page app: Home, Stemwijzer, Explorer (4 tabs) - -## Data Layer -- **DuckDB** (type: database) - 9 tables: motions, mp_votes, svd_vectors, mp_party_history, etc. -- **ibis** (type: ORM) - DuckDB backend for Pythonic SQL -- Query mode: duckdb:// path or :memory: (see database.py:50-51) - -## ML / Analytics -- **scikit-learn** (type: ML) - clustering, Procrustes alignment -- **UMAP** (type: dimensionality reduction) - 2D political compass projection -- **scipy** (type: scientific computing) - spatial/alignment algorithms -- **numpy** (type: numerical computing) - array operations - -## Visualization -- **Plotly** (type: charting) - dual-layer interactive charts (scatter + annotations) - -## Key Source Files -| File | Purpose | -|------|---------| -| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema | -| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) | -| `explorer_helpers.py` | Pure helper functions, Plotly chart builders, coordinate computation | -| `analysis/` | SVD pipeline, UMAP projection, clustering algorithms | -| `pipeline/` | Data fetch → transform → store pipeline | -| `pages/1_🗳️_Stemwijzer.py` | Quiz page (thin wrapper) | -| `pages/2_🔍_Explorer.py` | Explorer page (thin wrapper) | -| `config.py` | Dataclass Config pattern | - -## Database Tables -- `motions` - parliamentary motions with id, title, date, category -- `mp_votes` - individual MP votes on motions (1/0/-1) -- `svd_vectors` - SVD-computed political positions (entity_id, window, vector_2d) -- `mp_party_history` - MP-to-party mappings over time -- `party_centroids` - aggregated party positions -- `windows` - time period definitions -- `mp_trajectories` - MP position changes across windows -- Plus 2 additional tables (exact names vary) diff --git a/.mindmodel/system.md b/.mindmodel/system.md index f50775b..f4de3e5 100644 --- a/.mindmodel/system.md +++ b/.mindmodel/system.md @@ -21,7 +21,7 @@ TweedeKamer OData API ├── text_pipeline # AI embeddings via OpenRouter └── fusion # Combine SVD + text vectors ↓ - Streamlit Web App (app.py, pages/) + Streamlit Web App (Home.py, pages/) ├── Home.py # Landing page ├── 1_Stemwijzer.py # Voting quiz └── 2_Explorer.py # Political compass explorer @@ -36,34 +36,53 @@ TweedeKamer OData API | **AI Provider** | OpenRouter API for embeddings/summaries | `ai_provider.py` | | **Pipeline** | Orchestrated data processing | `pipeline/run_pipeline.py` | | **Analysis** | SVD, clustering, trajectory computation | `analysis/*.py` | -| **Similarity** | Motion similarity search | `similarity/*.py` | -| **Web App** | Streamlit UI | `app.py`, `pages/*.py` | - -### Data Models - -**Core Entities**: -- `Motion`: Parliamentary motion with voting results -- `MP` / `MPMetadata`: Member of Parliament with party/tenure -- `MPVote`: Individual vote record (Voor/Tegen/Onthouden/Geen stem/Afwezig) -- `Party`: Political party -- `UserSession` / `UserVote`: Voting session tracking -- `SVDVector`: Dimensionality-reduced vote vectors -- `FusedEmbedding`: Combined SVD + text embedding -- `SimilarityCache`: Pre-computed motion similarities - -### Technical Decisions - -1. **DuckDB over SQLite**: Chosen for OLAP performance with complex analytical queries -2. **ibis ORM**: Database-agnostic query building (currently using DuckDB backend) -3. **SVD + Procrustes**: Aligns voting vectors across time windows -4. **UMAP for visualization**: Non-linear dimensionality reduction for compass display -5. **OpenRouter API**: Abstraction layer for AI embeddings (currently using Qwen) -6. **Module-level singletons**: `db = MotionDatabase()` pattern for shared state - -### Key Conventions - -- **DuckDB connections**: Short-lived per method, always close -- **Error handling**: Catch `Exception`, return safe fallbacks (False/[]/None) -- **Logging**: Use `logging.getLogger(__name__)` - avoid print() -- **Type hints**: Required on public functions with typing module imports -- **Config**: Dataclass `Config` in `config.py`, accessed as `from config import config` +| **Explorer Helpers** | Pure functions, chart builders | `explorer_helpers.py` | +| **Web App** | Streamlit UI | `Home.py`, `pages/*.py` | + +### Tech Stack + +- **Language**: Python 3.13+ +- **Web Framework**: Streamlit (multi-page app) +- **Database**: DuckDB with ibis ORM (DuckDB-native implementation) +- **ML/Analytics**: scipy (SVD, Procrustes), scikit-learn (KMeans, cosine_similarity), umap-learn (optional) +- **AI/LLM**: OpenRouter-compatible API (QWEN embeddings + chat) +- **Visualization**: Plotly (interactive charts), matplotlib (optional) +- **HTTP**: requests with Session pooling and retry +- **Parsing**: beautifulsoup4, lxml + +### Key Patterns + +1. **Module-Level Singletons**: `db = MotionDatabase()`, `config = Config()` +2. **Repository Pattern**: MotionDatabase class with method-per-query +3. **Service Layer**: TweedeKamerAPI, ai_provider with retry/backoff +4. **Pipeline Orchestration**: ThreadPoolExecutor for parallel SVD +5. **Short-Lived Connections**: DuckDB connections in try/finally blocks +6. **Graceful Degradation**: try/except around optional dependencies + +### Domain Invariants + +⚠️ **CRITICAL RULES** (from AGENTS.md): + +1. **Right-wing parties on RIGHT**: PVV, FVD, JA21, SGP must appear on RIGHT side of all axes in visualizations +2. **SVD labels = voting patterns**: SVD labels reflect voting patterns, NOT semantic content + +### Database Tables + +| Table | Purpose | +|-------|---------| +| `motions` | Parliamentary motions with id, title, date, category | +| `mp_votes` | Individual MP votes on motions (Voor/Tegen/Onthouden) | +| `mp_metadata` | MP names, parties, tenure info | +| `svd_vectors` | 2D SVD-computed political positions per entity | +| `fused_embeddings` | Combined SVD + text embeddings | +| `embeddings` | Text embeddings for motions | +| `user_sessions` | Voting session tracking | +| `party_results` | Party match results per session | + +### Conventions + +- **Error Handling**: Catch `Exception`, return safe fallbacks (False/[]/None) +- **Logging**: Use `logging.getLogger(__name__)` — **never use print()** +- **Imports**: stdlib → 3rd party → local (3 groups) +- **Type Hints**: Required on public functions with typing module imports +- **DuckDB**: Short-lived connections with try/finally conn.close()