From 07dd393533458a84ccda816270d4afa7089b3cd0 Mon Sep 17 00:00:00 2001 From: Sven Geboers Date: Fri, 1 May 2026 12:11:06 +0200 Subject: [PATCH] cleanup: remove stale .mindmodel, old venvs, orphaned code, and transient artifacts Removes: - .mindmodel/ directory and related CI workflows (mindmodel-schedule.yml, mindmodel-validation.yml) - scripts/mindmodel/ and scripts/validate_mindmodel.py - src/types/ and src/validators/ (orphaned type modules, only used by mindmodel) - tests/ci/, tests/scripts/mindmodel/, tests/types/, tests/validators/ (mindmodel-only tests) - thoughts/ledgers/ and thoughts/shared/ (stale transient directories) - .venv_axis and .venv_plotly (orphaned virtual environments, ~1.1 GB) - outputs/blog-charts/ (stale generated HTML files) - data/*.json sidecars (empty cache artifacts) - __pycache__ and *.pyc files across repo Updates: - .gitignore: remove thoughts/shared/analyses/ entry Space reclaimed: ~1.1 GB+ --- .github/workflows/mindmodel-schedule.yml | 37 -- .github/workflows/mindmodel-validation.yml | 47 --- .gitignore | 1 - .mindmodel/README.md | 11 - .mindmodel/anti-patterns/anti-patterns.md | 127 ------- .mindmodel/architecture/architecture.yaml | 55 --- .mindmodel/constraints/README.md | 51 --- .mindmodel/constraints/error-handling.md | 143 -------- .mindmodel/constraints/imports.yaml | 205 ------------ .mindmodel/constraints/logging.md | 131 -------- .mindmodel/constraints/naming.yaml | 141 -------- .mindmodel/constraints/testing.yaml | 26 -- .mindmodel/constraints/types.yaml | 233 ------------- .mindmodel/conventions/conventions.yaml | 124 ------- .mindmodel/dependencies/dependencies.md | 92 ----- .mindmodel/domain/domain-glossary.md | 146 -------- .mindmodel/examples/api-client-example.py | 196 ----------- .mindmodel/examples/database-example.py | 191 ----------- .mindmodel/examples/pattern-examples.md | 116 ------- .mindmodel/examples/pipeline-example.py | 217 ------------ .mindmodel/examples/streamlit-page-example.py | 316 ------------------ .mindmodel/manifest.yaml | 108 ------ .mindmodel/patterns/api.yaml | 265 --------------- .mindmodel/patterns/architecture.yaml | 230 ------------- .mindmodel/patterns/database.yaml | 239 ------------- .mindmodel/patterns/duckdb-access.md | 79 ----- .mindmodel/patterns/embeddings-similarity.md | 74 ---- .mindmodel/patterns/error-handling.md | 63 ---- .mindmodel/patterns/module-singletons.md | 41 --- .mindmodel/patterns/patterns.yaml | 228 ------------- .mindmodel/patterns/python.yaml | 196 ----------- .mindmodel/patterns/requests-http.md | 77 ----- .mindmodel/patterns/streamlit.yaml | 225 ------------- .mindmodel/patterns/validation.md | 37 -- .mindmodel/stack/stack.md | 67 ---- .mindmodel/system.md | 88 ----- analysis/explorer_data.py | 13 +- scripts/mindmodel/checks.py | 72 ---- scripts/mindmodel/cli.py | 32 -- scripts/mindmodel/loader.py | 67 ---- scripts/mindmodel/validator.py | 108 ------ scripts/validate_mindmodel.py | 56 ---- src/types/motion_types.py | 35 -- src/validators/mindmodel_validator.py | 142 -------- tests/ci/test_schedule_exists.py | 11 - tests/ci/test_workflow_exists.py | 26 -- tests/scripts/mindmodel/test_checks.py | 43 --- tests/scripts/mindmodel/test_cli.py | 14 - tests/scripts/mindmodel/test_loader.py | 21 -- tests/scripts/mindmodel/test_validator.py | 70 ---- tests/scripts/test_validate_cli.py | 52 --- tests/types/test_motion_types.py | 22 -- tests/validators/test_mindmodel_validator.py | 45 --- tests/validators/test_types.py | 24 -- tests/validators/test_validator_edgecases.py | 56 ---- ...26-03-28-ansible-package-implementation.md | 40 --- .../changes/2026-03-28-env-removal-report.md | 36 -- .../2026-03-28-secrets-rotation-checklist.md | 25 -- 58 files changed, 11 insertions(+), 5622 deletions(-) delete mode 100644 .github/workflows/mindmodel-schedule.yml delete mode 100644 .github/workflows/mindmodel-validation.yml delete mode 100644 .mindmodel/README.md delete mode 100644 .mindmodel/anti-patterns/anti-patterns.md delete mode 100644 .mindmodel/architecture/architecture.yaml delete mode 100644 .mindmodel/constraints/README.md delete mode 100644 .mindmodel/constraints/error-handling.md delete mode 100644 .mindmodel/constraints/imports.yaml delete mode 100644 .mindmodel/constraints/logging.md delete mode 100644 .mindmodel/constraints/naming.yaml delete mode 100644 .mindmodel/constraints/testing.yaml delete mode 100644 .mindmodel/constraints/types.yaml delete mode 100644 .mindmodel/conventions/conventions.yaml delete mode 100644 .mindmodel/dependencies/dependencies.md delete mode 100644 .mindmodel/domain/domain-glossary.md delete mode 100644 .mindmodel/examples/api-client-example.py delete mode 100644 .mindmodel/examples/database-example.py delete mode 100644 .mindmodel/examples/pattern-examples.md delete mode 100644 .mindmodel/examples/pipeline-example.py delete mode 100644 .mindmodel/examples/streamlit-page-example.py delete mode 100644 .mindmodel/manifest.yaml delete mode 100644 .mindmodel/patterns/api.yaml delete mode 100644 .mindmodel/patterns/architecture.yaml delete mode 100644 .mindmodel/patterns/database.yaml delete mode 100644 .mindmodel/patterns/duckdb-access.md delete mode 100644 .mindmodel/patterns/embeddings-similarity.md delete mode 100644 .mindmodel/patterns/error-handling.md delete mode 100644 .mindmodel/patterns/module-singletons.md delete mode 100644 .mindmodel/patterns/patterns.yaml delete mode 100644 .mindmodel/patterns/python.yaml delete mode 100644 .mindmodel/patterns/requests-http.md delete mode 100644 .mindmodel/patterns/streamlit.yaml delete mode 100644 .mindmodel/patterns/validation.md delete mode 100644 .mindmodel/stack/stack.md delete mode 100644 .mindmodel/system.md delete mode 100644 scripts/mindmodel/checks.py delete mode 100644 scripts/mindmodel/cli.py delete mode 100644 scripts/mindmodel/loader.py delete mode 100644 scripts/mindmodel/validator.py delete mode 100644 scripts/validate_mindmodel.py delete mode 100644 src/types/motion_types.py delete mode 100644 src/validators/mindmodel_validator.py delete mode 100644 tests/ci/test_schedule_exists.py delete mode 100644 tests/ci/test_workflow_exists.py delete mode 100644 tests/scripts/mindmodel/test_checks.py delete mode 100644 tests/scripts/mindmodel/test_cli.py delete mode 100644 tests/scripts/mindmodel/test_loader.py delete mode 100644 tests/scripts/mindmodel/test_validator.py delete mode 100644 tests/scripts/test_validate_cli.py delete mode 100644 tests/types/test_motion_types.py delete mode 100644 tests/validators/test_mindmodel_validator.py delete mode 100644 tests/validators/test_types.py delete mode 100644 tests/validators/test_validator_edgecases.py delete mode 100644 thoughts/shared/changes/2026-03-28-ansible-package-implementation.md delete mode 100644 thoughts/shared/changes/2026-03-28-env-removal-report.md delete mode 100644 thoughts/shared/changes/2026-03-28-secrets-rotation-checklist.md diff --git a/.github/workflows/mindmodel-schedule.yml b/.github/workflows/mindmodel-schedule.yml deleted file mode 100644 index 7f4fcbc..0000000 --- a/.github/workflows/mindmodel-schedule.yml +++ /dev/null @@ -1,37 +0,0 @@ -name: mindmodel scheduled validate - -on: - schedule: - - cron: '0 0 * * 0' # weekly - -jobs: - validate: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Install uv - uses: astral-sh/setup-uv@v5 - with: - version: "0.6.x" - - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: "3.13" - - - name: Install dependencies - run: uv sync --locked - - - name: Run tests - run: uv run pytest tests/ -q - - - name: Run mindmodel validator if manifest exists - if: ${{ always() }} - run: | - if [ -f .mindmodel/manifest.yaml ]; then - uv run python -m scripts.mindmodel.cli || true - else - echo "No .mindmodel/manifest.yaml present β€” skipping validator" - fi diff --git a/.github/workflows/mindmodel-validation.yml b/.github/workflows/mindmodel-validation.yml deleted file mode 100644 index fc5d6a1..0000000 --- a/.github/workflows/mindmodel-validation.yml +++ /dev/null @@ -1,47 +0,0 @@ -name: mindmodel validation - -on: - push: - branches: [ main ] - pull_request: - branches: [ main ] - -jobs: - validate: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: '3.x' - - - name: Install development dependencies (if present) - run: | - python -m pip install --upgrade pip - if [ -f requirements-dev.txt ]; then - pip install -r requirements-dev.txt - else - echo "requirements-dev.txt not found, skipping" - fi - - - name: Run mindmodel validator (report-only) - if: ${{ always() }} - run: | - # Make this step report-only: run the validator but always exit 0 so PRs are not blocked - set +e - if [ -f .mindmodel/manifest.yaml ]; then - python scripts/validate_mindmodel.py --manifest .mindmodel/manifest.yaml --report reports/out.json || true - else - echo "No .mindmodel/manifest.yaml present β€” skipping validator" - fi - exit 0 - - - name: Upload mindmodel reports - if: ${{ always() }} - uses: actions/upload-artifact@v4 - with: - name: mindmodel-reports - path: reports/mindmodel-report-*.json diff --git a/.gitignore b/.gitignore index 103c620..955aff4 100644 --- a/.gitignore +++ b/.gitignore @@ -29,7 +29,6 @@ dummy # Generated analysis files thoughts/explorer/*.json thoughts/explorer/*_report.md -thoughts/shared/analyses/ # Compound Engineering local config .compound-engineering/*.local.yaml diff --git a/.mindmodel/README.md b/.mindmodel/README.md deleted file mode 100644 index 4f4b84d..0000000 --- a/.mindmodel/README.md +++ /dev/null @@ -1,11 +0,0 @@ -# .mindmodel - -This directory contains a generated, read-only snapshot of the repository's "mind model" β€” structured metadata and evidence used by tooling to reason about repository intent, patterns, and decisions. - -Guidelines -- Read-only: Treat files in this directory as generated artifacts. Local tooling or CI may regenerate or validate them; avoid manual edits unless you are intentionally updating the generator. -- No secrets: Do not place any credentials, tokens, or sensitive data here. The validator that consumes this folder is designed to detect common secret patterns and will fail if secrets are found. -- Safe to read: Tools and CI may read these files. They must avoid opening or parsing arbitrary repository secrets and should operate in read-only mode. -- Validation: CI workflows will run a validator against this folder (if present) to ensure manifest shape, evidence snippets, and referenced files meet project rules. - -If you need to propose a change to the mind model, open a PR describing the intent and the generator changes. The CI validator will validate the submitted artifact before merge. diff --git a/.mindmodel/anti-patterns/anti-patterns.md b/.mindmodel/anti-patterns/anti-patterns.md deleted file mode 100644 index 65cb59e..0000000 --- a/.mindmodel/anti-patterns/anti-patterns.md +++ /dev/null @@ -1,127 +0,0 @@ ---- -title: Anti-Patterns in Stemwijzer -category: anti-patterns -severity: critical ---- - -# Anti-Patterns - -> **NOTE**: Some anti-patterns below were investigated and found to be resolved or invalid. See individual entries for details. - -## CRITICAL: print() Instead of Logging - -**File**: `api_client.py` -**Evidence**: 11 instances of `print(f"...")` instead of `_logger.info(...)` - -**Broken code**: -```python -def get_motions(self, ...): - try: - # ... - print(f"Fetched {len(voting_records)} voting records from API") # BAD - print(f"Processed into {len(motions)} unique motions") # BAD - except Exception as e: - print(f"Error fetching motions from API: {e}") # BAD - no traceback -``` - -**Fix**: -```python -import logging - -_logger = logging.getLogger(__name__) - -def get_motions(self, ...): - try: - _logger.info("Fetched %d voting records from API", len(voting_records)) - _logger.info("Processed into %d unique motions", len(motions)) - except Exception as e: - _logger.exception("Error fetching motions from API: %s", e) - return [] -``` - ---- - -## CRITICAL: Global `_DummySt` Replacement - -**File**: `explorer.py` -**Evidence**: Lines ~50-70, module-level `st = _DummySt()` global replacement - -**Problem**: Creates a module-level variable `st` that shadows `streamlit` module, causing subtle bugs. - -**Fix**: Use conditional flags instead of global replacement: -```python -# GOOD: Use conditional logic -try: - import plotly.express as px - import plotly.graph_objects as go - HAS_PLOTLY = True -except ImportError: - HAS_PLOTLY = False - px = None - go = None - -def render_chart(data): - if not HAS_PLOTLY: - _logger.warning("Plotly not available") - return - # ... rest of chart logic -``` - ---- - -## WARNING: Logger Naming Inconsistency - -**Evidence**: 16 files use `logger`, 17 files use `_logger` - -**Files with `logger`** (without underscore): -- api_client.py, ai_provider.py, pipeline files, analysis files - -**Files with `_logger`** (with underscore): -- database.py, explorer.py, explorer_helpers.py - -**Recommendation**: Standardize on `_logger` for module-level loggers. - ---- - -## WARNING: Bare except with pass - -**File**: `database.py`, line 47 - -```python -# BAD - catches KeyboardInterrupt, SystemExit, MemoryError -try: - conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") -except: # bare except - pass -``` - -**Fix**: -```python -try: - conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") -except Exception as exc: - _logger.debug("Sequence creation skipped: %s", exc) -``` - ---- - -## INVESTIGATED: Entity-ID / Party-Name Mismatch - -**Status**: INVALID - investigated and resolved - -**Investigation Summary**: `svd_vectors.entity_id` only contains MP names (not party names). Party centroids are correctly computed via `mp_metadata` lookups. No production bug exists. - ---- - -## Pattern: Three Separate Party Alias Dictionaries - -**Problem**: Party name variations exist in 3+ places with no canonical alias mapping. - -**Fix**: Create one `PARTY_ALIASES` dict in `config.py`: -```python -PARTY_ALIASES = { - "GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"], - "PVV": ["Partij voor de Vrijheid"], - # ... -} -``` diff --git a/.mindmodel/architecture/architecture.yaml b/.mindmodel/architecture/architecture.yaml deleted file mode 100644 index dd49623..0000000 --- a/.mindmodel/architecture/architecture.yaml +++ /dev/null @@ -1,55 +0,0 @@ -# Architecture - -## Page Routing -- `Home.py` β†’ thin wrapper, minimal logic -- `pages/1_πŸ—³οΈ_Stemwijzer.py` β†’ thin wrapper delegating to quiz module -- `pages/2_πŸ”_Explorer.py` β†’ thin wrapper delegating to `explorer.py` -- **Pattern**: thin Streamlit page files that import and call into core modules - -## Core Modules -``` -database.py β†’ MotionDatabase singleton (shared across all pages) -explorer.py β†’ Explorer page logic, tab routing -explorer_helpers.py β†’ Pure functions, chart builders, coordinate computation -analysis/ β†’ SVD, UMAP, clustering algorithms -pipeline/ β†’ Data ingestion pipeline -config.py β†’ Dataclass Config, PARTY_COLOURS dict -``` - -## Data Flow -``` -DuckDB β†’ MotionDatabase (singleton) - ↓ - st.cache_data loaders - ↓ - explorer_helpers (pure functions) - ↓ - Plotly charts β†’ Streamlit -``` - -## Key Patterns -1. **Singleton per module**: `database.py` exports one `db` instance; `config.py` exports config + PARTY_COLOURS -2. **Graceful degradation**: try/except around optional dependencies (UMAP, Plotly) -3. **Pipeline**: fetch β†’ transform β†’ store (see `pipeline/` directory) -4. **API client**: with retry/backoff for external data sources -5. **Dummy fallbacks**: if optional dep unavailable, use dummy stub - -## Database Schema (key relationships) -``` -motions (id, title, date, category) - ↓ -mp_votes (mp_id, motion_id, vote: -1/0/1) - ↓ -svd_vectors (entity_id, window, vector_2d) ← entity_id = mp_name OR party_name - ↓ -party_centroids (party, window, centroid_2d) - ↓ -mp_party_history (mp_id, party, start_date, end_date) -``` - -## SVD Computation Pipeline -1. Build MP Γ— Motion vote matrix from `mp_votes` -2. Run SVD to get 2D embeddings per MP -3. Optionally aggregate to party centroids -4. Align across windows using Procrustes -5. Store in `svd_vectors` table diff --git a/.mindmodel/constraints/README.md b/.mindmodel/constraints/README.md deleted file mode 100644 index 163ba5c..0000000 --- a/.mindmodel/constraints/README.md +++ /dev/null @@ -1,51 +0,0 @@ -# Constraint Files Index - -This directory contains all constraint files for the Stemwijzer codebase. - -## Quick Navigation - -| Category | File | Purpose | -|----------|------|---------| -| **Stack** | `../stack/stack.yaml` | Tech stack overview | -| **Architecture** | `../architecture/architecture.yaml` | Data flow, page routing, component relationships | -| **Conventions** | `../conventions/conventions.yaml` | Naming, error handling, code organization | -| **Domain** | `../domain/domain-glossary.yaml` | Dutch political terms, algorithm concepts | -| **Patterns** | `../patterns/patterns.yaml` | 10 code patterns (page wrapper, pipeline, etc.) | -| **Anti-Patterns** | `../anti-patterns/anti-patterns.yaml` | ⚠️ 7 issues including CRITICAL BUG | -| **Dependencies** | `../dependencies/dependencies.yaml` | Library wiring, singletons, imports | - -## How to Use - -1. **Before writing code**: Check `patterns/patterns.yaml` for how similar features are implemented -2. **When naming things**: Follow `conventions/conventions.yaml` (snake_case functions, PascalCase classes) -3. **When handling errors**: Avoid patterns in `anti-patterns/anti-patterns.yaml` -4. **When working with domain terms**: Reference `domain/domain-glossary.yaml` -5. **When connecting components**: See `dependencies/dependencies.yaml` for wiring - -## Key Conventions Summary - -- **Files**: snake_case (`explorer_helpers.py`) -- **Functions**: snake_case (`compute_party_coords`) -- **Classes**: PascalCase (`MotionDatabase`) -- **Constants**: UPPER_SNAKE_CASE (`PARTY_COLOURS`) -- **No bare `except:`** β€” always specify exception type -- **Pure functions** in helpers β€” no IO, no Streamlit calls -- **One singleton per module** β€” `db`, `config`, `PARTY_COLOURS` - -## ⚠️ Critical Bug - -**Read `../anti-patterns/anti-patterns.yaml` first.** Section 1 documents a critical bug in -`explorer_helpers.py:compute_party_coords` where party names in `svd_vectors` entity_id are -not recognized because `party_map` only contains MP-name keys. - -## Files Generated - -- `manifest.yaml` β€” lists all constraint files with group mappings -- `stack/stack.yaml` β€” tech stack -- `architecture/architecture.yaml` β€” data flow & components -- `conventions/conventions.yaml` β€” coding conventions -- `domain/domain-glossary.yaml` β€” domain terminology -- `patterns/patterns.yaml` β€” 10 code patterns with examples -- `anti-patterns/anti-patterns.yaml` β€” 7 anti-patterns including CRITICAL BUG -- `dependencies/dependencies.yaml` β€” library wiring -- `README.md` β€” this index diff --git a/.mindmodel/constraints/error-handling.md b/.mindmodel/constraints/error-handling.md deleted file mode 100644 index 9d0c75d..0000000 --- a/.mindmodel/constraints/error-handling.md +++ /dev/null @@ -1,143 +0,0 @@ ---- -title: Error Handling Patterns -category: constraints -severity: high ---- - -# Error Handling Patterns - -## Core Rules - -1. **Catch `Exception`, return safe fallbacks** (False/[]/None) -2. **Log exceptions with traceback** using `_logger.exception()` -3. **Never swallow exceptions silently** - always log or return sensible default -4. **Avoid nested try/except blocks** - flatten exception handling - -## Pattern: Try/Except Safe Fallback - -This is the dominant pattern in the codebase (219+ instances). - -```python -# Standard pattern from database.py, api_client.py, etc. -try: - result = risky_operation() - return process(result) -except Exception as exc: - _logger.warning("Operation failed: %s", exc) - return safe_fallback # False, [], None, {} -``` - -### Examples from Codebase - -**database.py** - DuckDB operations: -```python -def get_svd_vectors(self, window: str): - try: - conn = duckdb.connect(self.db_path, read_only=True) - try: - result = conn.execute(query, (window,)).fetchall() - return self._parse_vectors(result) - finally: - conn.close() - except Exception as exc: - _logger.warning("Failed to get SVD vectors: %s", exc) - return [] -``` - -**ai_provider.py** - HTTP retries: -```python -try: - resp = requests.post(url, json=json, headers=headers, timeout=10) - resp.raise_for_status() - return resp.json() -except requests.ConnectionError as exc: - if attempt == retries: - raise ProviderError(f"Connection error: {exc}") from exc - # ... retry logic -``` - -## Pattern: Optional Dependency Fallback - -Gracefully degrade when optional packages are unavailable. - -```python -# UMAP fallback in explorer_helpers.py -try: - import umap - HAS_UMAP = True -except ImportError: - HAS_UMAP = False - _logger.debug("UMAP not available, using SVD vectors directly") - -def project_to_2d(vectors): - if HAS_UMAP: - return umap.UMAP().fit_transform(vectors) - return vectors[:, :2] # Fallback: first 2 SVD dimensions -``` - -## Anti-Patterns - -### 1. Bare except with pass (CRITICAL) -**File**: `database.py`, line 47 - -```python -# BAD - catches KeyboardInterrupt, SystemExit, MemoryError -try: - conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") -except: # bare except - pass -``` - -**Fix**: Catch specific exception or log and continue: -```python -try: - conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") -except Exception as exc: - _logger.debug("Sequence creation skipped (may already exist): %s", exc) -``` - -### 2. Nested Exception Handling -**File**: `explorer.py`, lines 244-261 - -```python -# BAD - opaque error paths -try: - result = compute_svd(motions) -except Exception: - try: - result = fallback_compute(motions) - except Exception: - pass # Both exceptions silently dropped -``` - -**Fix**: Flatten and handle each case explicitly: -```python -# GOOD - explicit handling -try: - result = compute_svd(motions) -except Exception as exc: - _logger.warning("SVD failed, trying fallback: %s", exc) - try: - result = fallback_compute(motions) - except Exception as fallback_exc: - _logger.error("Both SVD approaches failed: %s, %s", exc, fallback_exc) - raise -``` - -## Rule Summary - -| Pattern | When to Use | Return Value | -|---------|-------------|--------------| -| Safe fallback | Best-effort operations | `[]`, `{}`, `False`, `None` | -| Re-raise | Critical operations that must succeed | raise | -| Log and continue | Optional steps in pipeline | (continue) | -| Graceful degradation | Optional dependencies | Default behavior | - -## When to Log vs Return - -| Scenario | Action | -|----------|--------| -| User action fails | Log warning, return safe default | -| Internal error (corrupt data) | Log error, return safe default | -| Transient failure (network) | Log warning, retry if appropriate | -| Configuration error | Log error, raise with clear message | diff --git a/.mindmodel/constraints/imports.yaml b/.mindmodel/constraints/imports.yaml deleted file mode 100644 index 86ed296..0000000 --- a/.mindmodel/constraints/imports.yaml +++ /dev/null @@ -1,205 +0,0 @@ -# Import Organization Constraints - -## Standard Order - -Organize imports in three groups with blank lines between: - -```python -# 1. Standard library imports (alphabetical within group) -import json -import logging -import os -from datetime import datetime, timedelta -from typing import Dict, List, Optional, Tuple - -# 2. Third-party packages (alphabetical within group) -import duckdb -import requests -from config import config - -# 3. Local application modules (can use relative imports) -from database import db -from summarizer import summarizer -``` - -## Alphabetical Ordering - -Within each group, sort imports alphabetically: - -```python -# GOOD - alphabetical -import json -import logging -from datetime import datetime -from typing import Dict, List, Optional - -# BAD - random order -from typing import Optional -import json -from datetime import datetime -import logging -from typing import Dict, List -``` - -## Grouping Rules - -### Standard Library -- `json`, `logging`, `os`, `sys`, `time` -- `datetime`, `timedelta` from `datetime` -- `Dict`, `List`, `Optional`, etc. from `typing` -- `argparse`, `pathlib`, `re`, `uuid` - -### Third-Party -- `duckdb`, `requests`, `streamlit` -- `numpy`, `scipy`, `sklearn` -- `plotly`, `beautifulsoup4` -- `pytest` - -### Local Application -- Modules from same package -- Relative imports when appropriate - -## When to Use `from X import Y` - -### Prefer `from module import specific_items` for: -- Constants and config -- Single classes or functions used frequently -- Type annotations - -```python -# GOOD - clear about what we're using -from config import config -from database import db - -# GOOD - type hints -from typing import Dict, List, Optional -``` - -### Use `import module` when: -- You need multiple items from the module -- Using module.namespace is clearer - -```python -# GOOD - duckdb used for types and module access -import duckdb - -conn = duckdb.connect(...) -result = conn.execute(...) - -# Also acceptable for types -from typing import Dict -``` - -## Relative Imports - -In package modules, prefer relative imports: - -```python -# pipeline/svd_pipeline.py -from ..database import MotionDatabase # relative import -from .text_pipeline import process_text # relative import -``` - -## Circular Imports - -Avoid circular imports by: -1. Moving shared code to a third module -2. Using TYPE_CHECKING for type hints only - -```python -# types.py - shared type definitions -from typing import TypedDict - -class MotionDict(TypedDict): - id: int - title: str - ... - -# module_a.py -from .types import MotionDict - -# module_b.py - if needed here too -from .types import MotionDict -``` - -## Import Patterns to Avoid - -### Wildcard Imports -```python -# BAD -from database import * - -# GOOD -from database import db, MotionDatabase -``` - -### Import in Function Scope (unless necessary) -```python -# AVOID - delays import, makes dependencies unclear -def some_function(): - import pandas as pd # Late import - return pd.DataFrame(...) - -# PREFER - import at module level -import pandas as pd - -def some_function(): - return pd.DataFrame(...) -``` - -### Reassigning Imported Names -```python -# BAD - confusing -from module import process -process = something_else # Reassigning - -# GOOD - clear naming -from module import process as process_data -``` - -## Type Checking Imports - -For type hints only, use TYPE_CHECKING: - -```python -from typing import TYPE_CHECKING - -if TYPE_CHECKING: - from .models import Motion - -def get_motion(motion_id: int) -> "Motion": # String quote for forward ref - ... -``` - -## Optional Dependency Imports - -Handle optional dependencies gracefully: - -```python -try: - import duckdb -except Exception: - duckdb = None # Will be checked later - -class MotionDatabase: - def __init__(self): - if duckdb is None: - self._file_mode = True # Fallback mode -``` - -## Example: Complete Import Block - -```python -# Complete example from database.py -import json -import logging -import uuid -from datetime import datetime, timedelta -from typing import Dict, List, Optional, Tuple - -import duckdb - -from config import config - -from database import db -``` diff --git a/.mindmodel/constraints/logging.md b/.mindmodel/constraints/logging.md deleted file mode 100644 index adc7dd6..0000000 --- a/.mindmodel/constraints/logging.md +++ /dev/null @@ -1,131 +0,0 @@ ---- -title: Logging Constraints -category: constraints -severity: critical ---- - -# Logging Constraints - -## Core Rule - -Use `logging.getLogger(__name__)` - never use `print()` - -**CRITICAL ANTI-PATTERN**: `api_client.py` uses `print()` instead of logging (11 instances). - -## CRITICAL Anti-Pattern: print() Instead of Logging - -**File**: `api_client.py` -**Evidence**: Lines with `print(f"...")` instead of `_logger.info(...)` - -**Broken code**: -```python -def get_motions(self, ...): - try: - # ... - print(f"Fetched {len(voting_records)} voting records from API") # BAD - print(f"Processed into {len(motions)} unique motions") # BAD - except Exception as e: - print(f"Error fetching motions from API: {e}") # BAD - no traceback -``` - -**Fix**: -```python -import logging - -_logger = logging.getLogger(__name__) - -def get_motions(self, ...): - try: - _logger.info("Fetched %d voting records from API", len(voting_records)) - _logger.info("Processed into %d unique motions", len(motions)) - except Exception as e: - _logger.exception("Error fetching motions from API: %s", e) - return [] -``` - -## Logger Initialization - -Get logger at module level: - -```python -# GOOD: Use logging.getLogger(__name__) -import logging - -_logger = logging.getLogger(__name__) - -def some_function(): - _logger.info("Processing started") - _logger.debug("Detail: %s", detail) -``` - -## Logger Naming - -Use `__name__` for automatic module path: - -```python -# In database.py - logger will be "database" -_logger = logging.getLogger(__name__) - -# In pipeline/svd_pipeline.py - logger will be "pipeline.svd_pipeline" -_logger = logging.getLogger(__name__) -``` - -**INCONSISTENCY WARNING**: 16 files use `logger`, 17 files use `_logger`. Choose one convention. - -**Recommendation**: Use `_logger` (with underscore) for module-level loggers to distinguish from class-level loggers. - -## Log Levels - -| Level | When to Use | -|-------|-------------| -| DEBUG | Detailed diagnostic info (dev only) | -| INFO | Normal operation milestones | -| WARNING | Unexpected but handled (fallbacks) | -| ERROR | Operation failed, may need attention | -| CRITICAL | Fatal error, program may crash | - -## Exception Logging - -Use `_logger.exception()` for caught exceptions (includes traceback): - -```python -try: - result = risky_operation() -except Exception as exc: - _logger.exception("Operation failed: %s", exc) - return fallback_value -``` - -## Anti-Patterns - -### Debug Prints in Production Code -```python -# BAD -print(f"[TRAJ DEBUG] processing window {wid}") - -# GOOD -_logger.debug("Processing window %s", wid) -``` - -### Inconsistent Logger Names -```python -# BAD - mixing _logger and logger -_logger = logging.getLogger(__name__) -logger = logging.getLogger("other") # Inconsistent -``` - -## Sensitive Data - -Never log sensitive information: -- API keys -- User votes -- Session IDs (if tied to user data) -- Personal information - -```python -# BAD -_logger.info("User %s voted %s", user_id, vote) - -# GOOD - log aggregates, not individual votes -_logger.info("Vote recorded for session %s", session_id[:8]) -``` diff --git a/.mindmodel/constraints/naming.yaml b/.mindmodel/constraints/naming.yaml deleted file mode 100644 index bccbf7c..0000000 --- a/.mindmodel/constraints/naming.yaml +++ /dev/null @@ -1,141 +0,0 @@ -# Naming Constraints - -## File Names - -### Python Modules -- **Convention**: `snake_case.py` -- **Examples**: `motion_database.py`, `api_client.py`, `text_pipeline.py` - -### Test Files -- **Convention**: `test_.py` -- **Examples**: `test_database.py`, `test_api_client.py` - -### Config Files -- **Convention**: `snake_case` -- **Examples**: `config.py`, `.env.example`, `pyproject.toml` - -### Directories -- **Convention**: `snake_case/` -- **Examples**: `pipeline/`, `tests/integration/`, `src/validators/` - -## Class Names - -- **Convention**: `PascalCase` -- **Examples**: `MotionDatabase`, `TweedeKamerAPI`, `MotionSummarizer` - -### Naming Patterns -| Pattern | Example | -|---------|---------| -| Database wrapper | `MotionDatabase` | -| API client | `TweedeKamerAPI` | -| Service/Helpers | `MotionScraper`, `MotionAnalyzer` | -| Exceptions | `ProviderError` | - -## Function Names - -- **Convention**: `snake_case` -- **Examples**: `get_motions`, `compute_similarity`, `process_voting_records` - -### Private Methods -- **Convention**: `_snake_case` (single underscore prefix) -- **Examples**: `_get_voting_records`, `_parse_response` - -## Variable Names - -### Regular Variables -- **Convention**: `snake_case` -- **Examples**: `motion_id`, `party_name`, `voting_results` - -### Constants (Module-Level) -- **Convention**: `UPPER_SNAKE_CASE` -- **Examples**: `DATABASE_PATH`, `API_TIMEOUT`, `MAX_RETRIES` - -### Config Variables (in dataclass) -- **Convention**: `UPPER_SNAKE_CASE` -- **Examples**: `QWEN_MODEL`, `POLICY_AREAS` - -### Booleans -- **Convention**: `is_`, `has_`, `can_` prefixes or `_flag` suffix -- **Examples**: `is_active`, `has_votes`, `skip_extract` - -### Private Variables -- **Convention**: `_underscore_prefix` -- **Examples**: `_conn`, `_cache`, `_session` - -## Singleton Instances - -- **Convention**: `lower_snake_case` at module level -- **Examples**: `db = MotionDatabase()`, `summarizer = MotionSummarizer()` - -```python -# database.py -class MotionDatabase: - ... - -# Singleton instance -db = MotionDatabase() - -# Usage -from database import db -motions = db.get_motions() -``` - -## Type Variables - -- **Convention**: `PascalCase` -- **Examples**: `T = TypeVar('T')`, `MotionDict = Dict[str, Any]` - -## Anti-Patterns - -### Inconsistent Naming -```python -# BAD - mixing styles -get_motions() # snake_case -GetMotionById() # PascalCase -processData() # camelCase - -# GOOD - consistent snake_case -get_motions() -get_motion_by_id() -process_voting_data() -``` - -### Abbreviations -```python -# AVOID - unclear abbreviations -calc_similarity() # calculate_* -proc_votes() # process_* -get_mp_data() # get_mp_metadata() - -# PREFER - full words -calculate_similarity() -process_votes() -get_mp_metadata() -``` - -### Hungarian Notation -```python -# BAD - Hungarian notation -str_title = "..." -int_count = 0 -b_is_active = True - -# GOOD - clear types via naming -title = "..." -count = 0 -is_active = True -``` - -## Special Cases - -### Window IDs -- **Format**: `"YYYY-QN"` or `"YYYY"` -- **Examples**: `"2024-Q1"`, `"2024-Q2"`, `"2024"` - -### Policy Areas -- **Convention**: PascalCase with spaces -- **Examples**: `"Economie"`, `"Sociale Zaken"`, `"Klimaat"` - -### Vote Values -- **Convention**: PascalCase Dutch terms -- **Values**: `"Voor"`, `"Tegen"`, `"Onthouden"`, `"Geen stem"`, `"Afwezig"` diff --git a/.mindmodel/constraints/testing.yaml b/.mindmodel/constraints/testing.yaml deleted file mode 100644 index f6095d9..0000000 --- a/.mindmodel/constraints/testing.yaml +++ /dev/null @@ -1,26 +0,0 @@ -# Testing conventions constraint (YAML) - -rules: - - name: test_naming - rule: "Use pytest and name tests test_*.py and test_* functions." - examples: - - good: "tests/test_text_pipeline.py" - - bad: "tests/text_pipeline_test.py" - - - name: fixtures_and_conftest - rule: "Place shared fixtures in tests/conftest.py or tests/fixtures/ for reuse." - examples: - - good: "use fixtures declared in tests/conftest.py" - - - name: assert_raises - rule: "Explicitly assert expected exceptions with pytest.raises for invalid input." - examples: - - good: | - import pytest - - def test_invalid_input(): - with pytest.raises(ValueError): - function_under_test('bad') - -enforcement_examples: - - "Run pytest in CI; fail if tests don't run or if there are regressions." diff --git a/.mindmodel/constraints/types.yaml b/.mindmodel/constraints/types.yaml deleted file mode 100644 index 53d63d5..0000000 --- a/.mindmodel/constraints/types.yaml +++ /dev/null @@ -1,233 +0,0 @@ -# Type Hint Constraints - -## Core Rule - -**Use type hints on all public functions and methods** - -## Function Type Hints - -### Required on Public APIs - -```python -# GOOD - complete type hints -def get_motion(self, motion_id: int) -> Optional[Dict]: - ... - -def get_filtered_motions( - self, - policy_area: str = "Alle", - limit: int = 10 -) -> List[Dict]: - ... - -def calculate_similarity(self, motion_a: int, motion_b: int) -> float: - ... -``` - -### Optional Parameters - -Use `Optional[X]` or `X | None`: - -```python -# Both forms are acceptable -def get_motion(self, motion_id: Optional[int] = None) -> Optional[Dict]: - ... - -def get_motion(self, motion_id: int | None = None) -> dict | None: - ... -``` - -### Multiple Return Types - -Use `Union[X, Y]` or `|` operator: - -```python -# Acceptable forms -def parse_value(self, value: str) -> Union[bool, str, None]: - ... - -def parse_value(self, value: str) -> bool | str | None: - ... -``` - -### Generic Types - -Use `List[X]`, `Dict[K, V]`, `Tuple[X, Y]`: - -```python -from typing import Dict, List, Optional, Tuple - -def get_motions(self, ids: List[int]) -> Dict[int, Dict]: - """Map motion_id -> motion data.""" - ... - -def process_batch(self, items: List[str]) -> Tuple[List[str], List[str]]: - """Returns (successes, failures).""" - ... -``` - -## Collection Types - -Prefer specific types over bare `list`/`dict`: - -```python -# GOOD - specific types -def get_votes(self) -> List[str]: - ... - -def get_metadata(self) -> Dict[str, Any]: - ... - -# ACCEPTABLE - for truly generic collections -def merge_dicts(*dicts: dict) -> dict: - ... -``` - -## DuckDB Result Types - -DuckDB returns tuples/lists - document expected structure: - -```python -def get_motion(self, motion_id: int) -> Optional[Tuple]: - """Returns (id, title, description, date, ...) or None.""" - conn = duckdb.connect(self.db_path) - try: - result = conn.execute( - "SELECT * FROM motions WHERE id = ?", (motion_id,) - ).fetchone() - return result - finally: - conn.close() - -# Or use Dict for clarity -def get_motion_as_dict(self, motion_id: int) -> Optional[Dict]: - """Returns motion dict or None.""" - conn = duckdb.connect(self.db_path) - try: - row = conn.execute( - "SELECT * FROM motions WHERE id = ?", (motion_id,) - ).fetchone() - if row: - return { - "id": row[0], - "title": row[1], - "description": row[2], - ... - } - return None - finally: - conn.close() -``` - -## Class/Instance Types - -Use `Self` for methods returning instance type: - -```python -from typing import Self - -class MotionDatabase: - def with_connection(self, path: str) -> Self: - """Return new instance with different path.""" - return MotionDatabase(db_path=path) -``` - -## Callback/Function Types - -Use `Callable` for function parameters: - -```python -from typing import Callable - -def process_motions( - motions: List[Dict], - processor: Callable[[Dict], Any] -) -> List[Any]: - return [processor(m) for m in motions] -``` - -## Type Aliases - -Define clear type aliases for domain concepts: - -```python -from typing import Dict, List, TypedDict, Literal - -# Vote values -VoteValue = Literal["Voor", "Tegen", "Onthouden", "Geen stem", "Afwezig"] - -# Policy areas -PolicyArea = Literal["Alle", "Economie", "Klimaat", "Immigratie", ...] - -# Motion dict -class MotionDict(TypedDict): - id: int - title: str - description: Optional[str] - date: Optional[str] - policy_area: Optional[str] - voting_results: Optional[str] # JSON string - winning_margin: Optional[float] - -def get_motion(self, motion_id: int) -> Optional[MotionDict]: - ... -``` - -## Avoid `Any` - -Use `Any` sparingly - prefer specific types: - -```python -# AVOID - too vague -def process(data: Any) -> Any: - ... - -# PREFER - specific types -def process(motion: MotionDict) -> Optional[SimilarityResult]: - ... -``` - -## Inline Type Hints - -For simple cases, inline hints are fine: - -```python -def get_count(self) -> int: - ... - -def is_empty(self) -> bool: - ... -``` - -## Docstring Type Hints - -For complex types, include in docstrings: - -```python -def get_party_positions(self, window_id: str) -> Dict[str, List[float]]: - """Get party positions in political space. - - Args: - window_id: Time window (e.g., "2024-Q1") - - Returns: - Dict mapping party_name -> [x, y] coordinates - - Example: - >>> positions = db.get_party_positions("2024-Q1") - >>> positions["VVD"] - [0.5, -0.3] - """ - ... -``` - -## Type Checking - -For runtime type checking, use runtime checks: - -```python -def set_count(self, count: int) -> None: - if not isinstance(count, int): - raise TypeError(f"Expected int, got {type(count).__name__}") - self._count = count -``` diff --git a/.mindmodel/conventions/conventions.yaml b/.mindmodel/conventions/conventions.yaml deleted file mode 100644 index 7e7391c..0000000 --- a/.mindmodel/conventions/conventions.yaml +++ /dev/null @@ -1,124 +0,0 @@ -# Naming Conventions - -## Files -- **snake_case** for all Python files: `database.py`, `explorer_helpers.py`, `motion_cache.py` -- **PascalCase** NOT used for files - -## Functions -- **snake_case**: `get_svd_vectors()`, `compute_party_coords()`, `build_scatter_trace()` -- Private helpers prefixed with `_`: `_get_window_data()` - -## Classes -- **PascalCase**: `MotionDatabase`, `Config` -- **Dataclass pattern** for Config: `@dataclass` decorator with typed fields - -## Variables -- **snake_case**: `party_map`, `mp_name`, `svd_vectors`, `party_centroids` -- **CONSTANT_SNAKE_CASE** for module-level constants: `PARTY_COLOURS`, `DEFAULT_WINDOW` - -## Module-Level Exports -- **Singleton instance**: `db = MotionDatabase()` at module bottom (not class-level) -- **Config instance**: `config = Config(...)` at module bottom -- **Dicts**: `PARTY_COLOURS` exported from `config.py` - ---- - -# Error Handling - -## Known Patterns -1. **Bare except with pass** (ANTI-PATTERN - see anti-patterns.yaml) - ```python - except: - pass # database.py:47 - ``` - -2. **Graceful degradation**: catch specific exceptions, fall back to default - ```python - try: - result = compute_svd() - except ImportError: - result = DEFAULT_SVD - ``` - -3. **Optional dependency fallbacks**: - ```python - try: - import umap - use_umap = True - except ImportError: - use_umap = False - ``` - -4. **Nested exception handling** (ANTI-PATTERN - see anti-patterns.yaml): - ```python - try: - ... - except Exception: - try: - ... - except Exception: - pass - ``` - -## Rules -- Never use bare `except:` β€” always specify exception type -- Never swallow exceptions silently β€” log or return a sensible default -- For optional deps, use `ImportError` or `ModuleNotFoundError` explicitly -- Avoid nested try/except blocks - ---- - -# Code Organization - -## Singleton Pattern -Each module owns one shared instance: -```python -# database.py -db = MotionDatabase() - -# config.py -config = Config(...) -PARTY_COLOURS = {...} -``` - -## Pure Functions in Helpers -`explorer_helpers.py` contains only pure functions (no IO, no Streamlit calls): -```python -def compute_party_coords(svd_vectors, party_map): - """Pure: no side effects, no imports from this module""" - ... - -def build_scatter_trace(df, color_col): - """Pure: returns Plotly trace dict""" - ... -``` - -## Cached Data Loaders -Use `@st.cache_data` for expensive data loading: -```python -@st.cache_data -def load_svd_vectors(window: str) -> pd.DataFrame: - return db.get_svd_vectors(window) -``` - -## Dataclass Config -```python -@dataclass -class Config: - db_path: str = "data/stemwijzer.duckdb" - default_window: str = "2023" - party_colours: dict = field(default_factory=lambda: PARTY_COLOURS) -``` - ---- - -# Imports - -## Ordering (convention) -1. Standard library -2. Third-party (streamlit, ibis, plotly, sklearn, umap) -3. Local/relative imports - -## Avoid -- Wildcard imports (`from module import *`) -- Circular imports (ensure dependency direction: helpers β†’ database β†’ config) diff --git a/.mindmodel/dependencies/dependencies.md b/.mindmodel/dependencies/dependencies.md deleted file mode 100644 index 49c7ba9..0000000 --- a/.mindmodel/dependencies/dependencies.md +++ /dev/null @@ -1,92 +0,0 @@ ---- -title: Dependencies and Library Usage -category: dependencies ---- - -# Dependencies and Library Usage - -## Core Dependencies - -### duckdb -- **Required**: Yes -- **Fallback**: None (core functionality) -- **Usage**: SQL database for motions, embeddings, SVD vectors -- **Files**: database.py, analysis/*.py, pipeline/*.py - -### streamlit -- **Required**: Yes -- **Fallback**: None -- **Usage**: Web UI framework -- **Files**: app.py, pages/*.py, explorer.py - -### requests -- **Required**: Yes -- **Fallback**: None -- **Usage**: HTTP client for API calls -- **Files**: api_client.py, ai_provider.py - -### plotly -- **Required**: Yes -- **Fallback**: None (raises ImportError) -- **Usage**: Interactive charts for explorer -- **Files**: explorer.py, explorer_helpers.py - -## Optional Dependencies - -### umap-learn -- **Required**: No -- **Fallback**: Use raw SVD vectors (first 2 dimensions) -- **Usage**: Dimensionality reduction for visualization -- **Files**: analysis/clustering.py - -### matplotlib -- **Required**: No -- **Fallback**: Plotly or raw output -- **Usage**: Static charting -- **Files**: Various analysis scripts - -## ML Dependencies - -### sklearn -- **Required**: Yes -- **Usage**: KMeans clustering, cosine_similarity, StandardScaler -- **Files**: analysis/clustering.py, similarity/compute.py - -### scipy -- **Required**: Yes -- **Usage**: SVD (scipy.linalg.svd), spatial.procrustes for alignment -- **Files**: analysis/trajectory.py, pipeline/svd_pipeline.py - -### numpy -- **Required**: Yes -- **Usage**: Array operations, linear algebra -- **Files**: Throughout codebase - -## Key Imports by File - -### explorer.py -- `import streamlit as st` -- `from database import db` -- `from explorer_helpers import *` - -### explorer_helpers.py -- `import pandas as pd` -- `import plotly.graph_objects as go` -- `from database import db` (optional, for type hints) - -### database.py -- `import ibis` -- `import duckdb` -- `from config import config, PARTY_COLOURS` - -### config.py -- `from dataclasses import dataclass, field` -- `import streamlit as st` (optional, for warnings) - -## Singleton Instances - -| Module | Instance | Type | -|--------|----------|------| -| `database.py` | `db` | `MotionDatabase` | -| `config.py` | `config` | `Config` (dataclass) | -| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | diff --git a/.mindmodel/domain/domain-glossary.md b/.mindmodel/domain/domain-glossary.md deleted file mode 100644 index 9da8f9b..0000000 --- a/.mindmodel/domain/domain-glossary.md +++ /dev/null @@ -1,146 +0,0 @@ ---- -title: Domain Glossary -category: domain ---- - -# Domain Glossary - Dutch Political Terms - -## CRITICAL INVARIANTS - -> **Rule 1**: Centroid of right-wing parties on RIGHT side of ALL axes -> - PVV, FVD, JA21, SGP centroid must appear on the RIGHT -> - Individual right-wing parties may vary slightly from the centroid -> - This is non-negotiable for any compass/axis visualization - -> **Rule 2**: SVD labels are empirically derived from voting data -> - Labels represent WHAT THE DATA SHOWS, not party self-identification or public opinion -> - Labels are derived from outliers and 20 representative motions (10 positive, 10 negative) -> - See SVD Label Derivation section below - ---- - -## SVD Label Derivation - -### The Process - -SVD (Singular Value Decomposition) finds axes that maximize variance in the MP Γ— Motion voting matrix. To label each axis: - -1. **Identify outliers**: Find the two MPs with most extreme positions on that axis -2. **Select representative motions**: Pick 20 motions where these outliers disagreed most sharply (10 they voted opposite on, 10 where both voted same direction but with other extremes) -3. **Interpret theme**: Read the motion titles to derive what the axis represents -4. **Assign label**: Label describes the empirical theme, could be: - - Left-Right - - Coalition-Opposition - - Progressive-Conservative - - EU-National sovereignty - - Populist-Establishment - - Or whatever the voting patterns show - -### Example - -| Step | Description | -|------|-------------| -| Outlier A | Wilders (PVV) - extreme positive on Dim 1 | -| Outlier B | Marijnissen (SP) - extreme negative on Dim 1 | -| 20 Motions | Immigration, integration, law & order themes dominate | -| Label | "Links-Rechts" (Left-Right) | - -### Labeling Rules - -- **Never use party names in labels** (e.g., not "PVV-SP axis") -- **Never use semantic/ideological labels** (e.g., not "progressive-conservative" unless that's what the motions show) -- **Use motion-derived themes** (e.g., "Immigration", "EU", "Economy") -- **Fallback**: If theme is unclear, use "Axis 1", "Axis 2" - ---- - -## Core Entities - -### Motion / Motie -- Parliamentary motion submitted by MPs -- Fields: `id`, `title`, `date`, `category` -- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** - -### MP / Kamerlid -- Member of Parliament (Tweede Kamerlid) -- Identified by full name (e.g., "Van Dijk, I.") -- Has voting record, party affiliation, SVD position vector - -### Party / Fractie -- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") -- Party centroids: average SVD position of all MPs in party - -### Vote / Stemming -- Individual MP's vote on a motion: +1, 0, -1 -- Aggregated to compute SVD vectors - ---- - -## Time & Analysis Concepts - -### Window / Tijdsvenster -- Time period for analysis (annual or quarterly) -- Values: "2023", "2023-Q1", "2024", etc. -- SVD vectors computed per window - -### Trajectory -- MP's position change across multiple windows -- Computed from `svd_vectors` + window ordering - ---- - -## Mathematical / Algorithmic Terms - -### SVD Vector -- 2D vector from Singular Value Decomposition of MP Γ— Motion vote matrix -- Represents MP's position in political space - -### SVD Label -- Empirically derived axis label based on outlier MPs and representative motions -- Describes the theme of disagreement on that axis -- NOT based on party ideology or semantic labels - -### Political Compass -- 2D visualization with SVD axes mapped to compass quadrants -- X-axis: First SVD dimension (labeled from voting data) -- Y-axis: Second SVD dimension (labeled from voting data) - -### Procrustes Alignment -- Algorithm to align SVD vectors across time windows -- Ensures comparable positions across years/quarters - -### UMAP -- Uniform Manifold Approximation and Projection -- Dimensionality reduction for visualization -- Optional dependency with graceful SVD fallback - ---- - -## Database Table Reference - -| Table | Key Fields | -|-------|-----------| -| `motions` | id, title, date, category | -| `mp_votes` | mp_id, motion_id, vote | -| `svd_vectors` | entity_id, window, vector_2d (list[2]) | -| `mp_party_history` | mp_id, party, start_date, end_date | -| `windows` | window_id, start_date, end_date, period_type | -| `mp_trajectories` | mp_id, window, trajectory_vector | - ---- - -## Dutch Political Parties - -### Canonical Right-Wing (centroid on RIGHT of axes) -- PVV (Partij voor de Vrijheid) -- FVD (Forum voor Democratie) -- JA21 -- SGP (Staatkundig Gereformeerde Partij) - -### Other Major Parties -- VVD (Volkspartij voor Vrijheid en Democratie) -- GL-PvdA (GroenLinks-PvdA) -- NSC (Nieuw Sociaal Contract) -- BBB (BoerBurgerBeweging) -- SP (Socialistische Partij) -- D66 (Democraten 66) diff --git a/.mindmodel/examples/api-client-example.py b/.mindmodel/examples/api-client-example.py deleted file mode 100644 index 5e7bfa6..0000000 --- a/.mindmodel/examples/api-client-example.py +++ /dev/null @@ -1,196 +0,0 @@ -"""Example: TweedeKamerAPI usage - from api_client.py and actual codebase.""" - -from datetime import datetime, timedelta -from typing import Dict, List - -# Import the API client -from api_client import TweedeKamerAPI - - -# ============================================================================= -# Example 1: Basic API usage -# ============================================================================= - - -def example_fetch_motions(): - """Fetch recent parliamentary motions from TweedeKamer API.""" - - api = TweedeKamerAPI() - - # Fetch motions from last 30 days - start_date = datetime.now() - timedelta(days=30) - - try: - motions = api.get_motions(start_date=start_date, limit=100) - - print(f"Fetched {len(motions)} motions") - - for motion in motions[:5]: # Show first 5 - print(f" - {motion.get('title', 'N/A')}") - - return motions - finally: - api.close() - - -# ============================================================================= -# Example 2: Fetching with date range -# ============================================================================= - - -def example_date_range(): - """Fetch motions from a specific date range.""" - - api = TweedeKamerAPI() - - start = datetime(2024, 1, 1) - end = datetime(2024, 3, 31) # Q1 2024 - - try: - motions = api.get_motions(start_date=start, end_date=end, limit=500) - - # Group by policy area - by_area = {} - for m in motions: - area = m.get("policy_area", "Onbekend") - by_area.setdefault(area, []).append(m) - - for area, area_motions in sorted(by_area.items()): - print(f"{area}: {len(area_motions)} motions") - - return motions - finally: - api.close() - - -# ============================================================================= -# Example 3: Context manager usage -# ============================================================================= - - -def example_context_manager(): - """Use API client as context manager.""" - - with TweedeKamerAPI() as api: - motions = api.get_motions( - start_date=datetime.now() - timedelta(days=7), limit=50 - ) - - print(f"Fetched {len(motions)} motions this week") - - return motions - - -# ============================================================================= -# Example 4: Processing voting records -# ============================================================================= - - -def example_process_votes(): - """Process individual voting records from API.""" - - api = TweedeKamerAPI() - - start_date = datetime.now() - timedelta(days=7) - - try: - # Get voting records directly - voting_records, besluit_meta = api._get_voting_records( - start_date=start_date, limit=1000 - ) - - print(f"Fetched {len(voting_records)} voting records") - print(f"From {len(besluit_meta)} unique decisions") - - # Count votes by party - party_votes = {} - for record in voting_records: - party = record.get("Fractie", "Onbekend") - vote = record.get("Soort", "Onbekend") - party_votes.setdefault(party, {})[vote] = ( - party_votes.get(party, {}).get(vote, 0) + 1 - ) - - for party, votes in sorted(party_votes.items()): - total = sum(votes.values()) - voor = votes.get("Voor", 0) - print(f"{party}: {total} votes ({voor} voor)") - - return voting_records - finally: - api.close() - - -# ============================================================================= -# Example 5: Safe API call with fallback -# ============================================================================= - - -def example_safe_call(): - """Make API call with safe fallback on failure.""" - - api = TweedeKamerAPI() - - try: - # This will return [] on any error - motions = api.get_motions( - start_date=datetime.now() - timedelta(days=30), limit=100 - ) - - if not motions: - print("No motions returned - using cached data") - # Fallback to cached/local data - from database import db - - return db.get_filtered_motions(limit=10) - - return motions - finally: - api.close() - - -# ============================================================================= -# Example 6: Pagination handling -# ============================================================================= - - -def example_pagination(): - """Understand how pagination works in the API.""" - - api = TweedeKamerAPI() - - start_date = datetime.now() - timedelta(days=365) - - # Simulate pagination - page_size = 250 - total_limit = 500 - - all_motions = [] - skip = 0 - - while len(all_motions) < total_limit: - print(f"Fetching page with skip={skip}...") - - # In real usage, get_motions handles pagination internally - # This demonstrates what's happening under the hood - page_motions = api._fetch_page(start_date=start_date, skip=skip, top=page_size) - - if not page_motions: - break - - all_motions.extend(page_motions) - skip += page_size - - if len(page_motions) < page_size: - break # Last page - - print(f"Total fetched: {len(all_motions)} motions") - return all_motions - - -if __name__ == "__main__": - print("=== Basic Fetch ===") - example_fetch_motions() - - print("\n=== Process Votes ===") - example_process_votes() diff --git a/.mindmodel/examples/database-example.py b/.mindmodel/examples/database-example.py deleted file mode 100644 index fd21fbc..0000000 --- a/.mindmodel/examples/database-example.py +++ /dev/null @@ -1,191 +0,0 @@ -"""Example: MotionDatabase usage - from database.py and actual codebase.""" - -from typing import Dict, List, Optional -import duckdb -import json -from config import config - -# Import the singleton instance -from database import db - - -# ============================================================================= -# Example 1: Getting filtered motions -# ============================================================================= - - -def example_get_filtered_motions(): - """Get controversial motions from a specific policy area.""" - - motions = db.get_filtered_motions( - policy_area="Klimaat", - min_margin=0.0, - max_margin=0.3, # Controversial: close margin - limit=10, - ) - - for motion in motions: - print(f"{motion['title']}: {motion['winning_margin']:.1%} margin") - - return motions - - -# ============================================================================= -# Example 2: Creating a voting session -# ============================================================================= - - -def example_voting_session(): - """Create a new user session and record votes.""" - - # Create session for 10 motions - session_id = db.create_session(total_motions=10) - print(f"Created session: {session_id}") - - # Get motions for the session - motions = db.get_filtered_motions(policy_area="Alle", limit=10) - - # Record votes - for motion in motions: - # In real app, user would choose vote - vote = "Voor" # Example vote - db.record_vote(session_id=session_id, motion_id=motion["id"], vote=vote) - - # Get results - results = db.get_party_results(session_id) - - for party, result in sorted(results.items(), key=lambda x: -x[1]["agreement"]): - print(f"{party}: {result['agreement']:.1%} agreement") - - return results - - -# ============================================================================= -# Example 3: Working with DuckDB connections directly -# ============================================================================= - - -def example_direct_duckdb(): - """Example of proper DuckDB connection handling.""" - - conn = duckdb.connect(config.DATABASE_PATH) - try: - # Get motion with votes - result = conn.execute( - """ - SELECT m.*, - JSON_EXTRACT(voting_results, '$.total_votes') as total_votes - FROM motions m - WHERE m.id = ? - """, - (123,), - ).fetchone() - - if result: - print(f"Motion: {result[1]}") # title is index 1 - - return result - finally: - conn.close() - - -# ============================================================================= -# Example 4: Bulk operations -# ============================================================================= - - -def example_bulk_insert(): - """Example of bulk inserting motions.""" - - # Sample data - motions = [ - { - "title": "Motion about climate policy", - "description": "Proposal to reduce emissions", - "date": "2024-01-15", - "policy_area": "Klimaat", - "voting_results": json.dumps({"Voor": 75, "Tegen": 65}), - "winning_margin": 0.07, - "controversy_score": 0.85, - }, - { - "title": "Motion about healthcare", - "description": "Increase healthcare budget", - "date": "2024-01-20", - "policy_area": "Zorg", - "voting_results": json.dumps({"Voor": 90, "Tegen": 50}), - "winning_margin": 0.29, - "controversy_score": 0.42, - }, - ] - - conn = duckdb.connect(config.DATABASE_PATH) - try: - for motion in motions: - conn.execute( - """ - INSERT INTO motions - (title, description, date, policy_area, voting_results, - winning_margin, controversy_score) - VALUES (?, ?, ?, ?, ?, ?, ?) - """, - ( - motion["title"], - motion["description"], - motion["date"], - motion["policy_area"], - motion["voting_results"], - motion["winning_margin"], - motion["controversy_score"], - ), - ) - conn.close() - print(f"Inserted {len(motions)} motions") - except Exception as e: - conn.close() - print(f"Error inserting motions: {e}") - - -# ============================================================================= -# Example 5: Query with aggregation -# ============================================================================= - - -def example_aggregation(): - """Example of aggregate queries.""" - - conn = duckdb.connect(config.DATABASE_PATH) - try: - # Get statistics by policy area - results = conn.execute(""" - SELECT - policy_area, - COUNT(*) as motion_count, - AVG(winning_margin) as avg_margin, - AVG(controversy_score) as avg_controversy - FROM motions - WHERE policy_area IS NOT NULL - GROUP BY policy_area - ORDER BY motion_count DESC - """).fetchall() - - for row in results: - print( - f"{row[0]}: {row[1]} motions, " - f"avg margin {row[2]:.1%}, " - f"controversy {row[3]:.2f}" - ) - - conn.close() - return results - except Exception as e: - conn.close() - return [] - - -if __name__ == "__main__": - print("=== Filtered Motions ===") - example_get_filtered_motions() - - print("\n=== Aggregation ===") - example_aggregation() diff --git a/.mindmodel/examples/pattern-examples.md b/.mindmodel/examples/pattern-examples.md deleted file mode 100644 index aecdef6..0000000 --- a/.mindmodel/examples/pattern-examples.md +++ /dev/null @@ -1,116 +0,0 @@ -# Extracted pattern examples (representative snippets) - -Note: snippets are verbatim extracts from repository files (Phase 1). Paths shown. - -## DuckDB connect + schema init (database.py) -```python -conn = duckdb.connect(self.db_path) - -# Create sequence for auto-incrementing IDs -try: - conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") -except: - pass - -# Create tables with proper ID handling -conn.execute(""" - CREATE TABLE IF NOT EXISTS motions ( - id INTEGER DEFAULT nextval('motions_id_seq'), - title TEXT NOT NULL, - description TEXT, - date DATE, - policy_area TEXT, - voting_results JSON, - winning_margin FLOAT, - controversy_score FLOAT, - layman_explanation TEXT, - externe_identifier TEXT, - body_text TEXT, - url TEXT UNIQUE, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) - ) -""") -conn.close() -``` - -## Read-only compute worker (svd_pipeline.py) -```python -conn = duckdb.connect(db_path, read_only=True) -try: - rows = conn.execute( - "SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", - (start_date, end_date), - ).fetchall() -finally: - conn.close() -``` - -## Requests with retry/backoff (ai_provider.py) -```python -resp = requests.post(url, json=json, headers=headers, timeout=10) -... -if getattr(resp, "status_code", 0) == 429: - if attempt == retries: - raise ProviderError(f"Provider returned HTTP {resp.status_code}") - retry_after = None - raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None - if raw: - try: - retry_after = int(raw) - except Exception: - try: - dt = parsedate_to_datetime(raw) - now = datetime.now(tz=dt.tzinfo or timezone.utc) - secs = (dt - now).total_seconds() - retry_after = max(0, int(secs)) - except Exception: - retry_after = None - - if retry_after is not None: - time.sleep(retry_after) - continue -``` - -## Embedding batch + per-item fallback (pipeline/ai_provider_wrapper.py) -```python -for start in range(0, len(texts), batch_size): - chunk = texts[i:end] - emb_chunk, emb_exc = _attempt_batch(chunk, i) - if emb_chunk is not None: - for j, emb in enumerate(emb_chunk): - results[i + j] = emb - i = end - continue - - # batch failed -> fallback to per-item attempts - for j in range(i, end): - t = texts[j] - single, single_exc = _attempt_batch([t], j) - if single: - results[j] = single[0] - continue - results[j] = None -``` - -## Similarity compute (similarity/compute.py) -```python -# Ensure consistent dimensionality: pad shorter vectors with zeros -lengths = [len(v) for v in vecs] -max_dim = max(lengths) -if len(set(lengths)) != 1: - logger.warning( - "Inconsistent vector dimensions detected (max=%d). Padding shorter vectors with zeros.", - max_dim, - ) - -matrix = np.zeros((len(vecs), max_dim), dtype=np.float32) -for i, v in enumerate(vecs): - matrix[i, : len(v)] = v - -# Normalize rows and compute cosine similarity -norms = np.linalg.norm(matrix, axis=1, keepdims=True) -norms[norms == 0] = 1.0 -normalized = matrix / norms -sim = normalized @ normalized.T -``` diff --git a/.mindmodel/examples/pipeline-example.py b/.mindmodel/examples/pipeline-example.py deleted file mode 100644 index fafed63..0000000 --- a/.mindmodel/examples/pipeline-example.py +++ /dev/null @@ -1,217 +0,0 @@ -"""Example: Pipeline phase execution - from pipeline/run_pipeline.py and actual codebase.""" - -import argparse -from datetime import date, timedelta -from typing import List, Tuple - -# Import pipeline modules -from pipeline.fetch_mp_metadata import fetch_mp_metadata -from pipeline.extract_mp_votes import extract_mp_votes -from pipeline.svd_pipeline import run_svd_pipeline -from pipeline.text_pipeline import run_text_pipeline -from pipeline.fusion import run_fusion - -from database import MotionDatabase - - -# ============================================================================= -# Example 1: Running full pipeline -# ============================================================================= - - -def example_full_pipeline(): - """Run the complete data ingestion pipeline.""" - - # Parse arguments like CLI would - parser = argparse.ArgumentParser(description="Pipeline runner") - parser.add_argument("--db-path", default="data/motions.db") - parser.add_argument("--start-date", default=None) - parser.add_argument("--end-date", default=None) - parser.add_argument( - "--window-size", choices=["quarterly", "annual"], default="quarterly" - ) - parser.add_argument("--svd-k", type=int, default=50) - - args = parser.parse_args([]) - - # Resolve dates - end_date = date.fromisoformat(args.end_date) if args.end_date else date.today() - start_date = ( - date.fromisoformat(args.start_date) - if args.start_date - else end_date - timedelta(days=730) - ) - - print(f"Running pipeline: {start_date} β†’ {end_date}") - print(f"Window size: {args.window_size}") - print(f"DB path: {args.db_path}") - - # Initialize database - db = MotionDatabase(args.db_path) - - # Phase 1: Fetch MP metadata - print("\n=== Phase 1: MP Metadata ===") - n_mp = fetch_mp_metadata(db_path=args.db_path) - print(f"Processed {n_mp} MPs") - - # Phase 2: Extract MP votes - print("\n=== Phase 2: Extract Votes ===") - n_votes = extract_mp_votes(db_path=args.db_path) - print(f"Extracted {n_votes} vote records") - - # Phase 3: Generate time windows - print("\n=== Phase 3: SVD Pipeline ===") - windows = generate_windows(start_date, end_date, args.window_size) - print(f"Generated {len(windows)} windows: {windows}") - - # Phase 4: SVD per window - run_svd_pipeline(db, windows, args.svd_k) - print(f"Computed SVD for {len(windows)} windows") - - # Phase 5: Text embeddings - print("\n=== Phase 4: Text Embeddings ===") - run_text_pipeline(args.db_path, batch_size=50) - print("Text embeddings completed") - - # Phase 6: Fusion - print("\n=== Phase 5: Fusion ===") - run_fusion(args.db_path, windows) - print("Fusion completed") - - print("\n=== Pipeline Complete ===") - - -# ============================================================================= -# Example 2: Generate time windows -# ============================================================================= - - -def generate_windows( - start: date, end: date, granularity: str -) -> List[Tuple[str, str, str]]: - """Generate time windows for pipeline processing.""" - - windows = [] - cursor = date(start.year, start.month, 1) - - if granularity == "annual": - cursor = date(start.year, 1, 1) - while cursor <= end: - year_end = date(cursor.year, 12, 31) - w_end = min(year_end, end) - windows.append((str(cursor.year), cursor.isoformat(), w_end.isoformat())) - cursor = date(cursor.year + 1, 1, 1) - else: - # quarterly - quarter_starts = {1: 1, 2: 4, 3: 7, 4: 10} - quarter_ends = {1: 3, 2: 6, 3: 9, 4: 12} - - q = (cursor.month - 1) // 3 + 1 - cursor = date(cursor.year, quarter_starts[q], 1) - - while cursor <= end: - q = (cursor.month - 1) // 3 + 1 - import calendar - - q_end_month = quarter_ends[q] - last_day = calendar.monthrange(cursor.year, q_end_month)[1] - q_end = date(cursor.year, q_end_month, last_day) - w_end = min(q_end, end) - window_id = f"{cursor.year}-Q{q}" - windows.append((window_id, cursor.isoformat(), w_end.isoformat())) - cursor = q_end + timedelta(days=1) - - return windows - - -def example_window_generation(): - """Example of window generation.""" - - start = date(2023, 1, 1) - end = date(2024, 6, 30) - - print("Quarterly windows:") - quarterly = generate_windows(start, end, "quarterly") - for wid, s, e in quarterly: - print(f" {wid}: {s} to {e}") - - print("\nAnnual windows:") - annual = generate_windows(start, end, "annual") - for wid, s, e in annual: - print(f" {wid}: {s} to {e}") - - -# ============================================================================= -# Example 3: Running individual phases -# ============================================================================= - - -def example_individual_phases(): - """Run pipeline phases individually for debugging.""" - - db_path = "data/motions.db" - db = MotionDatabase(db_path) - - # Only run MP metadata fetch - print("Fetching MP metadata...") - n = fetch_mp_metadata(db_path=db_path) - print(f" {n} MPs processed") - - # Only run vote extraction - print("Extracting votes...") - n = extract_mp_votes(db_path=db_path) - print(f" {n} votes extracted") - - # Only run SVD for specific window - print("Computing SVD...") - windows = [("2024-Q1", "2024-01-01", "2024-03-31")] - run_svd_pipeline(db, windows, k=50) - print(" SVD computed") - - # Only run text embeddings - print("Computing embeddings...") - run_text_pipeline(db_path, batch_size=25) # Smaller batch for testing - print(" Embeddings computed") - - -# ============================================================================= -# Example 4: Dry run -# ============================================================================= - - -def example_dry_run(): - """Show what pipeline would do without making changes.""" - - print("DRY RUN - no writes will be made") - - start_date = date(2024, 1, 1) - end_date = date(2024, 6, 30) - - # Generate and show windows - windows = generate_windows(start_date, end_date, "quarterly") - - print(f"Would process {len(windows)} windows:") - for wid, s, e in windows: - print(f" {wid}: {s} to {e}") - - print("\nWould run phases:") - print(" 1. fetch_mp_metadata") - print(" 2. extract_mp_votes") - print(" 3. svd_pipeline") - print(" 4. text_pipeline") - print(" 5. fusion") - - -if __name__ == "__main__": - import logging - - logging.basicConfig( - level=logging.INFO, - format="%(asctime)s %(levelname)s %(name)s: %(message)s", - ) - - print("=== Window Generation ===") - example_window_generation() - - print("\n=== Dry Run ===") - example_dry_run() diff --git a/.mindmodel/examples/streamlit-page-example.py b/.mindmodel/examples/streamlit-page-example.py deleted file mode 100644 index f7502eb..0000000 --- a/.mindmodel/examples/streamlit-page-example.py +++ /dev/null @@ -1,316 +0,0 @@ -"""Example: Streamlit page patterns - from actual pages/ files.""" - -import streamlit as st - - -# ============================================================================= -# Example 1: Home page (Home.py) -# ============================================================================= - - -def render_home_page(): - """Simplified version of Home.py.""" - - st.set_page_config( - page_title="Motief: de stematlas", - page_icon="πŸ—ΊοΈ", - layout="centered", - initial_sidebar_state="expanded", - ) - - st.title("πŸ—ΊοΈ Motief: de stematlas") - st.markdown( - "**Motief** brengt de Nederlandse Tweede Kamer in kaart op basis van " - "echte stemmingen over moties. Gebruik de Stemwijzer om te ontdekken welke " - "partij het beste bij jouw standpunten past, of verken de politieke ruimte " - "zelf in de Explorer." - ) - - st.divider() - - col1, col2 = st.columns(2) - - with col1: - st.subheader("πŸ—³οΈ Stemwijzer") - st.markdown( - "Stem op echte Tweede Kamer moties en zie welke partij het " - "dichtst bij jouw keuzes staat." - ) - st.page_link("pages/1_Stemwijzer.py", label="Open Stemwijzer", icon="πŸ—³οΈ") - - with col2: - st.subheader("πŸ”­ Politiek Explorer") - st.markdown( - "Verken het politieke kompas, partijtrajecten door de tijd, " - "en zoek vergelijkbare moties op in het archief." - ) - st.page_link("pages/2_Explorer.py", label="Open Explorer", icon="πŸ”­") - - st.divider() - st.caption("Data: Tweede Kamer API Β· Embeddings: QWEN (via OpenRouter)") - - -# ============================================================================= -# Example 2: Thin page wrapper (pages/1_Stemwijzer.py) -# ============================================================================= - - -def render_stemwijzer_page(): - """Pattern: thin page that delegates to module function.""" - - st.set_page_config( - page_title="Stemwijzer", - page_icon="πŸ—³οΈ", - layout="centered", - ) - - # Delegate to main module - from explorer import build_mp_quiz_tab - - build_mp_quiz_tab("data/motions.db") - - -# ============================================================================= -# Example 3: Session state initialization -# ============================================================================= - - -def init_session_state(): - """Pattern: Initialize all session state at start.""" - - defaults = { - "session_id": None, - "current_motion_index": 0, - "motions": [], - "show_results": False, - "user_votes": {}, - } - - for key, default in defaults.items(): - if key not in st.session_state: - st.session_state[key] = default - - -# ============================================================================= -# Example 4: Sidebar configuration -# ============================================================================= - - -def render_sidebar(): - """Pattern: Sidebar for configuration.""" - - with st.sidebar: - st.header("Instellingen") - - motion_count = st.slider( - "Aantal moties", - min_value=5, - max_value=25, - value=10, - help="Hoeveel moties wilt u beantwoorden?", - ) - - policy_area = st.selectbox( - "Beleidsgebied", - [ - "Alle", - "Economie", - "Klimaat", - "Immigratie", - "Zorg", - "Onderwijs", - "Defensie", - "Sociale Zaken", - "Algemeen", - ], - ) - - margin_range = st.slider( - "ControversiΓ«le moties (%)", - min_value=0, - max_value=100, - value=(0, 100), - help="Filter op hoe omstreden de moties zijn", - ) - - st.divider() - - if st.button("Start Nieuwe Sessie", type="primary"): - return { - "motion_count": motion_count, - "policy_area": policy_area, - "margin_range": margin_range, - } - - return None - - -# ============================================================================= -# Example 5: Motion voting interface -# ============================================================================= - - -def render_motion_vote(motion: dict, index: int, total: int): - """Pattern: Display motion and voting buttons.""" - - st.subheader(f"Motie {index + 1} van {total}") - - # Motion content - st.markdown(f"### {motion['title']}") - - col1, col2 = st.columns([3, 1]) - with col1: - if motion.get("layman_explanation"): - st.info(motion["layman_explanation"]) - - with st.expander("Meer details"): - st.markdown(f"**Datum:** {motion.get('date', 'Onbekend')}") - st.markdown(f"**Beleidsgebied:** {motion.get('policy_area', 'Onbekend')}") - - if motion.get("description"): - st.markdown(f"**Beschrijving:** {motion['description']}") - - with col2: - st.metric( - label="Winstmarge", - value=f"{motion.get('winning_margin', 0):.0%}", - delta="Omstreden" if motion.get("controversy_score", 0) > 0.5 else "Helder", - ) - - st.divider() - - # Voting buttons - col1, col2, col3 = st.columns(3) - - with col1: - st.button( - "πŸ‘ **Voor**", - on_click=on_vote, - args=(motion["id"], "Voor"), - use_container_width=True, - ) - - with col2: - st.button( - "πŸ‘Ž **Tegen**", - on_click=on_vote, - args=(motion["id"], "Tegen"), - use_container_width=True, - ) - - with col3: - st.button( - "πŸ€” **Onthouden**", - on_click=on_vote, - args=(motion["id"], "Onthouden"), - use_container_width=True, - ) - - -def on_vote(motion_id: int, vote: str): - """Callback when user votes.""" - - # Record vote - from database import db - - db.record_vote( - session_id=st.session_state.session_id, motion_id=motion_id, vote=vote - ) - - # Update session state - st.session_state.user_votes[motion_id] = vote - - # Move to next or show results - if st.session_state.current_motion_index < len(st.session_state.motions) - 1: - st.session_state.current_motion_index += 1 - else: - st.session_state.show_results = True - - st.rerun() - - -# ============================================================================= -# Example 6: Results display -# ============================================================================= - - -def render_results(): - """Pattern: Display voting results.""" - - from database import db - - st.header("πŸ“Š Uw Resultaten") - - # Get party results - results = db.get_party_results(st.session_state.session_id) - - if not results: - st.warning("Geen resultaten beschikbaar") - return - - # Sort by agreement - sorted_results = sorted( - results.items(), key=lambda x: x[1].get("agreement_percentage", 0), reverse=True - ) - - # Display top match - if sorted_results: - top_party, top_data = sorted_results[0] - st.success( - f"**Uw beste match:** {top_party} ({top_data.get('agreement_percentage', 0):.0%} overeenstemming)" - ) - - st.divider() - - # Show all parties - for party, data in sorted_results: - agreement = data.get("agreement_percentage", 0) - - col1, col2 = st.columns([3, 1]) - with col1: - st.markdown(f"**{party}**") - st.progress(agreement, text=f"{agreement:.0%}") - - with col2: - st.metric("Overeenstemming", f"{agreement:.0%}") - - # Detailed breakdown - with st.expander("Details per motie"): - for motion in st.session_state.motions: - user_vote = st.session_state.user_votes.get(motion["id"], "?") - st.markdown(f"- **{motion['title']}**: U={user_vote}") - - -# ============================================================================= -# Example 7: Tabs layout -# ============================================================================= - - -def render_tabs_example(): - """Pattern: Use tabs for organizing content.""" - - tab1, tab2, tab3 = st.tabs(["Compass", "Trajectories", "Zoeken"]) - - with tab1: - st.subheader("Politiek Kompas") - st.write("Visualiseer partijposities in 2D ruimte") - # Add compass chart... - - with tab2: - st.subheader("Partij Trajectories") - st.write("Bekijk hoe partijen door de tijd bewegen") - # Add trajectory chart... - - with tab3: - st.subheader("Zoek Moties") - - query = st.text_input("Zoekterm") - if query: - # Search functionality... - st.write(f"Zoeken naar: {query}") - - -if __name__ == "__main__": - # Demo rendering - init_session_state() - st.write("Streamlit page structure example") diff --git a/.mindmodel/manifest.yaml b/.mindmodel/manifest.yaml deleted file mode 100644 index eb061e9..0000000 --- a/.mindmodel/manifest.yaml +++ /dev/null @@ -1,108 +0,0 @@ -# stemwijzer Mind Model - Manifest -# Generated: 2026-04-12 -# Phase: 2 - Assembly from Phase 1 Analysis - -name: stemwijzer -version: 2 -description: Dutch political voting compass (Stemwijzer) - Mind Model constraints - -categories: - # Core documentation - - path: system.md - description: System overview and architecture summary - group: docs - - path: stack/stack.md - description: Technology stack with versions and purposes - group: stack - - path: domain/domain-glossary.md - description: Domain entities, terms, relationships, and CRITICAL INVARIANTS - group: domain - - # Design patterns - - path: patterns/patterns.yaml - description: Code patterns (Singleton, Repository, Pipeline, etc.) - group: patterns - - path: patterns/streamlit.yaml - description: Streamlit-specific patterns (session state, cache) - group: patterns - - path: patterns/api.yaml - description: API client patterns with retry and pagination - group: patterns - - path: patterns/database.yaml - description: DuckDB patterns and connection management - group: patterns - - path: patterns/python.yaml - description: Python-specific patterns (dataclass, typing) - group: patterns - - path: patterns/duckdb-access.md - description: DuckDB connection patterns and best practices - group: patterns - - path: patterns/embeddings-similarity.md - description: Embeddings and similarity computation patterns - group: patterns - - path: patterns/error-handling.md - description: Error handling and exception patterns - group: patterns - - path: patterns/module-singletons.md - description: Module-level singleton patterns - group: patterns - - path: patterns/requests-http.md - description: HTTP client patterns with retry - group: patterns - - path: patterns/validation.md - description: Input validation patterns - group: patterns - - # Coding constraints - - path: constraints/error-handling.md - description: Error handling patterns with safe fallbacks - group: constraints - - path: constraints/logging.md - description: Logging conventions - group: constraints - - path: constraints/naming.yaml - description: File, class, function naming rules - group: constraints - - path: constraints/imports.yaml - description: Import organization and module structure - group: constraints - - path: constraints/types.yaml - description: Type hint conventions - group: constraints - - path: constraints/testing.yaml - description: Testing conventions - group: constraints - - # Anti-patterns - - path: anti-patterns/anti-patterns.md - description: Known anti-patterns with evidence and fixes - group: anti-patterns - - # Dependencies - - path: dependencies/dependencies.md - description: Library usage and singleton instances - group: dependencies - - # Code examples - - path: examples/database-example.py - description: MotionDatabase usage examples - group: examples - - path: examples/api-client-example.py - description: TweedeKamerAPI usage examples - group: examples - - path: examples/pipeline-example.py - description: Pipeline orchestration examples - group: examples - - path: examples/streamlit-page-example.py - description: Streamlit page patterns - group: examples - - path: examples/pattern-examples.md - description: Consolidated pattern examples - group: examples - -# Phase 1 findings summary: -# - Tech: Python 3.13+, Streamlit, DuckDB, scipy/sklearn/umap, OpenRouter (QWEN) -# - 10 patterns discovered: Module singletons, Repository, Service layer, Pipeline -# - 8 anti-patterns: print() instead of logging, _DummySt global, bare except -# - 6 code clusters: Database, Streamlit UI, API, Analysis/ML, Config, Singletons -# - 3 groups: stdlib, 3rd party, local imports diff --git a/.mindmodel/patterns/api.yaml b/.mindmodel/patterns/api.yaml deleted file mode 100644 index 310c193..0000000 --- a/.mindmodel/patterns/api.yaml +++ /dev/null @@ -1,265 +0,0 @@ -# API Client Patterns - -## Base API Client Pattern - -Using requests.Session for connection pooling: - -```python -# api_client.py -import requests -from typing import Dict, List, Optional -from config import config - -class TweedeKamerAPI: - def __init__(self): - self.odata_base_url = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" - self.session = requests.Session() - self.session.headers.update({ - "Accept": "application/json", - "User-Agent": "Dutch-Political-Compass-Tool/1.0", - }) - - def get_motions( - self, - start_date: datetime = None, - end_date: datetime = None, - limit: int = 500, - ) -> List[Dict]: - """Get motions with voting results using OData API.""" - if not start_date: - start_date = datetime.now() - timedelta(days=730) - - try: - voting_records, besluit_meta = self._get_voting_records( - start_date, end_date, limit - ) - return self._process_voting_records(voting_records, besluit_meta) - except Exception as e: - print(f"Error fetching motions from API: {e}") - return [] -``` - -## OData Pagination Pattern - -Handle server-side pagination with $skip: - -```python -def _get_voting_records( - self, - start_date: datetime, - end_date: datetime = None, - limit: int = 50000 -) -> tuple: - """Fetch with automatic pagination.""" - - filter_query = ( - f"GewijzigdOp ge {start_date.strftime('%Y-%m-%d')}T00:00:00Z" - " and StemmingsSoort ne null" - " and Verwijderd eq false" - ) - - page_size = 250 # API caps $top at 250 - base_url = f"{self.odata_base_url}/Besluit" - base_params = { - "$filter": filter_query, - "$top": page_size, - "$expand": "Stemming", - "$orderby": "GewijzigdOp desc", - } - - all_records = [] - skip = 0 - - while len(all_records) < limit: - params = {**base_params, "$skip": skip} - response = self.session.get( - base_url, - params=params, - timeout=config.API_TIMEOUT - ) - response.raise_for_status() - data = response.json() - - besluit_page = data.get("value", []) - if not besluit_page: - break - - # Process page - for besluit in besluit_page: - all_records.extend(self._extract_votes(besluit)) - - skip += page_size - - return all_records -``` - -## Retry with Backoff Pattern - -For transient failures: - -```python -# ai_provider.py -import time -import random -from requests.exceptions import ConnectionError - -def _post_with_retries( - path: str, - json: dict, - retries: int = 3 -) -> requests.Response: - """POST with exponential backoff retry.""" - - backoff = 0.5 - for attempt in range(1, retries + 1): - try: - resp = requests.post(url, json=json, headers=headers, timeout=10) - - # Handle rate limiting - if resp.status_code == 429: - if attempt == retries: - raise ProviderError("Rate limited") - - retry_after = resp.headers.get("Retry-After") - if retry_after: - time.sleep(int(retry_after)) - else: - sleep = backoff * (2 ** (attempt - 1)) - sleep += random.uniform(0, sleep * 0.1) - time.sleep(sleep) - continue - - # Handle server errors - if 500 <= resp.status_code < 600: - if attempt == retries: - raise ProviderError(f"Server error: {resp.status_code}") - time.sleep(backoff * (2 ** (attempt - 1))) - continue - - return resp - - except ConnectionError as exc: - if attempt == retries: - raise ProviderError(f"Connection error: {exc}") - time.sleep(backoff * (2 ** (attempt - 1))) - - raise ProviderError("Failed after retries") -``` - -## Batch Processing Pattern - -Process items in batches to manage API limits: - -```python -def get_embeddings_with_retry( - texts: List[str], - batch_size: int = 50, - retries: int = 3, -) -> List[Optional[List[float]]]: - """Process embeddings in batches with fallback to single items.""" - - results = [None] * len(texts) - - i = 0 - while i < len(texts): - end = min(len(texts), i + batch_size) - chunk = texts[i:end] - - # Try batch first - try: - emb_chunk = get_embeddings_batch(chunk) - for j, emb in enumerate(emb_chunk): - results[i + j] = emb - i = end - continue - except Exception: - pass - - # Fallback: single items - for j, text in enumerate(chunk): - try: - results[i + j] = get_embedding(text) - except Exception: - results[i + j] = None - - i = end - - return results -``` - -## Response Validation Pattern - -Validate API responses before processing: - -```python -def _process_response(self, response: requests.Response) -> Dict: - """Validate and parse API response.""" - - response.raise_for_status() - data = response.json() - - if "value" not in data: - raise ValueError("Unexpected response format: missing 'value' key") - - return data - -def _validate_besluit(self, besluit: Dict) -> bool: - """Check required fields exist.""" - required = ["Id", "GewijzigdOp"] - return all(field in besluit for field in required) -``` - -## Error Handling Patterns - -Always provide safe fallbacks: - -```python -def safe_api_call(self, endpoint: str, params: Dict = None) -> List[Dict]: - """Call API with error handling and fallback.""" - try: - response = self.session.get( - endpoint, - params=params, - timeout=config.API_TIMEOUT - ) - response.raise_for_status() - data = response.json() - return data.get("value", []) - except requests.Timeout: - _logger.warning(f"API timeout for {endpoint}") - return [] - except requests.HTTPError as e: - _logger.error(f"HTTP error: {e}") - return [] - except Exception as e: - _logger.error(f"API call failed: {e}") - return [] -``` - -## Session Management - -Reuse session for connection pooling: - -```python -class TweedeKamerAPI: - def __init__(self): - self.session = requests.Session() - self.session.headers.update({ - "Accept": "application/json", - "User-Agent": "Dutch-Political-Compass-Tool/1.0", - }) - - def close(self): - """Clean up session when done.""" - self.session.close() - - def __enter__(self): - return self - - def __exit__(self, *args): - self.close() - -# Usage -with TweedeKamerAPI() as api: - motions = api.get_motions(start_date) -``` diff --git a/.mindmodel/patterns/architecture.yaml b/.mindmodel/patterns/architecture.yaml deleted file mode 100644 index bda8a1e..0000000 --- a/.mindmodel/patterns/architecture.yaml +++ /dev/null @@ -1,230 +0,0 @@ -# Architectural Patterns - -## Repository Pattern - -The `MotionDatabase` class acts as a repository, encapsulating all database operations behind a clean interface. - -```python -# database.py -class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - self.db_path = db_path - self._init_database() - - def get_motion(self, motion_id: int) -> Optional[Dict]: - """Get a single motion by ID.""" - conn = duckdb.connect(self.db_path) - try: - result = conn.execute( - "SELECT * FROM motions WHERE id = ?", (motion_id,) - ).fetchone() - return result - finally: - conn.close() - - def get_filtered_motions( - self, - policy_area: str = "Alle", - min_margin: float = 0.0, - max_margin: float = 1.0, - limit: int = 10 - ) -> List[Dict]: - """Get filtered list of motions.""" - ... -``` - -**Usage**: Import the singleton instance for all DB operations. -```python -from database import db - -motions = db.get_filtered_motions(policy_area="Klimaat", limit=20) -``` - -## Facade Pattern - -Simplified interfaces over complex subsystems. - -### MotionDatabase Facade -```python -# Single entry point for all database operations -db = MotionDatabase() # Singleton instance - -# Operations are abstracted: -db.create_session(total_motions) -db.record_vote(session_id, motion_id, vote) -db.get_party_results(session_id) -``` - -### API Client Facade -```python -# api_client.py -class TweedeKamerAPI: - def __init__(self): - self.session = requests.Session() # Connection pooling - - def get_motions(self, start_date, end_date) -> List[Dict]: - """Simple interface hiding OData pagination details.""" - voting_records, besluit_meta = self._get_voting_records(start_date, end_date) - return self._process_voting_records(voting_records, besluit_meta) -``` - -### MotionScraper Facade -```python -# scraper.py (if used) -class MotionScraper: - def get_motion_content(self, url: str) -> Optional[str]: - """Extract body text from official website.""" - ... -``` - -## Pipeline Pattern - -Sequential phases with explicit dependencies: - -``` -pipeline/run_pipeline.py -β”œβ”€β”€ Phase 1: fetch_mp_metadata -β”‚ └── pipeline/fetch_mp_metadata.py -β”œβ”€β”€ Phase 2: extract_mp_votes -β”‚ └── pipeline/extract_mp_votes.py -β”œβ”€β”€ Phase 3: svd_pipeline -β”‚ └── pipeline/svd_pipeline.py -β”œβ”€β”€ Phase 4: text_pipeline (gap-fill) -β”‚ └── pipeline/text_pipeline.py -└── Phase 5: fusion (combine SVD + text) - └── pipeline/fusion.py -``` - -### Phase Orchestration -```python -# pipeline/run_pipeline.py -def run(args: argparse.Namespace) -> int: - db = MotionDatabase(args.db_path) - - # Phase 1: MP metadata - if not args.skip_metadata: - from pipeline.fetch_mp_metadata import fetch_mp_metadata - fetch_mp_metadata(db_path=db.db_path) - - # Phase 2: Extract votes - if not args.skip_extract: - from pipeline.extract_mp_votes import extract_mp_votes - extract_mp_votes(db_path=db.db_path) - - # Phase 3: SVD per window - if not args.skip_svd: - from pipeline.svd_pipeline import run_svd_pipeline - run_svd_pipeline(db, windows, args.svd_k) - - # ... additional phases -``` - -## Strategy Pattern - -Interchangeable algorithms for axis computation: - -```python -# analysis/political_axis.py -def compute_political_axis( - vectors: Dict[str, np.ndarray], - method: str = "pca" # or "anchor" -) -> Tuple[np.ndarray, np.ndarray]: - """Compute political axis using specified method. - - Methods: - - 'pca': Use first principal component - - 'anchor': Use predefined anchor motions - """ - if method == "pca": - return _compute_pca_axis(vectors) - elif method == "anchor": - return _compute_anchor_axis(vectors) -``` - -## Visitor Pattern - -External operations on data structures: - -```python -# analysis/trajectory.py -def _procrustes_align_windows( - window_vecs: Dict[str, Dict[str, np.ndarray]], - min_overlap: int = 5, -) -> Dict[str, Dict[str, np.ndarray]]: - """Align SVD vectors across windows using Procrustes rotations. - - Takes the first window as reference and aligns each subsequent window - to it via orthogonal Procrustes on the set of common entities. - """ -``` - -## Builder Pattern - -Configuration via method chaining: - -```python -# CLI argument parsing -parser = argparse.ArgumentParser(description="Pipeline runner") -parser.add_argument("--db-path", default="data/motions.db") -parser.add_argument("--start-date", default=None) -parser.add_argument("--end-date", default=None) -parser.add_argument("--window-size", choices=["quarterly", "annual"], default="quarterly") -parser.add_argument("--svd-k", type=int, default=50) -``` - -## Decorator Pattern - -Retry logic for transient failures: - -```python -# pipeline/ai_provider_wrapper.py -def get_embeddings_with_retry( - texts: List[str], - retries: int = 3, - batch_size: int = 50, -) -> List[Optional[List[float]]]: - """Return embeddings with automatic retry on failure.""" - for attempt in range(1, retries + 1): - try: - return _embedder(texts, batch_size=len(texts)) - except Exception as exc: - if attempt == retries: - break - time.sleep(backoff * (2 ** (attempt - 1))) - return [None] * len(texts) # Safe fallback -``` - -## Data Patterns - -### Batch Processing -Process items in chunks to manage memory and API limits: -```python -for i in range(0, len(items), batch_size): - chunk = items[i:i + batch_size] - process_batch(chunk) -``` - -### Caching -Pre-compute and store expensive results: -```python -# SimilarityCache table stores computed similarities -db.get_similarity(motion_a, motion_b) -``` - -### Lazy Loading -Load data only when needed: -```python -class MotionDatabase: - @property - def _connection(self): - if self._conn is None: - self._conn = duckdb.connect(self.db_path) - return self._conn -``` - -### Vectorization -Use numpy for batch operations: -```python -vectors = np.array([v for v in entity_vectors.values()]) -normalized = vectors / np.linalg.norm(vectors, axis=1, keepdims=True) -``` diff --git a/.mindmodel/patterns/database.yaml b/.mindmodel/patterns/database.yaml deleted file mode 100644 index b221ec4..0000000 --- a/.mindmodel/patterns/database.yaml +++ /dev/null @@ -1,239 +0,0 @@ -# DuckDB Database Patterns - -## Connection Management - -### Pattern 1: Short-lived per Method (Most Common) - -Always create a new connection, use try/finally for cleanup: - -```python -# database.py -class MotionDatabase: - def get_motion(self, motion_id: int) -> Optional[Dict]: - conn = duckdb.connect(self.db_path) - try: - result = conn.execute( - "SELECT * FROM motions WHERE id = ?", - (motion_id,) - ).fetchone() - conn.close() - return result - except Exception: - conn.close() - return None - - def get_filtered_motions( - self, - policy_area: str = "Alle", - min_margin: float = 0.0, - max_margin: float = 1.0, - limit: int = 10 - ) -> List[Dict]: - conn = duckdb.connect(self.db_path) - try: - query = """ - SELECT * FROM motions - WHERE (? = 'Alle' OR policy_area = ?) - AND winning_margin BETWEEN ? AND ? - ORDER BY RANDOM() - LIMIT ? - """ - rows = conn.execute(query, (policy_area, policy_area, min_margin, max_margin, limit)).fetchall() - conn.close() - return rows - except Exception: - conn.close() - return [] -``` - -### Pattern 2: With Statement (Cleaner) - -```python -def execute_query(self, query: str, params: tuple = ()): - with duckdb.connect(self.db_path) as conn: - return conn.execute(query, params).fetchall() -``` - -### Pattern 3: Lazy Connection Caching - -For frequently accessed connections: - -```python -class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - self.db_path = db_path - self._conn = None - - @property - def connection(self): - if self._conn is None: - self._conn = duckdb.connect(self.db_path) - return self._conn - - def close(self): - if self._conn: - self._conn.close() - self._conn = None -``` - -## Table Initialization - -Create tables with proper constraints and sequences: - -```python -def _init_database(self): - conn = duckdb.connect(self.db_path) - - # Create sequence for auto-incrementing IDs - try: - conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") - except: - pass - - # Create tables - conn.execute(""" - CREATE TABLE IF NOT EXISTS motions ( - id INTEGER DEFAULT nextval('motions_id_seq'), - title TEXT NOT NULL, - description TEXT, - date DATE, - policy_area TEXT, - voting_results JSON, - winning_margin FLOAT, - controversy_score FLOAT, - layman_explanation TEXT, - externe_identifier TEXT, - body_text TEXT, - url TEXT UNIQUE, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) - ) - """) - - # Add columns to existing tables safely - try: - conn.execute("ALTER TABLE motions ADD COLUMN IF NOT EXISTS body_text TEXT") - except Exception: - pass # Column may already exist - - conn.close() -``` - -## JSON Column Handling - -Store and retrieve JSON data: - -```python -# Insert JSON -def store_motion(self, motion: Dict): - conn = duckdb.connect(self.db_path) - try: - conn.execute( - "INSERT INTO motions (title, voting_results) VALUES (?, ?)", - (motion["title"], json.dumps(motion["voting_results"])) - ) - conn.close() - except Exception: - conn.close() - -# Query JSON -def get_motions_with_votes(self, party: str) -> List[Dict]: - conn = duckdb.connect(self.db_path) - try: - rows = conn.execute(""" - SELECT title, voting_results - FROM motions - WHERE JSON_EXTRACT(voting_results, '$.party') = ? - """, (party,)).fetchall() - conn.close() - return rows - except Exception: - conn.close() - return [] -``` - -## Query Patterns - -### Parameterized Queries (Always!) -```python -# SAFE - uses parameterized query -conn.execute("SELECT * FROM motions WHERE id = ?", (motion_id,)) - -# AVOID - SQL injection risk -# conn.execute(f"SELECT * FROM motions WHERE id = {motion_id}") # BAD! -``` - -### Batch Inserts -```python -def bulk_insert_motions(self, motions: List[Dict]): - conn = duckdb.connect(self.db_path) - try: - for motion in motions: - conn.execute( - """INSERT OR IGNORE INTO motions - (title, date, policy_area) VALUES (?, ?, ?)""", - (motion["title"], motion["date"], motion["policy_area"]) - ) - conn.close() - except Exception: - conn.close() -``` - -### Aggregation Queries -```python -def get_party_vote_stats(self, party: str) -> Dict: - conn = duckdb.connect(self.db_path) - try: - result = conn.execute(""" - SELECT - COUNT(*) as total_votes, - SUM(CASE WHEN vote = 'Voor' THEN 1 ELSE 0 END) as voor, - SUM(CASE WHEN vote = 'Tegen' THEN 1 ELSE 0 END) as tegen - FROM mp_votes - WHERE party = ? - """, (party,)).fetchone() - conn.close() - return {"total": result[0], "voor": result[1], "tegen": result[2]} - except Exception: - conn.close() - return {"total": 0, "voor": 0, "tegen": 0} -``` - -## Error Handling - -Always close connections in finally block or with context manager: - -```python -def safe_query(self, query: str, params: tuple = ()): - conn = None - try: - conn = duckdb.connect(self.db_path) - result = conn.execute(query, params).fetchall() - return result - except Exception as e: - _logger.error(f"Query failed: {e}") - return [] - finally: - if conn: - conn.close() -``` - -## Testing with Mock - -For unit tests without DuckDB: - -```python -# In MotionDatabase.__init__ -def __init__(self, db_path: str = config.DATABASE_PATH): - self.db_path = db_path - self._file_mode = duckdb is None - - if duckdb is None: - # Create JSON fallback files - for p in (f"{db_path}.embeddings.json", f"{db_path}.similarity_cache.json"): - if not os.path.exists(p): - with open(p, "w") as fh: - fh.write("[]") - else: - self._init_database() -``` diff --git a/.mindmodel/patterns/duckdb-access.md b/.mindmodel/patterns/duckdb-access.md deleted file mode 100644 index ec00d89..0000000 --- a/.mindmodel/patterns/duckdb-access.md +++ /dev/null @@ -1,79 +0,0 @@ ---- -title: DuckDB Access Pattern -category: patterns ---- -# DuckDB Access Pattern - -## Rules - -- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. -- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. -- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. -- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). - -## Examples - -### database.py - Explicit connect/close for schema init - -```python -conn = duckdb.connect(self.db_path) -... -conn.execute(""" - CREATE TABLE IF NOT EXISTS fused_embeddings ( - id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), - motion_id INTEGER NOT NULL, - window_id TEXT NOT NULL, - vector JSON NOT NULL, - svd_dims INTEGER NOT NULL, - text_dims INTEGER NOT NULL, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - PRIMARY KEY (id) - ) -""") -conn.close() -``` - -### pipeline/svd_pipeline.py - Read-only connection - -```python -conn = duckdb.connect(db_path, read_only=True) -try: - rows = conn.execute( - "SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", - (start_date, end_date), - ).fetchall() -finally: - conn.close() -``` - -### similarity/compute.py - Preferred 'with' context - -```python -try: - import duckdb -except Exception: - logger.exception("duckdb import failed; cannot load vectors") - return 0 - -with duckdb.connect(db.db_path) as conn: - rows = conn.execute(query, params).fetchall() -``` - -## Anti-Patterns - -### Bad: Connection without closure - -```python -# BAD: connection may leak if exception occurs before explicit close -conn = duckdb.connect(db_path) -rows = conn.execute("SELECT ...").fetchall() -# missing finally/close -``` - -**Remediation**: Use "with" context or ensure conn.close() in finally block. - -### Bad: Parallel write connections - -**Problem**: Opening write connections from many parallel workers without coordination. - -**Remediation**: Open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. diff --git a/.mindmodel/patterns/embeddings-similarity.md b/.mindmodel/patterns/embeddings-similarity.md deleted file mode 100644 index 5b41d32..0000000 --- a/.mindmodel/patterns/embeddings-similarity.md +++ /dev/null @@ -1,74 +0,0 @@ ---- -title: Embeddings Similarity Pipeline -category: patterns ---- -# Embeddings Similarity Pipeline - -## Rules - -- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. -- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. -- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. -- Use read_only DuckDB connections in compute workers to allow parallel runs. - -## Examples - -### pipeline/ai_provider_wrapper.py - Batched embed + fallback - -```python -for start in range(0, len(texts), batch_size): - chunk = texts[start : start + batch_size] - resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) -... -for j in range(i, end): - t = texts[j] - single, single_exc = _attempt_batch([t], j) - if single: - results[j] = single[0] -``` - -### pipeline/fusion.py - Concatenation and storage - -```python -try: - svd_vec = json.loads(svd_json) -except Exception: - _logger.exception("Invalid SVD vector JSON for entity %s", entity_id) - skipped_missing_svd += 1 - continue -... -fused = list(svd_vec) + list(text_vec) -res = db.store_fused_embedding( - int(entity_id), - window_id, - fused, - svd_dims=len(svd_vec), - text_dims=len(text_vec), -) -``` - -### similarity/compute.py - Normalized cosine similarity - -```python -# Normalize rows -norms = np.linalg.norm(matrix, axis=1, keepdims=True) -norms[norms == 0] = 1.0 -normalized = matrix / norms -sim = normalized @ normalized.T -... -# pick top-k neighbors and write to similarity_cache -``` - -## Anti-Patterns - -### Bad: Assuming consistent vector length - -**Problem**: Assuming consistent vector length without checks leads to shape errors. - -**Remediation**: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). - -### Bad: Inline heavy computation in UI - -**Problem**: Recomputing heavy pipelines inline in UI requests. - -**Remediation**: Schedule heavy work in scripts/subprocesses and read precomputed results in UI. diff --git a/.mindmodel/patterns/error-handling.md b/.mindmodel/patterns/error-handling.md deleted file mode 100644 index f0e5881..0000000 --- a/.mindmodel/patterns/error-handling.md +++ /dev/null @@ -1,63 +0,0 @@ ---- -title: Error Handling Pattern -category: patterns ---- -# Error Handling Pattern - -## Rules - -- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). -- Prefer logging.exception when catching an exception where stack trace is useful. -- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. -- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) β€” only return safe defaults where documented. - -## Examples - -### ai_provider.py - Network error to ProviderError - -```python -except requests.ConnectionError as exc: - if attempt == retries: - raise ProviderError( - f"Connection error when calling provider: {exc}" - ) from exc - ... -``` - -### pipeline/ai_provider_wrapper.py - Best-effort with logging - -```python -except Exception: - _logger.exception("Failed to append audit event for embedding failure") -results[j] = None -``` - -### similarity/compute.py - Defensive import handling - -```python -try: - import duckdb -except Exception: - logger.exception("duckdb import failed; cannot load vectors") - return 0 -``` - -## Anti-Patterns - -### Bad: Silent exception swallowing - -```python -try: - do_work() -except Exception: - return [] -# BAD: hides the root cause and returns an ambiguous default -``` - -**Remediation**: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. - -### Bad: Mixing print() and logging - -**Problem**: Mixing print() and logging for errors. - -**Remediation**: Replace print() calls with logger.* calls; use structured logging configuration. diff --git a/.mindmodel/patterns/module-singletons.md b/.mindmodel/patterns/module-singletons.md deleted file mode 100644 index f6c80be..0000000 --- a/.mindmodel/patterns/module-singletons.md +++ /dev/null @@ -1,41 +0,0 @@ ---- -title: Module Singletons Pattern -category: patterns ---- -# Module Singletons Pattern - -## Rules - -- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: - - Avoid expensive initialization at import time. - - Provide a way to construct with a test DB path or to reinitialize in tests. -- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. - -## Examples - -### database.py - Safe class initialization - -```python -class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - self.db_path = db_path - # If duckdb is not available, operate in lightweight file-backed mode - self._file_mode = duckdb is None - self._init_database() -``` - -### similarity/lookup.py - Local instances - -```python -db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() -if hasattr(db, "get_cached_similarities"): - rows = db.get_cached_similarities(...) -``` - -## Anti-Patterns - -### Bad: Heavy initialization at import time - -**Problem**: Creating connections and performing heavy schema migrations during import. - -**Remediation**: Move heavy init to an explicit initialize() method and keep import fast. diff --git a/.mindmodel/patterns/patterns.yaml b/.mindmodel/patterns/patterns.yaml deleted file mode 100644 index 3c57d0a..0000000 --- a/.mindmodel/patterns/patterns.yaml +++ /dev/null @@ -1,228 +0,0 @@ -# Code Patterns - -## 1. Page Wrapper Pattern -Thin Streamlit page files delegate to core modules. Pages contain only route logic, not business logic. - -**Example** (pages/1_πŸ—³οΈ_Stemwijzer.py): -```python -import streamlit as st -from quiz_module import render_quiz_page - -st.set_page_config(...) -render_quiz_page() -``` - -**Example** (pages/2_πŸ”_Explorer.py): -```python -import streamlit as st -from explorer import render_explorer - -st.set_page_config(...) -render_explorer() -``` - -**Rule**: Pages should have <20 lines of logic. All complexity lives in modules. - ---- - -## 2. Pipeline Pattern -Data flows: fetch β†’ transform β†’ store - -**Location**: `pipeline/` directory - -**Pattern**: -```python -def run_pipeline(): - raw_data = fetch_from_source() - transformed = transform(raw_data) - store(transformed) - -def fetch_from_source(): - # API call or DB query - ... - -def transform(raw): - # Clean, normalize, compute derived fields - ... -``` - -**Usage**: SVD computation pipeline, data ingestion, motion processing - ---- - -## 3. API Client Pattern -HTTP client with retry/backoff for external data sources. - -**Pattern**: -```python -import time -import requests - -def fetch_with_retry(url, max_retries=3): - for attempt in range(max_retries): - try: - response = requests.get(url) - response.raise_for_status() - return response.json() - except requests.RequestException: - if attempt < max_retries - 1: - time.sleep(2 ** attempt) # exponential backoff - else: - raise -``` - ---- - -## 4. Pure Helper Functions -Functions in `explorer_helpers.py` have no side effects, no IO. - -**Pattern**: -```python -def compute_party_coords(svd_df, party_map, window): - """Pure function: same inputs β†’ same outputs, no side effects.""" - # Filter, compute, return - return result_df - -def build_scatter_trace(df, color_col, marker_size=8): - """Pure: returns Plotly trace dict, no rendering.""" - trace = go.Scatter(x=df.x, y=df.y, mode='markers', ...) - return trace -``` - -**Rule**: No `import streamlit` in helper modules. No file I/O. No global state. - ---- - -## 5. Dummy Fallbacks for Optional Dependencies -Gracefully degrade when optional packages are unavailable. - -**Pattern**: -```python -try: - import umap - HAS_UMAP = True -except ImportError: - HAS_UMAP = False - # or provide dummy stub - -def project_to_2d(vectors): - if HAS_UMAP: - return umap.UMAP().fit_transform(vectors) - else: - return vectors[:, :2] # fallback: just take first 2 dims -``` - -**Used for**: UMAP, Plotly (with fallback to altair or text-only) - ---- - -## 6. Cached Data Loaders -Expensive DB queries wrapped with `@st.cache_data`. - -**Pattern**: -```python -@st.cache_data -def load_svd_vectors(window: str) -> pd.DataFrame: - return db.query("SELECT * FROM svd_vectors WHERE window = ?", window) - -@st.cache_data -def load_party_centroids(window: str) -> pd.DataFrame: - return db.query("SELECT * FROM party_centroids WHERE window = ?", window) - -# Clear cache when data updates -@st.cache_data -def load_motions(category: str | None = None) -> pd.DataFrame: - ... -``` - -**Rule**: Use `ttl=3600` for large datasets. Use `show_spinner=False` where appropriate. - ---- - -## 7. Plotly Dual-Layer Charts -Charts built with two traces: scatter points + text annotations. - -**Pattern**: -```python -def build_dual_layer_chart(df, x_col, y_col, label_col): - # Layer 1: markers - scatter = go.Scatter( - x=df[x_col], y=df[y_col], - mode='markers', - marker=dict(size=10, color=df['color']), - name='Parties' - ) - # Layer 2: labels (smaller, non-hoverable) - labels = go.Scatter( - x=df[x_col], y=df[y_col], - mode='text', - text=df[label_col], - textposition='top center', - showlegend=False - ) - return [scatter, labels] -``` - -**Used in**: Explorer tab charts, party position plots - ---- - -## 8. Singleton Module Instances -One shared instance per module, created at import time. - -**Pattern**: -```python -# database.py -class MotionDatabase: - def __init__(self, db_path=None): - self.conn = ibis.duckdb.connect(db_path) - self._load_schema() - -_db = None -def get_db(): - global _db - if _db is None: - _db = MotionDatabase() - return _db - -# At module bottom: -db = MotionDatabase() # singleton instance -``` - -**Also used in**: `config.py` exports `config` and `PARTY_COLOURS` - ---- - -## 9. Dataclass Config Pattern -Configuration centralized in a `@dataclass`. - -**Pattern**: -```python -from dataclasses import dataclass, field - -@dataclass -class Config: - db_path: str = "data/stemwijzer.duckdb" - default_window: str = "2023" - cache_ttl: int = 3600 - party_colours: dict = field(default_factory=lambda: PARTY_COLOURS) - - def __post_init__(self): - if not Path(self.db_path).exists(): - raise FileNotFoundError(f"Database not found: {self.db_path}") -``` - ---- - -## 10. Graceful Degradation with try/except -Core pattern throughout: attempt operation, fall back gracefully. - -**Pattern**: -```python -def get_political_position(mp_name, window): - try: - vectors = load_svd_vectors(window) - return vectors[vectors['mp_name'] == mp_name]['vector_2d'].iloc[0] - except (KeyError, IndexError): - return [0.0, 0.0] # neutral fallback -``` diff --git a/.mindmodel/patterns/python.yaml b/.mindmodel/patterns/python.yaml deleted file mode 100644 index 8b8b027..0000000 --- a/.mindmodel/patterns/python.yaml +++ /dev/null @@ -1,196 +0,0 @@ -# Python-Specific Patterns - -## Singleton Pattern - -Use module-level instances for shared resources: - -```python -# database.py -class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - self.db_path = db_path - self._init_database() - - def _init_database(self): - # Initialize tables on first instantiation - ... - -# Bottom of file - the singleton -db = MotionDatabase() -``` - -**Usage across the codebase:** -```python -# In other modules -from database import db - -def some_function(): - motions = db.get_filtered_motions(limit=10) - return motions -``` - -Similarly for other singletons: -```python -# summarizer.py -class MotionSummarizer: - def __init__(self): - pass # Stateless - - def generate_layman_explanation(self, title: str, body: str) -> str: - ... - -summarizer = MotionSummarizer() -``` - -## Dataclass Config Pattern - -Use dataclass for configuration with environment variable support: - -```python -# config.py -from dataclasses import dataclass -from typing import List -import os - -@dataclass -class Config: - # Database settings - DATABASE_PATH = "data/motions.db" - - # API settings - TWEEDE_KAMER_ODATA_API = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" - API_TIMEOUT = 30 - API_BATCH_SIZE = 250 - - # AI settings - OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") - OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1" - QWEN_MODEL = "qwen/qwen-2.5-72b-instruct" - - # App settings - DEFAULT_MOTION_COUNT = 10 - SESSION_TIMEOUT_DAYS = 30 - - # Policy areas - POLICY_AREAS: List[str] = None - def __post_init__(self): - self.POLICY_AREAS = [ - "Alle", "Economie", "Klimaat", "Immigratie", - "Zorg", "Onderwijs", "Defensie", "Sociale Zaken", "Algemeen" - ] - -config = Config() -``` - -**Usage:** -```python -from config import config - -# Access as attributes -timeout = config.API_TIMEOUT -areas = config.POLICY_AREAS -``` - -## DuckDB Connection Pattern - -Short-lived connections with explicit cleanup: - -```python -class MotionDatabase: - def get_motion(self, motion_id: int) -> Optional[Dict]: - conn = duckdb.connect(self.db_path) - try: - result = conn.execute( - "SELECT * FROM motions WHERE id = ?", - (motion_id,) - ).fetchone() - return result - finally: - conn.close() - - def get_filtered_motions(self, **kwargs) -> List[Dict]: - conn = duckdb.connect(self.db_path) - try: - rows = conn.execute(query, params).fetchall() - return rows - except Exception: - return [] # Safe fallback - finally: - conn.close() -``` - -**Context manager alternative (preferred when applicable):** -```python -def some_operation(self): - with duckdb.connect(self.db_path) as conn: - result = conn.execute("SELECT ...").fetchall() - return result -``` - -## Try/Except with Fallback Pattern - -Always provide safe fallbacks: - -```python -def get_motion_or_default(self, motion_id: int) -> Dict: - try: - conn = duckdb.connect(self.db_path) - result = conn.execute("SELECT * FROM motions WHERE id = ?", (motion_id,)).fetchone() - conn.close() - return result if result else {} - except Exception: - return {} -``` - -## Optional Import Pattern - -Handle optional dependencies gracefully: - -```python -try: - import duckdb -except Exception: # pragma: no cover - duckdb = None - -class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - self._file_mode = duckdb is None - ... -``` - -## Property Pattern - -Lazy initialization of expensive resources: - -```python -class MotionDatabase: - def __init__(self, db_path: str = config.DATABASE_PATH): - self.db_path = db_path - self._session_cache = None - - @property - def session(self): - """Lazy-load expensive resources.""" - if self._session_cache is None: - self._session_cache = self._create_session() - return self._session_cache -``` - -## Type Annotation Patterns - -```python -from typing import Dict, List, Optional, Tuple, Any - -# Optional with None default -def get_motion(self, motion_id: Optional[int] = None) -> Optional[Dict]: - ... - -# Multiple return types -def parse_vote(self, vote_str: str) -> Tuple[bool, str]: - """Returns (success, error_message)""" - ... - -# Generic types -def get_batch(self, ids: List[int]) -> Dict[str, Any]: - ... -``` diff --git a/.mindmodel/patterns/requests-http.md b/.mindmodel/patterns/requests-http.md deleted file mode 100644 index 0930fb6..0000000 --- a/.mindmodel/patterns/requests-http.md +++ /dev/null @@ -1,77 +0,0 @@ ---- -title: Requests HTTP Pattern -category: patterns ---- -# Requests HTTP Pattern - -## Rules - -- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. -- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. -- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). -- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. - -## Examples - -### ai_provider.py - 429 handling with Retry-After - -```python -resp = requests.post(url, json=json, headers=headers, timeout=10) -... -if getattr(resp, "status_code", 0) == 429: - if attempt == retries: - raise ProviderError(f"Provider returned HTTP {resp.status_code}") - retry_after = None - raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None - if raw: - try: - retry_after = int(raw) - except Exception: - ... - if retry_after is not None: - time.sleep(retry_after) - continue -``` - -### api_client.py - Session + raise_for_status - -```python -response = self.session.get( - base_url, params=params, timeout=config.API_TIMEOUT -) -response.raise_for_status() -data = response.json() -``` - -### pipeline/ai_provider_wrapper.py - Retry/backoff wrapper - -```python -def _attempt_batch(chunk_texts, start_index): - backoff = 0.5 - for attempt in range(1, retries + 1): - try: - emb_chunk = _embedder( - chunk_texts, model=model, batch_size=len(chunk_texts) - ) - return emb_chunk, None - except Exception as exc: - if attempt == retries: - break - sleep = backoff * (2 ** (attempt - 1)) - time.sleep(sleep) - continue -``` - -## Anti-Patterns - -### Bad: Silent exception swallowing - -**Problem**: Blindly catching all requests exceptions and returning empty response. - -**Remediation**: Map network exceptions to retryable vs terminal (ProviderError) and log details. - -### Bad: Using print() for errors - -**Problem**: Using print() for network errors instead of structured logging. - -**Remediation**: Use `_logger.exception()` instead (see api_client.py needs fixing). diff --git a/.mindmodel/patterns/streamlit.yaml b/.mindmodel/patterns/streamlit.yaml deleted file mode 100644 index a4742fe..0000000 --- a/.mindmodel/patterns/streamlit.yaml +++ /dev/null @@ -1,225 +0,0 @@ -# Streamlit Patterns - -## Session State Initialization - -Always initialize session state at the start of the main function: - -```python -# app.py -import streamlit as st - -def main(): - # Initialize all session state variables - if "session_id" not in st.session_state: - st.session_state.session_id = None - if "current_motion_index" not in st.session_state: - st.session_state.current_motion_index = 0 - if "motions" not in st.session_state: - st.session_state.motions = [] - if "show_results" not in st.session_state: - st.session_state.show_results = False - - # Rest of app... -``` - -## Page Configuration - -Set page config at the top of each page file: - -```python -# pages/1_Stemwijzer.py -import streamlit as st - -st.set_page_config( - page_title="Stemwijzer", - page_icon="πŸ—³οΈ", - layout="centered", -) - -from explorer import build_mp_quiz_tab -build_mp_quiz_tab("data/motions.db") -``` - -## Thin Page Wrapper Pattern - -Pages delegate to shared functions in main modules: - -```python -# pages/2_Explorer.py -import streamlit as st - -st.set_page_config( - page_title="Explorer", - page_icon="πŸ”­", - layout="wide", -) - -from explorer import build_explorer_tab -build_explorer_tab() -``` - -```python -# explorer.py -def build_explorer_tab(): - st.header("πŸ”­ Politiek Explorer") - - tab1, tab2, tab3 = st.tabs([ - "Compass", - "Trajectories", - "Zoeken" - ]) - - with tab1: - render_compass() - with tab2: - render_trajectories() - with tab3: - render_search() -``` - -## Sidebar Pattern - -Use sidebar for configuration and navigation: - -```python -# app.py -def main(): - with st.sidebar: - st.header("Instellingen") - - motion_count = st.slider( - "Aantal moties", - min_value=5, - max_value=25, - value=10, - ) - - policy_area = st.selectbox("Beleidsgebied", config.POLICY_AREAS) - - if st.button("Start Nieuwe Sessie"): - start_new_session(motion_count, policy_area) -``` - -## Callback Pattern for State Updates - -Use callbacks to handle user interactions: - -```python -def on_motion_vote(motion_id: int, vote: str): - """Callback when user votes on a motion.""" - st.session_state.user_votes[motion_id] = vote - - # Move to next motion - if st.session_state.current_motion_index < len(st.session_state.motions) - 1: - st.session_state.current_motion_index += 1 - else: - st.session_state.show_results = True - - st.rerun() - -# In UI -col1, col2, col3 = st.columns(3) -with col1: - st.button("πŸ‘ Voor", on_click=on_motion_vote, args=(motion_id, "Voor")) -with col2: - st.button("πŸ‘Ž Tegen", on_click=on_motion_vote, args=(motion_id, "Tegen")) -with col3: - st.button("❓ Onthouden", on_click=on_motion_vote, args=(motion_id, "Onthouden")) -``` - -## Container Pattern for Dynamic Content - -Use containers for dynamic rendering: - -```python -def show_motion_interface(): - if not st.session_state.motions: - st.warning("Geen moties geladen") - return - - current_idx = st.session_state.current_motion_index - motion = st.session_state.motions[current_idx] - - with st.container(): - st.subheader(f"Motie {current_idx + 1} van {len(st.session_state.motions)}") - st.markdown(f"**{motion['title']}**") - st.caption(f"πŸ“… {motion['date']} | 🏷️ {motion['policy_area']}") - - if motion.get("layman_explanation"): - st.info(motion["layman_explanation"]) - - # Voting buttons... -``` - -## Expander Pattern for Details - -Use expanders for collapsible content: - -```python -with st.expander("Meer details"): - st.markdown(f"**Beschrijving:** {motion.get('description', 'N/A')}") - - if motion.get("voting_results"): - results = json.loads(motion["voting_results"]) - st.json(results) -``` - -## Form Pattern for Batch Updates - -Use forms for multiple related inputs: - -```python -with st.form("session_settings"): - st.subheader("Sessie Instellingen") - - col1, col2 = st.columns(2) - with col1: - count = st.number_input("Aantal moties", min_value=5, max_value=25) - with col2: - area = st.selectbox("Beleidsgebied", config.POLICY_AREAS) - - submitted = st.form_submit_button("Start Sessie") - if submitted: - start_session(count, area) -``` - -## Caching Pattern - -Cache expensive computations: - -```python -@st.cache_data(ttl=3600) # Cache for 1 hour -def load_party_positions(window_id: str) -> Dict: - """Load party positions from database.""" - return db.get_party_positions(window_id) - -@st.cache_resource -def init_database(): - """Initialize database connection.""" - return MotionDatabase(config.DATABASE_PATH) -``` - -## Home Page Pattern - -Landing page with navigation: - -```python -# Home.py -import streamlit as st - -st.set_page_config( - page_title="Motief: de stematlas", - page_icon="πŸ—ΊοΈ", - layout="centered", -) - -def main(): - st.title("πŸ—ΊοΈ Motief: de stematlas") - st.markdown("**Motief** brengt de Nederlandse Tweede Kamer in kaart...") - - col1, col2 = st.columns(2) - with col1: - st.page_link("pages/1_Stemwijzer.py", label="Open Stemwijzer", icon="πŸ—³οΈ") - with col2: - st.page_link("pages/2_Explorer.py", label="Open Explorer", icon="πŸ”­") -``` diff --git a/.mindmodel/patterns/validation.md b/.mindmodel/patterns/validation.md deleted file mode 100644 index a8fab16..0000000 --- a/.mindmodel/patterns/validation.md +++ /dev/null @@ -1,37 +0,0 @@ ---- -title: Validation Pattern -category: patterns ---- -# Validation Pattern - -## Rules - -- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. -- Tests should assert that invalid inputs raise the expected exceptions. -- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). - -## Examples - -### ai_provider.py - Type validation - -```python -if not isinstance(text, str): - raise ProviderError("text must be a string") -``` - -### pipeline/ai_provider_wrapper.py - Defensive empty handling - -```python -if not texts: - return [] -if motion_ids is None: - motion_ids = [None for _ in texts] -``` - -## Anti-Patterns - -### Bad: Invalid values into computation - -**Problem**: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). - -**Remediation**: Fail fast with a typed exception and add unit tests to cover validations. diff --git a/.mindmodel/stack/stack.md b/.mindmodel/stack/stack.md deleted file mode 100644 index a2ea27d..0000000 --- a/.mindmodel/stack/stack.md +++ /dev/null @@ -1,67 +0,0 @@ ---- -title: Tech Stack -category: stack ---- - -# Tech Stack - -## Runtime & Language -- **Python >=3.13** - -## Web Framework -- **Streamlit** - Multi-page app with Home, Stemwijzer, Explorer pages - -## Data Layer -- **DuckDB** - Embedded OLAP database - - Tables: motions, mp_votes, svd_vectors, fused_embeddings, embeddings, user_sessions, party_results, mp_metadata -- **ibis** - ORM (referenced but DuckDB-native implementation used) - -## AI / LLM -- **OpenRouter** - API abstraction for AI providers -- **QWEN** - Primary model - - Embeddings: `qwen/qwen3-embedding-4b` - - Chat: `qwen/qwen-2.5-72b-instruct` -- **requests** - HTTP client (not raw openai) - -## ML / Analytics -- **scikit-learn** - KMeans clustering, cosine_similarity, StandardScaler -- **scipy** - SVD (scipy.linalg.svd), spatial.procrustes -- **umap-learn** - Dimensionality reduction (optional, graceful fallback to SVD) -- **numpy** - Numerical computing - -## Visualization -- **Plotly** - Interactive charts (go.Figure, _DummyTrace fallback) -- **matplotlib** - Static plotting (optional) - -## HTTP & Parsing -- **requests** - Session pooling, retry with backoff -- **beautifulsoup4** - HTML parsing -- **lxml** - XML/HTML processing - -## Key Source Files - -| File | Purpose | -|------|---------| -| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema | -| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) | -| `explorer_helpers.py` | Pure helper functions, Plotly chart builders | -| `analysis/` | SVD pipeline, UMAP projection, clustering | -| `pipeline/` | Data fetch, transform, store pipeline | -| `pages/1_Stemwijzer.py` | Quiz page | -| `pages/2_Explorer.py` | Explorer page | -| `config.py` | Dataclass Config pattern | -| `ai_provider.py` | OpenRouter API wrapper with retry | -| `api_client.py` | TweedeKamer OData API client | - -## Singleton Instances - -| Module | Instance | Type | -|--------|----------|------| -| `database.py` | `db` | `MotionDatabase` | -| `config.py` | `config` | `Config` (dataclass) | -| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | - -## Environment -- Python >=3.13 -- Environment variables via `.env` (DB path, API keys) -- No `.env` values in constraint files (security) diff --git a/.mindmodel/system.md b/.mindmodel/system.md deleted file mode 100644 index f4de3e5..0000000 --- a/.mindmodel/system.md +++ /dev/null @@ -1,88 +0,0 @@ -# System Overview - -## Project: Stemwijzer (Dutch Political Voting Compass) - -**Purpose**: A web application that maps the Dutch Tweede Kamer (House of Representatives) based on real parliamentary votes, helping citizens discover which political party aligns best with their views. - -## Architecture Summary - -### Data Flow -``` -TweedeKamer OData API - ↓ - API Client (api_client.py) - ↓ - DuckDB Database (database.py) - ↓ - Pipeline Processing (pipeline/) - β”œβ”€β”€ fetch_mp_metadata # MP party + tenure - β”œβ”€β”€ extract_mp_votes # voting_results β†’ mp_votes - β”œβ”€β”€ svd_pipeline # SVD on vote matrix + Procrustes - β”œβ”€β”€ text_pipeline # AI embeddings via OpenRouter - └── fusion # Combine SVD + text vectors - ↓ - Streamlit Web App (Home.py, pages/) - β”œβ”€β”€ Home.py # Landing page - β”œβ”€β”€ 1_Stemwijzer.py # Voting quiz - └── 2_Explorer.py # Political compass explorer -``` - -### Key Components - -| Component | Purpose | File(s) | -|-----------|---------|---------| -| **Database** | Motion storage, MP votes, embeddings | `database.py` | -| **API Client** | TweedeKamer OData API integration | `api_client.py` | -| **AI Provider** | OpenRouter API for embeddings/summaries | `ai_provider.py` | -| **Pipeline** | Orchestrated data processing | `pipeline/run_pipeline.py` | -| **Analysis** | SVD, clustering, trajectory computation | `analysis/*.py` | -| **Explorer Helpers** | Pure functions, chart builders | `explorer_helpers.py` | -| **Web App** | Streamlit UI | `Home.py`, `pages/*.py` | - -### Tech Stack - -- **Language**: Python 3.13+ -- **Web Framework**: Streamlit (multi-page app) -- **Database**: DuckDB with ibis ORM (DuckDB-native implementation) -- **ML/Analytics**: scipy (SVD, Procrustes), scikit-learn (KMeans, cosine_similarity), umap-learn (optional) -- **AI/LLM**: OpenRouter-compatible API (QWEN embeddings + chat) -- **Visualization**: Plotly (interactive charts), matplotlib (optional) -- **HTTP**: requests with Session pooling and retry -- **Parsing**: beautifulsoup4, lxml - -### Key Patterns - -1. **Module-Level Singletons**: `db = MotionDatabase()`, `config = Config()` -2. **Repository Pattern**: MotionDatabase class with method-per-query -3. **Service Layer**: TweedeKamerAPI, ai_provider with retry/backoff -4. **Pipeline Orchestration**: ThreadPoolExecutor for parallel SVD -5. **Short-Lived Connections**: DuckDB connections in try/finally blocks -6. **Graceful Degradation**: try/except around optional dependencies - -### Domain Invariants - -⚠️ **CRITICAL RULES** (from AGENTS.md): - -1. **Right-wing parties on RIGHT**: PVV, FVD, JA21, SGP must appear on RIGHT side of all axes in visualizations -2. **SVD labels = voting patterns**: SVD labels reflect voting patterns, NOT semantic content - -### Database Tables - -| Table | Purpose | -|-------|---------| -| `motions` | Parliamentary motions with id, title, date, category | -| `mp_votes` | Individual MP votes on motions (Voor/Tegen/Onthouden) | -| `mp_metadata` | MP names, parties, tenure info | -| `svd_vectors` | 2D SVD-computed political positions per entity | -| `fused_embeddings` | Combined SVD + text embeddings | -| `embeddings` | Text embeddings for motions | -| `user_sessions` | Voting session tracking | -| `party_results` | Party match results per session | - -### Conventions - -- **Error Handling**: Catch `Exception`, return safe fallbacks (False/[]/None) -- **Logging**: Use `logging.getLogger(__name__)` β€” **never use print()** -- **Imports**: stdlib β†’ 3rd party β†’ local (3 groups) -- **Type Hints**: Required on public functions with typing module imports -- **DuckDB**: Short-lived connections with try/finally conn.close() diff --git a/analysis/explorer_data.py b/analysis/explorer_data.py index 55da83f..32af634 100644 --- a/analysis/explorer_data.py +++ b/analysis/explorer_data.py @@ -346,7 +346,12 @@ def load_party_mp_vectors(db_path: str) -> Dict[str, List[np.ndarray]]: def load_scree_data(db_path: str) -> List[float]: - """Load scree plot data (explained variance) for current_parliament.""" + """Load scree plot data (explained variance) for current_parliament. + + First tries to read the cached metadata row from svd_vectors. + Falls back to on-the-fly computation via compute_svd_spectrum for + backward compatibility with databases that haven't stored it yet. + """ try: con = duckdb.connect(database=db_path, read_only=True) row = con.execute( @@ -364,7 +369,11 @@ def load_scree_data(db_path: str) -> List[float]: import json return json.loads(row[0]) - return [] + + # Fallback: compute dynamically for backward compatibility + from analysis.political_axis import compute_svd_spectrum + + return compute_svd_spectrum(db_path) except Exception: logger.exception("Failed to load scree data") return [] diff --git a/scripts/mindmodel/checks.py b/scripts/mindmodel/checks.py deleted file mode 100644 index b0bdd1a..0000000 --- a/scripts/mindmodel/checks.py +++ /dev/null @@ -1,72 +0,0 @@ -import os -import re -from typing import List - - -def file_exists(base_dir: str, path: str) -> bool: - """Check whether a path exists under base_dir without opening the file. - - This resolves the path relative to base_dir and returns True if the - resolved path exists on the filesystem (file or directory). - """ - if not base_dir: - base = "" - else: - base = base_dir - full = os.path.join(base, path) - return os.path.exists(full) - - -def detect_truncated(snippet: str) -> bool: - """Heuristic detection whether a snippet is truncated. - - Returns True if the snippet ends with an ellipsis '...' (after - trimming whitespace) or contains a common truncation marker like - the substring 'truncat' (case-insensitive). - """ - if snippet is None: - return False - s = snippet.strip() - if s.endswith("..."): - return True - if "truncat" in s.lower(): - return True - return False - - -def find_potential_secrets(text: str) -> List[str]: - """Scan the provided text and return a list of potential secret-like - strings. This uses a few common heuristics and regex patterns and only - scans the provided text (no external resources). - - The function returns a list of found token strings (values when - capture groups are available, otherwise the matched substring). - """ - if not text: - return [] - - candidates: List[str] = [] - - # AWS access key id pattern (common): AKIA followed by 16 alphanumeric - aws_pattern = re.compile(r"AKIA[0-9A-Z]{16}") - candidates.extend(aws_pattern.findall(text)) - - # Common key/value patterns like api_key = "..." or "api-key: ..." - # allow shorter secret values (down to 4 chars) to catch short test values - kv_pattern = re.compile( - r"(?i)(?:api[_-]?key|secret[_-]?key|access[_-]?token|access[_-]?key|token|password|passwd|pwd)\s*[=:]+\s*['\"]?([A-Za-z0-9\-_=+/\.]{4,128})['\"]?" - ) - candidates.extend(m.group(1) for m in kv_pattern.finditer(text)) - - # Generic long hex or base64-like strings (heuristic) - long_hex = re.compile(r"\b([a-f0-9]{32,128})\b", re.IGNORECASE) - candidates.extend(long_hex.findall(text)) - - # Deduplicate while preserving order - seen = set() - result: List[str] = [] - for c in candidates: - if c and c not in seen: - seen.add(c) - result.append(c) - return result diff --git a/scripts/mindmodel/cli.py b/scripts/mindmodel/cli.py deleted file mode 100644 index 6555aad..0000000 --- a/scripts/mindmodel/cli.py +++ /dev/null @@ -1,32 +0,0 @@ -from typing import List, Optional - - -def main(argv: Optional[List[str]] = None) -> int: - """CLI wrapper that delegates to scripts.mindmodel.validator.main. - - Returns the integer exit code from the delegated main. If the - validator module is not available or raises, return a non-zero - exit code. - """ - try: - # Import here to avoid side-effects on module import - from scripts.mindmodel import validator - - # Call the validator.main if present - if hasattr(validator, "main"): - result = validator.main(argv) - # Ensure we return an int - try: - return int(result) # type: ignore - except Exception: - return 1 - else: - return 2 - except Exception: - # Import error or runtime error β€” return non-zero so callers - # can detect failure (tests expect non-zero on missing manifest) - return 2 - - -if __name__ == "__main__": - raise SystemExit(main()) diff --git a/scripts/mindmodel/loader.py b/scripts/mindmodel/loader.py deleted file mode 100644 index 088a688..0000000 --- a/scripts/mindmodel/loader.py +++ /dev/null @@ -1,67 +0,0 @@ -"""Simple manifest loader for mindmodel manifests. - -Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`. - -Behavior: -- If PyYAML is installed, uses yaml.safe_load to parse the file. -- Otherwise falls back to the stdlib json parser. -- If the top-level document is a list it will be normalized to {"constraints": }. -- Raises ManifestLoadError for missing file or parse errors. -""" - -from typing import Any, Dict -import json -from pathlib import Path - - -class ManifestLoadError(Exception): - """Raised when a manifest cannot be loaded or parsed.""" - - -try: - import yaml # type: ignore -except Exception: # YAML not available - yaml = None # type: ignore - - -def _parse_with_yaml(text: str) -> Any: - # yamlsafe_load may return any Python structure - try: - return yaml.safe_load(text) - except Exception as exc: # pragma: no cover - defensive - raise ManifestLoadError(f"YAML parse error: {exc}") from exc - - -def _parse_with_json(text: str) -> Any: - try: - return json.loads(text) - except Exception as exc: - raise ManifestLoadError(f"JSON parse error: {exc}") from exc - - -def load_manifest(path: str) -> Dict[str, Any]: - """Load a manifest from the given file path and normalize it to a dict. - - If the top-level document is a list, it will be returned as {"constraints": list}. - Raises ManifestLoadError if the file does not exist or if parsing fails. - """ - p = Path(path) - if not p.exists(): - raise ManifestLoadError(f"Manifest file not found: {path}") - - text = p.read_text(encoding="utf-8") - - if yaml is not None: - data = _parse_with_yaml(text) - else: - data = _parse_with_json(text) - - # Normalize - if isinstance(data, list): - return {"constraints": data} - - if isinstance(data, dict): - return data - - # Unexpected top-level type, wrap it - return {"manifest": data} diff --git a/scripts/mindmodel/validator.py b/scripts/mindmodel/validator.py deleted file mode 100644 index 245ceb7..0000000 --- a/scripts/mindmodel/validator.py +++ /dev/null @@ -1,108 +0,0 @@ -from typing import Dict, Tuple, List, Any -import json -from pathlib import Path - -from scripts.mindmodel import loader -from scripts.mindmodel import checks - - -def validate_manifest(path: str, base_dir: str = None) -> Tuple[int, Dict[str, Any]]: - """Validate a manifest file at `path`. - - Returns a tuple (exit_code, report). - - exit codes: - 0 - ok (no issues) - 1 - warnings (only truncated snippets found) - 2 - critical (missing files, secrets, or parse error) - """ - report: Dict[str, Any] = { - "path": path, - "secrets": [], - "missing_files": [], - "truncated": 0, - "constraints": [], - } - - p = Path(path) - try: - raw_text = p.read_text(encoding="utf-8") - except Exception as exc: - report["load_error"] = f"Manifest file not readable: {exc}" - return 2, report - - # scan for secrets in the manifest text - secrets = checks.find_potential_secrets(raw_text) - report["secrets"] = secrets - - try: - manifest = loader.load_manifest(path) - except loader.ManifestLoadError as exc: - report["load_error"] = str(exc) - # treat parse/load errors as critical - return 2, report - - constraints = manifest.get("constraints") or [] - - for constraint in constraints: - c_rep: Dict[str, Any] = {"constraint": constraint, "evidence": []} - for ev in ( - constraint.get("evidence", []) - if isinstance(constraint.get("evidence", []), list) - else [] - ): - text = ev.get("text") if isinstance(ev, dict) else None - file_ref = ev.get("file") if isinstance(ev, dict) else None - - exists = True - if file_ref: - if not checks.file_exists(base_dir or "", file_ref): - exists = False - report["missing_files"].append(file_ref) - - truncated = False - if text: - truncated = checks.detect_truncated(text) - if truncated: - report["truncated"] += 1 - - c_rep["evidence"].append( - { - "text": text, - "file": file_ref, - "exists": exists, - "truncated": truncated, - } - ) - - report["constraints"].append(c_rep) - - # decide exit code - if report["secrets"]: - return 2, report - - if report["missing_files"]: - return 2, report - - if report["truncated"] > 0: - return 1, report - - return 0, report - - -def main(argv: List[str]) -> int: - import sys - - if len(argv) < 2: - print(json.dumps({"error": "manifest path required"})) - return 2 - - path = argv[1] - base_dir = argv[2] if len(argv) > 2 else None - - code, report = validate_manifest(path, base_dir=base_dir) - print(json.dumps(report)) - return code - - -# no execution at import time diff --git a/scripts/validate_mindmodel.py b/scripts/validate_mindmodel.py deleted file mode 100644 index d6deead..0000000 --- a/scripts/validate_mindmodel.py +++ /dev/null @@ -1,56 +0,0 @@ -"""Command-line wrapper around src.validators.mindmodel_validator.validate_manifest - -This tiny CLI loads a manifest and writes a structured JSON report to stdout -and optionally to a file path. It is report-only: it never raises an error or -changes exit code based on findings. -""" - -from __future__ import annotations - -import argparse -import json -import os -from pathlib import Path -from typing import Any - - -def _write_report(report: dict[str, Any], path: Path | None) -> None: - text = json.dumps(report, indent=2, ensure_ascii=False) - print(text) - if path: - path.parent.mkdir(parents=True, exist_ok=True) - path.write_text(text, encoding="utf-8") - - -def main(argv: list[str] | None = None) -> int: - parser = argparse.ArgumentParser("validate_mindmodel") - parser.add_argument("manifest", nargs="?", help="path to manifest file") - parser.add_argument("--manifest", dest="manifest_opt", help="path to manifest file") - parser.add_argument("--report", help="optional output report path") - args = parser.parse_args(argv) - - manifest = args.manifest_opt or args.manifest - if not manifest: - parser.error("manifest path is required (positional or --manifest)") - - # import here to keep CLI tiny when unused - try: - from src.validators.mindmodel_validator import validate_manifest - except Exception as e: # pragma: no cover - defensive - print(f"Failed to import validator: {e}") - return 0 - - try: - report = validate_manifest(manifest, report_only=True) - except Exception as e: # never fail the process - report = {"error": str(e)} - - report_path = Path(args.report) if args.report else None - _write_report(report, report_path) - - # always exit zero for report-only operation - return 0 - - -if __name__ == "__main__": - raise SystemExit(main()) diff --git a/src/types/motion_types.py b/src/types/motion_types.py deleted file mode 100644 index bf9d290..0000000 --- a/src/types/motion_types.py +++ /dev/null @@ -1,35 +0,0 @@ -"""Motion-related simple types and JSON helpers. - -Decision: MotionId is an alias for str for simplicity. -""" - -from dataclasses import dataclass, asdict -from typing import List -import json - -MotionId = str -Embedding = List[float] - - -@dataclass -class SimilarityNeighbor: - motion_id: MotionId - score: float - - -def to_json(neighbors: List[SimilarityNeighbor]) -> str: - """Serialize a list of SimilarityNeighbor to a JSON string. - - The format is a JSON list of objects with keys 'motion_id' and 'score'. - """ - list_of_dicts = [asdict(n) for n in neighbors] - return json.dumps(list_of_dicts) - - -def from_json(json_str: str) -> List[SimilarityNeighbor]: - """Deserialize a JSON string (list of dicts) into SimilarityNeighbor list.""" - parsed = json.loads(json_str) - return [ - SimilarityNeighbor(motion_id=item["motion_id"], score=float(item["score"])) - for item in parsed - ] diff --git a/src/validators/mindmodel_validator.py b/src/validators/mindmodel_validator.py deleted file mode 100644 index e415465..0000000 --- a/src/validators/mindmodel_validator.py +++ /dev/null @@ -1,142 +0,0 @@ -"""Conservative, report-only mindmodel/manifest validator. - -This module provides a small validator that reads a manifest (YAML if -PyYAML is available, otherwise a tiny fallback parser) and reports -potential issues without making changes. - -The returned report contains the keys: -- missing_files: list of file paths referenced in the manifest that don't exist -- truncated_evidence: list of items (dicts) where evidence_excerpt appears truncated -- potential_secrets: list of items (dicts) where evidence_excerpt looks like it may contain secrets - -The manifest is expected to contain a top-level `files` list with -entries that are mappings and have at least a `path` (or `file_path`) -and optionally `evidence_excerpt`. -""" - -from __future__ import annotations - -import os -from typing import List, Dict, Any - - -def _load_yaml_native(path: str) -> Dict[str, Any]: - try: - import yaml # type: ignore - - with open(path, "r", encoding="utf-8") as f: - return yaml.safe_load(f) or {} - except Exception: - raise - - -def _load_yaml_fallback(path: str) -> Dict[str, Any]: - """Tiny YAML-ish fallback parser that understands a minimal manifest. - - It only supports a top-level `files:` key and a sequence of simple - mappings with `-` list items and `key: value` pairs indented. - This is intentionally conservative and fragile; it's only used when - PyYAML is not available. - """ - result: Dict[str, Any] = {} - files: List[Dict[str, Any]] = [] - current: Dict[str, Any] | None = None - - with open(path, "r", encoding="utf-8") as f: - for raw in f: - line = raw.rstrip("\n") - stripped = line.lstrip() - if not stripped or stripped.startswith("#"): - continue - if stripped.startswith("files:") and line.startswith(stripped): - # top-level marker, skip - continue - if stripped.startswith("- "): - # start new item - if current is not None: - files.append(current) - current = {} - # possible inline key: - path: something - rest = stripped[2:].strip() - if rest: - if ":" in rest: - k, v = rest.split(":", 1) - current[k.strip()] = v.strip() - continue - # key: value lines (indented) - if ":" in stripped and current is not None: - k, v = stripped.split(":", 1) - current[k.strip()] = v.strip() - - if current is not None: - files.append(current) - if files: - result["files"] = files - return result - - -def _normalize_entry(entry: Any) -> Dict[str, Any]: - if not isinstance(entry, dict): - return {"path": str(entry)} - # prefer path or file_path - if "file_path" in entry and "path" not in entry: - entry = dict(entry) - entry["path"] = entry.pop("file_path") - return entry - - -def validate_manifest(manifest_path: str, report_only: bool = True) -> dict: - """Validate a minimal mindmodel manifest and return a report. - - Parameters - - manifest_path: path to the YAML manifest file - - report_only: unused flag for now; kept to emphasise this is report-only - - Returns a dict with keys: missing_files, truncated_evidence, potential_secrets - """ - if not os.path.exists(manifest_path): - raise FileNotFoundError(manifest_path) - - # attempt to use PyYAML if available, otherwise fallback - try: - manifest = _load_yaml_native(manifest_path) - except Exception: - manifest = _load_yaml_fallback(manifest_path) - - files = manifest.get("files") or [] - report = {"missing_files": [], "truncated_evidence": [], "potential_secrets": []} - - def _strip_surrounding_quotes(s: str) -> str: - s = s.strip() - if len(s) >= 2 and s[0] == s[-1] and s[0] in ('"', "'"): - return s[1:-1] - return s - - for raw in files: - entry = _normalize_entry(raw) - path = entry.get("path") - evidence = entry.get("evidence_excerpt") or entry.get("evidence") or "" - # Remove surrounding quotes if the fallback YAML parser left them in place - if isinstance(evidence, str): - evidence = _strip_surrounding_quotes(evidence) - - # missing files - if path: - if not os.path.exists(path): - report["missing_files"].append(path) - - # truncated evidence heuristics - if isinstance(evidence, str): - if len(evidence) > 1000 or evidence.strip().endswith("..."): - report["truncated_evidence"].append( - {"path": path, "evidence_excerpt": evidence} - ) - - # potential secrets heuristics - up = evidence.upper() - if "PASSWORD" in up or "SECRET" in up or "BEGIN PRIVATE KEY" in evidence: - report["potential_secrets"].append( - {"path": path, "evidence_excerpt": evidence} - ) - - return report diff --git a/tests/ci/test_schedule_exists.py b/tests/ci/test_schedule_exists.py deleted file mode 100644 index 122e038..0000000 --- a/tests/ci/test_schedule_exists.py +++ /dev/null @@ -1,11 +0,0 @@ -import pathlib - - -def test_schedule_workflow_exists(): - path = pathlib.Path(".github/workflows/mindmodel-schedule.yml") - assert path.exists(), f"Expected {path} to exist" - - text = path.read_text(encoding="utf-8") - # ensure the file is a GitHub Actions workflow that declares a schedule - assert "on:" in text - assert "schedule" in text diff --git a/tests/ci/test_workflow_exists.py b/tests/ci/test_workflow_exists.py deleted file mode 100644 index 9deaa8b..0000000 --- a/tests/ci/test_workflow_exists.py +++ /dev/null @@ -1,26 +0,0 @@ -import os - -try: - import yaml - - _HAS_YAML = True -except Exception: - _HAS_YAML = False - - -def test_mindmodel_workflow_exists_and_parses(): - path = os.path.join(".github", "workflows", "mindmodel-validation.yml") - assert os.path.exists(path), f"Workflow file {path} does not exist" - - # Minimal parse: if PyYAML is available, try safe_load; otherwise do a token check - with open(path, "r", encoding="utf-8") as f: - content = f.read() - - if _HAS_YAML: - data = yaml.safe_load(content) - assert data is not None and isinstance(data, dict) - assert "on" in data or "name" in data - else: - # fall back to simple checks to avoid introducing new deps - assert "name:" in content - assert "on:" in content diff --git a/tests/scripts/mindmodel/test_checks.py b/tests/scripts/mindmodel/test_checks.py deleted file mode 100644 index e5ece9f..0000000 --- a/tests/scripts/mindmodel/test_checks.py +++ /dev/null @@ -1,43 +0,0 @@ -import os -import tempfile - -from scripts.mindmodel import checks - - -def test_file_exists(tmp_path): - # create a file under tmp_path - base = str(tmp_path) - p = tmp_path / "subdir" - p.mkdir() - f = p / "file.txt" - f.write_text("hello") - - # path relative to base - assert checks.file_exists(base, "subdir/file.txt") - # non-existing - assert not checks.file_exists(base, "subdir/missing.txt") - - -def test_detect_truncated(): - assert checks.detect_truncated("This is a truncated snippet...") - assert checks.detect_truncated("Truncation marker: [truncated]") - assert checks.detect_truncated("contains truncatED word") - assert not checks.detect_truncated("This is complete") - assert not checks.detect_truncated("") - - -def test_find_potential_secrets(): - text = """ - api_key = "abcdEFGH1234ijklMNOP" - password: 'hunter2' - aws = AKIA1234567890ABCD12 - random_hex = deadbeefdeadbeefdeadbeefdeadbeef - not_a_secret = short - """ - - found = checks.find_potential_secrets(text) - # should find api_key value, password, aws and long hex - assert "abcdEFGH1234ijklMNOP" in found - assert "hunter2" in found - assert any(item.startswith("AKIA") for item in found) - assert any("deadbeef" in item for item in found) diff --git a/tests/scripts/mindmodel/test_cli.py b/tests/scripts/mindmodel/test_cli.py deleted file mode 100644 index 2333f24..0000000 --- a/tests/scripts/mindmodel/test_cli.py +++ /dev/null @@ -1,14 +0,0 @@ -import os - - -def test_cli_with_nonexistent_manifest(): - """Calling cli.main with a non-existent manifest should return non-zero.""" - from scripts.mindmodel import cli - - # Provide a path that is extremely unlikely to exist - fake_manifest = "/this/path/does/not/exist/manifest.json" - - code = cli.main([fake_manifest]) - - assert isinstance(code, int) - assert code != 0 diff --git a/tests/scripts/mindmodel/test_loader.py b/tests/scripts/mindmodel/test_loader.py deleted file mode 100644 index b4a3429..0000000 --- a/tests/scripts/mindmodel/test_loader.py +++ /dev/null @@ -1,21 +0,0 @@ -import json -import pytest - -from scripts.mindmodel import loader - - -def test_load_json_manifest(tmp_path): - data = [{"id": "c1", "description": "a constraint"}] - p = tmp_path / "manifest.json" - p.write_text(json.dumps(data), encoding="utf-8") - - loaded = loader.load_manifest(str(p)) - - assert isinstance(loaded, dict) - assert "constraints" in loaded - assert any(c.get("id") == "c1" for c in loaded["constraints"]) - - -def test_missing_manifest_raises(): - with pytest.raises(loader.ManifestLoadError): - loader.load_manifest("nonexistent-file-manifest.json") diff --git a/tests/scripts/mindmodel/test_validator.py b/tests/scripts/mindmodel/test_validator.py deleted file mode 100644 index 803a582..0000000 --- a/tests/scripts/mindmodel/test_validator.py +++ /dev/null @@ -1,70 +0,0 @@ -import json -import os - -from scripts.mindmodel import validator - - -def write_manifest(path, data: str): - p = path - p.write_text(data, encoding="utf-8") - return str(p) - - -def test_validate_ok(tmp_path): - # manifest with one constraint and evidence pointing to an existing file - evidence_file = tmp_path / "file.txt" - evidence_file.write_text("hello") - - manifest = { - "constraints": [ - {"id": "c1", "evidence": [{"file": "file.txt", "text": "complete content"}]} - ] - } - - manifest_path = tmp_path / "manifest.json" - manifest_path.write_text(json.dumps(manifest)) - - code, report = validator.validate_manifest( - str(manifest_path), base_dir=str(tmp_path) - ) - assert code == 0 - assert report["missing_files"] == [] - assert report["secrets"] == [] - - -def test_missing_file_flags_failure(tmp_path): - # manifest refers to missing file - manifest = { - "constraints": [{"id": "c2", "evidence": [{"file": "nope.txt", "text": "foo"}]}] - } - manifest_path = tmp_path / "manifest.json" - manifest_path.write_text(json.dumps(manifest)) - - code, report = validator.validate_manifest( - str(manifest_path), base_dir=str(tmp_path) - ) - assert code == 2 - assert "nope.txt" in report["missing_files"] - - -def test_truncated_produces_warning(tmp_path): - # evidence text is truncated -> warning - f = tmp_path / "manifest.json" - manifest = { - "constraints": [{"id": "c3", "evidence": [{"text": "This is truncated..."}]}] - } - f.write_text(json.dumps(manifest)) - - code, report = validator.validate_manifest(str(f), base_dir=str(tmp_path)) - assert code == 1 - assert report["truncated"] >= 1 - - -def test_manifest_scanned_for_secrets(tmp_path): - # manifest text contains an api_key pattern - f = tmp_path / "manifest.json" - f.write_text('api_key = "secretVALUE1234"') - - code, report = validator.validate_manifest(str(f), base_dir=str(tmp_path)) - assert code == 2 - assert any("secretVALUE1234" in s for s in report["secrets"]) or report["secrets"] diff --git a/tests/scripts/test_validate_cli.py b/tests/scripts/test_validate_cli.py deleted file mode 100644 index ebd1de4..0000000 --- a/tests/scripts/test_validate_cli.py +++ /dev/null @@ -1,52 +0,0 @@ -import json -import subprocess -import sys -from pathlib import Path - - -def test_cli_runs(tmp_path): - manifest = Path(".mindmodel/manifest.yaml") - assert manifest.exists(), "expected .mindmodel/manifest.yaml to exist in repo" - - report_path = tmp_path / "report.json" - - # Try module mode first, fallback to direct script invocation - cmds = [ - [ - sys.executable, - "-m", - "scripts.validate_mindmodel", - str(manifest), - "--report", - str(report_path), - ], - [ - sys.executable, - "scripts/validate_mindmodel.py", - str(manifest), - "--report", - str(report_path), - ], - ] - - result = None - for cmd in cmds: - try: - result = subprocess.run(cmd, check=False, capture_output=True, text=True) - # if process ran (any exit code), break and use this result - break - except FileNotFoundError: - continue - - assert result is not None, "Failed to run script (no suitable invocation)" - # CLI should exit with 0 (report-only) - assert result.returncode == 0, ( - f"CLI exited non-zero: {result.returncode}\nstderr: {result.stderr}" - ) - - assert report_path.exists(), f"Report file was not created at {report_path}" - - data = json.loads(report_path.read_text(encoding="utf-8")) - # top-level keys expected from validator - for key in ("missing_files", "truncated_evidence", "potential_secrets"): - assert key in data, f"Report JSON missing key: {key}" diff --git a/tests/types/test_motion_types.py b/tests/types/test_motion_types.py deleted file mode 100644 index 1b20634..0000000 --- a/tests/types/test_motion_types.py +++ /dev/null @@ -1,22 +0,0 @@ -import json - -from src.types.motion_types import SimilarityNeighbor, to_json, from_json - - -def test_similarity_neighbor_json_roundtrip(): - neighbors = [ - SimilarityNeighbor(motion_id="m1", score=0.9), - SimilarityNeighbor(motion_id="m2", score=0.75), - ] - - # Serialize to JSON string - json_str = to_json(neighbors) - assert isinstance(json_str, str) - - # Ensure it's valid JSON - parsed = json.loads(json_str) - assert isinstance(parsed, list) - - # Deserialize back to objects - recovered = from_json(json_str) - assert recovered == neighbors diff --git a/tests/validators/test_mindmodel_validator.py b/tests/validators/test_mindmodel_validator.py deleted file mode 100644 index e75a8a8..0000000 --- a/tests/validators/test_mindmodel_validator.py +++ /dev/null @@ -1,45 +0,0 @@ -import os -import tempfile -from pathlib import Path - -import pytest - -from src.validators.mindmodel_validator import validate_manifest - - -def _write_temp_manifest(contents: str) -> str: - fd, path = tempfile.mkstemp(prefix="manifest_", suffix=".yaml") - os.close(fd) - with open(path, "w", encoding="utf-8") as f: - f.write(contents) - return path - - -def test_validator_reports_missing_file(tmp_path): - # manifest referencing a non-existent file - missing = str(tmp_path / "no_such_file.txt") - manifest = f""" -files: - - path: {missing} -""" - mpath = _write_temp_manifest(manifest) - try: - report = validate_manifest(mpath) - assert "missing_files" in report - assert missing in report["missing_files"] - finally: - Path(mpath).unlink() - - -def test_validator_detects_potential_secret(tmp_path): - # manifest with evidence_excerpt containing PASSWORD - evidence = "This shows a PASSWORD=hunter2 in the output" - manifest = f'files:\n - path: some_file.txt\n evidence_excerpt: "{evidence}"\n' - mpath = _write_temp_manifest(manifest) - try: - report = validate_manifest(mpath) - assert "potential_secrets" in report - items = report["potential_secrets"] - assert any(evidence in (item.get("evidence_excerpt") or "") for item in items) - finally: - Path(mpath).unlink() diff --git a/tests/validators/test_types.py b/tests/validators/test_types.py deleted file mode 100644 index 0de0bea..0000000 --- a/tests/validators/test_types.py +++ /dev/null @@ -1,24 +0,0 @@ -import os -from pathlib import Path - -import pytest - -from src.validators.types import parse_manifest, Manifest - - -def test_manifest_model_parses_sample(tmp_path: Path): - sample = """ -files: - - path: data/file1.txt - evidence_excerpt: "some evidence" - - file_path: data/file2.txt - evidence_excerpt: "other evidence" -""" - p = tmp_path / "manifest.yaml" - p.write_text(sample, encoding="utf-8") - - manifest = parse_manifest(str(p)) - assert isinstance(manifest, Manifest) - assert len(manifest.files) == 2 - assert manifest.files[0]["path"] == "data/file1.txt" - assert manifest.files[1]["path"] == "data/file2.txt" diff --git a/tests/validators/test_validator_edgecases.py b/tests/validators/test_validator_edgecases.py deleted file mode 100644 index e01e9dd..0000000 --- a/tests/validators/test_validator_edgecases.py +++ /dev/null @@ -1,56 +0,0 @@ -import os -from pathlib import Path - -from src.validators.mindmodel_validator import validate_manifest - - -def test_missing_files_reported(tmp_path): - # create two paths that do not exist - p1 = str(tmp_path / "missing_one.txt") - p2 = str(tmp_path / "missing_two.txt") - - manifest = f""" -files: - - path: {p1} - - path: {p2} -""" - - mpath = tmp_path / "manifest_missing.yaml" - mpath.write_text(manifest, encoding="utf-8") - - report = validate_manifest(str(mpath)) - assert "missing_files" in report - # both missing paths should be reported - assert p1 in report["missing_files"] - assert p2 in report["missing_files"] - - -def test_truncated_evidence_and_secrets_reported(tmp_path): - # entry with truncated evidence (ends with ...) - trunc_path = str(tmp_path / "trunc.txt") - trunc_evidence = "This output was cut off..." - - # entry with potential secret (contains PASSWORD) - secret_path = str(tmp_path / "secret.txt") - secret_evidence = "Found PASSWORD=sekret123 in the logs" - - manifest = f""" -files: - - path: {trunc_path} - evidence_excerpt: "{trunc_evidence}" - - path: {secret_path} - evidence_excerpt: "{secret_evidence}" -""" - - mpath = tmp_path / "manifest_edgecases.yaml" - mpath.write_text(manifest, encoding="utf-8") - - report = validate_manifest(str(mpath)) - - # truncated evidence should report the trunc_path - assert "truncated_evidence" in report - assert any(item.get("path") == trunc_path for item in report["truncated_evidence"]) - - # potential secrets should report the secret_path - assert "potential_secrets" in report - assert any(item.get("path") == secret_path for item in report["potential_secrets"]) diff --git a/thoughts/shared/changes/2026-03-28-ansible-package-implementation.md b/thoughts/shared/changes/2026-03-28-ansible-package-implementation.md deleted file mode 100644 index 47d4498..0000000 --- a/thoughts/shared/changes/2026-03-28-ansible-package-implementation.md +++ /dev/null @@ -1,40 +0,0 @@ -# 2026-03-28 Ansible package implementation - -Summary of changes added to repository: - -- packages/@ansible/example/ - - package.json (scoped package @ansible/example) - - README.md - - src/index.js - - tests/ (test_package_json.js, test_pack_inspect.js, _pack_helpers.js, run.js) -- .github/workflows/publish-ansible-example.yml -- .github/workflows/deploy-motief.yml -- docs/deployment/ansible-package-deploy.md -- docs/embeddings.md -- README.md (top-level) -- thoughts/shared/changes/2026-03-28-ansible-package-implementation.md (this file) - -Verification commands (run from repo root): - -1. Run package tests: - cd packages/@ansible/example && npm test - -2. Run pack inspection: - cd packages/@ansible/example && node tests/test_pack_inspect.js - -3. Simulate pack locally: - cd packages/@ansible/example && npm pack && tar -tzf | head -n 20 - -4. Check workflows syntax locally (optional): - - Use `act` or `nektos/act` to run workflow_dispatch triggers in a container; ensure secrets are not printed. - -5. Verify docs updated for embeddings and deployment: open docs/embeddings.md and docs/deployment/ansible-package-deploy.md - -Notes: -- Do NOT add secrets to repo. Secrets: NPM_TOKEN, DEPLOY_SSH_KEY, DEPLOY_HOST, DEPLOY_USER, DEPLOY_SSH_PORT, OPENROUTER_API_KEY - -Contact: Sven Geboers - -End of changelog. - -Write the file with neutral tone and concise steps for verification. diff --git a/thoughts/shared/changes/2026-03-28-env-removal-report.md b/thoughts/shared/changes/2026-03-28-env-removal-report.md deleted file mode 100644 index 70a7b70..0000000 --- a/thoughts/shared/changes/2026-03-28-env-removal-report.md +++ /dev/null @@ -1,36 +0,0 @@ ---- -date: 2026-03-28 -title: "Remove .env from tracking β€” report" ---- - -Summary -------- - -I removed `.env` from the repository index and added it to `.gitignore` to prevent accidental future commits. This was a non-destructive, forward-facing change β€” the repository history still contains prior commits that touched `.env`. - -What I ran ------------ - -- git rm --cached .env -- ensured `.gitignore` contains `.env` -- committed the change: chore(secrets): stop tracking .env and add to .gitignore - -Commits that referenced .env ----------------------------- - -These commits touched `.env` in the repository history (from git log --all -- .env): - -- 35f4667 2026-03-28 Sven Geboers chore(secrets): stop tracking .env and add to .gitignore -- 3551a82 2026-03-21 Sven Geboers feat(analysis): add 2D political compass and 2D trajectories - -Notes ------ - -- The `.env` file was removed from the index but remains in historical commits. If you need to remove it from history, we can perform a history rewrite (git-filter-repo or BFG) and force-push; this is destructive and requires coordination. -- I created a CI guard to fail builds if a `.env` file is present in the repository root (see .github/workflows/forbid-env.yml). This prevents accidental re-adding via pushes/PRs. - -Next steps (recommended) ------------------------- - -1. Rotate secrets that might have been in `.env` (see the secrets-rotation checklist next). This is mandatory if those keys were used anywhere publicly or in shared CI. -2. If you require history purge, reply confirming and I'll prepare a filter-repo run and the exact force-push sequence. diff --git a/thoughts/shared/changes/2026-03-28-secrets-rotation-checklist.md b/thoughts/shared/changes/2026-03-28-secrets-rotation-checklist.md deleted file mode 100644 index 24612de..0000000 --- a/thoughts/shared/changes/2026-03-28-secrets-rotation-checklist.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -date: 2026-03-28 -title: "Secrets rotation checklist" ---- - -Rotate these secrets if they were stored in `.env` or otherwise exposed: - -- OPENROUTER_API_KEY / OPENAI_API_KEY -- NPM_TOKEN -- DEPLOY SSH keys or passwords (DEPLOY_SSH_KEY, DEPLOY_PASSWORD) -- Any database credentials, API keys, or third-party service tokens - -Steps ------ - -1. Revoke the current tokens in each provider's dashboard. -2. Create new tokens/keys and store them in the repository secrets (GitHub Settings β†’ Secrets). -3. Update any running services / CI variables to use the new tokens. -4. If you used SSH keys and replaced them, update the authorized_keys on the VPS and remove the old key. - -Verification ------------- - -- Use CI dry-run jobs that check connectivity and token validity. -- Run local commands that use the new tokens.