chore: convert mindmodel from YAML to markdown and clean up

Delete 17 malformed YAML constraint files and 10 stale numbered
constraint files. Convert domain glossary, patterns, stack, and
anti-patterns to markdown format. Update manifest.yaml to reference
new markdown files.
main
Sven Geboers 3 weeks ago
parent 910ef0dc3b
commit 88595c869b
  1. 127
      .mindmodel/anti-patterns/anti-patterns.md
  2. 146
      .mindmodel/anti-patterns/anti-patterns.yaml
  3. 34
      .mindmodel/constraints/01-naming.yaml
  4. 74
      .mindmodel/constraints/10-db-schema.yaml
  5. 22
      .mindmodel/constraints/20-domain-glossary.yaml
  6. 30
      .mindmodel/constraints/30-clusters.yaml
  7. 46
      .mindmodel/constraints/40-patterns.yaml
  8. 24
      .mindmodel/constraints/50-anti-patterns.yaml
  9. 117
      .mindmodel/constraints/60-examples.yaml
  10. 43
      .mindmodel/constraints/99-stack.yaml
  11. 29
      .mindmodel/constraints/db_connection.yaml
  12. 143
      .mindmodel/constraints/error-handling.md
  13. 184
      .mindmodel/constraints/error-handling.yaml
  14. 36
      .mindmodel/constraints/error_handling.yaml
  15. 124
      .mindmodel/constraints/logging.md
  16. 92
      .mindmodel/dependencies/dependencies.md
  17. 78
      .mindmodel/dependencies/dependencies.yaml
  18. 146
      .mindmodel/domain/domain-glossary.md
  19. 107
      .mindmodel/domain/domain-glossary.yaml
  20. 96
      .mindmodel/manifest.yaml
  21. 79
      .mindmodel/patterns/duckdb-access.md
  22. 70
      .mindmodel/patterns/duckdb_access.yaml
  23. 74
      .mindmodel/patterns/embeddings-similarity.md
  24. 63
      .mindmodel/patterns/embeddings_similarity.yaml
  25. 63
      .mindmodel/patterns/error-handling.md
  26. 54
      .mindmodel/patterns/error_handling.yaml
  27. 41
      .mindmodel/patterns/module-singletons.md
  28. 33
      .mindmodel/patterns/module_singletons.yaml
  29. 77
      .mindmodel/patterns/requests-http.md
  30. 65
      .mindmodel/patterns/requests_http.yaml
  31. 37
      .mindmodel/patterns/validation.md
  32. 29
      .mindmodel/patterns/validation.yaml
  33. 67
      .mindmodel/stack/stack.md
  34. 41
      .mindmodel/stack/stack.yaml
  35. 83
      .mindmodel/system.md

@ -0,0 +1,127 @@
---
title: Anti-Patterns in Stemwijzer
category: anti-patterns
severity: critical
---
# Anti-Patterns
> **NOTE**: Some anti-patterns below were investigated and found to be resolved or invalid. See individual entries for details.
## CRITICAL: print() Instead of Logging
**File**: `api_client.py`
**Evidence**: 11 instances of `print(f"...")` instead of `_logger.info(...)`
**Broken code**:
```python
def get_motions(self, ...):
try:
# ...
print(f"Fetched {len(voting_records)} voting records from API") # BAD
print(f"Processed into {len(motions)} unique motions") # BAD
except Exception as e:
print(f"Error fetching motions from API: {e}") # BAD - no traceback
```
**Fix**:
```python
import logging
_logger = logging.getLogger(__name__)
def get_motions(self, ...):
try:
_logger.info("Fetched %d voting records from API", len(voting_records))
_logger.info("Processed into %d unique motions", len(motions))
except Exception as e:
_logger.exception("Error fetching motions from API: %s", e)
return []
```
---
## CRITICAL: Global `_DummySt` Replacement
**File**: `explorer.py`
**Evidence**: Lines ~50-70, module-level `st = _DummySt()` global replacement
**Problem**: Creates a module-level variable `st` that shadows `streamlit` module, causing subtle bugs.
**Fix**: Use conditional flags instead of global replacement:
```python
# GOOD: Use conditional logic
try:
import plotly.express as px
import plotly.graph_objects as go
HAS_PLOTLY = True
except ImportError:
HAS_PLOTLY = False
px = None
go = None
def render_chart(data):
if not HAS_PLOTLY:
_logger.warning("Plotly not available")
return
# ... rest of chart logic
```
---
## WARNING: Logger Naming Inconsistency
**Evidence**: 16 files use `logger`, 17 files use `_logger`
**Files with `logger`** (without underscore):
- api_client.py, ai_provider.py, pipeline files, analysis files
**Files with `_logger`** (with underscore):
- database.py, explorer.py, explorer_helpers.py
**Recommendation**: Standardize on `_logger` for module-level loggers.
---
## WARNING: Bare except with pass
**File**: `database.py`, line 47
```python
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError
try:
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1")
except: # bare except
pass
```
**Fix**:
```python
try:
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1")
except Exception as exc:
_logger.debug("Sequence creation skipped: %s", exc)
```
---
## INVESTIGATED: Entity-ID / Party-Name Mismatch
**Status**: INVALID - investigated and resolved
**Investigation Summary**: `svd_vectors.entity_id` only contains MP names (not party names). Party centroids are correctly computed via `mp_metadata` lookups. No production bug exists.
---
## Pattern: Three Separate Party Alias Dictionaries
**Problem**: Party name variations exist in 3+ places with no canonical alias mapping.
**Fix**: Create one `PARTY_ALIASES` dict in `config.py`:
```python
PARTY_ALIASES = {
"GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"],
"PVV": ["Partij voor de Vrijheid"],
# ...
}
```

@ -1,146 +0,0 @@
# Anti-Patterns
> ⚠ **NOTE**: Section 1 below was **investigated and resolved** — it is NOT a bug (see §1 for details).
---
## 1. ~~CRITICAL: Entity-ID / Party-Name Mismatch in `compute_party_coords`~~ → **INVALID — INVESTIGATED & RESOLVED**
**Investigation Date**: 2026-03-31
**Investigation Summary**: After thorough analysis of the database schema and code, this anti-pattern is **INVALID**. The original concern was based on a false assumption about `svd_vectors.entity_id` containing party names.
**Investigation Findings**:
1. **`svd_vectors` table has NO rows with `entity_type='party'`** — only `mp` and `motion` entity types exist in practice.
2. **`entity_ids in svd_vectors are always MP names** (e.g., `"Van Dijk, I."`), never party names. The party centroids are correctly computed via `mp_metadata` lookups.
3. **The trajectories plot WORKS correctly** — no production bug exists. The code path for party-level visualization does not rely on `svd_vectors.entity_id` containing party names.
**Conclusion**: The original anti-pattern was a false positive caused by incorrect assumptions about data contents. The `party_map` reverse-lookup (`mp_name → party_name`) works correctly because `entity_id` values are always MP names, not party names.
---
## 2. Bare `except: pass`
**File**: `database.py`, line 47
**Problem**: Catches **all** exceptions including `KeyboardInterrupt`, `SystemExit`, `MemoryError`.
Silently swallows errors — no logging, no fallback.
**Broken code**:
```python
try:
self.conn.execute(sql)
except: # ← bare except
pass
```
**Fix**:
```python
try:
self.conn.execute(sql)
except ibis.errors.IbisError as e:
st.warning(f"Query failed: {e}")
raise # or return a default
```
---
## 3. Nested Exception Handling
**File**: `explorer.py`, lines 244–261
**Problem**: Try/except inside try/except creates opaque error paths. Inner exception silently swallows outer intent.
**Broken code**:
```python
try:
result = compute_svd(motions)
# ...
except Exception:
try:
# Try fallback approach
result = fallback_compute(motions)
except Exception:
pass # ← both exceptions silently dropped
```
**Fix**: Flatten — handle each case explicitly, or use a decorator.
---
## 4. Catch-All `Exception` Used Everywhere
**Problem**: `except Exception:` catches 50+ exception types including `ValueError`, `TypeError`, `KeyError`.
Overly broad — masks real bugs.
**Occurrence**: 850+ instances of bare/generic exception handlers across codebase.
**Fix**: Catch specific exceptions. If you must catch multiple, chain them:
```python
except (KeyError, ValueError) as e:
logger.warning(f"Missing field: {e}")
```
---
## 5. No `entity_id` Format Validation
**Problem**: `svd_vectors.entity_id` can be either:
- An MP name (e.g., `"Van Dijk, I."`) for individual-level SVD
- A party name (e.g., `"GroenLinks-PvdA"`) for party-level SVD
No validation distinguishes which is which. Code must infer from context. (Note: In practice `svd_vectors.entity_id` only contains MP names — see §1 for investigation findings.)
**Fix**: Add explicit format marker or separate columns:
```python
# Option A: separate columns
svd_vectors = pd.DataFrame({
'mp_name': [...], # nullable
'party_name': [...], # nullable
'window': [...],
'vector_2d': [...]
})
# Option B: format prefix
# "mp:Van Dijk, I." or "party:GroenLinks-PvdA"
```
---
## 6. Silent Fallback When Party Centroids Fail
**Problem**: If `party_map` lookup fails (entity is a party, not MP), the code silently produces
`party_map_count: 0` and empty `parties_with_centroid_counts`. No warning is raised.
**Fix**: Add validation and warning:
```python
if party_map_count == 0:
st.warning(f"No party mappings found for {len(svd_df)} entities in window '{window}'")
```
---
## 7. Three Separate Party Alias Dictionaries (No Single Source of Truth)
**Problem**: Party name variations exist in 3+ places:
- `PARTY_COLOURS` keys
- `party_map` values (from `mp_party_history`)
- Raw data column values
No canonical alias mapping. Spelling mismatches cause silent failures.
**Fix**: Create one `PARTY_ALIASES` dict in `config.py`:
```python
PARTY_ALIASES = {
"GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"],
"PVV": ["Partij voor de Vrijheid"],
...
}
def resolve_party(name: str) -> str:
"""Normalize any party name variant to canonical form."""
for canonical, aliases in PARTY_ALIASES.items():
if name in aliases or name == canonical:
return canonical
return name # no alias found
```

@ -1,34 +0,0 @@
# Naming & Style Conventions
## Rules
- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py
- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py)
- Classes: PascalCase. Evidence: MotionDatabase (database.py)
- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred)
- Imports order: stdlib, third-party, local; prefer absolute imports and grouped.
- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections).
## Examples
### Function example (from pipeline/run_pipeline.py)
```python
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]:
"""Return list of (window_id, start_str, end_str) tuples."""
```
### Class example (from database.py)
```python
class MotionDatabase:
def __init__(self, db_path: str = config.DATABASE_PATH):
...
```
## Anti-patterns
- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files.
## Remediations
- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step.
## Evidence pointers
- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120)
- database.py: MotionDatabase class and methods (file database.py lines 1-400+)

@ -1,74 +0,0 @@
# Database Schema (DuckDB) — extracted DDL
## Rules
- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py).
- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py).
## Examples (DDL snippets extracted from database.py)
### motions table
```sql
CREATE TABLE IF NOT EXISTS motions (
id INTEGER DEFAULT nextval('motions_id_seq'),
title TEXT NOT NULL,
description TEXT,
date DATE,
policy_area TEXT,
voting_results JSON,
winning_margin FLOAT,
controversy_score FLOAT,
layman_explanation TEXT,
externe_identifier TEXT,
body_text TEXT,
url TEXT UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
```
### mp_votes table
```sql
CREATE TABLE IF NOT EXISTS mp_votes (
id INTEGER DEFAULT nextval('mp_votes_id_seq'),
motion_id INTEGER NOT NULL,
mp_name TEXT NOT NULL,
party TEXT,
vote TEXT NOT NULL,
date DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
```
### embeddings / fused_embeddings
```sql
CREATE TABLE IF NOT EXISTS embeddings (
id INTEGER DEFAULT nextval('embeddings_id_seq'),
motion_id INTEGER NOT NULL,
model TEXT,
vector JSON NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
CREATE TABLE IF NOT EXISTS fused_embeddings (
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
motion_id INTEGER NOT NULL,
window_id TEXT NOT NULL,
vector JSON NOT NULL,
svd_dims INTEGER NOT NULL,
text_dims INTEGER NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
```
## Anti-patterns
- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior.
## Remediations
- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically.
- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80).
## Evidence pointers
- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings.

@ -1,22 +0,0 @@
# Domain Glossary
## Rules
- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id.
## Terms
- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110)
- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes
- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`.
- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table.
- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows
- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score
## Examples / Usage
- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120
## Evidence pointers
- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py)
- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py)
## Anti-patterns
- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations.

@ -1,30 +0,0 @@
# Code Clusters / Organization
## Rules
- The repository organizes code into the following clusters (observed):
- UI / Streamlit: Home.py, pages/, app.py, explorer.py
- Database & persistence: database.py, config.py
- ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion)
- AI provider & summarization: ai_provider.py, pipeline/..., analysis/
- Similarity & caching: similarity/*, similarity_cache table in DB
- API client & scraping: api_client.py, pipeline/fetch_mp_metadata
- Analysis & visualization: analysis/visualize.py, explorer.py
- CLI & scheduler: scheduler.py, pipeline/run_pipeline.py
- Tests & migrations: tests/ (pytest) and database reset helpers
## Examples
### Pipeline orchestrator (cluster: CLI & pipeline)
```python
from database import MotionDatabase
db = MotionDatabase(db_path)
# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window
```
## Remediations
- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests.
## Evidence pointers
- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py)
- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py)
- analysis/visualize.py: visualization cluster (file: analysis/visualize.py)

@ -1,46 +0,0 @@
# Design Patterns & Code Patterns
## Rules
- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management.
- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback.
- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes).
## Examples
### Repository pattern (database.py MotionDatabase)
```python
class MotionDatabase:
def __init__(self, db_path: str = config.DATABASE_PATH):
self.db_path = db_path
self._init_database()
def insert_motion(self, motion_data: Dict) -> bool:
"""Insert a new motion into database"""
# uses duckdb.connect and parameterized queries
```
### Provider adapter with retries (ai_provider.py)
```python
def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response:
# Implements retries/backoff, handles 429 with Retry-After and 5xx responses
```
### Pipeline parallelism pattern (run_pipeline)
```python
with ThreadPoolExecutor(max_workers=max_workers) as pool:
for window_id, w_start, w_end in windows:
fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k)
futures[fut] = window_id
# wait then write sequentially to DuckDB
```
## Anti-patterns
- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors.
## Remediations
- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md.
## Evidence pointers
- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300)
- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260)
- database.py: MotionDatabase methods (file: database.py)

@ -1,24 +0,0 @@
# Anti-patterns, Issues and Recommended Fixes
## Rules
- Flagged issues discovered in Phase 1 must be remediated with concrete actions.
## Issues
- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml
- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports.
- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility.
- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps.
- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging.
## Remediations / Recommended fixes
- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml.
- Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain.
- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var.
- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges.
- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage.
- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks.
## Evidence pointers
- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40)
- database.py: multiple broad except blocks (file: database.py top and methods)
- ai_provider.py: uses requests + env keys (file: ai_provider.py)

@ -1,117 +0,0 @@
# Example Extractions
## Rules
- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions.
## (a) Function signatures with docstrings (5 examples)
1) pipeline/run_pipeline.py::_generate_windows
```python
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]:
"""Return list of (window_id, start_str, end_str) tuples.
window_id format:
quarterly → "2024-Q1", "2024-Q2", …
annual → "2024"
"""
```
2) database.py::append_audit_event
```python
def append_audit_event(
self,
actor_id: Optional[str],
action: str,
target_type: Optional[str] = None,
target_id: Optional[str] = None,
metadata: Optional[Dict] = None,
) -> bool:
"""Record an audit event. Tries DB then falls back to ledger file."""
```
3) ai_provider.py::get_embedding
```python
def get_embedding(text: str, model: str | None = None) -> list[float]:
"""Return an embedding vector for `text` using the configured provider.
Raises ProviderError for configuration or provider-side failures.
"""
```
4) ai_provider.py::get_embeddings_batch
```python
def get_embeddings_batch(
texts: list[str], model: str | None = None, batch_size: int = 50
) -> list[list[float]]:
"""Return embedding vectors for multiple texts using batched API calls."""
```
5) analysis/visualize.py::plot_umap_scatter
```python
def plot_umap_scatter(
motion_ids: List[int],
coords: List[List[float]],
labels: Optional[List[int]] = None,
window_id: Optional[str] = None,
output_path: str = "analysis_umap.html",
) -> str:
"""Produce a 2D scatter plot of UMAP-reduced fused embeddings."""
```
## (b) SQL / DDL snippets (3 examples inferred from database.py)
1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110)
2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes
3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings
## (c) Pytest stubs (4 sample tests matching conventions)
Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add.
1) tests/test_database_basic.py
```python
def test_init_database_creates_tables(tmp_path):
db_path = str(tmp_path / "motions.db")
from database import MotionDatabase
db = MotionDatabase(db_path=db_path)
# If duckdb not available, JSON fallback should create .embeddings.json
assert db is not None
```
2) tests/test_ai_provider.py
```python
def test_local_embedding_fallback():
from ai_provider import _local_embedding
v = _local_embedding("hello world", dim=16)
assert isinstance(v, list) and len(v) == 16
```
3) tests/test_pipeline_windows.py
```python
from pipeline.run_pipeline import _generate_windows
def test_generate_quarterly_windows():
from datetime import date
start = date(2024, 1, 1)
end = date(2024, 3, 31)
windows = _generate_windows(start, end, "quarterly")
assert any(w[0].endswith("Q1") for w in windows)
```
4) tests/test_visualize_plot.py
```python
def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path):
# If plotly missing, function should raise ImportError with guidance
import analysis.visualize as vis
try:
vis._require_plotly()
except ImportError:
assert True
```
## Evidence pointers
- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py
- DDL: database.py create table blocks

@ -1,43 +0,0 @@
# Stack and Dependencies
## Rules
- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13")
- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile
- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py
- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/
## Examples
### pyproject dependencies (evidence: pyproject.toml)
```toml
dependencies = [
"duckdb>=1.3.2",
"ibis-framework[duckdb]>=10.8.0",
"openai>=1.99.7",
"scipy>=1.11",
"umap-learn>=0.5",
"plotly>=5.0",
"pytest>=9.0.2",
"requests>=2.32.4",
"schedule>=1.2.2",
"streamlit>=1.48.0",
"scikit-learn>=1.8.0",
"beautifulsoup4>=4.14.3",
"lxml>=6.0.2",
]
```
## Anti-patterns / Notes
- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml
- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility.
- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai).
## Remediations
- Move test-only libs (pytest) to dev-dependencies in pyproject.toml.
- Add lockfile and CI step to check for pinned dependencies.
- Audit declared but unused packages (openai) and remove or confirm dynamic usage.
## Evidence pointers
- pyproject.toml: full dependency list (lines 1-40)
- Home.py: streamlit usage and app entry (file: Home.py)
- database.py: duckdb table creation and connection (file: database.py lines ~1-350)

@ -1,29 +0,0 @@
# DB connection handling constraints
rules:
- name: use_context_managers_for_connections
rule: "Prefer using 'with duckdb.connect(path, read_only=...) as conn' for scoped DB interactions where possible."
rationale: "Ensures proper resource cleanup and avoids connection leaks."
- name: read_only_for_compute
rule: "Use read_only=True for compute steps that only read data (SVD, similarity compute)."
rationale: "Allows safe parallel workers and reduces write contention."
- name: short_lived_writes
rule: "When performing database writes, open short-lived connections, commit quickly and close."
rationale: "Avoids long-lived transactions and reduces lock windows."
examples:
- path: pipeline/svd_pipeline.py
snippet: |
conn = duckdb.connect(db_path, read_only=True)
try:
rows = conn.execute(...).fetchall()
finally:
conn.close()
anti_patterns_and_remediations:
- bad: "Creating a global connection at import that performs migrations."
remediation: "Move migrations to an explicit init function that runs at deployment/upgrade time."
- bad: "Not closing connections on exceptions."
remediation: "Wrap connects in `with` or finally: conn.close() blocks."

@ -0,0 +1,143 @@
---
title: Error Handling Patterns
category: constraints
severity: high
---
# Error Handling Patterns
## Core Rules
1. **Catch `Exception`, return safe fallbacks** (False/[]/None)
2. **Log exceptions with traceback** using `_logger.exception()`
3. **Never swallow exceptions silently** - always log or return sensible default
4. **Avoid nested try/except blocks** - flatten exception handling
## Pattern: Try/Except Safe Fallback
This is the dominant pattern in the codebase (219+ instances).
```python
# Standard pattern from database.py, api_client.py, etc.
try:
result = risky_operation()
return process(result)
except Exception as exc:
_logger.warning("Operation failed: %s", exc)
return safe_fallback # False, [], None, {}
```
### Examples from Codebase
**database.py** - DuckDB operations:
```python
def get_svd_vectors(self, window: str):
try:
conn = duckdb.connect(self.db_path, read_only=True)
try:
result = conn.execute(query, (window,)).fetchall()
return self._parse_vectors(result)
finally:
conn.close()
except Exception as exc:
_logger.warning("Failed to get SVD vectors: %s", exc)
return []
```
**ai_provider.py** - HTTP retries:
```python
try:
resp = requests.post(url, json=json, headers=headers, timeout=10)
resp.raise_for_status()
return resp.json()
except requests.ConnectionError as exc:
if attempt == retries:
raise ProviderError(f"Connection error: {exc}") from exc
# ... retry logic
```
## Pattern: Optional Dependency Fallback
Gracefully degrade when optional packages are unavailable.
```python
# UMAP fallback in explorer_helpers.py
try:
import umap
HAS_UMAP = True
except ImportError:
HAS_UMAP = False
_logger.debug("UMAP not available, using SVD vectors directly")
def project_to_2d(vectors):
if HAS_UMAP:
return umap.UMAP().fit_transform(vectors)
return vectors[:, :2] # Fallback: first 2 SVD dimensions
```
## Anti-Patterns
### 1. Bare except with pass (CRITICAL)
**File**: `database.py`, line 47
```python
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError
try:
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1")
except: # bare except
pass
```
**Fix**: Catch specific exception or log and continue:
```python
try:
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1")
except Exception as exc:
_logger.debug("Sequence creation skipped (may already exist): %s", exc)
```
### 2. Nested Exception Handling
**File**: `explorer.py`, lines 244-261
```python
# BAD - opaque error paths
try:
result = compute_svd(motions)
except Exception:
try:
result = fallback_compute(motions)
except Exception:
pass # Both exceptions silently dropped
```
**Fix**: Flatten and handle each case explicitly:
```python
# GOOD - explicit handling
try:
result = compute_svd(motions)
except Exception as exc:
_logger.warning("SVD failed, trying fallback: %s", exc)
try:
result = fallback_compute(motions)
except Exception as fallback_exc:
_logger.error("Both SVD approaches failed: %s, %s", exc, fallback_exc)
raise
```
## Rule Summary
| Pattern | When to Use | Return Value |
|---------|-------------|--------------|
| Safe fallback | Best-effort operations | `[]`, `{}`, `False`, `None` |
| Re-raise | Critical operations that must succeed | raise |
| Log and continue | Optional steps in pipeline | (continue) |
| Graceful degradation | Optional dependencies | Default behavior |
## When to Log vs Return
| Scenario | Action |
|----------|--------|
| User action fails | Log warning, return safe default |
| Internal error (corrupt data) | Log error, return safe default |
| Transient failure (network) | Log warning, retry if appropriate |
| Configuration error | Log error, raise with clear message |

@ -1,184 +0,0 @@
# Error Handling Constraints
## Core Rule
**Catch `Exception`, return safe fallbacks (False/[]/None)**
Never let exceptions propagate to user-facing code. Always provide a safe default.
## Patterns
### For Not-Found Operations
Return `None` or falsy value when item not found:
```python
# GOOD: Return None on not found
def get_motion_by_id(self, motion_id: int) -> Optional[Dict]:
try:
conn = duckdb.connect(self.db_path)
result = conn.execute(
"SELECT * FROM motions WHERE id = ?", (motion_id,)
).fetchone()
conn.close()
return result
except Exception:
conn.close()
return None
```
### For Collection Operations
Return empty list when no results:
```python
# GOOD: Return empty list on failure
def get_filtered_motions(self, **kwargs) -> List[Dict]:
try:
conn = duckdb.connect(self.db_path)
rows = conn.execute(query, params).fetchall()
conn.close()
return rows
except Exception:
conn.close()
return []
```
### For Boolean Operations
Return `False` for failed boolean checks:
```python
# GOOD: Return False on failure
def motion_exists(self, motion_id: int) -> bool:
try:
conn = duckdb.connect(self.db_path)
count = conn.execute(
"SELECT COUNT(*) FROM motions WHERE id = ?", (motion_id,)
).fetchone()[0]
conn.close()
return count > 0
except Exception:
return False
```
### For Creation Operations
Return `False` or empty string on failure:
```python
# GOOD: Return empty string on failure
def generate_summary(self, title: str, body: str) -> str:
try:
return ai_provider.chat_completion(messages)
except ai_provider.ProviderError:
logger.exception("AI provider failed")
return ""
```
## Anti-Patterns to Avoid
### Don't Catch Specific Exceptions Only
```python
# BAD: Catches only FileNotFoundError, misses other issues
try:
with open(path) as f:
return json.load(f)
except FileNotFoundError:
return None
```
### Don't Re-raise Without Context
```python
# BAD: Loses information
try:
process(data)
except Exception:
raise # No context added
```
### Don't Swallow Exceptions Silently
```python
# BAD: No logging, no fallback
try:
return risky_operation()
except Exception:
pass # What happened?
```
## Nested Exception Handling
When calling code that has its own error handling, wrap only if needed:
```python
# Accept result from wrapped function (it handles errors)
def fetch_motions(self, start_date):
# ai_provider_wrapper handles retries internally
embeddings = get_embeddings_with_retry(texts)
# Only wrap if wrapper doesn't handle errors
if all(e is None for e in embeddings):
logger.error("All embeddings failed")
return []
return process(embeddings)
```
## Context Managers
Use `try/finally` for cleanup:
```python
def process_with_temp_file(self):
temp = NamedTemporaryFile(delete=False)
try:
temp.write(data)
temp.close()
return process_file(temp.name)
finally:
os.unlink(temp.name)
temp.close()
```
## When to Log vs Return
| Scenario | Action |
|----------|--------|
| User action fails | Log warning, return safe default |
| Internal error (corrupt data) | Log error, return safe default |
| Transient failure (network) | Log warning, retry if appropriate |
| Configuration error | Log error, raise with clear message |
## Exception Propagation
Only raise exceptions for:
1. Configuration/setup errors (missing required env vars)
2. Programming errors (invalid arguments)
3. Fatal system errors (database corruption)
```python
# GOOD: Raise for configuration errors
def _get_api_key(self) -> str:
key = os.environ.get("OPENROUTER_API_KEY")
if not key:
raise ProviderError(
"OPENROUTER_API_KEY environment variable is required"
)
return key
```
## Logging Errors
Always include context:
```python
# GOOD: Include relevant context
_logger.error(
"Failed to fetch motion %d: %s",
motion_id,
exc
)
# BAD: No context
_logger.error("Failed to fetch")
```

@ -1,36 +0,0 @@
# Error handling style rules (YAML constraint example)
rules:
- name: explicit_exceptions
rule: "Raise explicit exceptions (ValueError, ProviderError) for known error conditions rather than returning magic values."
examples:
- good: |
if not isinstance(text, str):
raise ProviderError('text must be a string')
- bad: |
if not isinstance(text, str):
return []
- name: avoid_broad_except
rule: "Avoid 'except Exception:' that swallows errors. If broad except is used for best-effort, log the exception with logger.exception and re-raise or convert."
examples:
- bad: |
try:
do_work()
except Exception:
return []
- remediation: |
try:
do_work()
except SpecificError as exc:
logger.warning('Handled error: %s', exc)
raise
- name: logging_over_print
rule: "Prefer logger.* over print() for messages and errors."
examples:
- bad: "print('Error fetching motions from API: %s' % e)"
- good: "logger.exception('Error fetching motions from API')"
enforcement_examples:
- "Add a static code check to flag 'print(' in modules (except in simple scripts) and 'except Exception:' usages without logger.exception."

@ -1,8 +1,47 @@
---
title: Logging Constraints
category: constraints
severity: critical
---
# Logging Constraints
## Core Rule
**Use `logging.getLogger(__name__)` - never use `print()`**
Use `logging.getLogger(__name__)` - never use `print()`
**CRITICAL ANTI-PATTERN**: `api_client.py` uses `print()` instead of logging (11 instances).
## CRITICAL Anti-Pattern: print() Instead of Logging
**File**: `api_client.py`
**Evidence**: Lines with `print(f"...")` instead of `_logger.info(...)`
**Broken code**:
```python
def get_motions(self, ...):
try:
# ...
print(f"Fetched {len(voting_records)} voting records from API") # BAD
print(f"Processed into {len(motions)} unique motions") # BAD
except Exception as e:
print(f"Error fetching motions from API: {e}") # BAD - no traceback
```
**Fix**:
```python
import logging
_logger = logging.getLogger(__name__)
def get_motions(self, ...):
try:
_logger.info("Fetched %d voting records from API", len(voting_records))
_logger.info("Processed into %d unique motions", len(motions))
except Exception as e:
_logger.exception("Error fetching motions from API: %s", e)
return []
```
## Logger Initialization
@ -31,6 +70,10 @@ _logger = logging.getLogger(__name__)
_logger = logging.getLogger(__name__)
```
**INCONSISTENCY WARNING**: 16 files use `logger`, 17 files use `_logger`. Choose one convention.
**Recommendation**: Use `_logger` (with underscore) for module-level loggers to distinguish from class-level loggers.
## Log Levels
| Level | When to Use |
@ -41,30 +84,6 @@ _logger = logging.getLogger(__name__)
| ERROR | Operation failed, may need attention |
| CRITICAL | Fatal error, program may crash |
## Examples
### Good Logging Practice
```python
_logger.info("Pipeline run: %s → %s (%s windows)", start, end, count)
_logger.debug("Batch embedding attempt %d failed: %s", attempt, exc)
_logger.warning("Fallback used for motion %d: %s", motion_id, reason)
_logger.error("Query failed: %s", exc)
```
### Bad: Using print()
```python
# BAD - don't use print
print(f"Fetched {len(voting_records)} voting records from API")
print(f"Error fetching motions from API: {e}")
```
### Good: Using logger
```python
# GOOD - use logger
_logger.info("Fetched %d voting records from API", len(voting_records))
_logger.error("Error fetching motions from API: %s", e)
```
## Exception Logging
Use `_logger.exception()` for caught exceptions (includes traceback):
@ -77,30 +96,6 @@ except Exception as exc:
return fallback_value
```
Use `_logger.error()` with explicit exception for controlled errors:
```python
try:
result = risky_operation()
except Exception as exc:
_logger.error("Operation failed: %s", exc)
return fallback_value
```
## Configuration
Ensure logging is configured in entry points:
```python
# pipeline/run_pipeline.py
def run(args):
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)
# ... rest of pipeline
```
## Anti-Patterns
### Debug Prints in Production Code
@ -117,22 +112,6 @@ _logger.debug("Processing window %s", wid)
# BAD - mixing _logger and logger
_logger = logging.getLogger(__name__)
logger = logging.getLogger("other") # Inconsistent
# GOOD - use single consistent pattern
_logger = logging.getLogger(__name__)
```
### Missing Logger Initialization
```python
# BAD - no logger defined
def some_function():
logging.getLogger(__name__).info("...") # Redundant calls
# GOOD - define once at module level
_logger = logging.getLogger(__name__)
def some_function():
_logger.info("...")
```
## Sensitive Data
@ -150,18 +129,3 @@ _logger.info("User %s voted %s", user_id, vote)
# GOOD - log aggregates, not individual votes
_logger.info("Vote recorded for session %s", session_id[:8])
```
## Structured Logging
For complex data, use structured logging:
```python
_logger.info(
"Motion processed",
extra={
"motion_id": motion_id,
"policy_area": policy_area,
"processing_time_ms": elapsed_ms,
}
)
```

@ -0,0 +1,92 @@
---
title: Dependencies and Library Usage
category: dependencies
---
# Dependencies and Library Usage
## Core Dependencies
### duckdb
- **Required**: Yes
- **Fallback**: None (core functionality)
- **Usage**: SQL database for motions, embeddings, SVD vectors
- **Files**: database.py, analysis/*.py, pipeline/*.py
### streamlit
- **Required**: Yes
- **Fallback**: None
- **Usage**: Web UI framework
- **Files**: app.py, pages/*.py, explorer.py
### requests
- **Required**: Yes
- **Fallback**: None
- **Usage**: HTTP client for API calls
- **Files**: api_client.py, ai_provider.py
### plotly
- **Required**: Yes
- **Fallback**: None (raises ImportError)
- **Usage**: Interactive charts for explorer
- **Files**: explorer.py, explorer_helpers.py
## Optional Dependencies
### umap-learn
- **Required**: No
- **Fallback**: Use raw SVD vectors (first 2 dimensions)
- **Usage**: Dimensionality reduction for visualization
- **Files**: analysis/clustering.py
### matplotlib
- **Required**: No
- **Fallback**: Plotly or raw output
- **Usage**: Static charting
- **Files**: Various analysis scripts
## ML Dependencies
### sklearn
- **Required**: Yes
- **Usage**: KMeans clustering, cosine_similarity, StandardScaler
- **Files**: analysis/clustering.py, similarity/compute.py
### scipy
- **Required**: Yes
- **Usage**: SVD (scipy.linalg.svd), spatial.procrustes for alignment
- **Files**: analysis/trajectory.py, pipeline/svd_pipeline.py
### numpy
- **Required**: Yes
- **Usage**: Array operations, linear algebra
- **Files**: Throughout codebase
## Key Imports by File
### explorer.py
- `import streamlit as st`
- `from database import db`
- `from explorer_helpers import *`
### explorer_helpers.py
- `import pandas as pd`
- `import plotly.graph_objects as go`
- `from database import db` (optional, for type hints)
### database.py
- `import ibis`
- `import duckdb`
- `from config import config, PARTY_COLOURS`
### config.py
- `from dataclasses import dataclass, field`
- `import streamlit as st` (optional, for warnings)
## Singleton Instances
| Module | Instance | Type |
|--------|----------|------|
| `database.py` | `db` | `MotionDatabase` |
| `config.py` | `config` | `Config` (dataclass) |
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` |

@ -1,78 +0,0 @@
# Dependencies
## Core Library Wiring
### Database Layer
```
ibis → DuckDB → MotionDatabase singleton (database.py)
sqlglot (ibis dependency)
```
### Data Processing
```
pandas → (used throughout for DataFrame operations)
numpy → (used by sklearn, scipy, umap)
scipy → spatial.procrustes for window alignment
```
### ML Pipeline
```
sklearn.cluster → KMeans, Procrustes
sklearn.preprocessing → StandardScaler
umap → UMAP (optional, graceful fallback)
```
### Visualization
```
plotly → explorer_helpers.py chart builders
st.plotly_chart → explorer.py rendering
```
### Streamlit
```
streamlit → all pages, @st.cache_data decorators
```
## Optional Dependencies
| Package | Required | Fallback |
|---------|----------|----------|
| `umap` | No | Use raw SVD vectors (first 2 dims) |
| `plotly` | Yes | Raises ImportError |
| `duckdb` | Yes | — |
| `ibis` | Yes | — |
| `sklearn` | Yes | — |
## Singleton Instances
| Module | Instance | Type |
|--------|----------|------|
| `database.py` | `db` | `MotionDatabase` |
| `config.py` | `config` | `Config` (dataclass) |
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` |
## Key Imports by File
```
explorer.py:
- import streamlit as st
- from database import db
- from explorer_helpers import *
explorer_helpers.py:
- import pandas as pd
- import plotly.graph_objects as go
- from database import db (optional, for type hints)
database.py:
- import ibis
- import duckdb
- from config import config, PARTY_COLOURS
config.py:
- from dataclasses import dataclass, field
- import streamlit as st (optional, for warnings)
```
## Environment
- Python ≥3.13
- Environment variables via `.env` (DB path, API keys)
- No `.env` values in constraint files (security)

@ -0,0 +1,146 @@
---
title: Domain Glossary
category: domain
---
# Domain Glossary - Dutch Political Terms
## CRITICAL INVARIANTS
> **Rule 1**: Centroid of right-wing parties on RIGHT side of ALL axes
> - PVV, FVD, JA21, SGP centroid must appear on the RIGHT
> - Individual right-wing parties may vary slightly from the centroid
> - This is non-negotiable for any compass/axis visualization
> **Rule 2**: SVD labels are empirically derived from voting data
> - Labels represent WHAT THE DATA SHOWS, not party self-identification or public opinion
> - Labels are derived from outliers and 20 representative motions (10 positive, 10 negative)
> - See SVD Label Derivation section below
---
## SVD Label Derivation
### The Process
SVD (Singular Value Decomposition) finds axes that maximize variance in the MP × Motion voting matrix. To label each axis:
1. **Identify outliers**: Find the two MPs with most extreme positions on that axis
2. **Select representative motions**: Pick 20 motions where these outliers disagreed most sharply (10 they voted opposite on, 10 where both voted same direction but with other extremes)
3. **Interpret theme**: Read the motion titles to derive what the axis represents
4. **Assign label**: Label describes the empirical theme, could be:
- Left-Right
- Coalition-Opposition
- Progressive-Conservative
- EU-National sovereignty
- Populist-Establishment
- Or whatever the voting patterns show
### Example
| Step | Description |
|------|-------------|
| Outlier A | Wilders (PVV) - extreme positive on Dim 1 |
| Outlier B | Marijnissen (SP) - extreme negative on Dim 1 |
| 20 Motions | Immigration, integration, law & order themes dominate |
| Label | "Links-Rechts" (Left-Right) |
### Labeling Rules
- **Never use party names in labels** (e.g., not "PVV-SP axis")
- **Never use semantic/ideological labels** (e.g., not "progressive-conservative" unless that's what the motions show)
- **Use motion-derived themes** (e.g., "Immigration", "EU", "Economy")
- **Fallback**: If theme is unclear, use "Axis 1", "Axis 2"
---
## Core Entities
### Motion / Motie
- Parliamentary motion submitted by MPs
- Fields: `id`, `title`, `date`, `category`
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent**
### MP / Kamerlid
- Member of Parliament (Tweede Kamerlid)
- Identified by full name (e.g., "Van Dijk, I.")
- Has voting record, party affiliation, SVD position vector
### Party / Fractie
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD")
- Party centroids: average SVD position of all MPs in party
### Vote / Stemming
- Individual MP's vote on a motion: +1, 0, -1
- Aggregated to compute SVD vectors
---
## Time & Analysis Concepts
### Window / Tijdsvenster
- Time period for analysis (annual or quarterly)
- Values: "2023", "2023-Q1", "2024", etc.
- SVD vectors computed per window
### Trajectory
- MP's position change across multiple windows
- Computed from `svd_vectors` + window ordering
---
## Mathematical / Algorithmic Terms
### SVD Vector
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix
- Represents MP's position in political space
### SVD Label
- Empirically derived axis label based on outlier MPs and representative motions
- Describes the theme of disagreement on that axis
- NOT based on party ideology or semantic labels
### Political Compass
- 2D visualization with SVD axes mapped to compass quadrants
- X-axis: First SVD dimension (labeled from voting data)
- Y-axis: Second SVD dimension (labeled from voting data)
### Procrustes Alignment
- Algorithm to align SVD vectors across time windows
- Ensures comparable positions across years/quarters
### UMAP
- Uniform Manifold Approximation and Projection
- Dimensionality reduction for visualization
- Optional dependency with graceful SVD fallback
---
## Database Table Reference
| Table | Key Fields |
|-------|-----------|
| `motions` | id, title, date, category |
| `mp_votes` | mp_id, motion_id, vote |
| `svd_vectors` | entity_id, window, vector_2d (list[2]) |
| `mp_party_history` | mp_id, party, start_date, end_date |
| `windows` | window_id, start_date, end_date, period_type |
| `mp_trajectories` | mp_id, window, trajectory_vector |
---
## Dutch Political Parties
### Canonical Right-Wing (centroid on RIGHT of axes)
- PVV (Partij voor de Vrijheid)
- FVD (Forum voor Democratie)
- JA21
- SGP (Staatkundig Gereformeerde Partij)
### Other Major Parties
- VVD (Volkspartij voor Vrijheid en Democratie)
- GL-PvdA (GroenLinks-PvdA)
- NSC (Nieuw Sociaal Contract)
- BBB (BoerBurgerBeweging)
- SP (Socialistische Partij)
- D66 (Democraten 66)

@ -1,107 +0,0 @@
# Domain Glossary - Dutch Political Terms
## Core Entities
### Motion / Motie
- Parliamentary motion submitted by MPs
- Fields: `id`, `title`, `date`, `category`
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent**
### MP / Kamerlid
- Member of Parliament (Tweede Kamerlid)
- Identified by full name (e.g., "Van Dijk, I.")
- Has voting record, party affiliation, SVD position vector
- Historical: `mp_party_history` tracks party changes over time
### Party / Fractie
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD")
- Party centroids: average SVD position of all MPs in party
- Aliases: multiple spelling variants exist (see anti-patterns.yaml)
### Vote / Stemming
- Individual MP's vote on a motion: +1, 0, -1
- Aggregated to compute SVD vectors
---
## Time & Analysis Concepts
### Window / Tijdsvenster
- Time period for analysis (annual or quarterly)
- Values: "2023", "2023-Q1", "2024", etc.
- SVD vectors computed per window
- Windows can be aligned across time using Procrustes
### Trajectory
- MP's position change across multiple windows
- Computed from `svd_vectors` + window ordering
- Used for trend analysis in Evolution tab
---
## Mathematical / Algorithmic Terms
### SVD Vector
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix
- Represents MP's position in political space
- `entity_id` in `svd_vectors`: either MP name (when individual MPs) or party name (when party-level)
### Political Compass
- 2D visualization: X-axis = Left↔Right, Y-axis = Progressive↔Conservative
- SVD vectors mapped to compass quadrants
- UMAP used for projection
### Procrustes Alignment
- Algorithm to align SVD vectors across time windows
- Ensures comparable positions across years/quarters
- Implemented via `scipy.spatial.procrustes` or scikit-learn
### Centroid
- Geometric center of a set of points
- Party centroid = average SVD position of all MPs in that party
- Computed from `svd_vectors` filtered by party
### UMAP
- Uniform Manifold Approximation and Projection
- Dimensionality reduction for visualization
- Optional dependency — graceful fallback if unavailable
---
## Visualization
### PARTY_COLOURS
- Dict mapping party names to hex color codes
- Used in all Plotly charts for consistent party coloring
- Source: `config.py` → `PARTY_COLOURS` constant
- **Issue**: 3 separate alias dictionaries exist (no single source of truth)
---
## Application Pages
### Home
- Landing page with app overview
### Stemwijzer (Quiz)
- User answers questions → matched to parties
- Thin wrapper around quiz module
### Explorer (4 tabs)
- **Motion tab**: SVD positions colored by vote on selected motion
- **MP tab**: Individual MP trajectories across windows
- **Party tab**: Party centroids with members as scatter
- **Evolution tab**: How positions change over time
---
## Database Table Reference
| Table | Key Fields |
|-------|-----------|
| `motions` | id, title, date, category |
| `mp_votes` | mp_id, motion_id, vote |
| `svd_vectors` | entity_id, window, vector_2d (list[2]) |
| `party_centroids` | party, window, centroid_2d |
| `mp_party_history` | mp_id, party, start_date, end_date |
| `windows` | window_id, start_date, end_date, period_type |
| `mp_trajectories` | mp_id, window, trajectory_vector |

@ -1,3 +1,7 @@
# stemwijzer Mind Model - Manifest
# Generated: 2026-04-12
# Phase: 2 - Assembly from Phase 1 Analysis
name: stemwijzer
version: 2
description: Dutch political voting compass (Stemwijzer) - Mind Model constraints
@ -7,39 +11,54 @@ categories:
- path: system.md
description: System overview and architecture summary
group: docs
- path: tech-stack.yaml
- path: stack/stack.md
description: Technology stack with versions and purposes
group: docs
- path: conventions.yaml
description: Coding conventions and style guide
group: docs
- path: domain.yaml
description: Domain entities, terms, and relationships
group: docs
group: stack
- path: domain/domain-glossary.md
description: Domain entities, terms, relationships, and CRITICAL INVARIANTS
group: domain
# Design patterns
- path: patterns/architecture.yaml
description: Repository, Facade, Pipeline architectural patterns
- path: patterns/patterns.yaml
description: Code patterns (Singleton, Repository, Pipeline, etc.)
group: patterns
- path: patterns/python.yaml
description: Python-specific patterns (Singleton, dataclass, context manager)
- path: patterns/streamlit.yaml
description: Streamlit-specific patterns (session state, cache)
group: patterns
- path: patterns/api.yaml
description: API client patterns with retry and pagination
group: patterns
- path: patterns/database.yaml
description: DuckDB connection patterns and ORM usage
description: DuckDB patterns and connection management
group: patterns
- path: patterns/api.yaml
description: API client patterns with retry logic and pagination
- path: patterns/python.yaml
description: Python-specific patterns (dataclass, typing)
group: patterns
- path: patterns/streamlit.yaml
description: Streamlit session state and page patterns
- path: patterns/duckdb-access.md
description: DuckDB connection patterns and best practices
group: patterns
- path: patterns/embeddings-similarity.md
description: Embeddings and similarity computation patterns
group: patterns
- path: patterns/error-handling.md
description: Error handling and exception patterns
group: patterns
- path: patterns/module-singletons.md
description: Module-level singleton patterns
group: patterns
- path: patterns/requests-http.md
description: HTTP client patterns with retry
group: patterns
- path: patterns/validation.md
description: Input validation patterns
group: patterns
# Coding constraints
- path: constraints/error-handling.yaml
- path: constraints/error-handling.md
description: Error handling patterns with safe fallbacks
group: constraints
- path: constraints/logging.yaml
description: Logging conventions and best practices
- path: constraints/logging.md
description: Logging conventions
group: constraints
- path: constraints/naming.yaml
description: File, class, function naming rules
@ -50,25 +69,40 @@ categories:
- path: constraints/types.yaml
description: Type hint conventions
group: constraints
- path: constraints/testing.yaml
description: Testing conventions
group: constraints
# Anti-patterns
- path: anti-patterns/anti-patterns.md
description: Known anti-patterns with evidence and fixes
group: anti-patterns
# Dependencies
- path: dependencies/dependencies.md
description: Library usage and singleton instances
group: dependencies
# Code examples
- path: examples/database-example.py
description: MotionDatabase usage example
description: MotionDatabase usage examples
group: examples
- path: examples/api-client-example.py
description: TweedeKamerAPI usage
description: TweedeKamerAPI usage examples
group: examples
- path: examples/pipeline-example.py
description: Pipeline phase example
description: Pipeline orchestration examples
group: examples
- path: examples/streamlit-page-example.py
description: Streamlit page pattern
description: Streamlit page patterns
group: examples
- path: examples/pattern-examples.md
description: Consolidated pattern examples
group: examples
# Anti-patterns and workflows
- path: anti-patterns.yaml
description: Known anti-patterns to avoid
group: meta
- path: workflows.yaml
description: Key workflows (VotingSession, DataIngestion, EmbeddingGeneration)
group: meta
# Phase 1 findings summary:
# - Tech: Python 3.13+, Streamlit, DuckDB, scipy/sklearn/umap, OpenRouter (QWEN)
# - 10 patterns discovered: Module singletons, Repository, Service layer, Pipeline
# - 8 anti-patterns: print() instead of logging, _DummySt global, bare except
# - 6 code clusters: Database, Streamlit UI, API, Analysis/ML, Config, Singletons
# - 3 groups: stdlib, 3rd party, local imports

@ -0,0 +1,79 @@
---
title: DuckDB Access Pattern
category: patterns
---
# DuckDB Access Pattern
## Rules
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers.
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic.
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle.
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads).
## Examples
### database.py - Explicit connect/close for schema init
```python
conn = duckdb.connect(self.db_path)
...
conn.execute("""
CREATE TABLE IF NOT EXISTS fused_embeddings (
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
motion_id INTEGER NOT NULL,
window_id TEXT NOT NULL,
vector JSON NOT NULL,
svd_dims INTEGER NOT NULL,
text_dims INTEGER NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
""")
conn.close()
```
### pipeline/svd_pipeline.py - Read-only connection
```python
conn = duckdb.connect(db_path, read_only=True)
try:
rows = conn.execute(
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?",
(start_date, end_date),
).fetchall()
finally:
conn.close()
```
### similarity/compute.py - Preferred 'with' context
```python
try:
import duckdb
except Exception:
logger.exception("duckdb import failed; cannot load vectors")
return 0
with duckdb.connect(db.db_path) as conn:
rows = conn.execute(query, params).fetchall()
```
## Anti-Patterns
### Bad: Connection without closure
```python
# BAD: connection may leak if exception occurs before explicit close
conn = duckdb.connect(db_path)
rows = conn.execute("SELECT ...").fetchall()
# missing finally/close
```
**Remediation**: Use "with" context or ensure conn.close() in finally block.
### Bad: Parallel write connections
**Problem**: Opening write connections from many parallel workers without coordination.
**Remediation**: Open read_only for compute processes and centralize writes via short-lived connections or a single writer worker.

@ -1,70 +0,0 @@
name: duckdb_access
rules:
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers.
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic.
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle.
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads).
examples:
- path: database.py
excerpt: |
```python
conn = duckdb.connect(self.db_path)
...
conn.execute("""
CREATE TABLE IF NOT EXISTS fused_embeddings (
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
motion_id INTEGER NOT NULL,
window_id TEXT NOT NULL,
vector JSON NOT NULL,
svd_dims INTEGER NOT NULL,
text_dims INTEGER NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
""")
conn.close()
```
note: explicit connect/close used when initializing schema
- path: pipeline/svd_pipeline.py
excerpt: |
```python
conn = duckdb.connect(db_path, read_only=True)
try:
rows = conn.execute(
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?",
(start_date, end_date),
).fetchall()
finally:
conn.close()
```
note: read_only connection used for compute-heavy worker
- path: similarity/compute.py
excerpt: |
```python
try:
import duckdb
except Exception:
logger.exception("duckdb import failed; cannot load vectors")
return 0
with duckdb.connect(db.db_path) as conn:
rows = conn.execute(query, params).fetchall()
```
note: preferred 'with' context for automatic close
anti_patterns:
- Bad: creating a connection without closure in a long-running process
remediation: use "with" context or ensure conn.close() in finally block
example: |
```python
# BAD: connection may leak if exception occurs before explicit close
conn = duckdb.connect(db_path)
rows = conn.execute("SELECT ...").fetchall()
# missing finally/close
```
- Bad: Opening write connections from many parallel workers without coordination
remediation: open read_only for compute processes and centralize writes via short-lived connections or a single writer worker.

@ -0,0 +1,74 @@
---
title: Embeddings Similarity Pipeline
category: patterns
---
# Embeddings Similarity Pipeline
## Rules
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure.
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text].
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache.
- Use read_only DuckDB connections in compute workers to allow parallel runs.
## Examples
### pipeline/ai_provider_wrapper.py - Batched embed + fallback
```python
for start in range(0, len(texts), batch_size):
chunk = texts[start : start + batch_size]
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk})
...
for j in range(i, end):
t = texts[j]
single, single_exc = _attempt_batch([t], j)
if single:
results[j] = single[0]
```
### pipeline/fusion.py - Concatenation and storage
```python
try:
svd_vec = json.loads(svd_json)
except Exception:
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id)
skipped_missing_svd += 1
continue
...
fused = list(svd_vec) + list(text_vec)
res = db.store_fused_embedding(
int(entity_id),
window_id,
fused,
svd_dims=len(svd_vec),
text_dims=len(text_vec),
)
```
### similarity/compute.py - Normalized cosine similarity
```python
# Normalize rows
norms = np.linalg.norm(matrix, axis=1, keepdims=True)
norms[norms == 0] = 1.0
normalized = matrix / norms
sim = normalized @ normalized.T
...
# pick top-k neighbors and write to similarity_cache
```
## Anti-Patterns
### Bad: Assuming consistent vector length
**Problem**: Assuming consistent vector length without checks leads to shape errors.
**Remediation**: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py).
### Bad: Inline heavy computation in UI
**Problem**: Recomputing heavy pipelines inline in UI requests.
**Remediation**: Schedule heavy work in scripts/subprocesses and read precomputed results in UI.

@ -1,63 +0,0 @@
name: embeddings_similarity_pipeline
rules:
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure.
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text].
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache.
- Use read_only DuckDB connections in compute workers to allow parallel runs.
examples:
- path: pipeline/ai_provider_wrapper.py
excerpt: |
```python
for start in range(0, len(texts), batch_size):
chunk = texts[start : start + batch_size]
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk})
...
for j in range(i, end):
t = texts[j]
single, single_exc = _attempt_batch([t], j)
if single:
results[j] = single[0]
```
note: batched embed + fallback per-item retry
- path: pipeline/fusion.py
excerpt: |
```python
try:
svd_vec = json.loads(svd_json)
except Exception:
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id)
skipped_missing_svd += 1
continue
...
fused = list(svd_vec) + list(text_vec)
res = db.store_fused_embedding(
int(entity_id),
window_id,
fused,
svd_dims=len(svd_vec),
text_dims=len(text_vec),
)
```
note: concatenation of vectors and storage via MotionDatabase
- path: similarity/compute.py
excerpt: |
```python
# Normalize rows
norms = np.linalg.norm(matrix, axis=1, keepdims=True)
norms[norms == 0] = 1.0
normalized = matrix / norms
sim = normalized @ normalized.T
...
# pick top-k neighbors and write to similarity_cache
```
note: numeric pipeline and padding to consistent dimensionality
anti_patterns:
- Bad: Assuming consistent vector length without checks (leads to shape errors).
remediation: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py).
- Bad: Recomputing heavy pipelines inline in UI requests.
remediation: schedule heavy work in scripts/subprocesses and read precomputed results in UI.

@ -0,0 +1,63 @@
---
title: Error Handling Pattern
category: patterns
---
# Error Handling Pattern
## Rules
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError).
- Prefer logging.exception when catching an exception where stack trace is useful.
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context.
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented.
## Examples
### ai_provider.py - Network error to ProviderError
```python
except requests.ConnectionError as exc:
if attempt == retries:
raise ProviderError(
f"Connection error when calling provider: {exc}"
) from exc
...
```
### pipeline/ai_provider_wrapper.py - Best-effort with logging
```python
except Exception:
_logger.exception("Failed to append audit event for embedding failure")
results[j] = None
```
### similarity/compute.py - Defensive import handling
```python
try:
import duckdb
except Exception:
logger.exception("duckdb import failed; cannot load vectors")
return 0
```
## Anti-Patterns
### Bad: Silent exception swallowing
```python
try:
do_work()
except Exception:
return []
# BAD: hides the root cause and returns an ambiguous default
```
**Remediation**: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled.
### Bad: Mixing print() and logging
**Problem**: Mixing print() and logging for errors.
**Remediation**: Replace print() calls with logger.* calls; use structured logging configuration.

@ -1,54 +0,0 @@
name: error_handling
rules:
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError).
- Prefer logging.exception when catching an exception where stack trace is useful.
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context.
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented.
examples:
- path: ai_provider.py
excerpt: |
```python
except requests.ConnectionError as exc:
if attempt == retries:
raise ProviderError(
f"Connection error when calling provider: {exc}"
) from exc
...
```
note: mapping network error to ProviderError with re-raise chaining
- path: pipeline/ai_provider_wrapper.py
excerpt: |
```python
except Exception:
_logger.exception("Failed to append audit event for embedding failure")
results[j] = None
```
note: logs and assigns None for failure; fallback behavior documented earlier in wrapper rule
- path: similarity/compute.py
excerpt: |
```python
try:
import duckdb
except Exception:
logger.exception("duckdb import failed; cannot load vectors")
return 0
```
note: defensive import handling and early return on failure
anti_patterns:
- Bad: Broad except without logging and without re-raising (silently hides bugs)
remediation: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled.
example: |
```python
try:
do_work()
except Exception:
return []
# BAD: hides the root cause and returns an ambiguous default
```
- Bad: Mixing print() and logging for errors
remediation: Replace print() calls with logger.* calls; use structured logging configuration.

@ -0,0 +1,41 @@
---
title: Module Singletons Pattern
category: patterns
---
# Module Singletons Pattern
## Rules
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully:
- Avoid expensive initialization at import time.
- Provide a way to construct with a test DB path or to reinitialize in tests.
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit.
## Examples
### database.py - Safe class initialization
```python
class MotionDatabase:
def __init__(self, db_path: str = config.DATABASE_PATH):
self.db_path = db_path
# If duckdb is not available, operate in lightweight file-backed mode
self._file_mode = duckdb is None
self._init_database()
```
### similarity/lookup.py - Local instances
```python
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase()
if hasattr(db, "get_cached_similarities"):
rows = db.get_cached_similarities(...)
```
## Anti-Patterns
### Bad: Heavy initialization at import time
**Problem**: Creating connections and performing heavy schema migrations during import.
**Remediation**: Move heavy init to an explicit initialize() method and keep import fast.

@ -1,33 +0,0 @@
name: module_singletons
rules:
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully:
- Avoid expensive initialization at import time.
- Provide a way to construct with a test DB path or to reinitialize in tests.
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit.
examples:
- path: database.py
excerpt: |
```python
class MotionDatabase:
def __init__(self, db_path: str = config.DATABASE_PATH):
self.db_path = db_path
# If duckdb is not available, operate in lightweight file-backed mode
self._file_mode = duckdb is None
self._init_database()
```
note: class is safe to instantiate and creates DB at init; consider lazy init if heavy
- path: similarity/lookup.py
excerpt: |
```python
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase()
if hasattr(db, "get_cached_similarities"):
rows = db.get_cached_similarities(...)
```
note: consumers create local MotionDatabase instances, not relying on a single global
anti_patterns:
- Bad: Creating connections and performing heavy schema migrations during import
remediation: Move heavy init to an explicit initialize() method and keep import fast.

@ -0,0 +1,77 @@
---
title: Requests HTTP Pattern
category: patterns
---
# Requests HTTP Pattern
## Rules
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling.
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429.
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429).
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase.
## Examples
### ai_provider.py - 429 handling with Retry-After
```python
resp = requests.post(url, json=json, headers=headers, timeout=10)
...
if getattr(resp, "status_code", 0) == 429:
if attempt == retries:
raise ProviderError(f"Provider returned HTTP {resp.status_code}")
retry_after = None
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None
if raw:
try:
retry_after = int(raw)
except Exception:
...
if retry_after is not None:
time.sleep(retry_after)
continue
```
### api_client.py - Session + raise_for_status
```python
response = self.session.get(
base_url, params=params, timeout=config.API_TIMEOUT
)
response.raise_for_status()
data = response.json()
```
### pipeline/ai_provider_wrapper.py - Retry/backoff wrapper
```python
def _attempt_batch(chunk_texts, start_index):
backoff = 0.5
for attempt in range(1, retries + 1):
try:
emb_chunk = _embedder(
chunk_texts, model=model, batch_size=len(chunk_texts)
)
return emb_chunk, None
except Exception as exc:
if attempt == retries:
break
sleep = backoff * (2 ** (attempt - 1))
time.sleep(sleep)
continue
```
## Anti-Patterns
### Bad: Silent exception swallowing
**Problem**: Blindly catching all requests exceptions and returning empty response.
**Remediation**: Map network exceptions to retryable vs terminal (ProviderError) and log details.
### Bad: Using print() for errors
**Problem**: Using print() for network errors instead of structured logging.
**Remediation**: Use `_logger.exception()` instead (see api_client.py needs fixing).

@ -1,65 +0,0 @@
name: requests_http
rules:
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling.
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429.
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429).
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase.
examples:
- path: ai_provider.py
excerpt: |
```python
resp = requests.post(url, json=json, headers=headers, timeout=10)
...
if getattr(resp, "status_code", 0) == 429:
if attempt == retries:
raise ProviderError(f"Provider returned HTTP {resp.status_code}")
retry_after = None
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None
if raw:
try:
retry_after = int(raw)
except Exception:
...
if retry_after is not None:
time.sleep(retry_after)
continue
```
note: explicit handling of 429 and Retry-After
- path: api_client.py
excerpt: |
```python
response = self.session.get(
base_url, params=params, timeout=config.API_TIMEOUT
)
response.raise_for_status()
data = response.json()
```
note: uses session + raise_for_status() to surface HTTP errors
- path: pipeline/ai_provider_wrapper.py
excerpt: |
```python
def _attempt_batch(chunk_texts, start_index):
backoff = 0.5
for attempt in range(1, retries + 1):
try:
emb_chunk = _embedder(
chunk_texts, model=model, batch_size=len(chunk_texts)
)
return emb_chunk, None
except Exception as exc:
if attempt == retries:
break
sleep = backoff * (2 ** (attempt - 1))
time.sleep(sleep)
continue
```
note: wrapper adds retry/backoff and per-item fallback
anti_patterns:
- Bad: Blindly catching all requests exceptions and returning empty response
remediation: map network exceptions to retryable vs terminal (ProviderError) and log details.
- Bad: Using print() for network errors instead of structured logging (see api_client.py where print() is used; prefer logging).

@ -0,0 +1,37 @@
---
title: Validation Pattern
category: patterns
---
# Validation Pattern
## Rules
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs.
- Tests should assert that invalid inputs raise the expected exceptions.
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding).
## Examples
### ai_provider.py - Type validation
```python
if not isinstance(text, str):
raise ProviderError("text must be a string")
```
### pipeline/ai_provider_wrapper.py - Defensive empty handling
```python
if not texts:
return []
if motion_ids is None:
motion_ids = [None for _ in texts]
```
## Anti-Patterns
### Bad: Invalid values into computation
**Problem**: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline).
**Remediation**: Fail fast with a typed exception and add unit tests to cover validations.

@ -1,29 +0,0 @@
name: validation
rules:
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs.
- Tests should assert that invalid inputs raise the expected exceptions.
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding).
examples:
- path: ai_provider.py
excerpt: |
```python
if not isinstance(text, str):
raise ProviderError("text must be a string")
```
note: explicit type validation before network call
- path: pipeline/ai_provider_wrapper.py
excerpt: |
```python
if not texts:
return []
if motion_ids is None:
motion_ids = [None for _ in texts]
```
note: defensive handling of empty inputs
anti_patterns:
- Bad: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline).
remediation: Fail fast with a typed exception and add unit tests to cover validations.

@ -0,0 +1,67 @@
---
title: Tech Stack
category: stack
---
# Tech Stack
## Runtime & Language
- **Python >=3.13**
## Web Framework
- **Streamlit** - Multi-page app with Home, Stemwijzer, Explorer pages
## Data Layer
- **DuckDB** - Embedded OLAP database
- Tables: motions, mp_votes, svd_vectors, fused_embeddings, embeddings, user_sessions, party_results, mp_metadata
- **ibis** - ORM (referenced but DuckDB-native implementation used)
## AI / LLM
- **OpenRouter** - API abstraction for AI providers
- **QWEN** - Primary model
- Embeddings: `qwen/qwen3-embedding-4b`
- Chat: `qwen/qwen-2.5-72b-instruct`
- **requests** - HTTP client (not raw openai)
## ML / Analytics
- **scikit-learn** - KMeans clustering, cosine_similarity, StandardScaler
- **scipy** - SVD (scipy.linalg.svd), spatial.procrustes
- **umap-learn** - Dimensionality reduction (optional, graceful fallback to SVD)
- **numpy** - Numerical computing
## Visualization
- **Plotly** - Interactive charts (go.Figure, _DummyTrace fallback)
- **matplotlib** - Static plotting (optional)
## HTTP & Parsing
- **requests** - Session pooling, retry with backoff
- **beautifulsoup4** - HTML parsing
- **lxml** - XML/HTML processing
## Key Source Files
| File | Purpose |
|------|---------|
| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema |
| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) |
| `explorer_helpers.py` | Pure helper functions, Plotly chart builders |
| `analysis/` | SVD pipeline, UMAP projection, clustering |
| `pipeline/` | Data fetch, transform, store pipeline |
| `pages/1_Stemwijzer.py` | Quiz page |
| `pages/2_Explorer.py` | Explorer page |
| `config.py` | Dataclass Config pattern |
| `ai_provider.py` | OpenRouter API wrapper with retry |
| `api_client.py` | TweedeKamer OData API client |
## Singleton Instances
| Module | Instance | Type |
|--------|----------|------|
| `database.py` | `db` | `MotionDatabase` |
| `config.py` | `config` | `Config` (dataclass) |
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` |
## Environment
- Python >=3.13
- Environment variables via `.env` (DB path, API keys)
- No `.env` values in constraint files (security)

@ -1,41 +0,0 @@
# Tech Stack
## Runtime & Language
- **Python ≥3.13** (type: runtime)
- Streamlit (type: web framework) - multi-page app: Home, Stemwijzer, Explorer (4 tabs)
## Data Layer
- **DuckDB** (type: database) - 9 tables: motions, mp_votes, svd_vectors, mp_party_history, etc.
- **ibis** (type: ORM) - DuckDB backend for Pythonic SQL
- Query mode: duckdb:// path or :memory: (see database.py:50-51)
## ML / Analytics
- **scikit-learn** (type: ML) - clustering, Procrustes alignment
- **UMAP** (type: dimensionality reduction) - 2D political compass projection
- **scipy** (type: scientific computing) - spatial/alignment algorithms
- **numpy** (type: numerical computing) - array operations
## Visualization
- **Plotly** (type: charting) - dual-layer interactive charts (scatter + annotations)
## Key Source Files
| File | Purpose |
|------|---------|
| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema |
| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) |
| `explorer_helpers.py` | Pure helper functions, Plotly chart builders, coordinate computation |
| `analysis/` | SVD pipeline, UMAP projection, clustering algorithms |
| `pipeline/` | Data fetch → transform → store pipeline |
| `pages/1_🗳_Stemwijzer.py` | Quiz page (thin wrapper) |
| `pages/2_🔍_Explorer.py` | Explorer page (thin wrapper) |
| `config.py` | Dataclass Config pattern |
## Database Tables
- `motions` - parliamentary motions with id, title, date, category
- `mp_votes` - individual MP votes on motions (1/0/-1)
- `svd_vectors` - SVD-computed political positions (entity_id, window, vector_2d)
- `mp_party_history` - MP-to-party mappings over time
- `party_centroids` - aggregated party positions
- `windows` - time period definitions
- `mp_trajectories` - MP position changes across windows
- Plus 2 additional tables (exact names vary)

@ -21,7 +21,7 @@ TweedeKamer OData API
├── text_pipeline # AI embeddings via OpenRouter
└── fusion # Combine SVD + text vectors
Streamlit Web App (app.py, pages/)
Streamlit Web App (Home.py, pages/)
├── Home.py # Landing page
├── 1_Stemwijzer.py # Voting quiz
└── 2_Explorer.py # Political compass explorer
@ -36,34 +36,53 @@ TweedeKamer OData API
| **AI Provider** | OpenRouter API for embeddings/summaries | `ai_provider.py` |
| **Pipeline** | Orchestrated data processing | `pipeline/run_pipeline.py` |
| **Analysis** | SVD, clustering, trajectory computation | `analysis/*.py` |
| **Similarity** | Motion similarity search | `similarity/*.py` |
| **Web App** | Streamlit UI | `app.py`, `pages/*.py` |
### Data Models
**Core Entities**:
- `Motion`: Parliamentary motion with voting results
- `MP` / `MPMetadata`: Member of Parliament with party/tenure
- `MPVote`: Individual vote record (Voor/Tegen/Onthouden/Geen stem/Afwezig)
- `Party`: Political party
- `UserSession` / `UserVote`: Voting session tracking
- `SVDVector`: Dimensionality-reduced vote vectors
- `FusedEmbedding`: Combined SVD + text embedding
- `SimilarityCache`: Pre-computed motion similarities
### Technical Decisions
1. **DuckDB over SQLite**: Chosen for OLAP performance with complex analytical queries
2. **ibis ORM**: Database-agnostic query building (currently using DuckDB backend)
3. **SVD + Procrustes**: Aligns voting vectors across time windows
4. **UMAP for visualization**: Non-linear dimensionality reduction for compass display
5. **OpenRouter API**: Abstraction layer for AI embeddings (currently using Qwen)
6. **Module-level singletons**: `db = MotionDatabase()` pattern for shared state
### Key Conventions
- **DuckDB connections**: Short-lived per method, always close
- **Error handling**: Catch `Exception`, return safe fallbacks (False/[]/None)
- **Logging**: Use `logging.getLogger(__name__)` - avoid print()
- **Type hints**: Required on public functions with typing module imports
- **Config**: Dataclass `Config` in `config.py`, accessed as `from config import config`
| **Explorer Helpers** | Pure functions, chart builders | `explorer_helpers.py` |
| **Web App** | Streamlit UI | `Home.py`, `pages/*.py` |
### Tech Stack
- **Language**: Python 3.13+
- **Web Framework**: Streamlit (multi-page app)
- **Database**: DuckDB with ibis ORM (DuckDB-native implementation)
- **ML/Analytics**: scipy (SVD, Procrustes), scikit-learn (KMeans, cosine_similarity), umap-learn (optional)
- **AI/LLM**: OpenRouter-compatible API (QWEN embeddings + chat)
- **Visualization**: Plotly (interactive charts), matplotlib (optional)
- **HTTP**: requests with Session pooling and retry
- **Parsing**: beautifulsoup4, lxml
### Key Patterns
1. **Module-Level Singletons**: `db = MotionDatabase()`, `config = Config()`
2. **Repository Pattern**: MotionDatabase class with method-per-query
3. **Service Layer**: TweedeKamerAPI, ai_provider with retry/backoff
4. **Pipeline Orchestration**: ThreadPoolExecutor for parallel SVD
5. **Short-Lived Connections**: DuckDB connections in try/finally blocks
6. **Graceful Degradation**: try/except around optional dependencies
### Domain Invariants
**CRITICAL RULES** (from AGENTS.md):
1. **Right-wing parties on RIGHT**: PVV, FVD, JA21, SGP must appear on RIGHT side of all axes in visualizations
2. **SVD labels = voting patterns**: SVD labels reflect voting patterns, NOT semantic content
### Database Tables
| Table | Purpose |
|-------|---------|
| `motions` | Parliamentary motions with id, title, date, category |
| `mp_votes` | Individual MP votes on motions (Voor/Tegen/Onthouden) |
| `mp_metadata` | MP names, parties, tenure info |
| `svd_vectors` | 2D SVD-computed political positions per entity |
| `fused_embeddings` | Combined SVD + text embeddings |
| `embeddings` | Text embeddings for motions |
| `user_sessions` | Voting session tracking |
| `party_results` | Party match results per session |
### Conventions
- **Error Handling**: Catch `Exception`, return safe fallbacks (False/[]/None)
- **Logging**: Use `logging.getLogger(__name__)` — **never use print()**
- **Imports**: stdlib → 3rd party → local (3 groups)
- **Type Hints**: Required on public functions with typing module imports
- **DuckDB**: Short-lived connections with try/finally conn.close()

Loading…
Cancel
Save