Compare commits
56 Commits
ee8ffea6e2
...
be4375b303
@ -1,43 +0,0 @@ |
||||
# Known anti-patterns and recommended remediation (Phase 1 findings) |
||||
|
||||
anti_patterns: |
||||
- id: broad_except_swallows_errors |
||||
description: "Wide except: clauses that swallow exceptions without logging or re-raising." |
||||
examples: |
||||
- path: multiple |
||||
note: "Observed in various pipeline and ingestion spots where except Exception: returns a default without context." |
||||
remediation: |
||||
- "Replace broad except with specific exceptions." |
||||
- "When broad except is absolutely needed, call logger.exception(...) and re-raise or convert to a typed domain error." |
||||
- "Add unit tests to ensure critical errors are visible in CI logs." |
||||
|
||||
- id: mixed_print_and_logging |
||||
description: "Mixing print() and logging() for errors and info messages." |
||||
examples: |
||||
- path: api_client.py |
||||
excerpt: | |
||||
```python |
||||
print(f"Fetched {len(voting_records)} voting records from API") |
||||
... |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") |
||||
``` |
||||
remediation: |
||||
- "Use logging.getLogger(__name__) and logger.info/warning/exception consistently." |
||||
- "Add a top-level logging configuration for Streamlit and scripts." |
||||
|
||||
- id: no_lockfile |
||||
description: "No lockfile present -> unreproducible installs and CI unpredictability." |
||||
remediation: |
||||
- "Add a lockfile (poetry.lock, requirements.txt produced by pip-tools) and pin versions in CI." |
||||
- "Make CI use the lockfile for reproducible builds." |
||||
|
||||
- id: declared_but_unused_dependency |
||||
description: "Dependency declared but unused (openai in pyproject)." |
||||
remediation: |
||||
- "Either remove the dependency or add clear adapter code/tests that exercise it. Keep pyproject tidy." |
||||
|
||||
- id: brittle_identity_heuristics |
||||
description: "Heuristics for MP identity (comma-based parsing) are brittle." |
||||
remediation: |
||||
- "Add robust parsing rules and unit tests; prefer canonical identifiers (persoon_id) where available." |
||||
@ -0,0 +1,127 @@ |
||||
--- |
||||
title: Anti-Patterns in Stemwijzer |
||||
category: anti-patterns |
||||
severity: critical |
||||
--- |
||||
|
||||
# Anti-Patterns |
||||
|
||||
> **NOTE**: Some anti-patterns below were investigated and found to be resolved or invalid. See individual entries for details. |
||||
|
||||
## CRITICAL: print() Instead of Logging |
||||
|
||||
**File**: `api_client.py` |
||||
**Evidence**: 11 instances of `print(f"...")` instead of `_logger.info(...)` |
||||
|
||||
**Broken code**: |
||||
```python |
||||
def get_motions(self, ...): |
||||
try: |
||||
# ... |
||||
print(f"Fetched {len(voting_records)} voting records from API") # BAD |
||||
print(f"Processed into {len(motions)} unique motions") # BAD |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") # BAD - no traceback |
||||
``` |
||||
|
||||
**Fix**: |
||||
```python |
||||
import logging |
||||
|
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
def get_motions(self, ...): |
||||
try: |
||||
_logger.info("Fetched %d voting records from API", len(voting_records)) |
||||
_logger.info("Processed into %d unique motions", len(motions)) |
||||
except Exception as e: |
||||
_logger.exception("Error fetching motions from API: %s", e) |
||||
return [] |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## CRITICAL: Global `_DummySt` Replacement |
||||
|
||||
**File**: `explorer.py` |
||||
**Evidence**: Lines ~50-70, module-level `st = _DummySt()` global replacement |
||||
|
||||
**Problem**: Creates a module-level variable `st` that shadows `streamlit` module, causing subtle bugs. |
||||
|
||||
**Fix**: Use conditional flags instead of global replacement: |
||||
```python |
||||
# GOOD: Use conditional logic |
||||
try: |
||||
import plotly.express as px |
||||
import plotly.graph_objects as go |
||||
HAS_PLOTLY = True |
||||
except ImportError: |
||||
HAS_PLOTLY = False |
||||
px = None |
||||
go = None |
||||
|
||||
def render_chart(data): |
||||
if not HAS_PLOTLY: |
||||
_logger.warning("Plotly not available") |
||||
return |
||||
# ... rest of chart logic |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## WARNING: Logger Naming Inconsistency |
||||
|
||||
**Evidence**: 16 files use `logger`, 17 files use `_logger` |
||||
|
||||
**Files with `logger`** (without underscore): |
||||
- api_client.py, ai_provider.py, pipeline files, analysis files |
||||
|
||||
**Files with `_logger`** (with underscore): |
||||
- database.py, explorer.py, explorer_helpers.py |
||||
|
||||
**Recommendation**: Standardize on `_logger` for module-level loggers. |
||||
|
||||
--- |
||||
|
||||
## WARNING: Bare except with pass |
||||
|
||||
**File**: `database.py`, line 47 |
||||
|
||||
```python |
||||
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: # bare except |
||||
pass |
||||
``` |
||||
|
||||
**Fix**: |
||||
```python |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except Exception as exc: |
||||
_logger.debug("Sequence creation skipped: %s", exc) |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## INVESTIGATED: Entity-ID / Party-Name Mismatch |
||||
|
||||
**Status**: INVALID - investigated and resolved |
||||
|
||||
**Investigation Summary**: `svd_vectors.entity_id` only contains MP names (not party names). Party centroids are correctly computed via `mp_metadata` lookups. No production bug exists. |
||||
|
||||
--- |
||||
|
||||
## Pattern: Three Separate Party Alias Dictionaries |
||||
|
||||
**Problem**: Party name variations exist in 3+ places with no canonical alias mapping. |
||||
|
||||
**Fix**: Create one `PARTY_ALIASES` dict in `config.py`: |
||||
```python |
||||
PARTY_ALIASES = { |
||||
"GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"], |
||||
"PVV": ["Partij voor de Vrijheid"], |
||||
# ... |
||||
} |
||||
``` |
||||
@ -1,35 +0,0 @@ |
||||
# Architecture overview and confidence levels |
||||
|
||||
layers: |
||||
- name: ui |
||||
description: "Streamlit pages and app entrypoints (Home.py, pages/*)." |
||||
confidence: high |
||||
- name: ingestion |
||||
description: "API client and scrapers (api_client.py, scraper.py)." |
||||
confidence: high |
||||
- name: processing |
||||
description: "Pipelines for embeddings, SVD, fusion (pipeline/*, similarity/*)." |
||||
confidence: high |
||||
- name: storage |
||||
description: "DuckDB primary store; JSON fallback used in tests when duckdb missing." |
||||
confidence: high |
||||
- name: ai_provider |
||||
description: "Lightweight HTTP wrapper around OpenRouter/OpenAI-style backends in ai_provider.py." |
||||
confidence: medium |
||||
- name: orchestration |
||||
description: "Script-based orchestration (scripts/*.py), rerun_embeddings, scheduler." |
||||
confidence: medium |
||||
|
||||
organization: |
||||
- Keep UI code separated from heavy compute — Streamlit runs should avoid heavy compute inline (use subprocess or schedule). |
||||
- Pipelines are implemented as re-entrant functions returning summary dicts to facilitate testing and subprocess usage (seen in svd_pipeline.compute_svd_for_window). |
||||
- DB access is centralised via MotionDatabase helper (database.py) with convenience methods (store_fused_embedding, append_audit_event). |
||||
|
||||
design_decisions: |
||||
- Use DuckDB for local fast analytics storage; read_only connections used in compute stages to allow parallel workers. |
||||
- Embeddings and similarity cache are stored as JSON in DuckDB tables (vector columns). |
||||
- The ai_provider uses requests with retry/backoff rather than a heavy SDK to keep testing simple. |
||||
|
||||
confidence_summary: |
||||
overall_confidence: high |
||||
notes: "Phase 1 input inspected files across the repo; design mapping is consistent with code samples." |
||||
@ -1,34 +0,0 @@ |
||||
# Naming & Style Conventions |
||||
|
||||
## Rules |
||||
- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py |
||||
- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py) |
||||
- Classes: PascalCase. Evidence: MotionDatabase (database.py) |
||||
- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred) |
||||
- Imports order: stdlib, third-party, local; prefer absolute imports and grouped. |
||||
- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections). |
||||
|
||||
## Examples |
||||
|
||||
### Function example (from pipeline/run_pipeline.py) |
||||
```python |
||||
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
||||
"""Return list of (window_id, start_str, end_str) tuples.""" |
||||
``` |
||||
|
||||
### Class example (from database.py) |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
... |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files. |
||||
|
||||
## Remediations |
||||
- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step. |
||||
|
||||
## Evidence pointers |
||||
- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120) |
||||
- database.py: MotionDatabase class and methods (file database.py lines 1-400+) |
||||
@ -1,74 +0,0 @@ |
||||
# Database Schema (DuckDB) — extracted DDL |
||||
|
||||
## Rules |
||||
- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). |
||||
- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py). |
||||
|
||||
## Examples (DDL snippets extracted from database.py) |
||||
|
||||
### motions table |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS motions ( |
||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||
title TEXT NOT NULL, |
||||
description TEXT, |
||||
date DATE, |
||||
policy_area TEXT, |
||||
voting_results JSON, |
||||
winning_margin FLOAT, |
||||
controversy_score FLOAT, |
||||
layman_explanation TEXT, |
||||
externe_identifier TEXT, |
||||
body_text TEXT, |
||||
url TEXT UNIQUE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
### mp_votes table |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS mp_votes ( |
||||
id INTEGER DEFAULT nextval('mp_votes_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
mp_name TEXT NOT NULL, |
||||
party TEXT, |
||||
vote TEXT NOT NULL, |
||||
date DATE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
### embeddings / fused_embeddings |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS embeddings ( |
||||
id INTEGER DEFAULT nextval('embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
model TEXT, |
||||
vector JSON NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
|
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior. |
||||
|
||||
## Remediations |
||||
- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically. |
||||
- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80). |
||||
|
||||
## Evidence pointers |
||||
- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings. |
||||
@ -1,22 +0,0 @@ |
||||
# Domain Glossary |
||||
|
||||
## Rules |
||||
- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id. |
||||
|
||||
## Terms |
||||
- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110) |
||||
- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes |
||||
- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`. |
||||
- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table. |
||||
- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows |
||||
- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score |
||||
|
||||
## Examples / Usage |
||||
- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120 |
||||
|
||||
## Evidence pointers |
||||
- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py) |
||||
- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py) |
||||
|
||||
## Anti-patterns |
||||
- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations. |
||||
@ -1,30 +0,0 @@ |
||||
# Code Clusters / Organization |
||||
|
||||
## Rules |
||||
- The repository organizes code into the following clusters (observed): |
||||
- UI / Streamlit: Home.py, pages/, app.py, explorer.py |
||||
- Database & persistence: database.py, config.py |
||||
- ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion) |
||||
- AI provider & summarization: ai_provider.py, pipeline/..., analysis/ |
||||
- Similarity & caching: similarity/*, similarity_cache table in DB |
||||
- API client & scraping: api_client.py, pipeline/fetch_mp_metadata |
||||
- Analysis & visualization: analysis/visualize.py, explorer.py |
||||
- CLI & scheduler: scheduler.py, pipeline/run_pipeline.py |
||||
- Tests & migrations: tests/ (pytest) and database reset helpers |
||||
|
||||
## Examples |
||||
|
||||
### Pipeline orchestrator (cluster: CLI & pipeline) |
||||
```python |
||||
from database import MotionDatabase |
||||
db = MotionDatabase(db_path) |
||||
# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window |
||||
``` |
||||
|
||||
## Remediations |
||||
- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests. |
||||
|
||||
## Evidence pointers |
||||
- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py) |
||||
- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py) |
||||
- analysis/visualize.py: visualization cluster (file: analysis/visualize.py) |
||||
@ -1,46 +0,0 @@ |
||||
# Design Patterns & Code Patterns |
||||
|
||||
## Rules |
||||
- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management. |
||||
- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback. |
||||
- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes). |
||||
|
||||
## Examples |
||||
|
||||
### Repository pattern (database.py MotionDatabase) |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._init_database() |
||||
|
||||
def insert_motion(self, motion_data: Dict) -> bool: |
||||
"""Insert a new motion into database""" |
||||
# uses duckdb.connect and parameterized queries |
||||
``` |
||||
|
||||
### Provider adapter with retries (ai_provider.py) |
||||
```python |
||||
def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response: |
||||
# Implements retries/backoff, handles 429 with Retry-After and 5xx responses |
||||
``` |
||||
|
||||
### Pipeline parallelism pattern (run_pipeline) |
||||
```python |
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool: |
||||
for window_id, w_start, w_end in windows: |
||||
fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k) |
||||
futures[fut] = window_id |
||||
# wait then write sequentially to DuckDB |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors. |
||||
|
||||
## Remediations |
||||
- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md. |
||||
|
||||
## Evidence pointers |
||||
- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300) |
||||
- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260) |
||||
- database.py: MotionDatabase methods (file: database.py) |
||||
@ -1,24 +0,0 @@ |
||||
# Anti-patterns, Issues and Recommended Fixes |
||||
|
||||
## Rules |
||||
- Flagged issues discovered in Phase 1 must be remediated with concrete actions. |
||||
|
||||
## Issues |
||||
- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml |
||||
- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports. |
||||
- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility. |
||||
- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps. |
||||
- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging. |
||||
|
||||
## Remediations / Recommended fixes |
||||
- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml. |
||||
- Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain. |
||||
- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var. |
||||
- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges. |
||||
- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage. |
||||
- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks. |
||||
|
||||
## Evidence pointers |
||||
- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40) |
||||
- database.py: multiple broad except blocks (file: database.py top and methods) |
||||
- ai_provider.py: uses requests + env keys (file: ai_provider.py) |
||||
@ -1,117 +0,0 @@ |
||||
# Example Extractions |
||||
|
||||
## Rules |
||||
- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions. |
||||
|
||||
## (a) Function signatures with docstrings (5 examples) |
||||
1) pipeline/run_pipeline.py::_generate_windows |
||||
```python |
||||
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
||||
"""Return list of (window_id, start_str, end_str) tuples. |
||||
|
||||
window_id format: |
||||
quarterly → "2024-Q1", "2024-Q2", … |
||||
annual → "2024" |
||||
""" |
||||
``` |
||||
|
||||
2) database.py::append_audit_event |
||||
```python |
||||
def append_audit_event( |
||||
self, |
||||
actor_id: Optional[str], |
||||
action: str, |
||||
target_type: Optional[str] = None, |
||||
target_id: Optional[str] = None, |
||||
metadata: Optional[Dict] = None, |
||||
) -> bool: |
||||
"""Record an audit event. Tries DB then falls back to ledger file.""" |
||||
``` |
||||
|
||||
3) ai_provider.py::get_embedding |
||||
```python |
||||
def get_embedding(text: str, model: str | None = None) -> list[float]: |
||||
"""Return an embedding vector for `text` using the configured provider. |
||||
|
||||
Raises ProviderError for configuration or provider-side failures. |
||||
""" |
||||
``` |
||||
|
||||
4) ai_provider.py::get_embeddings_batch |
||||
```python |
||||
def get_embeddings_batch( |
||||
texts: list[str], model: str | None = None, batch_size: int = 50 |
||||
) -> list[list[float]]: |
||||
"""Return embedding vectors for multiple texts using batched API calls.""" |
||||
``` |
||||
|
||||
5) analysis/visualize.py::plot_umap_scatter |
||||
```python |
||||
def plot_umap_scatter( |
||||
motion_ids: List[int], |
||||
coords: List[List[float]], |
||||
labels: Optional[List[int]] = None, |
||||
window_id: Optional[str] = None, |
||||
output_path: str = "analysis_umap.html", |
||||
) -> str: |
||||
"""Produce a 2D scatter plot of UMAP-reduced fused embeddings.""" |
||||
``` |
||||
|
||||
## (b) SQL / DDL snippets (3 examples inferred from database.py) |
||||
1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110) |
||||
|
||||
2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes |
||||
|
||||
3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings |
||||
|
||||
## (c) Pytest stubs (4 sample tests matching conventions) |
||||
Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add. |
||||
|
||||
1) tests/test_database_basic.py |
||||
```python |
||||
def test_init_database_creates_tables(tmp_path): |
||||
db_path = str(tmp_path / "motions.db") |
||||
from database import MotionDatabase |
||||
|
||||
db = MotionDatabase(db_path=db_path) |
||||
# If duckdb not available, JSON fallback should create .embeddings.json |
||||
assert db is not None |
||||
``` |
||||
|
||||
2) tests/test_ai_provider.py |
||||
```python |
||||
def test_local_embedding_fallback(): |
||||
from ai_provider import _local_embedding |
||||
|
||||
v = _local_embedding("hello world", dim=16) |
||||
assert isinstance(v, list) and len(v) == 16 |
||||
``` |
||||
|
||||
3) tests/test_pipeline_windows.py |
||||
```python |
||||
from pipeline.run_pipeline import _generate_windows |
||||
|
||||
def test_generate_quarterly_windows(): |
||||
from datetime import date |
||||
|
||||
start = date(2024, 1, 1) |
||||
end = date(2024, 3, 31) |
||||
windows = _generate_windows(start, end, "quarterly") |
||||
assert any(w[0].endswith("Q1") for w in windows) |
||||
``` |
||||
|
||||
4) tests/test_visualize_plot.py |
||||
```python |
||||
def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path): |
||||
# If plotly missing, function should raise ImportError with guidance |
||||
import analysis.visualize as vis |
||||
|
||||
try: |
||||
vis._require_plotly() |
||||
except ImportError: |
||||
assert True |
||||
``` |
||||
|
||||
## Evidence pointers |
||||
- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py |
||||
- DDL: database.py create table blocks |
||||
@ -1,43 +0,0 @@ |
||||
# Stack and Dependencies |
||||
|
||||
## Rules |
||||
- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13") |
||||
- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile |
||||
- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py |
||||
- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/ |
||||
|
||||
## Examples |
||||
|
||||
### pyproject dependencies (evidence: pyproject.toml) |
||||
```toml |
||||
dependencies = [ |
||||
"duckdb>=1.3.2", |
||||
"ibis-framework[duckdb]>=10.8.0", |
||||
"openai>=1.99.7", |
||||
"scipy>=1.11", |
||||
"umap-learn>=0.5", |
||||
"plotly>=5.0", |
||||
"pytest>=9.0.2", |
||||
"requests>=2.32.4", |
||||
"schedule>=1.2.2", |
||||
"streamlit>=1.48.0", |
||||
"scikit-learn>=1.8.0", |
||||
"beautifulsoup4>=4.14.3", |
||||
"lxml>=6.0.2", |
||||
] |
||||
``` |
||||
|
||||
## Anti-patterns / Notes |
||||
- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml |
||||
- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility. |
||||
- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai). |
||||
|
||||
## Remediations |
||||
- Move test-only libs (pytest) to dev-dependencies in pyproject.toml. |
||||
- Add lockfile and CI step to check for pinned dependencies. |
||||
- Audit declared but unused packages (openai) and remove or confirm dynamic usage. |
||||
|
||||
## Evidence pointers |
||||
- pyproject.toml: full dependency list (lines 1-40) |
||||
- Home.py: streamlit usage and app entry (file: Home.py) |
||||
- database.py: duckdb table creation and connection (file: database.py lines ~1-350) |
||||
@ -1,29 +0,0 @@ |
||||
# DB connection handling constraints |
||||
|
||||
rules: |
||||
- name: use_context_managers_for_connections |
||||
rule: "Prefer using 'with duckdb.connect(path, read_only=...) as conn' for scoped DB interactions where possible." |
||||
rationale: "Ensures proper resource cleanup and avoids connection leaks." |
||||
|
||||
- name: read_only_for_compute |
||||
rule: "Use read_only=True for compute steps that only read data (SVD, similarity compute)." |
||||
rationale: "Allows safe parallel workers and reduces write contention." |
||||
|
||||
- name: short_lived_writes |
||||
rule: "When performing database writes, open short-lived connections, commit quickly and close." |
||||
rationale: "Avoids long-lived transactions and reduces lock windows." |
||||
|
||||
examples: |
||||
- path: pipeline/svd_pipeline.py |
||||
snippet: | |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute(...).fetchall() |
||||
finally: |
||||
conn.close() |
||||
|
||||
anti_patterns_and_remediations: |
||||
- bad: "Creating a global connection at import that performs migrations." |
||||
remediation: "Move migrations to an explicit init function that runs at deployment/upgrade time." |
||||
- bad: "Not closing connections on exceptions." |
||||
remediation: "Wrap connects in `with` or finally: conn.close() blocks." |
||||
@ -0,0 +1,143 @@ |
||||
--- |
||||
title: Error Handling Patterns |
||||
category: constraints |
||||
severity: high |
||||
--- |
||||
|
||||
# Error Handling Patterns |
||||
|
||||
## Core Rules |
||||
|
||||
1. **Catch `Exception`, return safe fallbacks** (False/[]/None) |
||||
2. **Log exceptions with traceback** using `_logger.exception()` |
||||
3. **Never swallow exceptions silently** - always log or return sensible default |
||||
4. **Avoid nested try/except blocks** - flatten exception handling |
||||
|
||||
## Pattern: Try/Except Safe Fallback |
||||
|
||||
This is the dominant pattern in the codebase (219+ instances). |
||||
|
||||
```python |
||||
# Standard pattern from database.py, api_client.py, etc. |
||||
try: |
||||
result = risky_operation() |
||||
return process(result) |
||||
except Exception as exc: |
||||
_logger.warning("Operation failed: %s", exc) |
||||
return safe_fallback # False, [], None, {} |
||||
``` |
||||
|
||||
### Examples from Codebase |
||||
|
||||
**database.py** - DuckDB operations: |
||||
```python |
||||
def get_svd_vectors(self, window: str): |
||||
try: |
||||
conn = duckdb.connect(self.db_path, read_only=True) |
||||
try: |
||||
result = conn.execute(query, (window,)).fetchall() |
||||
return self._parse_vectors(result) |
||||
finally: |
||||
conn.close() |
||||
except Exception as exc: |
||||
_logger.warning("Failed to get SVD vectors: %s", exc) |
||||
return [] |
||||
``` |
||||
|
||||
**ai_provider.py** - HTTP retries: |
||||
```python |
||||
try: |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
resp.raise_for_status() |
||||
return resp.json() |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Connection error: {exc}") from exc |
||||
# ... retry logic |
||||
``` |
||||
|
||||
## Pattern: Optional Dependency Fallback |
||||
|
||||
Gracefully degrade when optional packages are unavailable. |
||||
|
||||
```python |
||||
# UMAP fallback in explorer_helpers.py |
||||
try: |
||||
import umap |
||||
HAS_UMAP = True |
||||
except ImportError: |
||||
HAS_UMAP = False |
||||
_logger.debug("UMAP not available, using SVD vectors directly") |
||||
|
||||
def project_to_2d(vectors): |
||||
if HAS_UMAP: |
||||
return umap.UMAP().fit_transform(vectors) |
||||
return vectors[:, :2] # Fallback: first 2 SVD dimensions |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### 1. Bare except with pass (CRITICAL) |
||||
**File**: `database.py`, line 47 |
||||
|
||||
```python |
||||
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: # bare except |
||||
pass |
||||
``` |
||||
|
||||
**Fix**: Catch specific exception or log and continue: |
||||
```python |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except Exception as exc: |
||||
_logger.debug("Sequence creation skipped (may already exist): %s", exc) |
||||
``` |
||||
|
||||
### 2. Nested Exception Handling |
||||
**File**: `explorer.py`, lines 244-261 |
||||
|
||||
```python |
||||
# BAD - opaque error paths |
||||
try: |
||||
result = compute_svd(motions) |
||||
except Exception: |
||||
try: |
||||
result = fallback_compute(motions) |
||||
except Exception: |
||||
pass # Both exceptions silently dropped |
||||
``` |
||||
|
||||
**Fix**: Flatten and handle each case explicitly: |
||||
```python |
||||
# GOOD - explicit handling |
||||
try: |
||||
result = compute_svd(motions) |
||||
except Exception as exc: |
||||
_logger.warning("SVD failed, trying fallback: %s", exc) |
||||
try: |
||||
result = fallback_compute(motions) |
||||
except Exception as fallback_exc: |
||||
_logger.error("Both SVD approaches failed: %s, %s", exc, fallback_exc) |
||||
raise |
||||
``` |
||||
|
||||
## Rule Summary |
||||
|
||||
| Pattern | When to Use | Return Value | |
||||
|---------|-------------|--------------| |
||||
| Safe fallback | Best-effort operations | `[]`, `{}`, `False`, `None` | |
||||
| Re-raise | Critical operations that must succeed | raise | |
||||
| Log and continue | Optional steps in pipeline | (continue) | |
||||
| Graceful degradation | Optional dependencies | Default behavior | |
||||
|
||||
## When to Log vs Return |
||||
|
||||
| Scenario | Action | |
||||
|----------|--------| |
||||
| User action fails | Log warning, return safe default | |
||||
| Internal error (corrupt data) | Log error, return safe default | |
||||
| Transient failure (network) | Log warning, retry if appropriate | |
||||
| Configuration error | Log error, raise with clear message | |
||||
@ -1,36 +0,0 @@ |
||||
# Error handling style rules (YAML constraint example) |
||||
|
||||
rules: |
||||
- name: explicit_exceptions |
||||
rule: "Raise explicit exceptions (ValueError, ProviderError) for known error conditions rather than returning magic values." |
||||
examples: |
||||
- good: | |
||||
if not isinstance(text, str): |
||||
raise ProviderError('text must be a string') |
||||
- bad: | |
||||
if not isinstance(text, str): |
||||
return [] |
||||
|
||||
- name: avoid_broad_except |
||||
rule: "Avoid 'except Exception:' that swallows errors. If broad except is used for best-effort, log the exception with logger.exception and re-raise or convert." |
||||
examples: |
||||
- bad: | |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
- remediation: | |
||||
try: |
||||
do_work() |
||||
except SpecificError as exc: |
||||
logger.warning('Handled error: %s', exc) |
||||
raise |
||||
|
||||
- name: logging_over_print |
||||
rule: "Prefer logger.* over print() for messages and errors." |
||||
examples: |
||||
- bad: "print('Error fetching motions from API: %s' % e)" |
||||
- good: "logger.exception('Error fetching motions from API')" |
||||
|
||||
enforcement_examples: |
||||
- "Add a static code check to flag 'print(' in modules (except in simple scripts) and 'except Exception:' usages without logger.exception." |
||||
@ -1,24 +1,205 @@ |
||||
# Import grouping and ordering constraints |
||||
|
||||
rules: |
||||
- name: grouping |
||||
rule: "Group imports in three sections separated by a single blank line: stdlib, third-party, local." |
||||
examples: |
||||
- good: | |
||||
import json |
||||
import logging |
||||
|
||||
import requests |
||||
import duckdb |
||||
|
||||
from .pipeline import text_pipeline |
||||
- bad: | |
||||
import duckdb |
||||
import json |
||||
from pipeline import text_pipeline |
||||
|
||||
- name: from_imports |
||||
rule: "Prefer 'from x import y' only when it improves clarity or avoids circular import; otherwise import module and reference attributes." |
||||
|
||||
enforcement_examples: |
||||
- "Run isort or ruff- import sorting in pre-commit or CI to enforce ordering." |
||||
# Import Organization Constraints |
||||
|
||||
## Standard Order |
||||
|
||||
Organize imports in three groups with blank lines between: |
||||
|
||||
```python |
||||
# 1. Standard library imports (alphabetical within group) |
||||
import json |
||||
import logging |
||||
import os |
||||
from datetime import datetime, timedelta |
||||
from typing import Dict, List, Optional, Tuple |
||||
|
||||
# 2. Third-party packages (alphabetical within group) |
||||
import duckdb |
||||
import requests |
||||
from config import config |
||||
|
||||
# 3. Local application modules (can use relative imports) |
||||
from database import db |
||||
from summarizer import summarizer |
||||
``` |
||||
|
||||
## Alphabetical Ordering |
||||
|
||||
Within each group, sort imports alphabetically: |
||||
|
||||
```python |
||||
# GOOD - alphabetical |
||||
import json |
||||
import logging |
||||
from datetime import datetime |
||||
from typing import Dict, List, Optional |
||||
|
||||
# BAD - random order |
||||
from typing import Optional |
||||
import json |
||||
from datetime import datetime |
||||
import logging |
||||
from typing import Dict, List |
||||
``` |
||||
|
||||
## Grouping Rules |
||||
|
||||
### Standard Library |
||||
- `json`, `logging`, `os`, `sys`, `time` |
||||
- `datetime`, `timedelta` from `datetime` |
||||
- `Dict`, `List`, `Optional`, etc. from `typing` |
||||
- `argparse`, `pathlib`, `re`, `uuid` |
||||
|
||||
### Third-Party |
||||
- `duckdb`, `requests`, `streamlit` |
||||
- `numpy`, `scipy`, `sklearn` |
||||
- `plotly`, `beautifulsoup4` |
||||
- `pytest` |
||||
|
||||
### Local Application |
||||
- Modules from same package |
||||
- Relative imports when appropriate |
||||
|
||||
## When to Use `from X import Y` |
||||
|
||||
### Prefer `from module import specific_items` for: |
||||
- Constants and config |
||||
- Single classes or functions used frequently |
||||
- Type annotations |
||||
|
||||
```python |
||||
# GOOD - clear about what we're using |
||||
from config import config |
||||
from database import db |
||||
|
||||
# GOOD - type hints |
||||
from typing import Dict, List, Optional |
||||
``` |
||||
|
||||
### Use `import module` when: |
||||
- You need multiple items from the module |
||||
- Using module.namespace is clearer |
||||
|
||||
```python |
||||
# GOOD - duckdb used for types and module access |
||||
import duckdb |
||||
|
||||
conn = duckdb.connect(...) |
||||
result = conn.execute(...) |
||||
|
||||
# Also acceptable for types |
||||
from typing import Dict |
||||
``` |
||||
|
||||
## Relative Imports |
||||
|
||||
In package modules, prefer relative imports: |
||||
|
||||
```python |
||||
# pipeline/svd_pipeline.py |
||||
from ..database import MotionDatabase # relative import |
||||
from .text_pipeline import process_text # relative import |
||||
``` |
||||
|
||||
## Circular Imports |
||||
|
||||
Avoid circular imports by: |
||||
1. Moving shared code to a third module |
||||
2. Using TYPE_CHECKING for type hints only |
||||
|
||||
```python |
||||
# types.py - shared type definitions |
||||
from typing import TypedDict |
||||
|
||||
class MotionDict(TypedDict): |
||||
id: int |
||||
title: str |
||||
... |
||||
|
||||
# module_a.py |
||||
from .types import MotionDict |
||||
|
||||
# module_b.py - if needed here too |
||||
from .types import MotionDict |
||||
``` |
||||
|
||||
## Import Patterns to Avoid |
||||
|
||||
### Wildcard Imports |
||||
```python |
||||
# BAD |
||||
from database import * |
||||
|
||||
# GOOD |
||||
from database import db, MotionDatabase |
||||
``` |
||||
|
||||
### Import in Function Scope (unless necessary) |
||||
```python |
||||
# AVOID - delays import, makes dependencies unclear |
||||
def some_function(): |
||||
import pandas as pd # Late import |
||||
return pd.DataFrame(...) |
||||
|
||||
# PREFER - import at module level |
||||
import pandas as pd |
||||
|
||||
def some_function(): |
||||
return pd.DataFrame(...) |
||||
``` |
||||
|
||||
### Reassigning Imported Names |
||||
```python |
||||
# BAD - confusing |
||||
from module import process |
||||
process = something_else # Reassigning |
||||
|
||||
# GOOD - clear naming |
||||
from module import process as process_data |
||||
``` |
||||
|
||||
## Type Checking Imports |
||||
|
||||
For type hints only, use TYPE_CHECKING: |
||||
|
||||
```python |
||||
from typing import TYPE_CHECKING |
||||
|
||||
if TYPE_CHECKING: |
||||
from .models import Motion |
||||
|
||||
def get_motion(motion_id: int) -> "Motion": # String quote for forward ref |
||||
... |
||||
``` |
||||
|
||||
## Optional Dependency Imports |
||||
|
||||
Handle optional dependencies gracefully: |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
duckdb = None # Will be checked later |
||||
|
||||
class MotionDatabase: |
||||
def __init__(self): |
||||
if duckdb is None: |
||||
self._file_mode = True # Fallback mode |
||||
``` |
||||
|
||||
## Example: Complete Import Block |
||||
|
||||
```python |
||||
# Complete example from database.py |
||||
import json |
||||
import logging |
||||
import uuid |
||||
from datetime import datetime, timedelta |
||||
from typing import Dict, List, Optional, Tuple |
||||
|
||||
import duckdb |
||||
|
||||
from config import config |
||||
|
||||
from database import db |
||||
``` |
||||
|
||||
@ -0,0 +1,131 @@ |
||||
--- |
||||
title: Logging Constraints |
||||
category: constraints |
||||
severity: critical |
||||
--- |
||||
|
||||
# Logging Constraints |
||||
|
||||
## Core Rule |
||||
|
||||
Use `logging.getLogger(__name__)` - never use `print()` |
||||
|
||||
**CRITICAL ANTI-PATTERN**: `api_client.py` uses `print()` instead of logging (11 instances). |
||||
|
||||
## CRITICAL Anti-Pattern: print() Instead of Logging |
||||
|
||||
**File**: `api_client.py` |
||||
**Evidence**: Lines with `print(f"...")` instead of `_logger.info(...)` |
||||
|
||||
**Broken code**: |
||||
```python |
||||
def get_motions(self, ...): |
||||
try: |
||||
# ... |
||||
print(f"Fetched {len(voting_records)} voting records from API") # BAD |
||||
print(f"Processed into {len(motions)} unique motions") # BAD |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") # BAD - no traceback |
||||
``` |
||||
|
||||
**Fix**: |
||||
```python |
||||
import logging |
||||
|
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
def get_motions(self, ...): |
||||
try: |
||||
_logger.info("Fetched %d voting records from API", len(voting_records)) |
||||
_logger.info("Processed into %d unique motions", len(motions)) |
||||
except Exception as e: |
||||
_logger.exception("Error fetching motions from API: %s", e) |
||||
return [] |
||||
``` |
||||
|
||||
## Logger Initialization |
||||
|
||||
Get logger at module level: |
||||
|
||||
```python |
||||
# GOOD: Use logging.getLogger(__name__) |
||||
import logging |
||||
|
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
def some_function(): |
||||
_logger.info("Processing started") |
||||
_logger.debug("Detail: %s", detail) |
||||
``` |
||||
|
||||
## Logger Naming |
||||
|
||||
Use `__name__` for automatic module path: |
||||
|
||||
```python |
||||
# In database.py - logger will be "database" |
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
# In pipeline/svd_pipeline.py - logger will be "pipeline.svd_pipeline" |
||||
_logger = logging.getLogger(__name__) |
||||
``` |
||||
|
||||
**INCONSISTENCY WARNING**: 16 files use `logger`, 17 files use `_logger`. Choose one convention. |
||||
|
||||
**Recommendation**: Use `_logger` (with underscore) for module-level loggers to distinguish from class-level loggers. |
||||
|
||||
## Log Levels |
||||
|
||||
| Level | When to Use | |
||||
|-------|-------------| |
||||
| DEBUG | Detailed diagnostic info (dev only) | |
||||
| INFO | Normal operation milestones | |
||||
| WARNING | Unexpected but handled (fallbacks) | |
||||
| ERROR | Operation failed, may need attention | |
||||
| CRITICAL | Fatal error, program may crash | |
||||
|
||||
## Exception Logging |
||||
|
||||
Use `_logger.exception()` for caught exceptions (includes traceback): |
||||
|
||||
```python |
||||
try: |
||||
result = risky_operation() |
||||
except Exception as exc: |
||||
_logger.exception("Operation failed: %s", exc) |
||||
return fallback_value |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Debug Prints in Production Code |
||||
```python |
||||
# BAD |
||||
print(f"[TRAJ DEBUG] processing window {wid}") |
||||
|
||||
# GOOD |
||||
_logger.debug("Processing window %s", wid) |
||||
``` |
||||
|
||||
### Inconsistent Logger Names |
||||
```python |
||||
# BAD - mixing _logger and logger |
||||
_logger = logging.getLogger(__name__) |
||||
logger = logging.getLogger("other") # Inconsistent |
||||
``` |
||||
|
||||
## Sensitive Data |
||||
|
||||
Never log sensitive information: |
||||
- API keys |
||||
- User votes |
||||
- Session IDs (if tied to user data) |
||||
- Personal information |
||||
|
||||
```python |
||||
# BAD |
||||
_logger.info("User %s voted %s", user_id, vote) |
||||
|
||||
# GOOD - log aggregates, not individual votes |
||||
_logger.info("Vote recorded for session %s", session_id[:8]) |
||||
``` |
||||
@ -1,30 +1,141 @@ |
||||
# Naming constraint rules (example constraint file) |
||||
|
||||
rules: |
||||
- name: module_file_names |
||||
rule: "Use snake_case for Python module filenames (e.g., text_pipeline.py, ai_provider.py)." |
||||
examples: |
||||
- good: "text_pipeline.py" |
||||
- bad: "TextPipeline.py" |
||||
|
||||
- name: function_names |
||||
rule: "Use snake_case for functions and methods." |
||||
examples: |
||||
- good: "def compute_similarities(...):" |
||||
- bad: "def ComputeSimilarities(...):" |
||||
|
||||
- name: class_names |
||||
rule: "Use PascalCase for classes." |
||||
examples: |
||||
- good: "class MotionDatabase:" |
||||
- bad: "class motion_database:" |
||||
|
||||
- name: constants |
||||
rule: "Constants use UPPER_SNAKE_CASE." |
||||
examples: |
||||
- good: "VOTE_MAP = { ... }" |
||||
- bad: "vote_map = { ... }" |
||||
|
||||
enforcement_examples: |
||||
- "Add a linter rule in CI: ruff or flake8 naming plugin to detect violations." |
||||
- "Run `python -m pip install ruff` and `ruff check` as part of CI." |
||||
# Naming Constraints |
||||
|
||||
## File Names |
||||
|
||||
### Python Modules |
||||
- **Convention**: `snake_case.py` |
||||
- **Examples**: `motion_database.py`, `api_client.py`, `text_pipeline.py` |
||||
|
||||
### Test Files |
||||
- **Convention**: `test_<module_name>.py` |
||||
- **Examples**: `test_database.py`, `test_api_client.py` |
||||
|
||||
### Config Files |
||||
- **Convention**: `snake_case` |
||||
- **Examples**: `config.py`, `.env.example`, `pyproject.toml` |
||||
|
||||
### Directories |
||||
- **Convention**: `snake_case/` |
||||
- **Examples**: `pipeline/`, `tests/integration/`, `src/validators/` |
||||
|
||||
## Class Names |
||||
|
||||
- **Convention**: `PascalCase` |
||||
- **Examples**: `MotionDatabase`, `TweedeKamerAPI`, `MotionSummarizer` |
||||
|
||||
### Naming Patterns |
||||
| Pattern | Example | |
||||
|---------|---------| |
||||
| Database wrapper | `MotionDatabase` | |
||||
| API client | `TweedeKamerAPI` | |
||||
| Service/Helpers | `MotionScraper`, `MotionAnalyzer` | |
||||
| Exceptions | `ProviderError` | |
||||
|
||||
## Function Names |
||||
|
||||
- **Convention**: `snake_case` |
||||
- **Examples**: `get_motions`, `compute_similarity`, `process_voting_records` |
||||
|
||||
### Private Methods |
||||
- **Convention**: `_snake_case` (single underscore prefix) |
||||
- **Examples**: `_get_voting_records`, `_parse_response` |
||||
|
||||
## Variable Names |
||||
|
||||
### Regular Variables |
||||
- **Convention**: `snake_case` |
||||
- **Examples**: `motion_id`, `party_name`, `voting_results` |
||||
|
||||
### Constants (Module-Level) |
||||
- **Convention**: `UPPER_SNAKE_CASE` |
||||
- **Examples**: `DATABASE_PATH`, `API_TIMEOUT`, `MAX_RETRIES` |
||||
|
||||
### Config Variables (in dataclass) |
||||
- **Convention**: `UPPER_SNAKE_CASE` |
||||
- **Examples**: `QWEN_MODEL`, `POLICY_AREAS` |
||||
|
||||
### Booleans |
||||
- **Convention**: `is_`, `has_`, `can_` prefixes or `_flag` suffix |
||||
- **Examples**: `is_active`, `has_votes`, `skip_extract` |
||||
|
||||
### Private Variables |
||||
- **Convention**: `_underscore_prefix` |
||||
- **Examples**: `_conn`, `_cache`, `_session` |
||||
|
||||
## Singleton Instances |
||||
|
||||
- **Convention**: `lower_snake_case` at module level |
||||
- **Examples**: `db = MotionDatabase()`, `summarizer = MotionSummarizer()` |
||||
|
||||
```python |
||||
# database.py |
||||
class MotionDatabase: |
||||
... |
||||
|
||||
# Singleton instance |
||||
db = MotionDatabase() |
||||
|
||||
# Usage |
||||
from database import db |
||||
motions = db.get_motions() |
||||
``` |
||||
|
||||
## Type Variables |
||||
|
||||
- **Convention**: `PascalCase` |
||||
- **Examples**: `T = TypeVar('T')`, `MotionDict = Dict[str, Any]` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Inconsistent Naming |
||||
```python |
||||
# BAD - mixing styles |
||||
get_motions() # snake_case |
||||
GetMotionById() # PascalCase |
||||
processData() # camelCase |
||||
|
||||
# GOOD - consistent snake_case |
||||
get_motions() |
||||
get_motion_by_id() |
||||
process_voting_data() |
||||
``` |
||||
|
||||
### Abbreviations |
||||
```python |
||||
# AVOID - unclear abbreviations |
||||
calc_similarity() # calculate_* |
||||
proc_votes() # process_* |
||||
get_mp_data() # get_mp_metadata() |
||||
|
||||
# PREFER - full words |
||||
calculate_similarity() |
||||
process_votes() |
||||
get_mp_metadata() |
||||
``` |
||||
|
||||
### Hungarian Notation |
||||
```python |
||||
# BAD - Hungarian notation |
||||
str_title = "..." |
||||
int_count = 0 |
||||
b_is_active = True |
||||
|
||||
# GOOD - clear types via naming |
||||
title = "..." |
||||
count = 0 |
||||
is_active = True |
||||
``` |
||||
|
||||
## Special Cases |
||||
|
||||
### Window IDs |
||||
- **Format**: `"YYYY-QN"` or `"YYYY"` |
||||
- **Examples**: `"2024-Q1"`, `"2024-Q2"`, `"2024"` |
||||
|
||||
### Policy Areas |
||||
- **Convention**: PascalCase with spaces |
||||
- **Examples**: `"Economie"`, `"Sociale Zaken"`, `"Klimaat"` |
||||
|
||||
### Vote Values |
||||
- **Convention**: PascalCase Dutch terms |
||||
- **Values**: `"Voor"`, `"Tegen"`, `"Onthouden"`, `"Geen stem"`, `"Afwezig"` |
||||
|
||||
@ -0,0 +1,233 @@ |
||||
# Type Hint Constraints |
||||
|
||||
## Core Rule |
||||
|
||||
**Use type hints on all public functions and methods** |
||||
|
||||
## Function Type Hints |
||||
|
||||
### Required on Public APIs |
||||
|
||||
```python |
||||
# GOOD - complete type hints |
||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
||||
... |
||||
|
||||
def get_filtered_motions( |
||||
self, |
||||
policy_area: str = "Alle", |
||||
limit: int = 10 |
||||
) -> List[Dict]: |
||||
... |
||||
|
||||
def calculate_similarity(self, motion_a: int, motion_b: int) -> float: |
||||
... |
||||
``` |
||||
|
||||
### Optional Parameters |
||||
|
||||
Use `Optional[X]` or `X | None`: |
||||
|
||||
```python |
||||
# Both forms are acceptable |
||||
def get_motion(self, motion_id: Optional[int] = None) -> Optional[Dict]: |
||||
... |
||||
|
||||
def get_motion(self, motion_id: int | None = None) -> dict | None: |
||||
... |
||||
``` |
||||
|
||||
### Multiple Return Types |
||||
|
||||
Use `Union[X, Y]` or `|` operator: |
||||
|
||||
```python |
||||
# Acceptable forms |
||||
def parse_value(self, value: str) -> Union[bool, str, None]: |
||||
... |
||||
|
||||
def parse_value(self, value: str) -> bool | str | None: |
||||
... |
||||
``` |
||||
|
||||
### Generic Types |
||||
|
||||
Use `List[X]`, `Dict[K, V]`, `Tuple[X, Y]`: |
||||
|
||||
```python |
||||
from typing import Dict, List, Optional, Tuple |
||||
|
||||
def get_motions(self, ids: List[int]) -> Dict[int, Dict]: |
||||
"""Map motion_id -> motion data.""" |
||||
... |
||||
|
||||
def process_batch(self, items: List[str]) -> Tuple[List[str], List[str]]: |
||||
"""Returns (successes, failures).""" |
||||
... |
||||
``` |
||||
|
||||
## Collection Types |
||||
|
||||
Prefer specific types over bare `list`/`dict`: |
||||
|
||||
```python |
||||
# GOOD - specific types |
||||
def get_votes(self) -> List[str]: |
||||
... |
||||
|
||||
def get_metadata(self) -> Dict[str, Any]: |
||||
... |
||||
|
||||
# ACCEPTABLE - for truly generic collections |
||||
def merge_dicts(*dicts: dict) -> dict: |
||||
... |
||||
``` |
||||
|
||||
## DuckDB Result Types |
||||
|
||||
DuckDB returns tuples/lists - document expected structure: |
||||
|
||||
```python |
||||
def get_motion(self, motion_id: int) -> Optional[Tuple]: |
||||
"""Returns (id, title, description, date, ...) or None.""" |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
||||
).fetchone() |
||||
return result |
||||
finally: |
||||
conn.close() |
||||
|
||||
# Or use Dict for clarity |
||||
def get_motion_as_dict(self, motion_id: int) -> Optional[Dict]: |
||||
"""Returns motion dict or None.""" |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
row = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
||||
).fetchone() |
||||
if row: |
||||
return { |
||||
"id": row[0], |
||||
"title": row[1], |
||||
"description": row[2], |
||||
... |
||||
} |
||||
return None |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
## Class/Instance Types |
||||
|
||||
Use `Self` for methods returning instance type: |
||||
|
||||
```python |
||||
from typing import Self |
||||
|
||||
class MotionDatabase: |
||||
def with_connection(self, path: str) -> Self: |
||||
"""Return new instance with different path.""" |
||||
return MotionDatabase(db_path=path) |
||||
``` |
||||
|
||||
## Callback/Function Types |
||||
|
||||
Use `Callable` for function parameters: |
||||
|
||||
```python |
||||
from typing import Callable |
||||
|
||||
def process_motions( |
||||
motions: List[Dict], |
||||
processor: Callable[[Dict], Any] |
||||
) -> List[Any]: |
||||
return [processor(m) for m in motions] |
||||
``` |
||||
|
||||
## Type Aliases |
||||
|
||||
Define clear type aliases for domain concepts: |
||||
|
||||
```python |
||||
from typing import Dict, List, TypedDict, Literal |
||||
|
||||
# Vote values |
||||
VoteValue = Literal["Voor", "Tegen", "Onthouden", "Geen stem", "Afwezig"] |
||||
|
||||
# Policy areas |
||||
PolicyArea = Literal["Alle", "Economie", "Klimaat", "Immigratie", ...] |
||||
|
||||
# Motion dict |
||||
class MotionDict(TypedDict): |
||||
id: int |
||||
title: str |
||||
description: Optional[str] |
||||
date: Optional[str] |
||||
policy_area: Optional[str] |
||||
voting_results: Optional[str] # JSON string |
||||
winning_margin: Optional[float] |
||||
|
||||
def get_motion(self, motion_id: int) -> Optional[MotionDict]: |
||||
... |
||||
``` |
||||
|
||||
## Avoid `Any` |
||||
|
||||
Use `Any` sparingly - prefer specific types: |
||||
|
||||
```python |
||||
# AVOID - too vague |
||||
def process(data: Any) -> Any: |
||||
... |
||||
|
||||
# PREFER - specific types |
||||
def process(motion: MotionDict) -> Optional[SimilarityResult]: |
||||
... |
||||
``` |
||||
|
||||
## Inline Type Hints |
||||
|
||||
For simple cases, inline hints are fine: |
||||
|
||||
```python |
||||
def get_count(self) -> int: |
||||
... |
||||
|
||||
def is_empty(self) -> bool: |
||||
... |
||||
``` |
||||
|
||||
## Docstring Type Hints |
||||
|
||||
For complex types, include in docstrings: |
||||
|
||||
```python |
||||
def get_party_positions(self, window_id: str) -> Dict[str, List[float]]: |
||||
"""Get party positions in political space. |
||||
|
||||
Args: |
||||
window_id: Time window (e.g., "2024-Q1") |
||||
|
||||
Returns: |
||||
Dict mapping party_name -> [x, y] coordinates |
||||
|
||||
Example: |
||||
>>> positions = db.get_party_positions("2024-Q1") |
||||
>>> positions["VVD"] |
||||
[0.5, -0.3] |
||||
""" |
||||
... |
||||
``` |
||||
|
||||
## Type Checking |
||||
|
||||
For runtime type checking, use runtime checks: |
||||
|
||||
```python |
||||
def set_count(self, count: int) -> None: |
||||
if not isinstance(count, int): |
||||
raise TypeError(f"Expected int, got {type(count).__name__}") |
||||
self._count = count |
||||
``` |
||||
@ -1,32 +0,0 @@ |
||||
# Coding conventions cheat-sheet (extracted from Phase 1) |
||||
|
||||
naming: |
||||
module_files: snake_case (e.g., text_pipeline.py, ai_provider.py) |
||||
functions: snake_case |
||||
classes: PascalCase |
||||
constants: UPPER_SNAKE_CASE |
||||
module_singletons: module-level instances, named lower_snake (e.g., db = MotionDatabase()) |
||||
|
||||
imports: |
||||
order: |
||||
- stdlib |
||||
- third-party |
||||
- local application imports |
||||
style: |
||||
- group imports with a blank line between groups |
||||
- prefer "from x import y" only when needed to avoid circular imports |
||||
|
||||
types_and_dataclasses: |
||||
- Use type hints broadly (functions, public APIs) |
||||
- config should be a dataclass in config.py |
||||
- Module-level singletons are allowed (but follow lifecycle rules in db_connection constraints) |
||||
|
||||
tests: |
||||
- pytest |
||||
- tests/ directory, files named test_*.py |
||||
- Use fixtures in tests/fixtures and conftest.py |
||||
- Tests expect raises(...) for invalid input or ProviderError |
||||
|
||||
error_handling: |
||||
- Prefer explicit exceptions (ValueError, ProviderError) |
||||
- Avoid overly-broad except: clauses (see anti-patterns) |
||||
@ -0,0 +1,124 @@ |
||||
# Naming Conventions |
||||
|
||||
## Files |
||||
- **snake_case** for all Python files: `database.py`, `explorer_helpers.py`, `motion_cache.py` |
||||
- **PascalCase** NOT used for files |
||||
|
||||
## Functions |
||||
- **snake_case**: `get_svd_vectors()`, `compute_party_coords()`, `build_scatter_trace()` |
||||
- Private helpers prefixed with `_`: `_get_window_data()` |
||||
|
||||
## Classes |
||||
- **PascalCase**: `MotionDatabase`, `Config` |
||||
- **Dataclass pattern** for Config: `@dataclass` decorator with typed fields |
||||
|
||||
## Variables |
||||
- **snake_case**: `party_map`, `mp_name`, `svd_vectors`, `party_centroids` |
||||
- **CONSTANT_SNAKE_CASE** for module-level constants: `PARTY_COLOURS`, `DEFAULT_WINDOW` |
||||
|
||||
## Module-Level Exports |
||||
- **Singleton instance**: `db = MotionDatabase()` at module bottom (not class-level) |
||||
- **Config instance**: `config = Config(...)` at module bottom |
||||
- **Dicts**: `PARTY_COLOURS` exported from `config.py` |
||||
|
||||
--- |
||||
|
||||
# Error Handling |
||||
|
||||
## Known Patterns |
||||
1. **Bare except with pass** (ANTI-PATTERN - see anti-patterns.yaml) |
||||
```python |
||||
except: |
||||
pass # database.py:47 |
||||
``` |
||||
|
||||
2. **Graceful degradation**: catch specific exceptions, fall back to default |
||||
```python |
||||
try: |
||||
result = compute_svd() |
||||
except ImportError: |
||||
result = DEFAULT_SVD |
||||
``` |
||||
|
||||
3. **Optional dependency fallbacks**: |
||||
```python |
||||
try: |
||||
import umap |
||||
use_umap = True |
||||
except ImportError: |
||||
use_umap = False |
||||
``` |
||||
|
||||
4. **Nested exception handling** (ANTI-PATTERN - see anti-patterns.yaml): |
||||
```python |
||||
try: |
||||
... |
||||
except Exception: |
||||
try: |
||||
... |
||||
except Exception: |
||||
pass |
||||
``` |
||||
|
||||
## Rules |
||||
- Never use bare `except:` — always specify exception type |
||||
- Never swallow exceptions silently — log or return a sensible default |
||||
- For optional deps, use `ImportError` or `ModuleNotFoundError` explicitly |
||||
- Avoid nested try/except blocks |
||||
|
||||
--- |
||||
|
||||
# Code Organization |
||||
|
||||
## Singleton Pattern |
||||
Each module owns one shared instance: |
||||
```python |
||||
# database.py |
||||
db = MotionDatabase() |
||||
|
||||
# config.py |
||||
config = Config(...) |
||||
PARTY_COLOURS = {...} |
||||
``` |
||||
|
||||
## Pure Functions in Helpers |
||||
`explorer_helpers.py` contains only pure functions (no IO, no Streamlit calls): |
||||
```python |
||||
def compute_party_coords(svd_vectors, party_map): |
||||
"""Pure: no side effects, no imports from this module""" |
||||
... |
||||
|
||||
def build_scatter_trace(df, color_col): |
||||
"""Pure: returns Plotly trace dict""" |
||||
... |
||||
``` |
||||
|
||||
## Cached Data Loaders |
||||
Use `@st.cache_data` for expensive data loading: |
||||
```python |
||||
@st.cache_data |
||||
def load_svd_vectors(window: str) -> pd.DataFrame: |
||||
return db.get_svd_vectors(window) |
||||
``` |
||||
|
||||
## Dataclass Config |
||||
```python |
||||
@dataclass |
||||
class Config: |
||||
db_path: str = "data/stemwijzer.duckdb" |
||||
default_window: str = "2023" |
||||
party_colours: dict = field(default_factory=lambda: PARTY_COLOURS) |
||||
``` |
||||
|
||||
--- |
||||
|
||||
# Imports |
||||
|
||||
## Ordering (convention) |
||||
1. Standard library |
||||
2. Third-party (streamlit, ibis, plotly, sklearn, umap) |
||||
3. Local/relative imports |
||||
|
||||
## Avoid |
||||
- Wildcard imports (`from module import *`) |
||||
- Circular imports (ensure dependency direction: helpers → database → config) |
||||
@ -1,55 +0,0 @@ |
||||
# Dependencies map and recommended extras (Phase 1 authoritative) |
||||
declared: |
||||
- streamlit |
||||
- duckdb |
||||
- ibis-framework[duckdb] |
||||
- plotly |
||||
- scikit-learn |
||||
- scipy |
||||
- umap-learn |
||||
- openai # note: declared but not observed imported; review usage |
||||
- requests |
||||
|
||||
observed: |
||||
- requests |
||||
- duckdb (used but sometimes import guarded) |
||||
- numpy |
||||
- pytest |
||||
|
||||
grouped: |
||||
core: |
||||
- python >=3.13 |
||||
- streamlit |
||||
- duckdb |
||||
- ibis-framework[duckdb] |
||||
- requests |
||||
ml: |
||||
- scikit-learn |
||||
- scipy |
||||
- umap-learn |
||||
- numpy |
||||
viz: |
||||
- plotly |
||||
testing: |
||||
- pytest |
||||
|
||||
recommended_extras: |
||||
reproducibility: |
||||
- poetry (poetry.lock) or pip-tools (requirements.txt + requirements.in) |
||||
- pipx or virtualenv usage documented |
||||
linting_and_formatting: |
||||
- black |
||||
- ruff |
||||
- isort |
||||
- mypy |
||||
logging_and_monitoring: |
||||
- structlog (optional) |
||||
containerization: |
||||
- docker (already used) |
||||
heavy_analytics (optional): |
||||
- pandas |
||||
- altair |
||||
- dash (if more interactive dashboards are needed) |
||||
notes: |
||||
- Because no lockfile was present during Phase 1, adding one is high priority for reproducible CI builds. |
||||
- openai is declared but not imported anywhere in Phase 1 files; prefer to either remove or add an explicit adapter usage and tests. |
||||
@ -0,0 +1,92 @@ |
||||
--- |
||||
title: Dependencies and Library Usage |
||||
category: dependencies |
||||
--- |
||||
|
||||
# Dependencies and Library Usage |
||||
|
||||
## Core Dependencies |
||||
|
||||
### duckdb |
||||
- **Required**: Yes |
||||
- **Fallback**: None (core functionality) |
||||
- **Usage**: SQL database for motions, embeddings, SVD vectors |
||||
- **Files**: database.py, analysis/*.py, pipeline/*.py |
||||
|
||||
### streamlit |
||||
- **Required**: Yes |
||||
- **Fallback**: None |
||||
- **Usage**: Web UI framework |
||||
- **Files**: app.py, pages/*.py, explorer.py |
||||
|
||||
### requests |
||||
- **Required**: Yes |
||||
- **Fallback**: None |
||||
- **Usage**: HTTP client for API calls |
||||
- **Files**: api_client.py, ai_provider.py |
||||
|
||||
### plotly |
||||
- **Required**: Yes |
||||
- **Fallback**: None (raises ImportError) |
||||
- **Usage**: Interactive charts for explorer |
||||
- **Files**: explorer.py, explorer_helpers.py |
||||
|
||||
## Optional Dependencies |
||||
|
||||
### umap-learn |
||||
- **Required**: No |
||||
- **Fallback**: Use raw SVD vectors (first 2 dimensions) |
||||
- **Usage**: Dimensionality reduction for visualization |
||||
- **Files**: analysis/clustering.py |
||||
|
||||
### matplotlib |
||||
- **Required**: No |
||||
- **Fallback**: Plotly or raw output |
||||
- **Usage**: Static charting |
||||
- **Files**: Various analysis scripts |
||||
|
||||
## ML Dependencies |
||||
|
||||
### sklearn |
||||
- **Required**: Yes |
||||
- **Usage**: KMeans clustering, cosine_similarity, StandardScaler |
||||
- **Files**: analysis/clustering.py, similarity/compute.py |
||||
|
||||
### scipy |
||||
- **Required**: Yes |
||||
- **Usage**: SVD (scipy.linalg.svd), spatial.procrustes for alignment |
||||
- **Files**: analysis/trajectory.py, pipeline/svd_pipeline.py |
||||
|
||||
### numpy |
||||
- **Required**: Yes |
||||
- **Usage**: Array operations, linear algebra |
||||
- **Files**: Throughout codebase |
||||
|
||||
## Key Imports by File |
||||
|
||||
### explorer.py |
||||
- `import streamlit as st` |
||||
- `from database import db` |
||||
- `from explorer_helpers import *` |
||||
|
||||
### explorer_helpers.py |
||||
- `import pandas as pd` |
||||
- `import plotly.graph_objects as go` |
||||
- `from database import db` (optional, for type hints) |
||||
|
||||
### database.py |
||||
- `import ibis` |
||||
- `import duckdb` |
||||
- `from config import config, PARTY_COLOURS` |
||||
|
||||
### config.py |
||||
- `from dataclasses import dataclass, field` |
||||
- `import streamlit as st` (optional, for warnings) |
||||
|
||||
## Singleton Instances |
||||
|
||||
| Module | Instance | Type | |
||||
|--------|----------|------| |
||||
| `database.py` | `db` | `MotionDatabase` | |
||||
| `config.py` | `config` | `Config` (dataclass) | |
||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||
@ -1,37 +0,0 @@ |
||||
# Domain glossary (core concepts from Phase 1) |
||||
|
||||
terms: |
||||
Motion: |
||||
short: "A parliamentary motion/decision" |
||||
keys: [id, title, description, date, body_text, url] |
||||
motie: |
||||
short: "Dutch: motion (motie). Equivalent to Motion in code comments and UI." |
||||
MP: |
||||
short: "Member of Parliament (kamerlid)" |
||||
keys: [mp_name, party, van, tot_en_met, persoon_id] |
||||
mp_votes: |
||||
short: "Raw voting rows: motion_id, mp_name, vote, date" |
||||
mp_metadata: |
||||
short: "Per-MP metadata table and fields" |
||||
user_sessions: |
||||
short: "Streamlit user quiz session state (session_id, user_votes, completed_motions...)" |
||||
embeddings: |
||||
short: "Raw text embeddings stored per motion (embeddings table)" |
||||
svd_vectors: |
||||
short: "SVD-derived vectors from the vote matrix (svd_vectors table)" |
||||
fused_embeddings: |
||||
short: "Concatenation of SVD and text embeddings (fused_embeddings table)" |
||||
similarity_cache: |
||||
short: "Precomputed nearest neighbors for each motion" |
||||
window_id: |
||||
short: "Processing window identifier used for SVD/fusion runs" |
||||
controversy_score: |
||||
short: "Numeric measure stored in motions table" |
||||
winning_margin: |
||||
short: "Numeric field indicating margin of win in a vote" |
||||
Politiek_Kompas: |
||||
short: "Political compass; also appears in UI features" |
||||
MP_quiz: |
||||
short: "Interactive quiz derived from motions and mp_votes" |
||||
notes: |
||||
- Use these canonical terms in docs, tests, variable names and DB schemas. |
||||
@ -0,0 +1,146 @@ |
||||
--- |
||||
title: Domain Glossary |
||||
category: domain |
||||
--- |
||||
|
||||
# Domain Glossary - Dutch Political Terms |
||||
|
||||
## CRITICAL INVARIANTS |
||||
|
||||
> **Rule 1**: Centroid of right-wing parties on RIGHT side of ALL axes |
||||
> - PVV, FVD, JA21, SGP centroid must appear on the RIGHT |
||||
> - Individual right-wing parties may vary slightly from the centroid |
||||
> - This is non-negotiable for any compass/axis visualization |
||||
|
||||
> **Rule 2**: SVD labels are empirically derived from voting data |
||||
> - Labels represent WHAT THE DATA SHOWS, not party self-identification or public opinion |
||||
> - Labels are derived from outliers and 20 representative motions (10 positive, 10 negative) |
||||
> - See SVD Label Derivation section below |
||||
|
||||
--- |
||||
|
||||
## SVD Label Derivation |
||||
|
||||
### The Process |
||||
|
||||
SVD (Singular Value Decomposition) finds axes that maximize variance in the MP × Motion voting matrix. To label each axis: |
||||
|
||||
1. **Identify outliers**: Find the two MPs with most extreme positions on that axis |
||||
2. **Select representative motions**: Pick 20 motions where these outliers disagreed most sharply (10 they voted opposite on, 10 where both voted same direction but with other extremes) |
||||
3. **Interpret theme**: Read the motion titles to derive what the axis represents |
||||
4. **Assign label**: Label describes the empirical theme, could be: |
||||
- Left-Right |
||||
- Coalition-Opposition |
||||
- Progressive-Conservative |
||||
- EU-National sovereignty |
||||
- Populist-Establishment |
||||
- Or whatever the voting patterns show |
||||
|
||||
### Example |
||||
|
||||
| Step | Description | |
||||
|------|-------------| |
||||
| Outlier A | Wilders (PVV) - extreme positive on Dim 1 | |
||||
| Outlier B | Marijnissen (SP) - extreme negative on Dim 1 | |
||||
| 20 Motions | Immigration, integration, law & order themes dominate | |
||||
| Label | "Links-Rechts" (Left-Right) | |
||||
|
||||
### Labeling Rules |
||||
|
||||
- **Never use party names in labels** (e.g., not "PVV-SP axis") |
||||
- **Never use semantic/ideological labels** (e.g., not "progressive-conservative" unless that's what the motions show) |
||||
- **Use motion-derived themes** (e.g., "Immigration", "EU", "Economy") |
||||
- **Fallback**: If theme is unclear, use "Axis 1", "Axis 2" |
||||
|
||||
--- |
||||
|
||||
## Core Entities |
||||
|
||||
### Motion / Motie |
||||
- Parliamentary motion submitted by MPs |
||||
- Fields: `id`, `title`, `date`, `category` |
||||
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** |
||||
|
||||
### MP / Kamerlid |
||||
- Member of Parliament (Tweede Kamerlid) |
||||
- Identified by full name (e.g., "Van Dijk, I.") |
||||
- Has voting record, party affiliation, SVD position vector |
||||
|
||||
### Party / Fractie |
||||
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") |
||||
- Party centroids: average SVD position of all MPs in party |
||||
|
||||
### Vote / Stemming |
||||
- Individual MP's vote on a motion: +1, 0, -1 |
||||
- Aggregated to compute SVD vectors |
||||
|
||||
--- |
||||
|
||||
## Time & Analysis Concepts |
||||
|
||||
### Window / Tijdsvenster |
||||
- Time period for analysis (annual or quarterly) |
||||
- Values: "2023", "2023-Q1", "2024", etc. |
||||
- SVD vectors computed per window |
||||
|
||||
### Trajectory |
||||
- MP's position change across multiple windows |
||||
- Computed from `svd_vectors` + window ordering |
||||
|
||||
--- |
||||
|
||||
## Mathematical / Algorithmic Terms |
||||
|
||||
### SVD Vector |
||||
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix |
||||
- Represents MP's position in political space |
||||
|
||||
### SVD Label |
||||
- Empirically derived axis label based on outlier MPs and representative motions |
||||
- Describes the theme of disagreement on that axis |
||||
- NOT based on party ideology or semantic labels |
||||
|
||||
### Political Compass |
||||
- 2D visualization with SVD axes mapped to compass quadrants |
||||
- X-axis: First SVD dimension (labeled from voting data) |
||||
- Y-axis: Second SVD dimension (labeled from voting data) |
||||
|
||||
### Procrustes Alignment |
||||
- Algorithm to align SVD vectors across time windows |
||||
- Ensures comparable positions across years/quarters |
||||
|
||||
### UMAP |
||||
- Uniform Manifold Approximation and Projection |
||||
- Dimensionality reduction for visualization |
||||
- Optional dependency with graceful SVD fallback |
||||
|
||||
--- |
||||
|
||||
## Database Table Reference |
||||
|
||||
| Table | Key Fields | |
||||
|-------|-----------| |
||||
| `motions` | id, title, date, category | |
||||
| `mp_votes` | mp_id, motion_id, vote | |
||||
| `svd_vectors` | entity_id, window, vector_2d (list[2]) | |
||||
| `mp_party_history` | mp_id, party, start_date, end_date | |
||||
| `windows` | window_id, start_date, end_date, period_type | |
||||
| `mp_trajectories` | mp_id, window, trajectory_vector | |
||||
|
||||
--- |
||||
|
||||
## Dutch Political Parties |
||||
|
||||
### Canonical Right-Wing (centroid on RIGHT of axes) |
||||
- PVV (Partij voor de Vrijheid) |
||||
- FVD (Forum voor Democratie) |
||||
- JA21 |
||||
- SGP (Staatkundig Gereformeerde Partij) |
||||
|
||||
### Other Major Parties |
||||
- VVD (Volkspartij voor Vrijheid en Democratie) |
||||
- GL-PvdA (GroenLinks-PvdA) |
||||
- NSC (Nieuw Sociaal Contract) |
||||
- BBB (BoerBurgerBeweging) |
||||
- SP (Socialistische Partij) |
||||
- D66 (Democraten 66) |
||||
@ -0,0 +1,196 @@ |
||||
"""Example: TweedeKamerAPI usage - from api_client.py and actual codebase.""" |
||||
|
||||
from datetime import datetime, timedelta |
||||
from typing import Dict, List |
||||
|
||||
# Import the API client |
||||
from api_client import TweedeKamerAPI |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 1: Basic API usage |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_fetch_motions(): |
||||
"""Fetch recent parliamentary motions from TweedeKamer API.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
# Fetch motions from last 30 days |
||||
start_date = datetime.now() - timedelta(days=30) |
||||
|
||||
try: |
||||
motions = api.get_motions(start_date=start_date, limit=100) |
||||
|
||||
print(f"Fetched {len(motions)} motions") |
||||
|
||||
for motion in motions[:5]: # Show first 5 |
||||
print(f" - {motion.get('title', 'N/A')}") |
||||
|
||||
return motions |
||||
finally: |
||||
api.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 2: Fetching with date range |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_date_range(): |
||||
"""Fetch motions from a specific date range.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
start = datetime(2024, 1, 1) |
||||
end = datetime(2024, 3, 31) # Q1 2024 |
||||
|
||||
try: |
||||
motions = api.get_motions(start_date=start, end_date=end, limit=500) |
||||
|
||||
# Group by policy area |
||||
by_area = {} |
||||
for m in motions: |
||||
area = m.get("policy_area", "Onbekend") |
||||
by_area.setdefault(area, []).append(m) |
||||
|
||||
for area, area_motions in sorted(by_area.items()): |
||||
print(f"{area}: {len(area_motions)} motions") |
||||
|
||||
return motions |
||||
finally: |
||||
api.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 3: Context manager usage |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_context_manager(): |
||||
"""Use API client as context manager.""" |
||||
|
||||
with TweedeKamerAPI() as api: |
||||
motions = api.get_motions( |
||||
start_date=datetime.now() - timedelta(days=7), limit=50 |
||||
) |
||||
|
||||
print(f"Fetched {len(motions)} motions this week") |
||||
|
||||
return motions |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 4: Processing voting records |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_process_votes(): |
||||
"""Process individual voting records from API.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
start_date = datetime.now() - timedelta(days=7) |
||||
|
||||
try: |
||||
# Get voting records directly |
||||
voting_records, besluit_meta = api._get_voting_records( |
||||
start_date=start_date, limit=1000 |
||||
) |
||||
|
||||
print(f"Fetched {len(voting_records)} voting records") |
||||
print(f"From {len(besluit_meta)} unique decisions") |
||||
|
||||
# Count votes by party |
||||
party_votes = {} |
||||
for record in voting_records: |
||||
party = record.get("Fractie", "Onbekend") |
||||
vote = record.get("Soort", "Onbekend") |
||||
party_votes.setdefault(party, {})[vote] = ( |
||||
party_votes.get(party, {}).get(vote, 0) + 1 |
||||
) |
||||
|
||||
for party, votes in sorted(party_votes.items()): |
||||
total = sum(votes.values()) |
||||
voor = votes.get("Voor", 0) |
||||
print(f"{party}: {total} votes ({voor} voor)") |
||||
|
||||
return voting_records |
||||
finally: |
||||
api.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 5: Safe API call with fallback |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_safe_call(): |
||||
"""Make API call with safe fallback on failure.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
try: |
||||
# This will return [] on any error |
||||
motions = api.get_motions( |
||||
start_date=datetime.now() - timedelta(days=30), limit=100 |
||||
) |
||||
|
||||
if not motions: |
||||
print("No motions returned - using cached data") |
||||
# Fallback to cached/local data |
||||
from database import db |
||||
|
||||
return db.get_filtered_motions(limit=10) |
||||
|
||||
return motions |
||||
finally: |
||||
api.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 6: Pagination handling |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_pagination(): |
||||
"""Understand how pagination works in the API.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
start_date = datetime.now() - timedelta(days=365) |
||||
|
||||
# Simulate pagination |
||||
page_size = 250 |
||||
total_limit = 500 |
||||
|
||||
all_motions = [] |
||||
skip = 0 |
||||
|
||||
while len(all_motions) < total_limit: |
||||
print(f"Fetching page with skip={skip}...") |
||||
|
||||
# In real usage, get_motions handles pagination internally |
||||
# This demonstrates what's happening under the hood |
||||
page_motions = api._fetch_page(start_date=start_date, skip=skip, top=page_size) |
||||
|
||||
if not page_motions: |
||||
break |
||||
|
||||
all_motions.extend(page_motions) |
||||
skip += page_size |
||||
|
||||
if len(page_motions) < page_size: |
||||
break # Last page |
||||
|
||||
print(f"Total fetched: {len(all_motions)} motions") |
||||
return all_motions |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
print("=== Basic Fetch ===") |
||||
example_fetch_motions() |
||||
|
||||
print("\n=== Process Votes ===") |
||||
example_process_votes() |
||||
@ -0,0 +1,191 @@ |
||||
"""Example: MotionDatabase usage - from database.py and actual codebase.""" |
||||
|
||||
from typing import Dict, List, Optional |
||||
import duckdb |
||||
import json |
||||
from config import config |
||||
|
||||
# Import the singleton instance |
||||
from database import db |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 1: Getting filtered motions |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_get_filtered_motions(): |
||||
"""Get controversial motions from a specific policy area.""" |
||||
|
||||
motions = db.get_filtered_motions( |
||||
policy_area="Klimaat", |
||||
min_margin=0.0, |
||||
max_margin=0.3, # Controversial: close margin |
||||
limit=10, |
||||
) |
||||
|
||||
for motion in motions: |
||||
print(f"{motion['title']}: {motion['winning_margin']:.1%} margin") |
||||
|
||||
return motions |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 2: Creating a voting session |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_voting_session(): |
||||
"""Create a new user session and record votes.""" |
||||
|
||||
# Create session for 10 motions |
||||
session_id = db.create_session(total_motions=10) |
||||
print(f"Created session: {session_id}") |
||||
|
||||
# Get motions for the session |
||||
motions = db.get_filtered_motions(policy_area="Alle", limit=10) |
||||
|
||||
# Record votes |
||||
for motion in motions: |
||||
# In real app, user would choose vote |
||||
vote = "Voor" # Example vote |
||||
db.record_vote(session_id=session_id, motion_id=motion["id"], vote=vote) |
||||
|
||||
# Get results |
||||
results = db.get_party_results(session_id) |
||||
|
||||
for party, result in sorted(results.items(), key=lambda x: -x[1]["agreement"]): |
||||
print(f"{party}: {result['agreement']:.1%} agreement") |
||||
|
||||
return results |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 3: Working with DuckDB connections directly |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_direct_duckdb(): |
||||
"""Example of proper DuckDB connection handling.""" |
||||
|
||||
conn = duckdb.connect(config.DATABASE_PATH) |
||||
try: |
||||
# Get motion with votes |
||||
result = conn.execute( |
||||
""" |
||||
SELECT m.*, |
||||
JSON_EXTRACT(voting_results, '$.total_votes') as total_votes |
||||
FROM motions m |
||||
WHERE m.id = ? |
||||
""", |
||||
(123,), |
||||
).fetchone() |
||||
|
||||
if result: |
||||
print(f"Motion: {result[1]}") # title is index 1 |
||||
|
||||
return result |
||||
finally: |
||||
conn.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 4: Bulk operations |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_bulk_insert(): |
||||
"""Example of bulk inserting motions.""" |
||||
|
||||
# Sample data |
||||
motions = [ |
||||
{ |
||||
"title": "Motion about climate policy", |
||||
"description": "Proposal to reduce emissions", |
||||
"date": "2024-01-15", |
||||
"policy_area": "Klimaat", |
||||
"voting_results": json.dumps({"Voor": 75, "Tegen": 65}), |
||||
"winning_margin": 0.07, |
||||
"controversy_score": 0.85, |
||||
}, |
||||
{ |
||||
"title": "Motion about healthcare", |
||||
"description": "Increase healthcare budget", |
||||
"date": "2024-01-20", |
||||
"policy_area": "Zorg", |
||||
"voting_results": json.dumps({"Voor": 90, "Tegen": 50}), |
||||
"winning_margin": 0.29, |
||||
"controversy_score": 0.42, |
||||
}, |
||||
] |
||||
|
||||
conn = duckdb.connect(config.DATABASE_PATH) |
||||
try: |
||||
for motion in motions: |
||||
conn.execute( |
||||
""" |
||||
INSERT INTO motions |
||||
(title, description, date, policy_area, voting_results, |
||||
winning_margin, controversy_score) |
||||
VALUES (?, ?, ?, ?, ?, ?, ?) |
||||
""", |
||||
( |
||||
motion["title"], |
||||
motion["description"], |
||||
motion["date"], |
||||
motion["policy_area"], |
||||
motion["voting_results"], |
||||
motion["winning_margin"], |
||||
motion["controversy_score"], |
||||
), |
||||
) |
||||
conn.close() |
||||
print(f"Inserted {len(motions)} motions") |
||||
except Exception as e: |
||||
conn.close() |
||||
print(f"Error inserting motions: {e}") |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 5: Query with aggregation |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_aggregation(): |
||||
"""Example of aggregate queries.""" |
||||
|
||||
conn = duckdb.connect(config.DATABASE_PATH) |
||||
try: |
||||
# Get statistics by policy area |
||||
results = conn.execute(""" |
||||
SELECT |
||||
policy_area, |
||||
COUNT(*) as motion_count, |
||||
AVG(winning_margin) as avg_margin, |
||||
AVG(controversy_score) as avg_controversy |
||||
FROM motions |
||||
WHERE policy_area IS NOT NULL |
||||
GROUP BY policy_area |
||||
ORDER BY motion_count DESC |
||||
""").fetchall() |
||||
|
||||
for row in results: |
||||
print( |
||||
f"{row[0]}: {row[1]} motions, " |
||||
f"avg margin {row[2]:.1%}, " |
||||
f"controversy {row[3]:.2f}" |
||||
) |
||||
|
||||
conn.close() |
||||
return results |
||||
except Exception as e: |
||||
conn.close() |
||||
return [] |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
print("=== Filtered Motions ===") |
||||
example_get_filtered_motions() |
||||
|
||||
print("\n=== Aggregation ===") |
||||
example_aggregation() |
||||
@ -0,0 +1,217 @@ |
||||
"""Example: Pipeline phase execution - from pipeline/run_pipeline.py and actual codebase.""" |
||||
|
||||
import argparse |
||||
from datetime import date, timedelta |
||||
from typing import List, Tuple |
||||
|
||||
# Import pipeline modules |
||||
from pipeline.fetch_mp_metadata import fetch_mp_metadata |
||||
from pipeline.extract_mp_votes import extract_mp_votes |
||||
from pipeline.svd_pipeline import run_svd_pipeline |
||||
from pipeline.text_pipeline import run_text_pipeline |
||||
from pipeline.fusion import run_fusion |
||||
|
||||
from database import MotionDatabase |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 1: Running full pipeline |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_full_pipeline(): |
||||
"""Run the complete data ingestion pipeline.""" |
||||
|
||||
# Parse arguments like CLI would |
||||
parser = argparse.ArgumentParser(description="Pipeline runner") |
||||
parser.add_argument("--db-path", default="data/motions.db") |
||||
parser.add_argument("--start-date", default=None) |
||||
parser.add_argument("--end-date", default=None) |
||||
parser.add_argument( |
||||
"--window-size", choices=["quarterly", "annual"], default="quarterly" |
||||
) |
||||
parser.add_argument("--svd-k", type=int, default=50) |
||||
|
||||
args = parser.parse_args([]) |
||||
|
||||
# Resolve dates |
||||
end_date = date.fromisoformat(args.end_date) if args.end_date else date.today() |
||||
start_date = ( |
||||
date.fromisoformat(args.start_date) |
||||
if args.start_date |
||||
else end_date - timedelta(days=730) |
||||
) |
||||
|
||||
print(f"Running pipeline: {start_date} → {end_date}") |
||||
print(f"Window size: {args.window_size}") |
||||
print(f"DB path: {args.db_path}") |
||||
|
||||
# Initialize database |
||||
db = MotionDatabase(args.db_path) |
||||
|
||||
# Phase 1: Fetch MP metadata |
||||
print("\n=== Phase 1: MP Metadata ===") |
||||
n_mp = fetch_mp_metadata(db_path=args.db_path) |
||||
print(f"Processed {n_mp} MPs") |
||||
|
||||
# Phase 2: Extract MP votes |
||||
print("\n=== Phase 2: Extract Votes ===") |
||||
n_votes = extract_mp_votes(db_path=args.db_path) |
||||
print(f"Extracted {n_votes} vote records") |
||||
|
||||
# Phase 3: Generate time windows |
||||
print("\n=== Phase 3: SVD Pipeline ===") |
||||
windows = generate_windows(start_date, end_date, args.window_size) |
||||
print(f"Generated {len(windows)} windows: {windows}") |
||||
|
||||
# Phase 4: SVD per window |
||||
run_svd_pipeline(db, windows, args.svd_k) |
||||
print(f"Computed SVD for {len(windows)} windows") |
||||
|
||||
# Phase 5: Text embeddings |
||||
print("\n=== Phase 4: Text Embeddings ===") |
||||
run_text_pipeline(args.db_path, batch_size=50) |
||||
print("Text embeddings completed") |
||||
|
||||
# Phase 6: Fusion |
||||
print("\n=== Phase 5: Fusion ===") |
||||
run_fusion(args.db_path, windows) |
||||
print("Fusion completed") |
||||
|
||||
print("\n=== Pipeline Complete ===") |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 2: Generate time windows |
||||
# ============================================================================= |
||||
|
||||
|
||||
def generate_windows( |
||||
start: date, end: date, granularity: str |
||||
) -> List[Tuple[str, str, str]]: |
||||
"""Generate time windows for pipeline processing.""" |
||||
|
||||
windows = [] |
||||
cursor = date(start.year, start.month, 1) |
||||
|
||||
if granularity == "annual": |
||||
cursor = date(start.year, 1, 1) |
||||
while cursor <= end: |
||||
year_end = date(cursor.year, 12, 31) |
||||
w_end = min(year_end, end) |
||||
windows.append((str(cursor.year), cursor.isoformat(), w_end.isoformat())) |
||||
cursor = date(cursor.year + 1, 1, 1) |
||||
else: |
||||
# quarterly |
||||
quarter_starts = {1: 1, 2: 4, 3: 7, 4: 10} |
||||
quarter_ends = {1: 3, 2: 6, 3: 9, 4: 12} |
||||
|
||||
q = (cursor.month - 1) // 3 + 1 |
||||
cursor = date(cursor.year, quarter_starts[q], 1) |
||||
|
||||
while cursor <= end: |
||||
q = (cursor.month - 1) // 3 + 1 |
||||
import calendar |
||||
|
||||
q_end_month = quarter_ends[q] |
||||
last_day = calendar.monthrange(cursor.year, q_end_month)[1] |
||||
q_end = date(cursor.year, q_end_month, last_day) |
||||
w_end = min(q_end, end) |
||||
window_id = f"{cursor.year}-Q{q}" |
||||
windows.append((window_id, cursor.isoformat(), w_end.isoformat())) |
||||
cursor = q_end + timedelta(days=1) |
||||
|
||||
return windows |
||||
|
||||
|
||||
def example_window_generation(): |
||||
"""Example of window generation.""" |
||||
|
||||
start = date(2023, 1, 1) |
||||
end = date(2024, 6, 30) |
||||
|
||||
print("Quarterly windows:") |
||||
quarterly = generate_windows(start, end, "quarterly") |
||||
for wid, s, e in quarterly: |
||||
print(f" {wid}: {s} to {e}") |
||||
|
||||
print("\nAnnual windows:") |
||||
annual = generate_windows(start, end, "annual") |
||||
for wid, s, e in annual: |
||||
print(f" {wid}: {s} to {e}") |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 3: Running individual phases |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_individual_phases(): |
||||
"""Run pipeline phases individually for debugging.""" |
||||
|
||||
db_path = "data/motions.db" |
||||
db = MotionDatabase(db_path) |
||||
|
||||
# Only run MP metadata fetch |
||||
print("Fetching MP metadata...") |
||||
n = fetch_mp_metadata(db_path=db_path) |
||||
print(f" {n} MPs processed") |
||||
|
||||
# Only run vote extraction |
||||
print("Extracting votes...") |
||||
n = extract_mp_votes(db_path=db_path) |
||||
print(f" {n} votes extracted") |
||||
|
||||
# Only run SVD for specific window |
||||
print("Computing SVD...") |
||||
windows = [("2024-Q1", "2024-01-01", "2024-03-31")] |
||||
run_svd_pipeline(db, windows, k=50) |
||||
print(" SVD computed") |
||||
|
||||
# Only run text embeddings |
||||
print("Computing embeddings...") |
||||
run_text_pipeline(db_path, batch_size=25) # Smaller batch for testing |
||||
print(" Embeddings computed") |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 4: Dry run |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_dry_run(): |
||||
"""Show what pipeline would do without making changes.""" |
||||
|
||||
print("DRY RUN - no writes will be made") |
||||
|
||||
start_date = date(2024, 1, 1) |
||||
end_date = date(2024, 6, 30) |
||||
|
||||
# Generate and show windows |
||||
windows = generate_windows(start_date, end_date, "quarterly") |
||||
|
||||
print(f"Would process {len(windows)} windows:") |
||||
for wid, s, e in windows: |
||||
print(f" {wid}: {s} to {e}") |
||||
|
||||
print("\nWould run phases:") |
||||
print(" 1. fetch_mp_metadata") |
||||
print(" 2. extract_mp_votes") |
||||
print(" 3. svd_pipeline") |
||||
print(" 4. text_pipeline") |
||||
print(" 5. fusion") |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
import logging |
||||
|
||||
logging.basicConfig( |
||||
level=logging.INFO, |
||||
format="%(asctime)s %(levelname)s %(name)s: %(message)s", |
||||
) |
||||
|
||||
print("=== Window Generation ===") |
||||
example_window_generation() |
||||
|
||||
print("\n=== Dry Run ===") |
||||
example_dry_run() |
||||
@ -1,15 +1,108 @@ |
||||
# DO NOT EDIT - read-only until validated |
||||
# Sanitized manifest: contains non-sensitive sample excerpts only |
||||
files: |
||||
- path: src/lib/schema.ts |
||||
evidence_excerpt: "Defines schema for user input validation" |
||||
flags: |
||||
needs_review: true |
||||
- path: src/api/handler.ts |
||||
evidence_excerpt: "Handles API requests and routing" |
||||
flags: |
||||
needs_review: false |
||||
- path: README.md |
||||
evidence_excerpt: "Project overview and setup instructions" |
||||
flags: |
||||
needs_review: true |
||||
# stemwijzer Mind Model - Manifest |
||||
# Generated: 2026-04-12 |
||||
# Phase: 2 - Assembly from Phase 1 Analysis |
||||
|
||||
name: stemwijzer |
||||
version: 2 |
||||
description: Dutch political voting compass (Stemwijzer) - Mind Model constraints |
||||
|
||||
categories: |
||||
# Core documentation |
||||
- path: system.md |
||||
description: System overview and architecture summary |
||||
group: docs |
||||
- path: stack/stack.md |
||||
description: Technology stack with versions and purposes |
||||
group: stack |
||||
- path: domain/domain-glossary.md |
||||
description: Domain entities, terms, relationships, and CRITICAL INVARIANTS |
||||
group: domain |
||||
|
||||
# Design patterns |
||||
- path: patterns/patterns.yaml |
||||
description: Code patterns (Singleton, Repository, Pipeline, etc.) |
||||
group: patterns |
||||
- path: patterns/streamlit.yaml |
||||
description: Streamlit-specific patterns (session state, cache) |
||||
group: patterns |
||||
- path: patterns/api.yaml |
||||
description: API client patterns with retry and pagination |
||||
group: patterns |
||||
- path: patterns/database.yaml |
||||
description: DuckDB patterns and connection management |
||||
group: patterns |
||||
- path: patterns/python.yaml |
||||
description: Python-specific patterns (dataclass, typing) |
||||
group: patterns |
||||
- path: patterns/duckdb-access.md |
||||
description: DuckDB connection patterns and best practices |
||||
group: patterns |
||||
- path: patterns/embeddings-similarity.md |
||||
description: Embeddings and similarity computation patterns |
||||
group: patterns |
||||
- path: patterns/error-handling.md |
||||
description: Error handling and exception patterns |
||||
group: patterns |
||||
- path: patterns/module-singletons.md |
||||
description: Module-level singleton patterns |
||||
group: patterns |
||||
- path: patterns/requests-http.md |
||||
description: HTTP client patterns with retry |
||||
group: patterns |
||||
- path: patterns/validation.md |
||||
description: Input validation patterns |
||||
group: patterns |
||||
|
||||
# Coding constraints |
||||
- path: constraints/error-handling.md |
||||
description: Error handling patterns with safe fallbacks |
||||
group: constraints |
||||
- path: constraints/logging.md |
||||
description: Logging conventions |
||||
group: constraints |
||||
- path: constraints/naming.yaml |
||||
description: File, class, function naming rules |
||||
group: constraints |
||||
- path: constraints/imports.yaml |
||||
description: Import organization and module structure |
||||
group: constraints |
||||
- path: constraints/types.yaml |
||||
description: Type hint conventions |
||||
group: constraints |
||||
- path: constraints/testing.yaml |
||||
description: Testing conventions |
||||
group: constraints |
||||
|
||||
# Anti-patterns |
||||
- path: anti-patterns/anti-patterns.md |
||||
description: Known anti-patterns with evidence and fixes |
||||
group: anti-patterns |
||||
|
||||
# Dependencies |
||||
- path: dependencies/dependencies.md |
||||
description: Library usage and singleton instances |
||||
group: dependencies |
||||
|
||||
# Code examples |
||||
- path: examples/database-example.py |
||||
description: MotionDatabase usage examples |
||||
group: examples |
||||
- path: examples/api-client-example.py |
||||
description: TweedeKamerAPI usage examples |
||||
group: examples |
||||
- path: examples/pipeline-example.py |
||||
description: Pipeline orchestration examples |
||||
group: examples |
||||
- path: examples/streamlit-page-example.py |
||||
description: Streamlit page patterns |
||||
group: examples |
||||
- path: examples/pattern-examples.md |
||||
description: Consolidated pattern examples |
||||
group: examples |
||||
|
||||
# Phase 1 findings summary: |
||||
# - Tech: Python 3.13+, Streamlit, DuckDB, scipy/sklearn/umap, OpenRouter (QWEN) |
||||
# - 10 patterns discovered: Module singletons, Repository, Service layer, Pipeline |
||||
# - 8 anti-patterns: print() instead of logging, _DummySt global, bare except |
||||
# - 6 code clusters: Database, Streamlit UI, API, Analysis/ML, Config, Singletons |
||||
# - 3 groups: stdlib, 3rd party, local imports |
||||
|
||||
@ -0,0 +1,265 @@ |
||||
# API Client Patterns |
||||
|
||||
## Base API Client Pattern |
||||
|
||||
Using requests.Session for connection pooling: |
||||
|
||||
```python |
||||
# api_client.py |
||||
import requests |
||||
from typing import Dict, List, Optional |
||||
from config import config |
||||
|
||||
class TweedeKamerAPI: |
||||
def __init__(self): |
||||
self.odata_base_url = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
||||
self.session = requests.Session() |
||||
self.session.headers.update({ |
||||
"Accept": "application/json", |
||||
"User-Agent": "Dutch-Political-Compass-Tool/1.0", |
||||
}) |
||||
|
||||
def get_motions( |
||||
self, |
||||
start_date: datetime = None, |
||||
end_date: datetime = None, |
||||
limit: int = 500, |
||||
) -> List[Dict]: |
||||
"""Get motions with voting results using OData API.""" |
||||
if not start_date: |
||||
start_date = datetime.now() - timedelta(days=730) |
||||
|
||||
try: |
||||
voting_records, besluit_meta = self._get_voting_records( |
||||
start_date, end_date, limit |
||||
) |
||||
return self._process_voting_records(voting_records, besluit_meta) |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") |
||||
return [] |
||||
``` |
||||
|
||||
## OData Pagination Pattern |
||||
|
||||
Handle server-side pagination with $skip: |
||||
|
||||
```python |
||||
def _get_voting_records( |
||||
self, |
||||
start_date: datetime, |
||||
end_date: datetime = None, |
||||
limit: int = 50000 |
||||
) -> tuple: |
||||
"""Fetch with automatic pagination.""" |
||||
|
||||
filter_query = ( |
||||
f"GewijzigdOp ge {start_date.strftime('%Y-%m-%d')}T00:00:00Z" |
||||
" and StemmingsSoort ne null" |
||||
" and Verwijderd eq false" |
||||
) |
||||
|
||||
page_size = 250 # API caps $top at 250 |
||||
base_url = f"{self.odata_base_url}/Besluit" |
||||
base_params = { |
||||
"$filter": filter_query, |
||||
"$top": page_size, |
||||
"$expand": "Stemming", |
||||
"$orderby": "GewijzigdOp desc", |
||||
} |
||||
|
||||
all_records = [] |
||||
skip = 0 |
||||
|
||||
while len(all_records) < limit: |
||||
params = {**base_params, "$skip": skip} |
||||
response = self.session.get( |
||||
base_url, |
||||
params=params, |
||||
timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
|
||||
besluit_page = data.get("value", []) |
||||
if not besluit_page: |
||||
break |
||||
|
||||
# Process page |
||||
for besluit in besluit_page: |
||||
all_records.extend(self._extract_votes(besluit)) |
||||
|
||||
skip += page_size |
||||
|
||||
return all_records |
||||
``` |
||||
|
||||
## Retry with Backoff Pattern |
||||
|
||||
For transient failures: |
||||
|
||||
```python |
||||
# ai_provider.py |
||||
import time |
||||
import random |
||||
from requests.exceptions import ConnectionError |
||||
|
||||
def _post_with_retries( |
||||
path: str, |
||||
json: dict, |
||||
retries: int = 3 |
||||
) -> requests.Response: |
||||
"""POST with exponential backoff retry.""" |
||||
|
||||
backoff = 0.5 |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
|
||||
# Handle rate limiting |
||||
if resp.status_code == 429: |
||||
if attempt == retries: |
||||
raise ProviderError("Rate limited") |
||||
|
||||
retry_after = resp.headers.get("Retry-After") |
||||
if retry_after: |
||||
time.sleep(int(retry_after)) |
||||
else: |
||||
sleep = backoff * (2 ** (attempt - 1)) |
||||
sleep += random.uniform(0, sleep * 0.1) |
||||
time.sleep(sleep) |
||||
continue |
||||
|
||||
# Handle server errors |
||||
if 500 <= resp.status_code < 600: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Server error: {resp.status_code}") |
||||
time.sleep(backoff * (2 ** (attempt - 1))) |
||||
continue |
||||
|
||||
return resp |
||||
|
||||
except ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Connection error: {exc}") |
||||
time.sleep(backoff * (2 ** (attempt - 1))) |
||||
|
||||
raise ProviderError("Failed after retries") |
||||
``` |
||||
|
||||
## Batch Processing Pattern |
||||
|
||||
Process items in batches to manage API limits: |
||||
|
||||
```python |
||||
def get_embeddings_with_retry( |
||||
texts: List[str], |
||||
batch_size: int = 50, |
||||
retries: int = 3, |
||||
) -> List[Optional[List[float]]]: |
||||
"""Process embeddings in batches with fallback to single items.""" |
||||
|
||||
results = [None] * len(texts) |
||||
|
||||
i = 0 |
||||
while i < len(texts): |
||||
end = min(len(texts), i + batch_size) |
||||
chunk = texts[i:end] |
||||
|
||||
# Try batch first |
||||
try: |
||||
emb_chunk = get_embeddings_batch(chunk) |
||||
for j, emb in enumerate(emb_chunk): |
||||
results[i + j] = emb |
||||
i = end |
||||
continue |
||||
except Exception: |
||||
pass |
||||
|
||||
# Fallback: single items |
||||
for j, text in enumerate(chunk): |
||||
try: |
||||
results[i + j] = get_embedding(text) |
||||
except Exception: |
||||
results[i + j] = None |
||||
|
||||
i = end |
||||
|
||||
return results |
||||
``` |
||||
|
||||
## Response Validation Pattern |
||||
|
||||
Validate API responses before processing: |
||||
|
||||
```python |
||||
def _process_response(self, response: requests.Response) -> Dict: |
||||
"""Validate and parse API response.""" |
||||
|
||||
response.raise_for_status() |
||||
data = response.json() |
||||
|
||||
if "value" not in data: |
||||
raise ValueError("Unexpected response format: missing 'value' key") |
||||
|
||||
return data |
||||
|
||||
def _validate_besluit(self, besluit: Dict) -> bool: |
||||
"""Check required fields exist.""" |
||||
required = ["Id", "GewijzigdOp"] |
||||
return all(field in besluit for field in required) |
||||
``` |
||||
|
||||
## Error Handling Patterns |
||||
|
||||
Always provide safe fallbacks: |
||||
|
||||
```python |
||||
def safe_api_call(self, endpoint: str, params: Dict = None) -> List[Dict]: |
||||
"""Call API with error handling and fallback.""" |
||||
try: |
||||
response = self.session.get( |
||||
endpoint, |
||||
params=params, |
||||
timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
return data.get("value", []) |
||||
except requests.Timeout: |
||||
_logger.warning(f"API timeout for {endpoint}") |
||||
return [] |
||||
except requests.HTTPError as e: |
||||
_logger.error(f"HTTP error: {e}") |
||||
return [] |
||||
except Exception as e: |
||||
_logger.error(f"API call failed: {e}") |
||||
return [] |
||||
``` |
||||
|
||||
## Session Management |
||||
|
||||
Reuse session for connection pooling: |
||||
|
||||
```python |
||||
class TweedeKamerAPI: |
||||
def __init__(self): |
||||
self.session = requests.Session() |
||||
self.session.headers.update({ |
||||
"Accept": "application/json", |
||||
"User-Agent": "Dutch-Political-Compass-Tool/1.0", |
||||
}) |
||||
|
||||
def close(self): |
||||
"""Clean up session when done.""" |
||||
self.session.close() |
||||
|
||||
def __enter__(self): |
||||
return self |
||||
|
||||
def __exit__(self, *args): |
||||
self.close() |
||||
|
||||
# Usage |
||||
with TweedeKamerAPI() as api: |
||||
motions = api.get_motions(start_date) |
||||
``` |
||||
@ -0,0 +1,230 @@ |
||||
# Architectural Patterns |
||||
|
||||
## Repository Pattern |
||||
|
||||
The `MotionDatabase` class acts as a repository, encapsulating all database operations behind a clean interface. |
||||
|
||||
```python |
||||
# database.py |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._init_database() |
||||
|
||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
||||
"""Get a single motion by ID.""" |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
||||
).fetchone() |
||||
return result |
||||
finally: |
||||
conn.close() |
||||
|
||||
def get_filtered_motions( |
||||
self, |
||||
policy_area: str = "Alle", |
||||
min_margin: float = 0.0, |
||||
max_margin: float = 1.0, |
||||
limit: int = 10 |
||||
) -> List[Dict]: |
||||
"""Get filtered list of motions.""" |
||||
... |
||||
``` |
||||
|
||||
**Usage**: Import the singleton instance for all DB operations. |
||||
```python |
||||
from database import db |
||||
|
||||
motions = db.get_filtered_motions(policy_area="Klimaat", limit=20) |
||||
``` |
||||
|
||||
## Facade Pattern |
||||
|
||||
Simplified interfaces over complex subsystems. |
||||
|
||||
### MotionDatabase Facade |
||||
```python |
||||
# Single entry point for all database operations |
||||
db = MotionDatabase() # Singleton instance |
||||
|
||||
# Operations are abstracted: |
||||
db.create_session(total_motions) |
||||
db.record_vote(session_id, motion_id, vote) |
||||
db.get_party_results(session_id) |
||||
``` |
||||
|
||||
### API Client Facade |
||||
```python |
||||
# api_client.py |
||||
class TweedeKamerAPI: |
||||
def __init__(self): |
||||
self.session = requests.Session() # Connection pooling |
||||
|
||||
def get_motions(self, start_date, end_date) -> List[Dict]: |
||||
"""Simple interface hiding OData pagination details.""" |
||||
voting_records, besluit_meta = self._get_voting_records(start_date, end_date) |
||||
return self._process_voting_records(voting_records, besluit_meta) |
||||
``` |
||||
|
||||
### MotionScraper Facade |
||||
```python |
||||
# scraper.py (if used) |
||||
class MotionScraper: |
||||
def get_motion_content(self, url: str) -> Optional[str]: |
||||
"""Extract body text from official website.""" |
||||
... |
||||
``` |
||||
|
||||
## Pipeline Pattern |
||||
|
||||
Sequential phases with explicit dependencies: |
||||
|
||||
``` |
||||
pipeline/run_pipeline.py |
||||
├── Phase 1: fetch_mp_metadata |
||||
│ └── pipeline/fetch_mp_metadata.py |
||||
├── Phase 2: extract_mp_votes |
||||
│ └── pipeline/extract_mp_votes.py |
||||
├── Phase 3: svd_pipeline |
||||
│ └── pipeline/svd_pipeline.py |
||||
├── Phase 4: text_pipeline (gap-fill) |
||||
│ └── pipeline/text_pipeline.py |
||||
└── Phase 5: fusion (combine SVD + text) |
||||
└── pipeline/fusion.py |
||||
``` |
||||
|
||||
### Phase Orchestration |
||||
```python |
||||
# pipeline/run_pipeline.py |
||||
def run(args: argparse.Namespace) -> int: |
||||
db = MotionDatabase(args.db_path) |
||||
|
||||
# Phase 1: MP metadata |
||||
if not args.skip_metadata: |
||||
from pipeline.fetch_mp_metadata import fetch_mp_metadata |
||||
fetch_mp_metadata(db_path=db.db_path) |
||||
|
||||
# Phase 2: Extract votes |
||||
if not args.skip_extract: |
||||
from pipeline.extract_mp_votes import extract_mp_votes |
||||
extract_mp_votes(db_path=db.db_path) |
||||
|
||||
# Phase 3: SVD per window |
||||
if not args.skip_svd: |
||||
from pipeline.svd_pipeline import run_svd_pipeline |
||||
run_svd_pipeline(db, windows, args.svd_k) |
||||
|
||||
# ... additional phases |
||||
``` |
||||
|
||||
## Strategy Pattern |
||||
|
||||
Interchangeable algorithms for axis computation: |
||||
|
||||
```python |
||||
# analysis/political_axis.py |
||||
def compute_political_axis( |
||||
vectors: Dict[str, np.ndarray], |
||||
method: str = "pca" # or "anchor" |
||||
) -> Tuple[np.ndarray, np.ndarray]: |
||||
"""Compute political axis using specified method. |
||||
|
||||
Methods: |
||||
- 'pca': Use first principal component |
||||
- 'anchor': Use predefined anchor motions |
||||
""" |
||||
if method == "pca": |
||||
return _compute_pca_axis(vectors) |
||||
elif method == "anchor": |
||||
return _compute_anchor_axis(vectors) |
||||
``` |
||||
|
||||
## Visitor Pattern |
||||
|
||||
External operations on data structures: |
||||
|
||||
```python |
||||
# analysis/trajectory.py |
||||
def _procrustes_align_windows( |
||||
window_vecs: Dict[str, Dict[str, np.ndarray]], |
||||
min_overlap: int = 5, |
||||
) -> Dict[str, Dict[str, np.ndarray]]: |
||||
"""Align SVD vectors across windows using Procrustes rotations. |
||||
|
||||
Takes the first window as reference and aligns each subsequent window |
||||
to it via orthogonal Procrustes on the set of common entities. |
||||
""" |
||||
``` |
||||
|
||||
## Builder Pattern |
||||
|
||||
Configuration via method chaining: |
||||
|
||||
```python |
||||
# CLI argument parsing |
||||
parser = argparse.ArgumentParser(description="Pipeline runner") |
||||
parser.add_argument("--db-path", default="data/motions.db") |
||||
parser.add_argument("--start-date", default=None) |
||||
parser.add_argument("--end-date", default=None) |
||||
parser.add_argument("--window-size", choices=["quarterly", "annual"], default="quarterly") |
||||
parser.add_argument("--svd-k", type=int, default=50) |
||||
``` |
||||
|
||||
## Decorator Pattern |
||||
|
||||
Retry logic for transient failures: |
||||
|
||||
```python |
||||
# pipeline/ai_provider_wrapper.py |
||||
def get_embeddings_with_retry( |
||||
texts: List[str], |
||||
retries: int = 3, |
||||
batch_size: int = 50, |
||||
) -> List[Optional[List[float]]]: |
||||
"""Return embeddings with automatic retry on failure.""" |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
return _embedder(texts, batch_size=len(texts)) |
||||
except Exception as exc: |
||||
if attempt == retries: |
||||
break |
||||
time.sleep(backoff * (2 ** (attempt - 1))) |
||||
return [None] * len(texts) # Safe fallback |
||||
``` |
||||
|
||||
## Data Patterns |
||||
|
||||
### Batch Processing |
||||
Process items in chunks to manage memory and API limits: |
||||
```python |
||||
for i in range(0, len(items), batch_size): |
||||
chunk = items[i:i + batch_size] |
||||
process_batch(chunk) |
||||
``` |
||||
|
||||
### Caching |
||||
Pre-compute and store expensive results: |
||||
```python |
||||
# SimilarityCache table stores computed similarities |
||||
db.get_similarity(motion_a, motion_b) |
||||
``` |
||||
|
||||
### Lazy Loading |
||||
Load data only when needed: |
||||
```python |
||||
class MotionDatabase: |
||||
@property |
||||
def _connection(self): |
||||
if self._conn is None: |
||||
self._conn = duckdb.connect(self.db_path) |
||||
return self._conn |
||||
``` |
||||
|
||||
### Vectorization |
||||
Use numpy for batch operations: |
||||
```python |
||||
vectors = np.array([v for v in entity_vectors.values()]) |
||||
normalized = vectors / np.linalg.norm(vectors, axis=1, keepdims=True) |
||||
``` |
||||
@ -0,0 +1,239 @@ |
||||
# DuckDB Database Patterns |
||||
|
||||
## Connection Management |
||||
|
||||
### Pattern 1: Short-lived per Method (Most Common) |
||||
|
||||
Always create a new connection, use try/finally for cleanup: |
||||
|
||||
```python |
||||
# database.py |
||||
class MotionDatabase: |
||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", |
||||
(motion_id,) |
||||
).fetchone() |
||||
conn.close() |
||||
return result |
||||
except Exception: |
||||
conn.close() |
||||
return None |
||||
|
||||
def get_filtered_motions( |
||||
self, |
||||
policy_area: str = "Alle", |
||||
min_margin: float = 0.0, |
||||
max_margin: float = 1.0, |
||||
limit: int = 10 |
||||
) -> List[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
query = """ |
||||
SELECT * FROM motions |
||||
WHERE (? = 'Alle' OR policy_area = ?) |
||||
AND winning_margin BETWEEN ? AND ? |
||||
ORDER BY RANDOM() |
||||
LIMIT ? |
||||
""" |
||||
rows = conn.execute(query, (policy_area, policy_area, min_margin, max_margin, limit)).fetchall() |
||||
conn.close() |
||||
return rows |
||||
except Exception: |
||||
conn.close() |
||||
return [] |
||||
``` |
||||
|
||||
### Pattern 2: With Statement (Cleaner) |
||||
|
||||
```python |
||||
def execute_query(self, query: str, params: tuple = ()): |
||||
with duckdb.connect(self.db_path) as conn: |
||||
return conn.execute(query, params).fetchall() |
||||
``` |
||||
|
||||
### Pattern 3: Lazy Connection Caching |
||||
|
||||
For frequently accessed connections: |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._conn = None |
||||
|
||||
@property |
||||
def connection(self): |
||||
if self._conn is None: |
||||
self._conn = duckdb.connect(self.db_path) |
||||
return self._conn |
||||
|
||||
def close(self): |
||||
if self._conn: |
||||
self._conn.close() |
||||
self._conn = None |
||||
``` |
||||
|
||||
## Table Initialization |
||||
|
||||
Create tables with proper constraints and sequences: |
||||
|
||||
```python |
||||
def _init_database(self): |
||||
conn = duckdb.connect(self.db_path) |
||||
|
||||
# Create sequence for auto-incrementing IDs |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: |
||||
pass |
||||
|
||||
# Create tables |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS motions ( |
||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||
title TEXT NOT NULL, |
||||
description TEXT, |
||||
date DATE, |
||||
policy_area TEXT, |
||||
voting_results JSON, |
||||
winning_margin FLOAT, |
||||
controversy_score FLOAT, |
||||
layman_explanation TEXT, |
||||
externe_identifier TEXT, |
||||
body_text TEXT, |
||||
url TEXT UNIQUE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
|
||||
# Add columns to existing tables safely |
||||
try: |
||||
conn.execute("ALTER TABLE motions ADD COLUMN IF NOT EXISTS body_text TEXT") |
||||
except Exception: |
||||
pass # Column may already exist |
||||
|
||||
conn.close() |
||||
``` |
||||
|
||||
## JSON Column Handling |
||||
|
||||
Store and retrieve JSON data: |
||||
|
||||
```python |
||||
# Insert JSON |
||||
def store_motion(self, motion: Dict): |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
conn.execute( |
||||
"INSERT INTO motions (title, voting_results) VALUES (?, ?)", |
||||
(motion["title"], json.dumps(motion["voting_results"])) |
||||
) |
||||
conn.close() |
||||
except Exception: |
||||
conn.close() |
||||
|
||||
# Query JSON |
||||
def get_motions_with_votes(self, party: str) -> List[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
rows = conn.execute(""" |
||||
SELECT title, voting_results |
||||
FROM motions |
||||
WHERE JSON_EXTRACT(voting_results, '$.party') = ? |
||||
""", (party,)).fetchall() |
||||
conn.close() |
||||
return rows |
||||
except Exception: |
||||
conn.close() |
||||
return [] |
||||
``` |
||||
|
||||
## Query Patterns |
||||
|
||||
### Parameterized Queries (Always!) |
||||
```python |
||||
# SAFE - uses parameterized query |
||||
conn.execute("SELECT * FROM motions WHERE id = ?", (motion_id,)) |
||||
|
||||
# AVOID - SQL injection risk |
||||
# conn.execute(f"SELECT * FROM motions WHERE id = {motion_id}") # BAD! |
||||
``` |
||||
|
||||
### Batch Inserts |
||||
```python |
||||
def bulk_insert_motions(self, motions: List[Dict]): |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
for motion in motions: |
||||
conn.execute( |
||||
"""INSERT OR IGNORE INTO motions |
||||
(title, date, policy_area) VALUES (?, ?, ?)""", |
||||
(motion["title"], motion["date"], motion["policy_area"]) |
||||
) |
||||
conn.close() |
||||
except Exception: |
||||
conn.close() |
||||
``` |
||||
|
||||
### Aggregation Queries |
||||
```python |
||||
def get_party_vote_stats(self, party: str) -> Dict: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute(""" |
||||
SELECT |
||||
COUNT(*) as total_votes, |
||||
SUM(CASE WHEN vote = 'Voor' THEN 1 ELSE 0 END) as voor, |
||||
SUM(CASE WHEN vote = 'Tegen' THEN 1 ELSE 0 END) as tegen |
||||
FROM mp_votes |
||||
WHERE party = ? |
||||
""", (party,)).fetchone() |
||||
conn.close() |
||||
return {"total": result[0], "voor": result[1], "tegen": result[2]} |
||||
except Exception: |
||||
conn.close() |
||||
return {"total": 0, "voor": 0, "tegen": 0} |
||||
``` |
||||
|
||||
## Error Handling |
||||
|
||||
Always close connections in finally block or with context manager: |
||||
|
||||
```python |
||||
def safe_query(self, query: str, params: tuple = ()): |
||||
conn = None |
||||
try: |
||||
conn = duckdb.connect(self.db_path) |
||||
result = conn.execute(query, params).fetchall() |
||||
return result |
||||
except Exception as e: |
||||
_logger.error(f"Query failed: {e}") |
||||
return [] |
||||
finally: |
||||
if conn: |
||||
conn.close() |
||||
``` |
||||
|
||||
## Testing with Mock |
||||
|
||||
For unit tests without DuckDB: |
||||
|
||||
```python |
||||
# In MotionDatabase.__init__ |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._file_mode = duckdb is None |
||||
|
||||
if duckdb is None: |
||||
# Create JSON fallback files |
||||
for p in (f"{db_path}.embeddings.json", f"{db_path}.similarity_cache.json"): |
||||
if not os.path.exists(p): |
||||
with open(p, "w") as fh: |
||||
fh.write("[]") |
||||
else: |
||||
self._init_database() |
||||
``` |
||||
@ -0,0 +1,79 @@ |
||||
--- |
||||
title: DuckDB Access Pattern |
||||
category: patterns |
||||
--- |
||||
# DuckDB Access Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
||||
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
||||
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
||||
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
||||
|
||||
## Examples |
||||
|
||||
### database.py - Explicit connect/close for schema init |
||||
|
||||
```python |
||||
conn = duckdb.connect(self.db_path) |
||||
... |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
conn.close() |
||||
``` |
||||
|
||||
### pipeline/svd_pipeline.py - Read-only connection |
||||
|
||||
```python |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute( |
||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||
(start_date, end_date), |
||||
).fetchall() |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
### similarity/compute.py - Preferred 'with' context |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
|
||||
with duckdb.connect(db.db_path) as conn: |
||||
rows = conn.execute(query, params).fetchall() |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Connection without closure |
||||
|
||||
```python |
||||
# BAD: connection may leak if exception occurs before explicit close |
||||
conn = duckdb.connect(db_path) |
||||
rows = conn.execute("SELECT ...").fetchall() |
||||
# missing finally/close |
||||
``` |
||||
|
||||
**Remediation**: Use "with" context or ensure conn.close() in finally block. |
||||
|
||||
### Bad: Parallel write connections |
||||
|
||||
**Problem**: Opening write connections from many parallel workers without coordination. |
||||
|
||||
**Remediation**: Open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
||||
@ -1,70 +0,0 @@ |
||||
name: duckdb_access |
||||
|
||||
rules: |
||||
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
||||
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
||||
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
||||
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
||||
|
||||
examples: |
||||
- path: database.py |
||||
excerpt: | |
||||
```python |
||||
conn = duckdb.connect(self.db_path) |
||||
... |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
conn.close() |
||||
``` |
||||
note: explicit connect/close used when initializing schema |
||||
|
||||
- path: pipeline/svd_pipeline.py |
||||
excerpt: | |
||||
```python |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute( |
||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||
(start_date, end_date), |
||||
).fetchall() |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
note: read_only connection used for compute-heavy worker |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
|
||||
with duckdb.connect(db.db_path) as conn: |
||||
rows = conn.execute(query, params).fetchall() |
||||
``` |
||||
note: preferred 'with' context for automatic close |
||||
|
||||
anti_patterns: |
||||
- Bad: creating a connection without closure in a long-running process |
||||
remediation: use "with" context or ensure conn.close() in finally block |
||||
example: | |
||||
```python |
||||
# BAD: connection may leak if exception occurs before explicit close |
||||
conn = duckdb.connect(db_path) |
||||
rows = conn.execute("SELECT ...").fetchall() |
||||
# missing finally/close |
||||
``` |
||||
- Bad: Opening write connections from many parallel workers without coordination |
||||
remediation: open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
||||
@ -0,0 +1,74 @@ |
||||
--- |
||||
title: Embeddings Similarity Pipeline |
||||
category: patterns |
||||
--- |
||||
# Embeddings Similarity Pipeline |
||||
|
||||
## Rules |
||||
|
||||
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
||||
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
||||
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
||||
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
||||
|
||||
## Examples |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Batched embed + fallback |
||||
|
||||
```python |
||||
for start in range(0, len(texts), batch_size): |
||||
chunk = texts[start : start + batch_size] |
||||
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
||||
... |
||||
for j in range(i, end): |
||||
t = texts[j] |
||||
single, single_exc = _attempt_batch([t], j) |
||||
if single: |
||||
results[j] = single[0] |
||||
``` |
||||
|
||||
### pipeline/fusion.py - Concatenation and storage |
||||
|
||||
```python |
||||
try: |
||||
svd_vec = json.loads(svd_json) |
||||
except Exception: |
||||
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
||||
skipped_missing_svd += 1 |
||||
continue |
||||
... |
||||
fused = list(svd_vec) + list(text_vec) |
||||
res = db.store_fused_embedding( |
||||
int(entity_id), |
||||
window_id, |
||||
fused, |
||||
svd_dims=len(svd_vec), |
||||
text_dims=len(text_vec), |
||||
) |
||||
``` |
||||
|
||||
### similarity/compute.py - Normalized cosine similarity |
||||
|
||||
```python |
||||
# Normalize rows |
||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||
norms[norms == 0] = 1.0 |
||||
normalized = matrix / norms |
||||
sim = normalized @ normalized.T |
||||
... |
||||
# pick top-k neighbors and write to similarity_cache |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Assuming consistent vector length |
||||
|
||||
**Problem**: Assuming consistent vector length without checks leads to shape errors. |
||||
|
||||
**Remediation**: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
||||
|
||||
### Bad: Inline heavy computation in UI |
||||
|
||||
**Problem**: Recomputing heavy pipelines inline in UI requests. |
||||
|
||||
**Remediation**: Schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
||||
@ -1,63 +0,0 @@ |
||||
name: embeddings_similarity_pipeline |
||||
|
||||
rules: |
||||
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
||||
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
||||
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
||||
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
||||
|
||||
examples: |
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
for start in range(0, len(texts), batch_size): |
||||
chunk = texts[start : start + batch_size] |
||||
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
||||
... |
||||
for j in range(i, end): |
||||
t = texts[j] |
||||
single, single_exc = _attempt_batch([t], j) |
||||
if single: |
||||
results[j] = single[0] |
||||
``` |
||||
note: batched embed + fallback per-item retry |
||||
|
||||
- path: pipeline/fusion.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
svd_vec = json.loads(svd_json) |
||||
except Exception: |
||||
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
||||
skipped_missing_svd += 1 |
||||
continue |
||||
... |
||||
fused = list(svd_vec) + list(text_vec) |
||||
res = db.store_fused_embedding( |
||||
int(entity_id), |
||||
window_id, |
||||
fused, |
||||
svd_dims=len(svd_vec), |
||||
text_dims=len(text_vec), |
||||
) |
||||
``` |
||||
note: concatenation of vectors and storage via MotionDatabase |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
# Normalize rows |
||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||
norms[norms == 0] = 1.0 |
||||
normalized = matrix / norms |
||||
sim = normalized @ normalized.T |
||||
... |
||||
# pick top-k neighbors and write to similarity_cache |
||||
``` |
||||
note: numeric pipeline and padding to consistent dimensionality |
||||
|
||||
anti_patterns: |
||||
- Bad: Assuming consistent vector length without checks (leads to shape errors). |
||||
remediation: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
||||
- Bad: Recomputing heavy pipelines inline in UI requests. |
||||
remediation: schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
||||
@ -0,0 +1,63 @@ |
||||
--- |
||||
title: Error Handling Pattern |
||||
category: patterns |
||||
--- |
||||
# Error Handling Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
||||
- Prefer logging.exception when catching an exception where stack trace is useful. |
||||
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
||||
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - Network error to ProviderError |
||||
|
||||
```python |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError( |
||||
f"Connection error when calling provider: {exc}" |
||||
) from exc |
||||
... |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Best-effort with logging |
||||
|
||||
```python |
||||
except Exception: |
||||
_logger.exception("Failed to append audit event for embedding failure") |
||||
results[j] = None |
||||
``` |
||||
|
||||
### similarity/compute.py - Defensive import handling |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Silent exception swallowing |
||||
|
||||
```python |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
# BAD: hides the root cause and returns an ambiguous default |
||||
``` |
||||
|
||||
**Remediation**: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
||||
|
||||
### Bad: Mixing print() and logging |
||||
|
||||
**Problem**: Mixing print() and logging for errors. |
||||
|
||||
**Remediation**: Replace print() calls with logger.* calls; use structured logging configuration. |
||||
@ -1,54 +0,0 @@ |
||||
name: error_handling |
||||
|
||||
rules: |
||||
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
||||
- Prefer logging.exception when catching an exception where stack trace is useful. |
||||
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
||||
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError( |
||||
f"Connection error when calling provider: {exc}" |
||||
) from exc |
||||
... |
||||
``` |
||||
note: mapping network error to ProviderError with re-raise chaining |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
except Exception: |
||||
_logger.exception("Failed to append audit event for embedding failure") |
||||
results[j] = None |
||||
``` |
||||
note: logs and assigns None for failure; fallback behavior documented earlier in wrapper rule |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
``` |
||||
note: defensive import handling and early return on failure |
||||
|
||||
anti_patterns: |
||||
- Bad: Broad except without logging and without re-raising (silently hides bugs) |
||||
remediation: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
||||
example: | |
||||
```python |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
# BAD: hides the root cause and returns an ambiguous default |
||||
``` |
||||
- Bad: Mixing print() and logging for errors |
||||
remediation: Replace print() calls with logger.* calls; use structured logging configuration. |
||||
@ -0,0 +1,41 @@ |
||||
--- |
||||
title: Module Singletons Pattern |
||||
category: patterns |
||||
--- |
||||
# Module Singletons Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
||||
- Avoid expensive initialization at import time. |
||||
- Provide a way to construct with a test DB path or to reinitialize in tests. |
||||
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
||||
|
||||
## Examples |
||||
|
||||
### database.py - Safe class initialization |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
# If duckdb is not available, operate in lightweight file-backed mode |
||||
self._file_mode = duckdb is None |
||||
self._init_database() |
||||
``` |
||||
|
||||
### similarity/lookup.py - Local instances |
||||
|
||||
```python |
||||
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
||||
if hasattr(db, "get_cached_similarities"): |
||||
rows = db.get_cached_similarities(...) |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Heavy initialization at import time |
||||
|
||||
**Problem**: Creating connections and performing heavy schema migrations during import. |
||||
|
||||
**Remediation**: Move heavy init to an explicit initialize() method and keep import fast. |
||||
@ -1,33 +0,0 @@ |
||||
name: module_singletons |
||||
|
||||
rules: |
||||
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
||||
- Avoid expensive initialization at import time. |
||||
- Provide a way to construct with a test DB path or to reinitialize in tests. |
||||
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
||||
|
||||
examples: |
||||
- path: database.py |
||||
excerpt: | |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
# If duckdb is not available, operate in lightweight file-backed mode |
||||
self._file_mode = duckdb is None |
||||
self._init_database() |
||||
``` |
||||
note: class is safe to instantiate and creates DB at init; consider lazy init if heavy |
||||
|
||||
- path: similarity/lookup.py |
||||
excerpt: | |
||||
```python |
||||
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
||||
if hasattr(db, "get_cached_similarities"): |
||||
rows = db.get_cached_similarities(...) |
||||
``` |
||||
note: consumers create local MotionDatabase instances, not relying on a single global |
||||
|
||||
anti_patterns: |
||||
- Bad: Creating connections and performing heavy schema migrations during import |
||||
remediation: Move heavy init to an explicit initialize() method and keep import fast. |
||||
@ -0,0 +1,196 @@ |
||||
# Python-Specific Patterns |
||||
|
||||
## Singleton Pattern |
||||
|
||||
Use module-level instances for shared resources: |
||||
|
||||
```python |
||||
# database.py |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._init_database() |
||||
|
||||
def _init_database(self): |
||||
# Initialize tables on first instantiation |
||||
... |
||||
|
||||
# Bottom of file - the singleton |
||||
db = MotionDatabase() |
||||
``` |
||||
|
||||
**Usage across the codebase:** |
||||
```python |
||||
# In other modules |
||||
from database import db |
||||
|
||||
def some_function(): |
||||
motions = db.get_filtered_motions(limit=10) |
||||
return motions |
||||
``` |
||||
|
||||
Similarly for other singletons: |
||||
```python |
||||
# summarizer.py |
||||
class MotionSummarizer: |
||||
def __init__(self): |
||||
pass # Stateless |
||||
|
||||
def generate_layman_explanation(self, title: str, body: str) -> str: |
||||
... |
||||
|
||||
summarizer = MotionSummarizer() |
||||
``` |
||||
|
||||
## Dataclass Config Pattern |
||||
|
||||
Use dataclass for configuration with environment variable support: |
||||
|
||||
```python |
||||
# config.py |
||||
from dataclasses import dataclass |
||||
from typing import List |
||||
import os |
||||
|
||||
@dataclass |
||||
class Config: |
||||
# Database settings |
||||
DATABASE_PATH = "data/motions.db" |
||||
|
||||
# API settings |
||||
TWEEDE_KAMER_ODATA_API = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
||||
API_TIMEOUT = 30 |
||||
API_BATCH_SIZE = 250 |
||||
|
||||
# AI settings |
||||
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") |
||||
OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1" |
||||
QWEN_MODEL = "qwen/qwen-2.5-72b-instruct" |
||||
|
||||
# App settings |
||||
DEFAULT_MOTION_COUNT = 10 |
||||
SESSION_TIMEOUT_DAYS = 30 |
||||
|
||||
# Policy areas |
||||
POLICY_AREAS: List[str] = None |
||||
def __post_init__(self): |
||||
self.POLICY_AREAS = [ |
||||
"Alle", "Economie", "Klimaat", "Immigratie", |
||||
"Zorg", "Onderwijs", "Defensie", "Sociale Zaken", "Algemeen" |
||||
] |
||||
|
||||
config = Config() |
||||
``` |
||||
|
||||
**Usage:** |
||||
```python |
||||
from config import config |
||||
|
||||
# Access as attributes |
||||
timeout = config.API_TIMEOUT |
||||
areas = config.POLICY_AREAS |
||||
``` |
||||
|
||||
## DuckDB Connection Pattern |
||||
|
||||
Short-lived connections with explicit cleanup: |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", |
||||
(motion_id,) |
||||
).fetchone() |
||||
return result |
||||
finally: |
||||
conn.close() |
||||
|
||||
def get_filtered_motions(self, **kwargs) -> List[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
rows = conn.execute(query, params).fetchall() |
||||
return rows |
||||
except Exception: |
||||
return [] # Safe fallback |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
**Context manager alternative (preferred when applicable):** |
||||
```python |
||||
def some_operation(self): |
||||
with duckdb.connect(self.db_path) as conn: |
||||
result = conn.execute("SELECT ...").fetchall() |
||||
return result |
||||
``` |
||||
|
||||
## Try/Except with Fallback Pattern |
||||
|
||||
Always provide safe fallbacks: |
||||
|
||||
```python |
||||
def get_motion_or_default(self, motion_id: int) -> Dict: |
||||
try: |
||||
conn = duckdb.connect(self.db_path) |
||||
result = conn.execute("SELECT * FROM motions WHERE id = ?", (motion_id,)).fetchone() |
||||
conn.close() |
||||
return result if result else {} |
||||
except Exception: |
||||
return {} |
||||
``` |
||||
|
||||
## Optional Import Pattern |
||||
|
||||
Handle optional dependencies gracefully: |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: # pragma: no cover |
||||
duckdb = None |
||||
|
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self._file_mode = duckdb is None |
||||
... |
||||
``` |
||||
|
||||
## Property Pattern |
||||
|
||||
Lazy initialization of expensive resources: |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._session_cache = None |
||||
|
||||
@property |
||||
def session(self): |
||||
"""Lazy-load expensive resources.""" |
||||
if self._session_cache is None: |
||||
self._session_cache = self._create_session() |
||||
return self._session_cache |
||||
``` |
||||
|
||||
## Type Annotation Patterns |
||||
|
||||
```python |
||||
from typing import Dict, List, Optional, Tuple, Any |
||||
|
||||
# Optional with None default |
||||
def get_motion(self, motion_id: Optional[int] = None) -> Optional[Dict]: |
||||
... |
||||
|
||||
# Multiple return types |
||||
def parse_vote(self, vote_str: str) -> Tuple[bool, str]: |
||||
"""Returns (success, error_message)""" |
||||
... |
||||
|
||||
# Generic types |
||||
def get_batch(self, ids: List[int]) -> Dict[str, Any]: |
||||
... |
||||
``` |
||||
@ -0,0 +1,77 @@ |
||||
--- |
||||
title: Requests HTTP Pattern |
||||
category: patterns |
||||
--- |
||||
# Requests HTTP Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
||||
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
||||
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
||||
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - 429 handling with Retry-After |
||||
|
||||
```python |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
... |
||||
if getattr(resp, "status_code", 0) == 429: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||
retry_after = None |
||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||
if raw: |
||||
try: |
||||
retry_after = int(raw) |
||||
except Exception: |
||||
... |
||||
if retry_after is not None: |
||||
time.sleep(retry_after) |
||||
continue |
||||
``` |
||||
|
||||
### api_client.py - Session + raise_for_status |
||||
|
||||
```python |
||||
response = self.session.get( |
||||
base_url, params=params, timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Retry/backoff wrapper |
||||
|
||||
```python |
||||
def _attempt_batch(chunk_texts, start_index): |
||||
backoff = 0.5 |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
emb_chunk = _embedder( |
||||
chunk_texts, model=model, batch_size=len(chunk_texts) |
||||
) |
||||
return emb_chunk, None |
||||
except Exception as exc: |
||||
if attempt == retries: |
||||
break |
||||
sleep = backoff * (2 ** (attempt - 1)) |
||||
time.sleep(sleep) |
||||
continue |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Silent exception swallowing |
||||
|
||||
**Problem**: Blindly catching all requests exceptions and returning empty response. |
||||
|
||||
**Remediation**: Map network exceptions to retryable vs terminal (ProviderError) and log details. |
||||
|
||||
### Bad: Using print() for errors |
||||
|
||||
**Problem**: Using print() for network errors instead of structured logging. |
||||
|
||||
**Remediation**: Use `_logger.exception()` instead (see api_client.py needs fixing). |
||||
@ -1,65 +0,0 @@ |
||||
name: requests_http |
||||
|
||||
rules: |
||||
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
||||
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
||||
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
||||
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
... |
||||
if getattr(resp, "status_code", 0) == 429: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||
retry_after = None |
||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||
if raw: |
||||
try: |
||||
retry_after = int(raw) |
||||
except Exception: |
||||
... |
||||
if retry_after is not None: |
||||
time.sleep(retry_after) |
||||
continue |
||||
``` |
||||
note: explicit handling of 429 and Retry-After |
||||
|
||||
- path: api_client.py |
||||
excerpt: | |
||||
```python |
||||
response = self.session.get( |
||||
base_url, params=params, timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
``` |
||||
note: uses session + raise_for_status() to surface HTTP errors |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
def _attempt_batch(chunk_texts, start_index): |
||||
backoff = 0.5 |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
emb_chunk = _embedder( |
||||
chunk_texts, model=model, batch_size=len(chunk_texts) |
||||
) |
||||
return emb_chunk, None |
||||
except Exception as exc: |
||||
if attempt == retries: |
||||
break |
||||
sleep = backoff * (2 ** (attempt - 1)) |
||||
time.sleep(sleep) |
||||
continue |
||||
``` |
||||
note: wrapper adds retry/backoff and per-item fallback |
||||
|
||||
anti_patterns: |
||||
- Bad: Blindly catching all requests exceptions and returning empty response |
||||
remediation: map network exceptions to retryable vs terminal (ProviderError) and log details. |
||||
- Bad: Using print() for network errors instead of structured logging (see api_client.py where print() is used; prefer logging). |
||||
@ -0,0 +1,37 @@ |
||||
--- |
||||
title: Validation Pattern |
||||
category: patterns |
||||
--- |
||||
# Validation Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
||||
- Tests should assert that invalid inputs raise the expected exceptions. |
||||
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - Type validation |
||||
|
||||
```python |
||||
if not isinstance(text, str): |
||||
raise ProviderError("text must be a string") |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Defensive empty handling |
||||
|
||||
```python |
||||
if not texts: |
||||
return [] |
||||
if motion_ids is None: |
||||
motion_ids = [None for _ in texts] |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Invalid values into computation |
||||
|
||||
**Problem**: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
||||
|
||||
**Remediation**: Fail fast with a typed exception and add unit tests to cover validations. |
||||
@ -1,29 +0,0 @@ |
||||
name: validation |
||||
|
||||
rules: |
||||
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
||||
- Tests should assert that invalid inputs raise the expected exceptions. |
||||
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
if not isinstance(text, str): |
||||
raise ProviderError("text must be a string") |
||||
``` |
||||
note: explicit type validation before network call |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
if not texts: |
||||
return [] |
||||
if motion_ids is None: |
||||
motion_ids = [None for _ in texts] |
||||
``` |
||||
note: defensive handling of empty inputs |
||||
|
||||
anti_patterns: |
||||
- Bad: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
||||
remediation: Fail fast with a typed exception and add unit tests to cover validations. |
||||
@ -1,33 +0,0 @@ |
||||
# Tech stack (Phase 1 authoritative) |
||||
|
||||
language: |
||||
name: python |
||||
version: ">=3.13" |
||||
|
||||
frameworks: |
||||
- streamlit: ">=1.48.0" # UI: Home.py, pages/..., app.py |
||||
|
||||
database: |
||||
primary: duckdb |
||||
orm_or_adapter: ibis-framework[duckdb] # used for some parts |
||||
|
||||
visualization: |
||||
- plotly |
||||
|
||||
ml: |
||||
- scikit-learn |
||||
- scipy |
||||
- umap-learn |
||||
|
||||
ai: |
||||
declared_dependency: openai # declared in pyproject but not observed imported; ai_provider uses requests |
||||
runtime_adapter: custom requests-based wrapper (ai_provider.py) |
||||
|
||||
container: |
||||
- docker: Dockerfile FROM python:3.13-slim, EXPOSE 8501, CMD streamlit run Home.py |
||||
|
||||
testing: |
||||
- pytest |
||||
|
||||
ci: |
||||
- drone: .drone.yml present |
||||
@ -0,0 +1,67 @@ |
||||
--- |
||||
title: Tech Stack |
||||
category: stack |
||||
--- |
||||
|
||||
# Tech Stack |
||||
|
||||
## Runtime & Language |
||||
- **Python >=3.13** |
||||
|
||||
## Web Framework |
||||
- **Streamlit** - Multi-page app with Home, Stemwijzer, Explorer pages |
||||
|
||||
## Data Layer |
||||
- **DuckDB** - Embedded OLAP database |
||||
- Tables: motions, mp_votes, svd_vectors, fused_embeddings, embeddings, user_sessions, party_results, mp_metadata |
||||
- **ibis** - ORM (referenced but DuckDB-native implementation used) |
||||
|
||||
## AI / LLM |
||||
- **OpenRouter** - API abstraction for AI providers |
||||
- **QWEN** - Primary model |
||||
- Embeddings: `qwen/qwen3-embedding-4b` |
||||
- Chat: `qwen/qwen-2.5-72b-instruct` |
||||
- **requests** - HTTP client (not raw openai) |
||||
|
||||
## ML / Analytics |
||||
- **scikit-learn** - KMeans clustering, cosine_similarity, StandardScaler |
||||
- **scipy** - SVD (scipy.linalg.svd), spatial.procrustes |
||||
- **umap-learn** - Dimensionality reduction (optional, graceful fallback to SVD) |
||||
- **numpy** - Numerical computing |
||||
|
||||
## Visualization |
||||
- **Plotly** - Interactive charts (go.Figure, _DummyTrace fallback) |
||||
- **matplotlib** - Static plotting (optional) |
||||
|
||||
## HTTP & Parsing |
||||
- **requests** - Session pooling, retry with backoff |
||||
- **beautifulsoup4** - HTML parsing |
||||
- **lxml** - XML/HTML processing |
||||
|
||||
## Key Source Files |
||||
|
||||
| File | Purpose | |
||||
|------|---------| |
||||
| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema | |
||||
| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) | |
||||
| `explorer_helpers.py` | Pure helper functions, Plotly chart builders | |
||||
| `analysis/` | SVD pipeline, UMAP projection, clustering | |
||||
| `pipeline/` | Data fetch, transform, store pipeline | |
||||
| `pages/1_Stemwijzer.py` | Quiz page | |
||||
| `pages/2_Explorer.py` | Explorer page | |
||||
| `config.py` | Dataclass Config pattern | |
||||
| `ai_provider.py` | OpenRouter API wrapper with retry | |
||||
| `api_client.py` | TweedeKamer OData API client | |
||||
|
||||
## Singleton Instances |
||||
|
||||
| Module | Instance | Type | |
||||
|--------|----------|------| |
||||
| `database.py` | `db` | `MotionDatabase` | |
||||
| `config.py` | `config` | `Config` (dataclass) | |
||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||
|
||||
## Environment |
||||
- Python >=3.13 |
||||
- Environment variables via `.env` (DB path, API keys) |
||||
- No `.env` values in constraint files (security) |
||||
@ -0,0 +1,10 @@ |
||||
# Agents |
||||
|
||||
## Documented Solutions |
||||
|
||||
`docs/solutions/` — documented solutions to past problems (bugs, best practices, workflow patterns), organized by category with YAML frontmatter (`module`, `tags`, `problem_type`). Relevant when implementing or debugging in documented areas. |
||||
|
||||
## Project Conventions |
||||
|
||||
- Right-wing parties (PVV, FVD, JA21, SGP) must appear on the RIGHT side of all axes in visualizations |
||||
- SVD labels should reflect voting patterns, not semantic content — see `docs/solutions/best-practices/svd-labels-voting-patterns-not-semantics.md` |
||||
@ -0,0 +1,269 @@ |
||||
"""Configuration constants for the parliamentary explorer. |
||||
|
||||
This module contains all constant definitions used across the explorer. |
||||
It is intentionally free of Streamlit and DuckDB dependencies. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
from typing import Dict |
||||
|
||||
__all__ = [ |
||||
"PARTY_COLOURS", |
||||
"SVD_THEMES", |
||||
"KNOWN_MAJOR_PARTIES", |
||||
"CURRENT_PARLIAMENT_PARTIES", |
||||
"_PARTY_NORMALIZE", |
||||
"CANONICAL_RIGHT", |
||||
"CANONICAL_LEFT", |
||||
] |
||||
|
||||
CANONICAL_RIGHT: frozenset[str] = frozenset( |
||||
{ |
||||
"PVV", |
||||
"FVD", |
||||
"JA21", |
||||
"SGP", |
||||
} |
||||
) |
||||
|
||||
CANONICAL_LEFT: frozenset[str] = frozenset( |
||||
{ |
||||
"SP", |
||||
"PvdA", |
||||
"GL", |
||||
"GroenLinks", |
||||
"GroenLinks-PvdA", |
||||
"DENK", |
||||
"PvdD", |
||||
"Volt", |
||||
} |
||||
) |
||||
|
||||
PARTY_COLOURS: Dict[str, str] = { |
||||
"VVD": "#1E73BE", |
||||
"PVV": "#002366", |
||||
"D66": "#00A36C", |
||||
"CDA": "#4CAF50", |
||||
"SP": "#E53935", |
||||
"PvdA": "#D32F2F", |
||||
"GroenLinks": "#388E3C", |
||||
"GroenLinks-PvdA": "#2E7D32", |
||||
"CU": "#0288D1", |
||||
"SGP": "#F4511E", |
||||
"PvdD": "#43A047", |
||||
"FVD": "#6A1B9A", |
||||
"JA21": "#7B1FA2", |
||||
"BBB": "#8D6E63", |
||||
"NSC": "#FF8F00", |
||||
"Nieuw Sociaal Contract": "#FF8F00", |
||||
"DENK": "#00897B", |
||||
"50PLUS": "#7E57C2", |
||||
"Volt": "#572AB7", |
||||
"ChristenUnie": "#0288D1", |
||||
"Unknown": "#9E9E9E", |
||||
} |
||||
|
||||
SVD_THEMES: dict[int, dict[str, str]] = { |
||||
1: { |
||||
"label": "Fiscaal-economisch beleid versus sociaal welzijn en internationale rechten", |
||||
"explanation": ( |
||||
"Deze as scheidt fiscaal-economisch beleid van sociaal welzijn en internationale solidariteit. " |
||||
"Aan de positieve kant staan moties over dijkvervanging, medische bijscholing, gaswinning op land, " |
||||
"landbouwsubsidies en fiscale verlichting. " |
||||
"Aan de negatieve kant staan moties over huurprijsbeheersing, boycot van defensiebedrijven, " |
||||
"beëindiging van militaire verdragen, antipersoneelslandmijnen en zorgbuurthuizen. " |
||||
"Deze as weerspiegelt de spanning tussen financieel-economische prioriteiten en sociaal-internationaal beleid." |
||||
), |
||||
"positive_pole": "Fiscaal-economisch: dijkvervanging, landbouwsubsidies, gaswinning, fiscale verlichting", |
||||
"negative_pole": "Sociaal welzijn en internationale rechten: huurbeheersing, defensieboycot, zorg, landmijnverbod", |
||||
"flip": False, |
||||
}, |
||||
2: { |
||||
"label": "Nationalistische versus multilateralistische oriëntatie", |
||||
"explanation": ( |
||||
"Deze as meet een onafhankelijke culturele dimensie: nationalistisch-populistisch " |
||||
"tegenover kosmopolitisch-mainstream. Aan de positieve kant staan PVV en FVD. " |
||||
"Aan de negatieve kant staan Volt, GroenLinks-PvdA, DENK en SP. " |
||||
"Deze as is onafhankelijk van links-rechts (as 1) en scheidt partijen " |
||||
"op hun houding tegenover nationale identiteit, EU-samenwerking en de " |
||||
"etnisch-culturele dimensie." |
||||
), |
||||
"positive_pole": "Nationalistisch/populistisch — PVV, FVD: nationale identiteit en soevereiniteit", |
||||
"negative_pole": "Kosmopolitisch/mainstream — Volt, GL-PvdA, DENK, SP: EU en internationale samenwerking", |
||||
"flip": False, |
||||
}, |
||||
3: { |
||||
"label": "Verzorgingsstaat versus defensie en nationale veiligheid", |
||||
"explanation": ( |
||||
"Deze as weerspiegelt de spanning tussen staatsingrijpen en marktliberalisme, " |
||||
"aangescherpt door de kabinetscrisis van 2025. Aan de positieve kant staan moties " |
||||
"die bezuinigingen op zorg en het gemeentefonds willen terugdraaien, winstuitkeringen " |
||||
"in de zorg verbieden en publieke controle over ziekenhuisfusies eisen. SP, PvdD, " |
||||
"GroenLinks-PvdA stemmen hier gelijk — ondanks hun tegengestelde PC1-posities. " |
||||
"Aan de negatieve kant staan moties " |
||||
"over marktwerking in de zorg, fiscale bedrijfsopvolgingsfaciliteiten (VVD), " |
||||
"doorgaan met besturen ondanks de kabinetscrisis (VVD/BBB) en defensie-" |
||||
"uitgaven van 3,5% bbp." |
||||
), |
||||
"positive_pole": "Pro-verzorgingsstaat: SP, PvdD, GroenLinks-PvdA (anti-bezuinigingen)", |
||||
"negative_pole": "Marktliberaal en fiscaal conservatief: VVD, D66, CDA, SGP, BBB", |
||||
"flip": True, |
||||
}, |
||||
4: { |
||||
"label": "Actieve internationale betrokkenheid versus terughoudendheid", |
||||
"explanation": ( |
||||
"Deze as scheidt actieve internationale betrokkenheid van terughoudendheid of terugtrekking. " |
||||
"Aan de positieve kant staan moties over bilaterale en Europese samenwerking: partnerschappen met Australië, " |
||||
"actieve vaderbetrokkenheid, kennisuitwisseling en coördinatie via internationale gremia. " |
||||
"Aan de negatieve kant staan moties over verlaten van de WHO, beperking van migratiesaldo, " |
||||
"gezinsbeleid en asielrestricties. " |
||||
"Deze as is indicatief — de spreiding van partijen is breed." |
||||
), |
||||
"positive_pole": "Actieve internationale betrokkenheid: bilaterale samenwerking, kennisuitwisseling, multilaterale coördinatie", |
||||
"negative_pole": "Terughoudendheid en restricties: WHO-verlating, migratielimieten, binnenlands gericht beleid", |
||||
"flip": False, |
||||
}, |
||||
5: { |
||||
"label": "Pragmatische financiële ondersteuning versus progressieve individuele rechten", |
||||
"explanation": ( |
||||
"Deze as scheidt pragmatische financiële en structurele ondersteuning van progressieve individuele rechten. " |
||||
"Aan de positieve kant staan moties over een vrijgesteld minimumbudget voor infrastructurele werken, " |
||||
"maatschappelijke diensttijd voor kwetsbare jongeren, verkorting van de WW alleen met concrete " |
||||
"ondersteuningsmaatregelen, en vrijwaring van kindertoeslagen. " |
||||
"Aan de negatieve kant staan moties over erkenning van meerouderschap, " |
||||
"wettelijke kwaliteitseisen aan zwemlessen, een nationaal coördinator tegen buitenlandse beïnvloeding, " |
||||
"en vastlegging van abortusrecht in het EU-Handvest. " |
||||
"Deze as weerspiegelt de spanning tussen financiële prikkels en individuele rechtenbescherming." |
||||
), |
||||
"positive_pole": "Pragmatische financiële ondersteuning: budgetvrijwaring, diensttijd, WW-hervorming, kindertoeslagen", |
||||
"negative_pole": "Progressieve individuele rechten: meerouderschap, abortusrecht, zwemveiligheid, buitenlandse beïnvloeding", |
||||
"flip": False, |
||||
}, |
||||
6: { |
||||
"label": "Fossiele brandstoffen en financiële prikkels versus klimaatbeleid en internationale rechten", |
||||
"explanation": ( |
||||
"Deze as scheidt fossiele brandstoffen en financiële marktprikkels van klimaatbeleid en internationale rechten. " |
||||
"Aan de positieve kant staan moties over lng-capaciteit als alternatief voor gaswinning, " |
||||
"kernenergie als volwaardig onderdeel van energiebeleid, vermogenswinstbelasting en beperkte " |
||||
"overheidsuitgaven. " |
||||
"Aan de negatieve kant staan moties over het uitsluiten van de fossiele industrie van klimaatconferenties, " |
||||
"veroordeling van aanvallen op Libanon, sancties tegen internationale conflicten, " |
||||
"en structureel overleg met moslimgemeenschappen. " |
||||
"Deze as weerspiegelt de spanning tussen economisch-fiscale prioriteiten en klimaat/internationale solidariteit." |
||||
), |
||||
"positive_pole": "Fossiel en financieel: lng-capaciteit, kernenergie, vermogenswinstbelasting, bezuinigingen", |
||||
"negative_pole": "Klimaat en internationale rechten: fossiele industrie uitsluiten, sancties, Libanon, gemeenschappen", |
||||
"flip": False, |
||||
}, |
||||
7: { |
||||
"label": "Praktisch-bestuurlijk versus idealistisch-proceduraal", |
||||
"explanation": ( |
||||
"Een residuele as die overwegend beleidsdossiers uit 2024 (vorige parlementaire " |
||||
"periode) omvat. De scores zijn smal (max ~11 punten) en de partijcombinaties " |
||||
"ideologisch divers — dit label is indicatief. Aan de positieve kant staan " |
||||
"pragmatische bestuursmoties: een compleet kostenoverzicht van producten van eigen " |
||||
"bodem, papieren schoolboeken voor basisvaardigheden, een invoeringstoets voor het " |
||||
"minimumloon en de A2-snelwegplanning. ChristenUnie, Volt, DENK en SP scoren " |
||||
"positief. Aan de negatieve kant staan meer ideologisch geladen moties: een " |
||||
"landelijk stookverbod (PvdD), het strafbaar stellen van verbranding van religieuze " |
||||
"geschriften (DENK), chroom-6 schadevergoedingen en tegenhouden van nieuwe " |
||||
"gaswinning. GroenLinks-PvdA, VVD, FVD en JA21 scoren negatief." |
||||
), |
||||
"positive_pole": "Praktisch-bestuurlijk: ChristenUnie, Volt, SGP, DENK, SP", |
||||
"negative_pole": "Ideologisch-principieel: GroenLinks-PvdA, VVD, FVD, JA21", |
||||
"flip": True, |
||||
}, |
||||
8: { |
||||
"label": "Europese defensiesamenwerking versus binnenlands sociaaleconomisch beleid", |
||||
"explanation": ( |
||||
"Deze as scheidt Europese defensiesamenwerking van binnenlands sociaaleconomisch beleid. " |
||||
"Aan de positieve kant staan moties over militaire mobiliteit in EU- en NAVO-verband, " |
||||
"een Europees onderzoeksinstituut voor defensie, en concrete stappen voor 35% defensie-uitgaven. " |
||||
"Aan de negatieve kant staan moties over toeslagenaffaire-herstel, ontslagrecht, " |
||||
"coronastrategie en bestuurlijke instructieregels. " |
||||
"Deze as is indicatief — de spreiding van partijen is breed en de thematische diversiteit is groot." |
||||
), |
||||
"positive_pole": "Europese defensiesamenwerking: NAVO-militaire mobiliteit, Europees defensie-instituut, defensie-uitgaven", |
||||
"negative_pole": "Binnenlands beleid: toeslagen, ontslagrecht, coronastrategie, administratieve lasten", |
||||
"flip": False, |
||||
}, |
||||
9: { |
||||
"label": "Concreet-bestuurlijke versus systemische hervorming", |
||||
"explanation": ( |
||||
"Deze as scheidt concreet-bestuurlijke oplossingen van systemische hervorming. " |
||||
"Aan de positieve kant staan moties over naleving van financiële verhoudingswetten voor gemeenten, " |
||||
"beperking van arbeidsmigratie, een nieuwe tandartsopleiding in Rotterdam, " |
||||
"en oplossingen voor milieuproblemen op Bonaire. " |
||||
"Aan de negatieve kant staan moties over een moratorium op geitenstallen, " |
||||
"een verbod op gokadvertenties, gronden voor voorlopige hechtenis, " |
||||
"een leegstandbelasting en end-to-end-encryptie. " |
||||
"Deze as is indicatief — de scores zijn smal en ideologisch divers." |
||||
), |
||||
"positive_pole": "Concreet-bestuurlijk: financiële verhoudingswet, arbeidsmigratie, tandartsopleiding, Bonaire", |
||||
"negative_pole": "Systemische hervorming: geitenstallen-moratorium, gokverbod, leegstandbelasting, encryptie", |
||||
"flip": False, |
||||
}, |
||||
10: { |
||||
"label": "Bescherming van burgers versus overheidsregulering", |
||||
"explanation": ( |
||||
"Deze as scheidt bescherming van burgers van overheidsregulering en handhaving. " |
||||
"Aan de positieve kant staan moties over minder tijdsintensieve schoolinspecties, " |
||||
"het recht van toeslagenouders op hun persoonlijk dossier, behoud van tegemoetkomingen " |
||||
"voor arbeidsongeschikten, integratie die geldt voor nieuwkomers (niet voor Nederlanders), " |
||||
"en verlaging van de leeftijdsdrempel voor kindgesprekken. " |
||||
"Aan de negatieve kant staan moties over een aangifteplicht voor scholen bij " |
||||
"veiligheidsincidenten, rookverboden in auto's met kinderen, " |
||||
"braakliggende landbouwgrond en verhoogd beloningsgeld voor tipgevers. " |
||||
"Deze as is indicatief — de scores zijn smal en de partijcombinaties divers." |
||||
), |
||||
"positive_pole": "Bescherming van burgers: minder inspecties, toegang tot dossiers, behoud toeslagen, kindleeftijd", |
||||
"negative_pole": "Overheidsregulering: aangifteplicht scholen, rookverbod, braakliggende grond, tipgeversbeloning", |
||||
"flip": True, |
||||
}, |
||||
} |
||||
|
||||
KNOWN_MAJOR_PARTIES = [ |
||||
"VVD", |
||||
"PVV", |
||||
"D66", |
||||
"GroenLinks-PvdA", |
||||
"GroenLinks", |
||||
"PvdA", |
||||
"CDA", |
||||
"SP", |
||||
"NSC", |
||||
"CU", |
||||
"BBB", |
||||
] |
||||
|
||||
CURRENT_PARLIAMENT_PARTIES: frozenset[str] = frozenset( |
||||
{ |
||||
"PVV", |
||||
"VVD", |
||||
"NSC", |
||||
"BBB", |
||||
"D66", |
||||
"GroenLinks-PvdA", |
||||
"CDA", |
||||
"SP", |
||||
"ChristenUnie", |
||||
"SGP", |
||||
"Volt", |
||||
"DENK", |
||||
"PvdD", |
||||
"JA21", |
||||
"FVD", |
||||
} |
||||
) |
||||
|
||||
_PARTY_NORMALIZE: dict[str, str] = { |
||||
"Nieuw Sociaal Contract": "NSC", |
||||
"CU": "ChristenUnie", |
||||
"GL": "GroenLinks-PvdA", |
||||
"GroenLinks": "GroenLinks-PvdA", |
||||
"PvdA": "GroenLinks-PvdA", |
||||
"Gündoğan": "Volt", |
||||
"Lid Keijzer": "BBB", |
||||
"Groep Markuszower": "PVV", |
||||
} |
||||
@ -0,0 +1,568 @@ |
||||
"""Data loading functions for the parliamentary explorer. |
||||
|
||||
This module contains all data loading functions extracted from explorer.py. |
||||
It is intentionally free of Streamlit side-effects to be easy to unit test. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
import logging |
||||
from typing import Dict, List, Set, Tuple |
||||
|
||||
try: |
||||
import duckdb |
||||
except ( |
||||
Exception |
||||
): # pragma: no cover - allow lightweight import without duckdb installed |
||||
duckdb = None # type: ignore |
||||
import numpy as np |
||||
import pandas as pd |
||||
|
||||
from analysis.config import CURRENT_PARLIAMENT_PARTIES, _PARTY_NORMALIZE |
||||
|
||||
__all__ = [ |
||||
"get_available_windows", |
||||
"get_uniform_dim_windows", |
||||
"load_party_map", |
||||
"load_active_mps", |
||||
"load_mp_vectors_by_window", |
||||
"load_mp_vectors_by_party", |
||||
"load_mp_vectors_by_party_for_window", |
||||
"load_party_axis_scores", |
||||
"load_party_axis_scores_for_window", |
||||
"load_party_scores_all_windows", |
||||
"load_party_scores_all_windows_aligned", |
||||
"load_party_mp_vectors", |
||||
"build_window_party_scores", |
||||
"load_motions_df", |
||||
"query_similar", |
||||
"compute_party_axis_scores", |
||||
] |
||||
|
||||
logger = logging.getLogger(__name__) |
||||
|
||||
_WINDOW_SQL = """ |
||||
SELECT DISTINCT window_id FROM svd_vectors ORDER BY window_id |
||||
""" |
||||
|
||||
_UNIFORM_DIM_SQL = """ |
||||
WITH vec_dims AS ( |
||||
SELECT window_id, json_array_length(vector) AS dim |
||||
FROM svd_vectors |
||||
WHERE entity_type = 'mp' |
||||
), |
||||
window_dim_counts AS ( |
||||
SELECT window_id, dim, COUNT(*) AS cnt |
||||
FROM vec_dims |
||||
GROUP BY window_id, dim |
||||
), |
||||
dominant AS ( |
||||
SELECT DISTINCT ON (window_id) window_id, dim, cnt |
||||
FROM window_dim_counts |
||||
ORDER BY window_id, cnt DESC, dim DESC |
||||
) |
||||
SELECT window_id |
||||
FROM dominant |
||||
WHERE dim >= 25 AND cnt >= 10 |
||||
ORDER BY window_id |
||||
""" |
||||
|
||||
|
||||
def get_available_windows(db_path: str) -> List[str]: |
||||
"""Return sorted list of distinct window_ids from svd_vectors.""" |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
try: |
||||
rows = con.execute(_WINDOW_SQL).fetchall() |
||||
return [r[0] for r in rows] |
||||
except Exception: |
||||
logger.exception("Failed to query available windows") |
||||
return [] |
||||
finally: |
||||
con.close() |
||||
|
||||
|
||||
def get_uniform_dim_windows(db_path: str) -> List[str]: |
||||
"""Return only windows whose dominant MP-vector dimension is >= 25. |
||||
|
||||
Some windows contain a mix of vector lengths due to multiple pipeline runs |
||||
(e.g. 2016 has both dim=1 and dim=50 rows). We find the most common dimension |
||||
per window and include only windows where that dominant dim >= 25. |
||||
Windows with too few dim-25+ entities (< 10) are also excluded to avoid |
||||
degenerate PCA inputs. |
||||
""" |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
try: |
||||
rows = con.execute(_UNIFORM_DIM_SQL).fetchall() |
||||
return [r[0] for r in rows] |
||||
except Exception: |
||||
logger.exception("Failed to query uniform-dim windows") |
||||
return [] |
||||
finally: |
||||
con.close() |
||||
|
||||
|
||||
def load_party_map(db_path: str) -> Dict[str, str]: |
||||
"""Return {mp_name: party} mapping, with party names normalised to abbreviations.""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute( |
||||
"SELECT mp_name, party FROM mp_metadata WHERE party IS NOT NULL" |
||||
).fetchall() |
||||
con.close() |
||||
return { |
||||
mp: _PARTY_NORMALIZE.get(party, party) for mp, party in rows if mp and party |
||||
} |
||||
except Exception: |
||||
logger.exception("Failed to load party map") |
||||
return {} |
||||
|
||||
|
||||
def load_active_mps(db_path: str) -> Set[str]: |
||||
"""Return the set of mp_name values that are currently seated in parliament. |
||||
|
||||
An MP is considered active if their mp_metadata row has tot_en_met IS NULL, |
||||
meaning they have no recorded end date for their current seat. |
||||
""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute( |
||||
"SELECT mp_name FROM mp_metadata WHERE tot_en_met IS NULL" |
||||
).fetchall() |
||||
con.close() |
||||
return {r[0] for r in rows if r[0]} |
||||
except Exception: |
||||
logger.exception("Failed to load active MPs") |
||||
return set() |
||||
|
||||
|
||||
def load_party_axis_scores(db_path: str) -> Dict[str, List[float]]: |
||||
"""Return party scores for all windows (non-aligned). |
||||
|
||||
Returns dict mapping party_abbrev -> list of axis scores, one per window. |
||||
""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute( |
||||
""" |
||||
SELECT party_abbrev, window_id, x_axis, y_axis |
||||
FROM party_axis_scores |
||||
ORDER BY party_abbrev, window_id |
||||
""" |
||||
).fetchall() |
||||
con.close() |
||||
|
||||
scores: Dict[str, List[float]] = {} |
||||
for party, window, x, y in rows: |
||||
if party not in scores: |
||||
scores[party] = [] |
||||
if x is not None and y is not None: |
||||
scores[party].extend([x, y]) |
||||
return scores |
||||
except Exception: |
||||
logger.exception("Failed to load party axis scores") |
||||
return {} |
||||
|
||||
|
||||
def load_party_axis_scores_for_window( |
||||
db_path: str, window: str |
||||
) -> Dict[str, List[float]]: |
||||
"""Return party scores for a specific window (aligned).""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute( |
||||
""" |
||||
SELECT party_abbrev, x_axis, y_axis |
||||
FROM party_axis_scores |
||||
WHERE window_id = ? |
||||
ORDER BY party_abbrev |
||||
""", |
||||
[window], |
||||
).fetchall() |
||||
con.close() |
||||
|
||||
return {party: [x or 0.0, y or 0.0] for party, x, y in rows} |
||||
except Exception: |
||||
logger.exception("Failed to load party axis scores for window %s", window) |
||||
return {} |
||||
|
||||
|
||||
def load_party_scores_all_windows(db_path: str) -> Dict[str, List[List[float]]]: |
||||
"""Return party scores across all windows (non-aligned).""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute( |
||||
""" |
||||
SELECT party_abbrev, window_id, x_axis, y_axis |
||||
FROM party_axis_scores |
||||
ORDER BY party_abbrev, window_id |
||||
""" |
||||
).fetchall() |
||||
con.close() |
||||
|
||||
scores: Dict[str, List[List[float]]] = {} |
||||
current_party = None |
||||
for party, window, x, y in rows: |
||||
if party != current_party: |
||||
scores[party] = [] |
||||
current_party = party |
||||
if x is not None and y is not None: |
||||
scores[party].append([x, y]) |
||||
else: |
||||
scores[party].append([0.0, 0.0]) |
||||
return scores |
||||
except Exception: |
||||
logger.exception("Failed to load party scores all windows") |
||||
return {} |
||||
|
||||
|
||||
def load_party_scores_all_windows_aligned( |
||||
db_path: str, |
||||
) -> Dict[str, List[List[float]]]: |
||||
"""Return party scores across all windows (Procrustes-aligned).""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute( |
||||
""" |
||||
SELECT party_abbrev, window_id, x_axis_aligned, y_axis_aligned |
||||
FROM party_axis_scores |
||||
ORDER BY party_abbrev, window_id |
||||
""" |
||||
).fetchall() |
||||
con.close() |
||||
|
||||
scores: Dict[str, List[List[float]]] = {} |
||||
current_party = None |
||||
for party, window, x, y in rows: |
||||
if party != current_party: |
||||
scores[party] = [] |
||||
current_party = party |
||||
if x is not None and y is not None: |
||||
scores[party].append([x, y]) |
||||
else: |
||||
scores[party].append([0.0, 0.0]) |
||||
return scores |
||||
except Exception: |
||||
logger.exception("Failed to load aligned party scores all windows") |
||||
return {} |
||||
|
||||
|
||||
def build_window_party_scores( |
||||
scores_by_party: Dict[str, List[List[float]]], |
||||
window_idx: int, |
||||
) -> Dict[str, List[float]]: |
||||
"""Extract scores for one window as {party: [x, y]} for compute_flip_direction. |
||||
|
||||
Args: |
||||
scores_by_party: Output of load_party_scores_all_windows_aligned — |
||||
{party: [[x, y], [x, y], ...]} per window. |
||||
window_idx: Zero-based index of the window to extract. |
||||
|
||||
Returns: |
||||
{party: [x, y]} for the given window. Returns empty dict if |
||||
window_idx is out of range. |
||||
""" |
||||
if window_idx < 0: |
||||
return {} |
||||
result: Dict[str, List[float]] = {} |
||||
for party, window_scores in scores_by_party.items(): |
||||
if window_idx < len(window_scores): |
||||
result[party] = window_scores[window_idx] |
||||
return result |
||||
|
||||
|
||||
def load_party_mp_vectors(db_path: str) -> Dict[str, List[np.ndarray]]: |
||||
"""Load individual MP SVD vectors grouped by party. |
||||
|
||||
Returns {party_name: [np.ndarray(50,), ...]} — one array per MP. |
||||
""" |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
try: |
||||
meta_rows = con.execute( |
||||
"SELECT mp_name, party FROM mp_metadata " |
||||
"WHERE van >= '2023-11-22' OR tot_en_met IS NULL OR tot_en_met >= '2023-11-22' " |
||||
"ORDER BY van ASC" |
||||
).fetchall() |
||||
mp_party: Dict[str, str] = {} |
||||
for mp_name, party in meta_rows: |
||||
if mp_name and party: |
||||
mp_party[mp_name] = _PARTY_NORMALIZE.get(party, party) |
||||
|
||||
rows = con.execute( |
||||
"SELECT entity_id, vector FROM svd_vectors " |
||||
"WHERE entity_type = 'mp' AND window_id = 'current_parliament'" |
||||
).fetchall() |
||||
|
||||
vectors_by_party: Dict[str, List[np.ndarray]] = {} |
||||
for entity_id, vector_json in rows: |
||||
if entity_id in mp_party: |
||||
party = mp_party[entity_id] |
||||
if party not in vectors_by_party: |
||||
vectors_by_party[party] = [] |
||||
vectors_by_party[party].append(np.array(vector_json)) |
||||
|
||||
return vectors_by_party |
||||
except Exception: |
||||
logger.exception("Failed to load party MP vectors") |
||||
return {} |
||||
finally: |
||||
con.close() |
||||
|
||||
|
||||
def load_scree_data(db_path: str) -> List[float]: |
||||
"""Load scree plot data (explained variance) for current_parliament.""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
row = con.execute( |
||||
""" |
||||
SELECT sv_metadata FROM svd_vectors |
||||
WHERE window_id = 'current_parliament' AND entity_type = 'singular_values' |
||||
LIMIT 1 |
||||
""" |
||||
).fetchone() |
||||
con.close() |
||||
|
||||
if row and row[0]: |
||||
import json |
||||
|
||||
return json.loads(row[0]) |
||||
return [] |
||||
except Exception: |
||||
logger.exception("Failed to load scree data") |
||||
return [] |
||||
|
||||
|
||||
def load_motions_df(db_path: str) -> pd.DataFrame: |
||||
"""Load the full motions table as a pandas DataFrame (read-only).""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
df = con.execute( |
||||
""" |
||||
SELECT id, title, description, date, policy_area, |
||||
voting_results, layman_explanation, |
||||
winning_margin, controversy_score, url |
||||
FROM motions |
||||
""" |
||||
).fetchdf() |
||||
con.close() |
||||
df["date"] = pd.to_datetime(df["date"], errors="coerce") |
||||
df["year"] = df["date"].dt.year |
||||
return df |
||||
except Exception: |
||||
logger.exception("Failed to load motions DataFrame") |
||||
return pd.DataFrame() |
||||
|
||||
|
||||
def load_mp_vectors_by_window(db_path: str, window: str) -> Dict[str, np.ndarray]: |
||||
"""Load individual MP SVD vectors for a specific window. |
||||
|
||||
Args: |
||||
db_path: Path to DuckDB database |
||||
window: Window ID (e.g., "2015", "current_parliament") |
||||
|
||||
Returns: |
||||
{mp_name: np.ndarray(50,)} — one vector per MP |
||||
""" |
||||
import json as _json |
||||
|
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute( |
||||
""" |
||||
SELECT entity_id, vector FROM svd_vectors |
||||
WHERE entity_type = 'mp' AND window_id = ? |
||||
""", |
||||
[window], |
||||
).fetchall() |
||||
con.close() |
||||
|
||||
mp_vecs: Dict[str, np.ndarray] = {} |
||||
for entity_id, raw_vec in rows: |
||||
if isinstance(raw_vec, str): |
||||
vec = _json.loads(raw_vec) |
||||
elif isinstance(raw_vec, (bytes, bytearray)): |
||||
vec = _json.loads(raw_vec.decode()) |
||||
elif isinstance(raw_vec, list): |
||||
vec = raw_vec |
||||
else: |
||||
try: |
||||
vec = list(raw_vec) |
||||
except Exception: |
||||
continue |
||||
fvec = np.array([float(v) if v is not None else 0.0 for v in vec]) |
||||
mp_vecs[entity_id] = fvec |
||||
|
||||
return mp_vecs |
||||
except Exception: |
||||
logger.exception("Failed to load MP vectors for window %s", window) |
||||
return {} |
||||
|
||||
|
||||
def query_similar( |
||||
db_path: str, |
||||
source_motion_id: int, |
||||
vector_type: str = "fused", |
||||
top_k: int = 10, |
||||
) -> pd.DataFrame: |
||||
"""Return top-k similar motions from similarity_cache (read-only).""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute( |
||||
""" |
||||
SELECT sc.target_motion_id, sc.score, sc.window_id, |
||||
m.title, m.date, m.policy_area |
||||
FROM similarity_cache sc |
||||
JOIN motions m ON m.id = sc.target_motion_id |
||||
WHERE sc.source_motion_id = ? |
||||
AND sc.vector_type = ? |
||||
ORDER BY sc.score DESC |
||||
LIMIT ? |
||||
""", |
||||
[source_motion_id, vector_type, top_k], |
||||
).fetchdf() |
||||
con.close() |
||||
return rows |
||||
except Exception: |
||||
logger.exception( |
||||
"Failed to query similarity cache for motion %s", source_motion_id |
||||
) |
||||
return pd.DataFrame() |
||||
|
||||
|
||||
def load_mp_vectors_by_party(db_path: str) -> Dict[str, List[np.ndarray]]: |
||||
"""Load individual MP SVD vectors grouped by party for current_parliament. |
||||
|
||||
Returns: |
||||
{party_name: [np.ndarray(50,), ...]} — one array per MP. |
||||
""" |
||||
import json as _json |
||||
|
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
meta_rows = con.execute( |
||||
"SELECT mp_name, party FROM mp_metadata " |
||||
"WHERE van >= '2023-11-22' OR tot_en_met IS NULL OR tot_en_met >= '2023-11-22' " |
||||
"ORDER BY van ASC" |
||||
).fetchall() |
||||
mp_party: Dict[str, str] = {} |
||||
for mp_name, party in meta_rows: |
||||
if mp_name and party: |
||||
mp_party[mp_name] = _PARTY_NORMALIZE.get(party, party) |
||||
|
||||
rows = con.execute( |
||||
"SELECT entity_id, vector FROM svd_vectors " |
||||
"WHERE entity_type='mp' AND window_id='current_parliament'" |
||||
).fetchall() |
||||
con.close() |
||||
|
||||
party_vecs: Dict[str, List[np.ndarray]] = {} |
||||
for entity_id, raw_vec in rows: |
||||
party = mp_party.get(entity_id) |
||||
if party is None or party not in CURRENT_PARLIAMENT_PARTIES: |
||||
continue |
||||
if isinstance(raw_vec, str): |
||||
vec = _json.loads(raw_vec) |
||||
elif isinstance(raw_vec, (bytes, bytearray)): |
||||
vec = _json.loads(raw_vec.decode()) |
||||
elif isinstance(raw_vec, list): |
||||
vec = raw_vec |
||||
else: |
||||
try: |
||||
vec = list(raw_vec) |
||||
except Exception: |
||||
continue |
||||
fvec = np.array([float(v) if v is not None else 0.0 for v in vec]) |
||||
party_vecs.setdefault(party, []).append(fvec) |
||||
return party_vecs |
||||
except Exception: |
||||
logger.exception("Failed to load MP vectors by party") |
||||
return {} |
||||
|
||||
|
||||
def load_mp_vectors_by_party_for_window( |
||||
db_path: str, window: str |
||||
) -> Dict[str, List[np.ndarray]]: |
||||
"""Load individual MP SVD vectors grouped by party for a specific window. |
||||
|
||||
For historical windows, uses the MP→party mapping from that time period. |
||||
|
||||
Returns: |
||||
{party_name: [np.ndarray(50,), ...]} — one array per MP. |
||||
""" |
||||
import json as _json |
||||
|
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
is_current = window == "current_parliament" |
||||
|
||||
if is_current: |
||||
meta_rows = con.execute( |
||||
"SELECT mp_name, party FROM mp_metadata " |
||||
"WHERE van >= '2023-11-22' OR tot_en_met IS NULL OR tot_en_met >= '2023-11-22' " |
||||
"ORDER BY van ASC" |
||||
).fetchall() |
||||
else: |
||||
try: |
||||
year = int(window.split("-")[0]) |
||||
except ValueError: |
||||
year = 2023 |
||||
meta_rows = con.execute( |
||||
"SELECT mp_name, party FROM mp_metadata " |
||||
"WHERE van <= ? AND (tot_en_met IS NULL OR tot_en_met >= ?) " |
||||
"ORDER BY van ASC", |
||||
[f"{year}-12-31", f"{year}-01-01"], |
||||
).fetchall() |
||||
|
||||
mp_party: Dict[str, str] = {} |
||||
for mp_name, party in meta_rows: |
||||
if mp_name and party: |
||||
mp_party[mp_name] = _PARTY_NORMALIZE.get(party, party) |
||||
|
||||
rows = con.execute( |
||||
"SELECT entity_id, vector FROM svd_vectors " |
||||
"WHERE entity_type='mp' AND window_id=?", |
||||
[window], |
||||
).fetchall() |
||||
con.close() |
||||
|
||||
party_vecs: Dict[str, List[np.ndarray]] = {} |
||||
for entity_id, raw_vec in rows: |
||||
party = mp_party.get(entity_id) |
||||
if party is None: |
||||
continue |
||||
if is_current and party not in CURRENT_PARLIAMENT_PARTIES: |
||||
continue |
||||
if isinstance(raw_vec, str): |
||||
vec = _json.loads(raw_vec) |
||||
elif isinstance(raw_vec, (bytes, bytearray)): |
||||
vec = _json.loads(raw_vec.decode()) |
||||
elif isinstance(raw_vec, list): |
||||
vec = raw_vec |
||||
else: |
||||
try: |
||||
vec = list(raw_vec) |
||||
except Exception: |
||||
continue |
||||
fvec = np.array([float(v) if v is not None else 0.0 for v in vec]) |
||||
party_vecs.setdefault(party, []).append(fvec) |
||||
return party_vecs |
||||
except Exception: |
||||
logger.exception("Failed to load MP vectors by party for window %s", window) |
||||
return {} |
||||
|
||||
|
||||
def compute_party_axis_scores( |
||||
party_vecs: Dict[str, List[np.ndarray]], |
||||
) -> Dict[str, List[float]]: |
||||
"""Compute per-party axis scores as mean of MP vectors. |
||||
|
||||
Returns: |
||||
{party_name: [float * k]} — k = 50, mean over all MPs in that party. |
||||
""" |
||||
try: |
||||
return { |
||||
party: np.array(vecs).mean(axis=0).tolist() |
||||
for party, vecs in party_vecs.items() |
||||
} |
||||
except Exception: |
||||
logger.exception("Failed to compute party axis scores") |
||||
return {} |
||||
@ -0,0 +1,128 @@ |
||||
"""SVD projection utilities for the parliamentary explorer. |
||||
|
||||
Pure computation functions for projecting motions and entities onto ideological axes. |
||||
No IO or external dependencies - fully testable without Streamlit or DuckDB. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
import math |
||||
from typing import Any, Dict, List, Tuple |
||||
|
||||
__all__ = [ |
||||
"should_swap_axes", |
||||
"swap_axes", |
||||
"project_motion_scores", |
||||
"normalize_coordinates", |
||||
] |
||||
|
||||
|
||||
def should_swap_axes(axis_def: dict) -> bool: |
||||
"""Return True if the Y axis is economic left-right and the X axis is not. |
||||
|
||||
When true, caller should swap x/y positions and metadata so the economic |
||||
dimension (welfare vs market) is conventionally on the horizontal axis. |
||||
""" |
||||
economic_labels = {"Verzorgingsstaat–Marktwerking", "Links–Rechts"} |
||||
y_label = axis_def.get("y_label") |
||||
x_label = axis_def.get("x_label") |
||||
return y_label in economic_labels and x_label not in economic_labels |
||||
|
||||
|
||||
def swap_axes( |
||||
positions_by_window: Dict[str, Dict[str, Tuple[float, float]]], |
||||
axis_def: dict, |
||||
) -> Tuple[Dict[str, Dict[str, Tuple[float, float]]], dict]: |
||||
"""Swap x and y in all positions and axis metadata. |
||||
|
||||
Pure function — returns (new_positions_by_window, new_axis_def). |
||||
""" |
||||
new_positions: Dict[str, Dict[str, Tuple[float, float]]] = {} |
||||
for wid, pos_dict in positions_by_window.items(): |
||||
new_positions[wid] = {ent: (y, x) for ent, (x, y) in pos_dict.items()} |
||||
|
||||
new_ax = dict(axis_def) |
||||
new_ax["x_label"] = axis_def.get("y_label") |
||||
new_ax["y_label"] = axis_def.get("x_label") |
||||
|
||||
for x_key, y_key in [ |
||||
("x_quality", "y_quality"), |
||||
("x_interpretation", "y_interpretation"), |
||||
("x_top_motions", "y_top_motions"), |
||||
("x_label_confidence", "y_label_confidence"), |
||||
("x_axis", "y_axis"), |
||||
]: |
||||
new_ax[x_key] = axis_def.get(y_key) |
||||
new_ax[y_key] = axis_def.get(x_key) |
||||
|
||||
return new_positions, new_ax |
||||
|
||||
|
||||
def project_motion_scores( |
||||
motion_scores: Dict[int, float], top_n: int = 5 |
||||
) -> Tuple[List[Tuple[int, float]], List[Tuple[int, float]]]: |
||||
"""Split motion scores into positive and negative poles. |
||||
|
||||
Args: |
||||
motion_scores: Dict mapping motion_id to loading score |
||||
top_n: Number of top motions per pole |
||||
|
||||
Returns: |
||||
Tuple of (positive_pole, negative_pole) where each is a list of (motion_id, score) |
||||
""" |
||||
sorted_scores = sorted(motion_scores.items(), key=lambda x: x[1], reverse=True) |
||||
|
||||
positive_pole = sorted_scores[:top_n] |
||||
negative_pole = sorted_scores[-top_n:][::-1] |
||||
|
||||
return positive_pole, negative_pole |
||||
|
||||
|
||||
def normalize_coordinates( |
||||
positions: Dict[str, Tuple[float, float]], |
||||
clamp_abs_value: float = 1e3, |
||||
null_tokens: Tuple[str, ...] = ("nan", "NaN", "None", "none", "null", ""), |
||||
) -> Dict[str, Tuple[float, float]]: |
||||
"""Normalize coordinate values. |
||||
|
||||
Pure function that clamps extreme values and handles null tokens. |
||||
|
||||
Args: |
||||
positions: Dict mapping entity names to (x, y) coordinates |
||||
clamp_abs_value: Maximum absolute coordinate value |
||||
null_tokens: Values to treat as null |
||||
|
||||
Returns: |
||||
Dict with normalized coordinates |
||||
""" |
||||
|
||||
def _coerce(val: Any) -> float: |
||||
if val is None: |
||||
return float("nan") |
||||
if isinstance(val, (float, int)): |
||||
v = float(val) |
||||
if math.isnan(v) or math.isinf(v): |
||||
return float("nan") |
||||
if abs(v) > clamp_abs_value: |
||||
return float("nan") |
||||
return v |
||||
if isinstance(val, str): |
||||
if val in null_tokens or val.strip() in null_tokens: |
||||
return float("nan") |
||||
try: |
||||
v = float(val) |
||||
if math.isnan(v) or math.isinf(v): |
||||
return float("nan") |
||||
if abs(v) > clamp_abs_value: |
||||
return float("nan") |
||||
return v |
||||
except ValueError: |
||||
return float("nan") |
||||
return float("nan") |
||||
|
||||
result = {} |
||||
for entity, (x, y) in positions.items(): |
||||
nx = _coerce(x) |
||||
ny = _coerce(y) |
||||
result[entity] = (nx, ny) |
||||
return result |
||||
@ -0,0 +1,21 @@ |
||||
"""Tab modules for the parliamentary explorer. |
||||
|
||||
This package contains tab-building functions extracted from explorer.py. |
||||
Each module contains a `build_<tab>_tab()` function that implements one tab. |
||||
""" |
||||
|
||||
from analysis.tabs.compass import build_compass_tab |
||||
from analysis.tabs.trajectories import build_trajectories_tab |
||||
from analysis.tabs.search import build_search_tab |
||||
from analysis.tabs.browser import build_browser_tab |
||||
from analysis.tabs.components import build_svd_components_tab |
||||
from analysis.tabs.quiz import build_mp_quiz_tab |
||||
|
||||
__all__ = [ |
||||
"build_compass_tab", |
||||
"build_trajectories_tab", |
||||
"build_search_tab", |
||||
"build_browser_tab", |
||||
"build_svd_components_tab", |
||||
"build_mp_quiz_tab", |
||||
] |
||||
@ -0,0 +1,18 @@ |
||||
"""Browser tab for the parliamentary explorer. |
||||
|
||||
This module will contain the browser tab implementation. |
||||
Currently: Tab logic remains in explorer.py pending Streamlit decoupling. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
|
||||
def build_browser_tab(db_path: str, show_rejected: bool) -> None: |
||||
"""Build the Motie Browser tab. |
||||
|
||||
Currently delegates to explorer.py implementation. |
||||
Will be extracted when rendering logic is decoupled from Streamlit. |
||||
""" |
||||
import explorer |
||||
|
||||
explorer.build_browser_tab(db_path, show_rejected) |
||||
@ -0,0 +1,20 @@ |
||||
"""Compass tab for the parliamentary explorer. |
||||
|
||||
This module will contain the compass tab implementation. |
||||
Currently: Tab logic remains in explorer.py pending Streamlit decoupling. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
from typing import List |
||||
|
||||
|
||||
def build_compass_tab(db_path: str, window_size: str) -> None: |
||||
"""Build the Politiek Kompas tab. |
||||
|
||||
Currently delegates to explorer.py implementation. |
||||
Will be extracted when rendering logic is decoupled from Streamlit. |
||||
""" |
||||
import explorer |
||||
|
||||
explorer.build_compass_tab(db_path, window_size) |
||||
@ -0,0 +1,18 @@ |
||||
"""SVD Components tab for the parliamentary explorer. |
||||
|
||||
This module will contain the SVD components tab implementation. |
||||
Currently: Tab logic remains in explorer.py pending Streamlit decoupling. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
|
||||
def build_svd_components_tab(db_path: str) -> None: |
||||
"""Build the SVD Components tab. |
||||
|
||||
Currently delegates to explorer.py implementation. |
||||
Will be extracted when rendering logic is decoupled from Streamlit. |
||||
""" |
||||
import explorer |
||||
|
||||
explorer.build_svd_components_tab(db_path) |
||||
@ -0,0 +1,18 @@ |
||||
"""MP Quiz tab for the parliamentary explorer. |
||||
|
||||
This module will contain the MP quiz tab implementation. |
||||
Currently: Tab logic remains in explorer.py pending Streamlit decoupling. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
|
||||
def build_mp_quiz_tab(db_path: str) -> None: |
||||
"""Build the MP Quiz tab. |
||||
|
||||
Currently delegates to explorer.py implementation. |
||||
Will be extracted when rendering logic is decoupled from Streamlit. |
||||
""" |
||||
import explorer |
||||
|
||||
explorer.build_mp_quiz_tab(db_path) |
||||
@ -0,0 +1,18 @@ |
||||
"""Search tab for the parliamentary explorer. |
||||
|
||||
This module will contain the search tab implementation. |
||||
Currently: Tab logic remains in explorer.py pending Streamlit decoupling. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
|
||||
def build_search_tab(db_path: str, show_rejected: bool) -> None: |
||||
"""Build the Motie Zoeken tab. |
||||
|
||||
Currently delegates to explorer.py implementation. |
||||
Will be extracted when rendering logic is decoupled from Streamlit. |
||||
""" |
||||
import explorer |
||||
|
||||
explorer.build_search_tab(db_path, show_rejected) |
||||
@ -0,0 +1,20 @@ |
||||
"""Trajectories tab for the parliamentary explorer. |
||||
|
||||
This module will contain the trajectories tab implementation. |
||||
Currently: Tab logic remains in explorer.py pending Streamlit decoupling. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
from typing import List |
||||
|
||||
|
||||
def build_trajectories_tab(db_path: str, window_size: str) -> None: |
||||
"""Build the Partij Trajectories tab. |
||||
|
||||
Currently delegates to explorer.py implementation. |
||||
Will be extracted when rendering logic is decoupled from Streamlit. |
||||
""" |
||||
import explorer |
||||
|
||||
explorer.build_trajectories_tab(db_path, window_size) |
||||
@ -0,0 +1,110 @@ |
||||
# De Tweede Kamer is gepolariseerder, maar niet rechtser — dit zegt het data |
||||
|
||||
*Een analyse van 10 jaar moties in de Tweede Kamer* |
||||
|
||||
--- |
||||
|
||||
**Samenvatting**: We hebben 10 jaar stemmingen in de Tweede Kamer (2016-2026) geanalyseerd. De belangrijkste bevinding: de **coalitie verloor haar meerderheid in 2019 en heeft sindsdien structureel moties verloren**. Dit verklaart waarom "rechts" lijkt te winnen — niet omdat rechtser partijen groter werden, maar omdat de coalitie vaker verloor. |
||||
|
||||
--- |
||||
|
||||
## De Ontdekking: De Coalitie Verloor in 2019 |
||||
|
||||
Toen we keken naar welke moties werden aangenomen, zagen we een opvallende omslag: |
||||
|
||||
| Jaar | Gewonnen moties door | Interpretatie | |
||||
|------|---------------------|--------------| |
||||
| 2016 | **Coalitie** (+6.48) | Kabinet wint | |
||||
| 2017 | **Oppositie** (-6.72) | Eerste tekenen van verlies | |
||||
| 2018 | **Coalitie** (+7.54) | Laatste stabiele jaar | |
||||
| **2019** | **Coalitie** (+2.92) | *Krimp, maar nog winst* | |
||||
| **2022-2026** | **Oppositie** (-3.22 tot -4.70) | *Structueel verlies* | |
||||
|
||||
De getallen tussen haakjes zijn de gemiddelde score op de "coalitie-tegenstelling" as — positief betekent dat de coalitie-achtige moties wonnen, negatief dat oppositie-achtige moties wonnen. |
||||
|
||||
--- |
||||
|
||||
## Vondst 1: Niet de oppositie won, maar de coalitie verloor |
||||
|
||||
Het verhaal is genuanceerder dan "rechts wint": |
||||
|
||||
- In 2016-2018: Coalitie had een werkende meerderheid en won moties |
||||
- In 2019: Coalitie verloor de meerderheid (Rutte III-crisis) |
||||
- In 2022-2026: Structureel verlies voor de coalitie-kant |
||||
|
||||
De PVV en FVD werden **niet** groter omdat hun standpunten mainstream werden — ze werden groter terwijl de coalitie **minder moties won**. |
||||
|
||||
--- |
||||
|
||||
## Vondst 2: Polarisatie is toegenomen |
||||
|
||||
Ongeacht wie er won, werden moties wel extremer: |
||||
|
||||
| Jaar | Spreiding (std) | Interpretatie | |
||||
|------|-----------------|--------------| |
||||
| 2016 | 3.46 | Gematigde verdeeldheid | |
||||
| 2019 | 6.31 | Toegenomen verdeeldheid | |
||||
| **2026** | **7.44** | **Sterke polarisatie** | |
||||
|
||||
De spreiding **verdubbelde** in tien jaar tijd — ongeacht of de coalitie of oppositie won. |
||||
|
||||
--- |
||||
|
||||
## Vondst 3: PVV/FVD-groei vs regeringsverlies |
||||
|
||||
De partijgrootte veranderde drastisch: |
||||
|
||||
| Jaar | Toppartijen (voor-stemmen) | |
||||
|------|---------------------------| |
||||
| 2016 | PvdA, VVD, D66 (coalitie) | |
||||
| 2019 | SP, PvdD, GL (oppositie) | |
||||
| 2022 | BBB, SP, DENK (anti-kabinetspartijen) | |
||||
| **2026** | **PVV, 50PLUS, DENK** | |
||||
|
||||
De PVV werd groot, maar dat betekent niet dat "rechts" beleid won — het betekent dat de coalitie **minder vaak wist te winnen**. |
||||
|
||||
--- |
||||
|
||||
## Vondst 4: Onderwerpen verschoven naar migratie |
||||
|
||||
De onderwerpen die de coalitie-achtige kant nu winnen zijn anders: |
||||
|
||||
### 2016: Bestuurlijke moties |
||||
- Belastinghervorming |
||||
- Internationale verdragen |
||||
- Administratieve wetgeving |
||||
|
||||
### 2026: Identiteit/migratie |
||||
- **Asielstop** |
||||
- **Intrekken Syrische vergunningen** |
||||
- **Terugkeerbeleid Oekraïners** |
||||
|
||||
Dezelfde structuur (wie met wie stemt), maar andere onderwerpen. |
||||
|
||||
--- |
||||
|
||||
## Conclusies |
||||
|
||||
### 1. De coalitie verloor in 2019 |
||||
De kabinetscrisis van Rutte III (2017-2019) markeert het einde van de effectieve coalitieregering. Sindsdien wint de oppositie-kant structureel meer moties. |
||||
|
||||
### 2. Polarisatie nam toe |
||||
Ongeacht wie er won, werden moties extremer. De gemiddelde afwijking verdubbelde van 3.46 naar 7.44. |
||||
|
||||
### 3. Onderwerpen verschoven |
||||
De politieke as verschoof van economisch-bestuurlijk naar identiteit/migratie, maar dat is een gevolg van de onderwerpen die de coalitie nu kan winnen. |
||||
|
||||
### 4. Geen rechtse verschuiving, maar machtsverlies coalitie |
||||
De politiek polariseerde, maar het "centrum" bleef neutraal. Wat veranderde was dat de coalitie haar greep op de agenda verloor. |
||||
|
||||
--- |
||||
|
||||
## Methodologische noot |
||||
|
||||
De as waarover we praten is de eerste principale component van alle stemgedrag — de belangrijkste verdeling in hoe partijen tegen elkaar stemmen. Positieve scores betekenen dat een motie kenmerken heeft van de kant die we "coalitie" noemen (historisch VVD, CDA, D66), negatief van de oppositiekant. |
||||
|
||||
De volledige code is beschikbaar in de [GitHub-repository](https://github.com/sgeboers/stemwijzer). |
||||
|
||||
--- |
||||
|
||||
*Analyse uitgevoerd op 5 april 2026. Data: 8.700+ moties 2016-2026.* |
||||
@ -0,0 +1,118 @@ |
||||
--- |
||||
date: 2026-04-04 |
||||
topic: explorer-refactor |
||||
--- |
||||
|
||||
# Explorer.py Refactor: Extract to analysis/ |
||||
|
||||
## Problem Frame |
||||
|
||||
explorer.py is 3715 lines with 39 functions mixing: |
||||
- Data loading (DuckDB queries) |
||||
- Business logic (SVD projections, trajectory alignment) |
||||
- UI rendering (Streamlit components) |
||||
|
||||
This makes the file: |
||||
- Hard to navigate (no clear boundaries) |
||||
- Hard to test (requires Streamlit + DuckDB) |
||||
- Hard to review (changes affect everything) |
||||
|
||||
**Goal**: Improve navigability by extracting computation-heavy logic to `analysis/`, leaving explorer.py as a UI orchestration layer. |
||||
|
||||
## Requirements |
||||
|
||||
### Data Layer |
||||
|
||||
- **R1.1**: Create `analysis/explorer_data.py` containing all data loading functions currently in explorer.py: |
||||
- `get_available_windows()` |
||||
- `get_uniform_dim_windows()` |
||||
- `load_positions()` |
||||
- `load_party_map()` |
||||
- `load_active_mps()` |
||||
- `load_party_axis_scores()` |
||||
- `load_party_scores_all_windows()` |
||||
- `load_party_scores_all_windows_aligned()` |
||||
- `load_party_mp_vectors()` |
||||
- `load_scree_data()` |
||||
- `load_motions_df()` |
||||
|
||||
- **R1.2**: All extracted functions must be callable without Streamlit imports (no `@st.cache_data`, no `st.*` calls) |
||||
|
||||
- **R1.3**: Functions return pure Python data structures (DataFrames, dicts, lists) - no Plotly figures |
||||
|
||||
### Business Logic Layer |
||||
|
||||
- **R2.1**: Move computation functions to `analysis/` modules based on domain: |
||||
- `_should_swap_axes()`, `_swap_axes()` → `analysis/axis_utils.py` (new) |
||||
- `compute_party_discipline()` → `analysis/trajectories.py` |
||||
- Trajectory computation functions → `analysis/trajectories.py` |
||||
- SVD projection functions → `analysis/svd_labels.py` or new `analysis/projections.py` |
||||
|
||||
- **R2.2**: Computations must be pure functions (no IO, deterministic outputs) |
||||
|
||||
### UI Layer (explorer.py) |
||||
|
||||
- **R3.1**: explorer.py becomes a thin orchestration layer: |
||||
- Imports from `analysis/explorer_data.py` for data |
||||
- Imports from `analysis/` modules for computations |
||||
- Contains only Streamlit UI code and `@st.cache_data` wrappers |
||||
|
||||
- **R3.2**: Render functions (`_render_*`) stay in explorer.py (they're UI-only) |
||||
|
||||
- **R3.3**: Tab-building functions (`build_*_tab()`) stay in explorer.py but delegate to imported functions |
||||
|
||||
### Import Safety |
||||
|
||||
- **R4.1**: New `analysis/` modules must not import from `explorer.py` (no circular dependencies) |
||||
|
||||
- **R4.2**: `analysis/explorer_data.py` may import from `database.py` (already exists) |
||||
|
||||
### Testing |
||||
|
||||
- **R5.1**: Extracted data functions should be testable with mocked DuckDB connections |
||||
|
||||
- **R5.2**: Extracted computation functions should be pure and testable without database |
||||
|
||||
## Success Criteria |
||||
|
||||
- explorer.py reduced to under 1500 lines (from 3715) |
||||
- No function in explorer.py exceeds 100 lines |
||||
- Clear module boundaries: data → computation → UI |
||||
- All extracted functions have docstrings with type hints |
||||
- No circular imports between `analysis/` and `explorer/` |
||||
|
||||
## Scope Boundaries |
||||
|
||||
**Included:** |
||||
- Data loading functions |
||||
- Computation/transformation logic |
||||
- Clear separation of concerns |
||||
|
||||
**Excluded:** |
||||
- UI rendering functions (they can stay in explorer.py) |
||||
- Database schema changes |
||||
- New features or behavior changes |
||||
- Test suite updates (handled separately) |
||||
|
||||
## Key Decisions |
||||
|
||||
- **Domain-based splitting**: Computation goes to relevant `analysis/` module, not all to one file |
||||
- **Import direction**: `explorer.py` imports from `analysis/`, never vice versa |
||||
- **Preserve function signatures**: Refactoring shouldn't change public APIs |
||||
|
||||
## Dependencies / Assumptions |
||||
|
||||
- `database.py` provides `MotionDatabase` singleton - data functions will use this |
||||
- `explorer_helpers.py` pattern is already established - follow its conventions |
||||
- Streamlit caching (`@st.cache_data`) stays in explorer.py as the orchestration layer |
||||
|
||||
## Outstanding Questions |
||||
|
||||
### Deferred to Planning |
||||
- [ ] [Implementation] Should `_load_mp_vectors_by_party()` and variants be merged or kept separate? |
||||
- [ ] [Implementation] Should we create `analysis/projections.py` or extend existing `analysis/axis_classifier.py`? |
||||
- [ ] [Implementation] How to handle the `_cached_bootstrap_cis()` function - move to analysis or keep as cache wrapper? |
||||
|
||||
## Next Steps |
||||
|
||||
→ `/ce:plan` for structured implementation planning |
||||
@ -0,0 +1,77 @@ |
||||
--- |
||||
date: 2026-04-05 |
||||
topic: right-wing-party-axis-validation |
||||
--- |
||||
|
||||
# Right-Wing Party Axis Validation |
||||
|
||||
## Problem Frame |
||||
|
||||
The project convention states that PVV, FVD, JA21, and SGP must appear on the RIGHT side of all axes in visualizations (AGENTS.md). This is the #1 documented convention with zero automated enforcement. A single test prevents regression when SVD labels change or new components are added. |
||||
|
||||
## Requirements |
||||
|
||||
**R1. Canonical party sets defined once, imported everywhere** |
||||
- Define `CANONICAL_RIGHT = frozenset({"PVV", "FVD", "JA21", "SGP"})` in `analysis/config.py` |
||||
- Define `CANONICAL_LEFT = frozenset({"SP", "PvdA", "GL", "GroenLinks", "GroenLinks-PvdA", "DENK", "PvdD", "Volt"})` in `analysis/config.py` — matches svd_labels.py LEFT_PARTIES exactly |
||||
- All code that checks political orientation (svd_labels.py, political_axis.py) imports from config instead of defining inline |
||||
|
||||
**R2. Validation test loads real data from DuckDB** |
||||
- Test file: `tests/test_axis_political_orientation.py` |
||||
- Uses existing data loading functions (`load_party_scores_all_windows_aligned` from `analysis/explorer_data.py`) |
||||
- No synthetic data — validates against actual `party_axis_scores` table |
||||
|
||||
**R3. 2D political compass orientation check (statistical, not per-party)** |
||||
- `party_axis_scores` table has `x_axis_aligned` (component 1) and `y_axis_aligned` (component 2) |
||||
- For each window, validate both axes using **mean scores**: |
||||
- **Axis 1 (x)**: Compute mean of `CANONICAL_RIGHT` x-values and mean of `CANONICAL_LEFT` x-values. Assert `right_mean > left_mean` |
||||
- **Axis 2 (y)**: Same for y-values. Assert `right_mean > left_mean` |
||||
- "Right on right" means the **average** right party is right of the **average** left party — individual parties may deviate slightly (e.g., one right party slightly negative is fine) |
||||
- `compute_flip_direction` already implements this logic (compares group means) — use it |
||||
- Skips parties not present in a given window (graceful, not a failure) |
||||
|
||||
**R4. `compute_flip_direction` consistency check** |
||||
- After loading data, call `compute_flip_direction(1, party_scores)` and `compute_flip_direction(2, party_scores)` per window |
||||
- Assert both return `False` (no flip needed) when data is already correctly oriented |
||||
- If either returns `True`, the data violates the convention and the test fails with a clear message |
||||
|
||||
**R5. Clear failure messages** |
||||
- When orientation check fails, report: window, axis (x/y), right_mean, left_mean, difference |
||||
- Example: `"Window '2021-2023', x-axis: right_mean=-0.12, left_mean=0.08 (right parties on LEFT side — flip direction=True)"` |
||||
|
||||
## Success Criteria |
||||
|
||||
- Test runs as part of `pytest` suite (`.venv/bin/python -m pytest tests/test_axis_political_orientation.py`) |
||||
- Test passes with current data (convention currently holds — this establishes the baseline) |
||||
- If convention is violated in future data, test fails with actionable message |
||||
- Test works for all windows in the database (not just current) |
||||
- Statistical check (mean-based) — test passes even if individual parties deviate slightly from group mean |
||||
|
||||
## Scope Boundaries |
||||
|
||||
- **Not included**: Testing unaligned scores (only aligned scores are validated — these are what users see) |
||||
- **Not included**: VVD, NSC, BBB, CDA, ChristenUnie — these are center parties, not right-wing per AGENTS.md convention |
||||
- **Not included**: Per-party strict sign checks (statistical mean check is sufficient and more robust) |
||||
- **Not included**: Updating `political_axis.py` — R1 only updates `svd_labels.py` to import from config; `political_axis.py` uses different party sets for PCA centroid orientation and is out of scope |
||||
|
||||
## Key Decisions |
||||
|
||||
- **Canonical sets match AGENTS.md for right, svd_labels.py for left**: `CANONICAL_RIGHT = {PVV, FVD, JA21, SGP}` matches AGENTS.md exactly. `CANONICAL_LEFT = {SP, PvdA, GL, GroenLinks, GroenLinks-PvdA, DENK, PvdD, Volt}` matches svd_labels.py LEFT_PARTIES exactly. |
||||
- **Single unified source of truth in config.py**: `CANONICAL_RIGHT` and `CANONICAL_LEFT` frozensets go in `config.py` — it's a prerequisite for the test to work correctly. Only `svd_labels.py` is updated to import from config; `political_axis.py` is out of scope (uses party sets for PCA centroid orientation, not the same usage). |
||||
- **Aligned scores only**: Unaligned scores may vary across windows due to Procrustes alignment drift; aligned scores are the stable, user-facing representation. |
||||
- **Statistical (mean-based) validation, not per-party**: The orientation check compares group means, not individual party scores. A single right party being slightly negative is not a failure — the mean right score must exceed the mean left score. |
||||
|
||||
## Dependencies / Assumptions |
||||
|
||||
- DuckDB database is populated with `party_axis_scores` table with `x_axis_aligned` and `y_axis_aligned` columns (verified) |
||||
- `analysis/explorer_data.py` functions work correctly (already tested) |
||||
- `_PARTY_NORMALIZE` already exists in `config.py` (lines 247-256) — use it for party name alias normalization |
||||
- `config.py` currently lacks `CANONICAL_RIGHT`/`CANONICAL_LEFT` frozensets — these must be added as part of R1 |
||||
- `compute_flip_direction()` in `svd_labels.py` currently uses inline `RIGHT_PARTIES`/`LEFT_PARTIES` — must be updated to import from config after R1 |
||||
|
||||
## Outstanding Questions |
||||
|
||||
All resolved. Key decisions documented above. |
||||
|
||||
## Next Steps |
||||
→ `/ce:plan` for structured implementation planning |
||||
@ -0,0 +1,86 @@ |
||||
--- |
||||
date: 2026-04-13 |
||||
topic: topic-derived-svd-axis-labels |
||||
--- |
||||
|
||||
# Topic-Derived SVD Axis Labels |
||||
|
||||
## Problem Frame |
||||
|
||||
The current SVD axis labels in `SVD_THEMES` (config.py) describe which parties land where, not what policy dimension the axis captures. This produces misleading labels: |
||||
|
||||
- **Axis 1**: labeled "Links: PvdD, GL-PvdA" but PvdD and D66 vote the same way on the defining motions (Israel, rent, antipersonnel mines, gas extraction). D66 is known as centrist, not left. The label reflects party positions, not the actual policy divide. |
||||
- The negative pole is named after parties that *coincidentally* vote together, not parties that define the axis. |
||||
|
||||
**Users** want to understand what policy dimension each axis represents. A good label should be topic-derived from the motions that define each axis. |
||||
|
||||
## Requirements |
||||
|
||||
### Label Derivation |
||||
|
||||
- **R1** Labels are derived from the **content of the motions** that define each axis, not from party positions. |
||||
- **R2** Use **50 motions per component** (top 25 positive + top 25 negative by absolute loading) to capture the full topic breadth, not just the top 10 (which can show a misleadingly narrow slice). |
||||
- **R3** Derive the label using **TF-IDF keyword extraction** on motion titles (Dutch stopwords removed). Use the top 3-5 most distinctive keywords to form a short label. |
||||
- **R4** Also consider `policy_area` field to validate or supplement the keyword-derived label. |
||||
- **R5** Labels should be **reviewed manually** before being applied to `SVD_THEMES`. The script outputs suggestions; human validates before committing. |
||||
- **R6** For each component, the output includes: |
||||
- Suggested short label (≤60 chars) |
||||
- Top 10 representative motions (5 pos + 5 neg pole) |
||||
- Top 10 TF-IDF keywords |
||||
- Dominant `policy_area` |
||||
- Current SVD_THEMES label for reference |
||||
|
||||
### Tooling |
||||
|
||||
- **R7** Create a new script `scripts/derive_svd_labels.py` that generates a **review report** (markdown) with label suggestions per component. |
||||
- **R8** The report is generated by running: |
||||
```bash |
||||
uv run python3 scripts/derive_svd_labels.py --db data/motions.db --window current_parliament |
||||
``` |
||||
- **R9** After review, the validated labels are written to `analysis/config.py` (updating `SVD_THEMES`). |
||||
|
||||
### Output Report Format |
||||
|
||||
For each component (1-10), the review report includes: |
||||
- Suggested label |
||||
- TF-IDF keyword list |
||||
- Dominant policy area |
||||
- Top 5 positive-pole motion titles |
||||
- Top 5 negative-pole motion titles |
||||
- Current label for comparison |
||||
|
||||
## Success Criteria |
||||
|
||||
- Each axis label reflects the actual policy topics that define that axis |
||||
- Labels are consistent and interpretable (e.g., "Buitenlandbeleid & Klimaat" not "Links vs Rechts") |
||||
- PvdD and D66 scoring on axis 1 makes sense given the derived label |
||||
- The review report makes it easy for a human to validate or correct labels |
||||
|
||||
## Scope Boundaries |
||||
|
||||
- **In scope**: Label derivation for axis 1-10, review workflow, updating config |
||||
- **Out of scope**: Automatically applying labels without review, changing the SVD computation, modifying the UI |
||||
- **Not changing**: The `positive_pole` / `negative_pole` fields in SVD_THEMES (those describe party coalitions, not topics — acceptable as-is) |
||||
|
||||
## Key Decisions |
||||
|
||||
- **TF-IDF over LLM**: TF-IDF is deterministic, fast, and sufficient for keyword extraction. No LLM dependency. Reviewer still validates output. |
||||
- **Static labels in config**: After review, labels go into `SVD_THEMES` in config.py. This keeps the current architecture (no runtime derivation). |
||||
- **Large motion sample (≥50)**: 10 motions per component is too few — axis 1's 10 motions show a mix of Israel, rent, mines, gas that looks incoherent. ≥50 gives a clearer picture of what the axis truly captures. |
||||
|
||||
## Dependencies / Assumptions |
||||
|
||||
- Motion titles in `motions` table are in Dutch and sufficiently descriptive |
||||
- `policy_area` field has meaningful coverage |
||||
- `svd_vectors` table contains all motion loadings for the window |
||||
|
||||
## Outstanding Questions |
||||
|
||||
### Resolve Before Planning |
||||
(none) |
||||
|
||||
### Deferred to Planning |
||||
- **Tooling approach**: Use parallel subagents (one per axis) to analyze 50 motions each and derive labels, rather than a single sequential script. Each subagent produces a suggested label independently. |
||||
|
||||
## Next Steps |
||||
→ `/ce:plan` for structured implementation planning |
||||
@ -0,0 +1,149 @@ |
||||
--- |
||||
date: 2026-04-04 |
||||
topic: code-quality-architecture-ideation |
||||
focus: code quality and architecture improvements |
||||
--- |
||||
|
||||
# Ideation: Code Quality & Architecture Improvements |
||||
|
||||
## Codebase Context |
||||
- **explorer.py**: 3715 lines — monolithic Streamlit app with 65+ `except Exception:` handlers |
||||
- **database.py**: 1366 lines — `MotionDatabase` class with similar exception patterns |
||||
- **explorer_helpers.py**: 317 lines — pure functions, import-safe, well-testable (the pattern) |
||||
- **Anti-patterns**: 208 instances of bare/broad exception handling, nested try-except blocks |
||||
- **Tests**: Well-organized in `tests/` with good coverage of helpers |
||||
|
||||
## Ranked Ideas |
||||
|
||||
### 1. Systematic Exception Handler Audit & Refactor |
||||
**Description:** Audit all 208 `except Exception:` blocks across the codebase. Categorize by failure mode (missing dependency, data validation, network, IO) and replace with specific exceptions. Add error context propagation. |
||||
|
||||
**Rationale:** The current pattern silently swallows errors, making debugging impossible. Refactoring to specific exceptions enables proper error handling, logging, and user feedback. This compounds: each fix reduces 2-3 nested exception handlers. |
||||
|
||||
**Downsides:** High volume of changes requires careful regression testing. |
||||
|
||||
**Confidence:** 90% |
||||
|
||||
**Complexity:** High |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 2. Extract Business Logic from explorer.py into Pure Functions |
||||
**Description:** Identify and extract computation-heavy sections from the 3715-line explorer.py. Move to pure functions in a new module (e.g., `explorer_logic.py`), keeping Streamlit UI glue in the main file. |
||||
|
||||
**Rationale:** explorer.py mixes UI code with business logic, making it untestable and hard to reason about. The existing `explorer_helpers.py` proves this pattern works — same approach applied more broadly enables unit testing of core algorithms. |
||||
|
||||
**Downsides:** Requires careful interface design to avoid breaking the Streamlit page. |
||||
|
||||
**Confidence:** 85% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 3. Create Typed Data Transfer Objects (DTOs) for Database Layer |
||||
**Description:** Replace dictionary-based data passing between `database.py` and consumers with typed dataclasses or Pydantic models. Define `MotionDTO`, `PartyResultDTO`, `SessionDTO`. |
||||
|
||||
**Rationale:** 208 exception handlers often mask type mismatches that would be caught at compile-time with typed DTOs. The `src/validators/types.py` shows existing type awareness — extend this systematically to the data layer. |
||||
|
||||
**Downsides:** Migration effort; some duckdb results may not serialize cleanly. |
||||
|
||||
**Confidence:** 75% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 4. Establish Explicit Error Recovery Strategies |
||||
**Description:** Rather than catch-all exception handling, implement explicit recovery strategies per failure mode: retry with backoff for transient failures, fallback to cached data for missing dependencies, graceful degradation for optional features. |
||||
|
||||
**Rationale:** The anti-pattern exists because there's no systematic recovery approach. Explicit strategies replace 208 silent catches with intentional behavior — this is the "compound leverage" angle. |
||||
|
||||
**Downsides:** Requires identifying which failures are transient vs. permanent per operation. |
||||
|
||||
**Confidence:** 80% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 5. Modularize database.py into Focused Modules |
||||
**Description:** Split `database.py` (1366 lines) into: `db_connection.py` (connection lifecycle), `db_motions.py` (motion queries), `db_sessions.py` (session management), `db_migrations.py` (schema updates). |
||||
|
||||
**Rationale:** Single-responsibility violation — database.py handles connection, schema, queries, and migrations. Splitting enables independent testing and clearer ownership. The `pipeline/` modular structure shows this is already the project's convention. |
||||
|
||||
**Downsides:** Breaking changes for any existing imports. |
||||
|
||||
**Confidence:** 70% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 6. Add Comprehensive Type Hints to Core Modules |
||||
**Description:** Run mypy on `explorer.py`, `database.py`, `analysis/*.py`. Fix missing type hints and enable strict type checking in CI. |
||||
|
||||
**Rationale:** Type hints catch the errors that 208 exception handlers are currently masking. The `src/types/motion_types.py` shows the project already has some type investment — this extends it to the pain points. |
||||
|
||||
**Downsides:** May require `cast()` in some duckdb interop scenarios. |
||||
|
||||
**Confidence:** 85% |
||||
|
||||
**Complexity:** Low |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 7. Create Code Climate Metrics & Monitoring |
||||
**Description:** Add radon or lizard to measure cyclomatic complexity per module. Set thresholds that fail CI if exceeded. Track over time. |
||||
|
||||
**Rationale:** Quantitative baseline for refactoring impact. Currently no way to measure if the 3715-line explorer.py is improving or degrading. Compounds: each refactor can be measured. |
||||
|
||||
**Downsides:** Tool overhead; thresholds may need tuning. |
||||
|
||||
**Confidence:** 60% |
||||
|
||||
**Complexity:** Low |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 8. Extract Static Analysis Rule for Bare Except Detection |
||||
**Description:** Add a flake8 plugin or ruff rule that flags `except:` and `except Exception:` without re-raising or logging. Document the project-specific exception hierarchy. |
||||
|
||||
**Rationale:** Prevents the anti-pattern from re-entering. The project has 208 violations — a custom lint rule catches new violations and encodes the team's error-handling philosophy. This is the "assumption-breaking" angle: stop fixing cases, fix the system. |
||||
|
||||
**Downsides:** Requires defining what specific exceptions ARE allowed per context. |
||||
|
||||
**Confidence:** 70% |
||||
|
||||
**Complexity:** Low |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
## Rejection Summary |
||||
|
||||
| # | Idea | Reason Rejected | |
||||
|---|------|-----------------| |
||||
| 1 | Add docstrings to all functions | Too obvious; not leverage-focused | |
||||
| 2 | Migrate to async database operations | Premature optimization; duckdb is sync | |
||||
| 3 | Add logging library (structured logging) | Tool-focused, not addressing root cause | |
||||
| 4 | Replace Streamlit with another framework | Out of scope for this codebase | |
||||
| 5 | Add Caching layer for database queries | Already exists via Streamlit caching; not addressing architecture | |
||||
|
||||
## Session Log |
||||
- 2026-04-04: Initial ideation — 13 generated, 8 survived |
||||
@ -0,0 +1,160 @@ |
||||
--- |
||||
date: 2026-04-04 |
||||
topic: reliability-correctness-improvements |
||||
focus: reliability and correctness |
||||
--- |
||||
|
||||
# Ideation: Reliability & Correctness Improvements |
||||
|
||||
## Codebase Context |
||||
- **Python + Streamlit + DuckDB** data pipeline application |
||||
- **Key Issues from docs/solutions/**: |
||||
- SVD labels must reflect voting patterns, not semantic content (850+ SVD component labels in code) |
||||
- Bare exception handlers: 850+ `except Exception:` across codebase |
||||
- Nested exception handling creates opaque error paths |
||||
- Error handling catches broad Exception and prints to stdout (179 `print()` statements in error paths) |
||||
- **Existing Pattern**: `explorer_helpers.py` is pure functions, testable, well-structured — the model to follow |
||||
|
||||
## Grounding Evidence |
||||
1. `docs/solutions/best-practices/svd-labels-voting-patterns-not-semantics.md` documents the SVD labeling convention |
||||
2. Grep search found 281 `except Exception:` in `.py` files plus bare `except:` handlers |
||||
3. `database.py` line 47: bare `except:` that catches everything including KeyboardInterrupt |
||||
4. 179 print statements in error handling paths hide issues from logging |
||||
|
||||
## Ranked Ideas |
||||
|
||||
### 1. Right-Wing Party Axis Validation — Automated Assert |
||||
**Description:** Add runtime validation that PVV, FVD, JA21, SGP appear on RIGHT side of all SVD/PCA axes. Create a `validate_axis_polrity()` function that checks party loadings and raises `AssertionError` if right-wing parties appear on the left. |
||||
|
||||
**Rationale:** This is the most impactful correctness fix — the project convention is explicitly documented in AGENTS.md yet has no automated enforcement. A single validation pass catches SVD labeling errors before they reach production. |
||||
|
||||
**Downsides:** Requires careful handling of axis flips (sometimes flipping is the correct fix, not validation failure). |
||||
|
||||
**Confidence:** 95% |
||||
|
||||
**Complexity:** Low |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 2. Type-Safe Vote Normalization with Exhaustiveness Checking |
||||
**Description:** Replace the fragile string-based vote normalization in `database.py` (lines 715-744) with a typed enum + exhaustiveness checking. Add a `Vote` enum with variants: `VOOR`, `TEGEN`, `ONTHOUDEN`, `AFWEZIG`. Use match/case with `case _` to catch unmapped values at development time. |
||||
|
||||
**Rationale:** The current normalization silently returns `None` for unknown vote values — this causes data loss that only manifests as "agreement percentage is wrong". Typed enums with exhaustiveness checking prevent silent data loss. |
||||
|
||||
**Downsides:** Requires updating all call sites that pass vote strings. |
||||
|
||||
**Confidence:** 90% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 3. DuckDB Connection Leak Detector — Context Manager Audit |
||||
**Description:** Audit all `duckdb.connect()` calls for proper context manager usage or explicit `.close()`. Many handlers catch exceptions but forget to close connections. Add a `ConnectionTracker` that warns on unclosed connections in development. |
||||
|
||||
**Rationale:** Connection leaks accumulate and eventually exhaust database connections. The codebase has 15+ places where exceptions cause early returns without connection cleanup. |
||||
|
||||
**Downsides:** Tracking adds overhead; some leaks are already handled by DuckDB's connection pooling. |
||||
|
||||
**Confidence:** 85% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 4. Replace Print-Based Debugging with Structured Logging |
||||
**Description:** Replace the 179 `print()` statements in error paths with structured logging using the existing `_logger`. Create a script that automates this conversion for common patterns. |
||||
|
||||
**Rationale:** Print statements go to stdout and are discarded in production. Proper logging enables log aggregation, alerting, and debugging of production issues. |
||||
|
||||
**Downsides:** High volume of changes; risk of losing context in some print statements. |
||||
|
||||
**Confidence:** 80% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 5. SVD Component Label Verification — Pre-Deployment Assertion |
||||
**Description:** Create a CI/CD pre-deployment script that verifies SVD labels against actual voting data — checking that labels match the voting pattern, not semantic assumptions. Query which parties vote positive/negative per component and validate label accuracy. |
||||
|
||||
**Rationale:** The SVD label documentation exists but there's no enforcement. This automated check prevents the documented mistake (semantic labels that don't match voting) from recurring. |
||||
|
||||
**Downsides:** Requires understanding of the SVD pipeline and periodic re-calibration as voting data changes. |
||||
|
||||
**Confidence:** 75% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 6. Nested Exception Handler Flattening — EAFP to LBYL Migration |
||||
**Description:** Replace nested try-except blocks with explicit preconditions (LBYL — Look Before You Leap). Many handlers wrap every operation in `try-except` because they don't trust the data. Add validation functions that check preconditions before operations. |
||||
|
||||
**Rationale:** Nested exception handlers make the control flow impossible to reason about. Replacing with explicit validation makes code more readable and debuggable. |
||||
|
||||
**Downsides:** Requires understanding what conditions each operation actually needs. |
||||
|
||||
**Confidence:** 70% |
||||
|
||||
**Complexity:** High |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 7. Database Schema Validation — Foreign Key and Constraint Checks |
||||
**Description:** Add startup validation that checks the actual database schema against expected schema. Verify table existence, column types, and foreign key relationships. Fail fast with clear error messages if schema is stale. |
||||
|
||||
**Rationale:** The current code tries to add columns with `ALTER TABLE ... IF NOT EXISTS` which can fail silently. A schema validation pass catches migration failures immediately. |
||||
|
||||
**Downsides:** Schema changes require updating validation code. |
||||
|
||||
**Confidence:** 85% |
||||
|
||||
**Complexity:** Low |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
### 8. Motion Data Sanitization Pipeline — Pre-Insert Validation |
||||
**Description:** Add a sanitization layer for incoming motion data that validates: |
||||
- `winning_margin` is between 0 and 1 |
||||
- `policy_area` is non-empty |
||||
- `voting_results` keys match known parties |
||||
- Date parsing succeeds for motion dates |
||||
|
||||
**Rationale:** The current insertion code trusts upstream data. Invalid data causes hard-to-debug issues downstream in SVD computation and similarity calculations. |
||||
|
||||
**Downsides:** Requires defining what "valid" means for each field. |
||||
|
||||
**Confidence:** 80% |
||||
|
||||
**Complexity:** Medium |
||||
|
||||
**Status:** Unexplored |
||||
|
||||
--- |
||||
|
||||
## Rejection Summary |
||||
|
||||
| # | Idea | Reason Rejected | |
||||
|---|------|-----------------| |
||||
| 1 | Add unit tests for exception paths | Good idea but lower leverage than preventing errors at source; covered by existing test infrastructure | |
||||
| 2 | Refactor all 850+ exception handlers in one pass | Too high volume — needs phased approach captured by idea #1 | |
||||
| 3 | Add type hints to all functions | Good hygiene but doesn't directly address reliability — covered by existing typing effort | |
||||
| 4 | Implement circuit breaker for external API calls | No external API calls observed in core codebase | |
||||
|
||||
## Session Log |
||||
- 2026-04-04: Initial ideation — 8 generated, 8 survived |
||||
@ -0,0 +1,149 @@ |
||||
--- |
||||
date: 2026-04-04 |
||||
topic: stemwijzer-improvement-ideas |
||||
focus: general |
||||
--- |
||||
|
||||
# Ideation: Stemwijzer Improvement Ideas |
||||
|
||||
## Codebase Context |
||||
|
||||
**Project shape:** Python/Streamlit Dutch voting advice tool ("Stemwijzer") |
||||
- Uses uv for package management, pytest for testing, DuckDB for data |
||||
- Key modules: analysis/, pipeline/, database.py (50KB), explorer.py (143KB) |
||||
- Notable: 3 venvs (.venv, .venv_axis, .venv_plotly) suggest dependency experimentation |
||||
- AGENTS.md exists with conventions (right-wing parties on RIGHT side, SVD labels reflect voting patterns) |
||||
|
||||
**Pain points identified:** |
||||
- explorer.py is 143KB monolith - hard to navigate |
||||
- SVD labels must reflect voting patterns (documented as learning) |
||||
- 850+ bare exception handlers documented as anti-pattern |
||||
- No CONTRIBUTING.md for onboarding |
||||
|
||||
**Leverage points:** |
||||
- Good test organization (tests/ with subdirs) |
||||
- Documented solutions in docs/solutions/ |
||||
- explorer_helpers.py proves pure-function pattern works |
||||
|
||||
## Ranked Ideas |
||||
|
||||
### 1. Right-Wing Party Axis Validation |
||||
**Description:** Add an automated test that asserts PVV, FVD, JA21, SGP appear on the RIGHT side (positive loading) of all SVD/PCA axes. |
||||
|
||||
**Rationale:** This is the #1 project convention (from AGENTS.md) with zero automated enforcement. The documented SVD label bug showed how easy it is to get this wrong. A simple test prevents regression. |
||||
|
||||
**Downsides:** Requires defining "RIGHT side" for each component - some components may have flipped poles. |
||||
|
||||
**Confidence:** 95% |
||||
**Complexity:** Low |
||||
**Status:** Unexplored |
||||
|
||||
### 2. Extract Business Logic from explorer.py |
||||
**Description:** Break the 143KB explorer.py monolith into pure functions in a new module (e.g., analysis/explorer_core.py), keeping only UI glue in the main file. |
||||
|
||||
**Rationale:** explorer.py is too large to navigate, review, or refactor safely. The explorer_helpers.py pattern already proves pure functions work. This enables parallel development and safer changes. |
||||
|
||||
**Downsides:** High complexity - requires understanding all the current dependencies and careful extraction to avoid breaking the Streamlit UI. |
||||
|
||||
**Confidence:** 90% |
||||
**Complexity:** High |
||||
**Status:** Unexplored |
||||
|
||||
### 3. SVD Component Label Verification |
||||
**Description:** Create a pre-deployment verification script that checks SVD_THEMES labels against actual voting data, flagging components where labels don't match party score distributions. |
||||
|
||||
**Rationale:** The documented SVD label bug showed labels can drift from reality. A verification step before deployment prevents this recurring. |
||||
|
||||
**Downsides:** Requires clear criteria for "label matches voting data" - some components are genuinely ambiguous. |
||||
|
||||
**Confidence:** 85% |
||||
**Complexity:** Medium |
||||
**Status:** Unexplored |
||||
|
||||
### 4. Interactive Component-Explorer UI |
||||
**Description:** Add a Streamlit UI selector letting users view any pair of SVD components as a 2D scatter plot, not just the political compass (components 1-2). |
||||
|
||||
**Rationale:** Components 3-10 are essentially black boxes. Making these explorable reveals hidden political dimensions and adds significant user value. |
||||
|
||||
**Downsides:** Requires understanding how to project between arbitrary component pairs. |
||||
|
||||
**Confidence:** 85% |
||||
**Complexity:** Medium |
||||
**Status:** Unexplored |
||||
|
||||
### 5. Type-Safe Vote Normalization |
||||
**Description:** Replace string-based vote normalization (casting '1', '-1', '0' strings) with typed enums and exhaustiveness checking. |
||||
|
||||
**Rationale:** Vote matching is core functionality - wrong types cause silent bugs. Typed enums catch errors at compile time. |
||||
|
||||
**Downsides:** Requires updating all callers and ensuring backward compatibility. |
||||
|
||||
**Confidence:** 80% |
||||
**Complexity:** Medium |
||||
**Status:** Unexplored |
||||
|
||||
### 6. Add CONTRIBUTING.md |
||||
**Description:** Create top-level CONTRIBUTING.md covering setup (uv), running tests, lint/typecheck commands, and key conventions from AGENTS.md. |
||||
|
||||
**Rationale:** AGENTS.md is internal-focused. A CONTRIBUTING.md lowers the barrier for external contributors and encodes project norms explicitly. |
||||
|
||||
**Downsides:** Low risk - straightforward documentation. |
||||
|
||||
**Confidence:** 75% |
||||
**Complexity:** Low |
||||
**Status:** Explored |
||||
|
||||
### 7. Database Schema Validation |
||||
**Description:** Add startup validation that checks the actual database schema against expected schema. Verify table existence, column types, and foreign key relationships. Fail fast with clear error messages if schema is stale. |
||||
**Rationale:** The current code tries to add columns with `ALTER TABLE ... IF NOT EXISTS` which can fail silently. A schema validation pass catches migration failures immediately. |
||||
**Downsides:** Schema changes require updating validation code. |
||||
**Confidence:** 85% |
||||
**Complexity:** Low |
||||
**Status:** Unexplored |
||||
|
||||
### 8. DuckDB Connection Leak Detector |
||||
**Description:** Audit all `duckdb.connect()` calls for proper context manager usage or explicit `.close()`. Many handlers catch exceptions but forget to close connections. Add a `ConnectionTracker` that warns on unclosed connections in development. |
||||
**Rationale:** Connection leaks accumulate and eventually exhaust database connections. The codebase has 15+ places where exceptions cause early returns without connection cleanup. |
||||
**Downsides:** Tracking adds overhead; some leaks are already handled by DuckDB's connection pooling. |
||||
**Confidence:** 85% |
||||
**Complexity:** Medium |
||||
**Status:** Unexplored |
||||
|
||||
### 9. Static Analysis Rule for Bare Except |
||||
**Description:** Add a flake8 plugin or ruff rule that flags `except:` and `except Exception:` without re-raising or logging. Document the project-specific exception hierarchy. |
||||
**Rationale:** Prevents the anti-pattern from re-entering. The project has 208 violations — a custom lint rule catches new violations and encodes the team's error-handling philosophy. |
||||
**Downsides:** Requires defining what specific exceptions ARE allowed per context. |
||||
**Confidence:** 70% |
||||
**Complexity:** Low |
||||
**Status:** Unexplored |
||||
|
||||
### 10. SVD Component Label Verification |
||||
**Description:** Create a CI/CD pre-deployment script that verifies SVD labels against actual voting data — checking that labels match the voting pattern, not semantic assumptions. |
||||
**Rationale:** The SVD label documentation exists but there's no enforcement. This automated check prevents the documented mistake from recurring. |
||||
**Downsides:** Requires understanding of the SVD pipeline and periodic re-calibration. |
||||
**Confidence:** 75% |
||||
**Complexity:** Medium |
||||
**Status:** Unexplored |
||||
|
||||
## Rejection Summary (Raised Bar — 2026-04-05) |
||||
|
||||
| # | Idea | Reason Rejected | |
||||
|---|------|-----------------| |
||||
| 1 | Consolidate 3 venvs into 1 | Lower priority - works currently, would need investigation | |
||||
| 2 | Modularize database.py | Secondary to explorer.py refactor; not a direct user/developer impact | |
||||
| 3 | Add Makefile/Task Aliases | Nice-to-have, lower leverage | |
||||
| 4 | Exception Handler Audit (208 handlers) | Too large to scope safely; architectural, not fixing root cause | |
||||
| 5 | Add Comprehensive Type Hints | Huge scope; hygiene, not correctness | |
||||
| 6 | Party Polarization Score | Interesting but niche | |
||||
| 7 | Scree Plot Extension | Low urgency feature | |
||||
| 8 | Typed DTOs for Database Layer | High migration effort; duckdb interop complications | |
||||
| 9 | Nested Exception Handler Flattening | Architectural refactor; too much change for uncertain value | |
||||
| 10 | Print→Logging Replacement (179 print statements) | High effort, low leverage — logging exists but not used | |
||||
| 11 | Code Climate Metrics | Measures for its own sake; doesn't directly prevent bugs | |
||||
| 12 | CONTRIBUTING.md | Good hygiene, low urgency — can defer | |
||||
|
||||
## Session Log |
||||
|
||||
- 2026-04-04: Initial ideation — 32 generated, 6 survived |
||||
- 2026-04-05: Raised the bar — 22 ideas reviewed, 5 survivors after stricter filtering |
||||
- Idea #1 (Right-Wing Party Axis Validation) selected for brainstorming |
||||
@ -0,0 +1,525 @@ |
||||
# Fix Trajectory Plot Not Showing - Implementation Plan |
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. |
||||
|
||||
**Goal:** Fix the trajectory plot not showing by diagnosing and handling the NaN centroid edge case that's causing `trace_count == 0` |
||||
|
||||
**Architecture:** Add diagnostics to identify why `plottable_parties` is empty, improve the name matching between positions and party_map, and ensure the plot renders even when party centroids have NaN values by falling back to MP trajectories. |
||||
|
||||
**Tech Stack:** Python, Streamlit, Plotly, DuckDB, NumPy |
||||
|
||||
--- |
||||
|
||||
## Investigation Summary |
||||
|
||||
The trajectory plot isn't rendering because: |
||||
1. `trace_count == 0` at `explorer.py:2099` |
||||
2. `plottable_parties` is empty because all party centroids have NaN values |
||||
3. NaN centroids occur when MP names in `positions_by_window` don't match names in `party_map` |
||||
4. The data exists (73k SVD vectors, 1036 party mappings) but the join fails silently |
||||
|
||||
## Files to Modify |
||||
|
||||
- `explorer.py` - Main trajectory tab logic (lines 1601-2143) |
||||
- `explorer_helpers.py` - `compute_party_centroids()` function (line 246) |
||||
- `tests/test_trajectory_debug_diagnostics.py` - New test for diagnostics |
||||
|
||||
--- |
||||
|
||||
### Task 1: Add Diagnostic Logging to Identify the Root Cause |
||||
|
||||
**Files:** |
||||
- Modify: `explorer.py:1966-2010` (around the `select_trajectory_plot_data` call) |
||||
|
||||
- [ ] **Step 1: Add diagnostics to show why trace_count is 0** |
||||
|
||||
Add diagnostic logging before the `trace_count == 0` check to capture the state: |
||||
|
||||
```python |
||||
# Around line 2095 in explorer.py, before the trace_count check |
||||
# Add detailed diagnostics to understand why trace_count is 0 |
||||
|
||||
# Debug: Log the state of data leading to trace_count |
||||
if trace_count == 0: |
||||
_last_trajectories_diagnostics.update({ |
||||
"stage": "zero_traces", |
||||
"positions_count": sum(len(pos) for pos in positions_by_window.values()) if positions_by_window else 0, |
||||
"party_map_count": len(party_map) if party_map else 0, |
||||
"centroids_count": len(centroids) if centroids else 0, |
||||
"selected_parties_count": len(selected_parties) if selected_parties else 0, |
||||
"timestamp": datetime.now().isoformat(), |
||||
}) |
||||
|
||||
# Check if there are positions but no centroids (name mismatch) |
||||
if positions_by_window and party_map and not centroids: |
||||
# Sample some MP names from positions |
||||
sample_mps = [] |
||||
for window, positions in list(positions_by_window.items())[:1]: |
||||
sample_mps = list(positions.keys())[:5] |
||||
break |
||||
|
||||
# Check if these MPs are in party_map |
||||
matched = sum(1 for mp in sample_mps if mp in party_map) |
||||
_last_trajectories_diagnostics["name_match_check"] = { |
||||
"sample_mps": sample_mps, |
||||
"matched_in_party_map": matched, |
||||
"sample_size": len(sample_mps), |
||||
} |
||||
``` |
||||
|
||||
- [ ] **Step 2: Run the app and check the diagnostics** |
||||
|
||||
Run the Streamlit app and navigate to the trajectory tab: |
||||
|
||||
```bash |
||||
cd /home/sgeboers/Projects/stemwijzer |
||||
.venv/bin/python -m streamlit run Home.py |
||||
``` |
||||
|
||||
Check if the diagnostics now show why `trace_count` is 0. |
||||
|
||||
- [ ] **Step 3: Commit the diagnostic changes** |
||||
|
||||
```bash |
||||
git add explorer.py |
||||
git commit -m "diagnose(trajectory): add diagnostics to identify why trace_count is 0" |
||||
``` |
||||
|
||||
--- |
||||
|
||||
### Task 2: Improve Party Centroid Calculation with NaN Handling |
||||
|
||||
**Files:** |
||||
- Modify: `explorer_helpers.py:246-297` (`compute_party_centroids` function) |
||||
|
||||
- [ ] **Step 1: Add diagnostics to compute_party_centroids** |
||||
|
||||
Modify the `compute_party_centroids` function to log when parties have NaN centroids: |
||||
|
||||
```python |
||||
# In explorer_helpers.py, modify compute_party_centroids function |
||||
# Add at the start of the function (around line 249) |
||||
|
||||
def compute_party_centroids(positions_by_window, party_map, min_mps=5): |
||||
""" |
||||
Compute party centroids from MP positions. |
||||
|
||||
Returns: |
||||
dict: {party: [(x, y), ...]} for each window |
||||
dict: Diagnostic info about computation |
||||
""" |
||||
diagnostics = { |
||||
"input_windows": len(positions_by_window) if positions_by_window else 0, |
||||
"input_party_map_entries": len(party_map) if party_map else 0, |
||||
"windows_processed": 0, |
||||
"parties_with_positions": set(), |
||||
"parties_all_nan": [], |
||||
"name_mismatch_samples": [], |
||||
} |
||||
|
||||
if not positions_by_window or not party_map: |
||||
return {}, diagnostics |
||||
|
||||
# ... rest of existing code ... |
||||
|
||||
# After computing centroids, check for all-NaN parties |
||||
for party, coords in party_centroids.items(): |
||||
if all(np.isnan(x) and np.isnan(y) for x, y in coords): |
||||
diagnostics["parties_all_nan"].append(party) |
||||
|
||||
return party_centroids, diagnostics |
||||
``` |
||||
|
||||
- [ ] **Step 2: Update the return signature and handle the new return value** |
||||
|
||||
Change the return from: |
||||
```python |
||||
return party_centroids |
||||
``` |
||||
to: |
||||
```python |
||||
return party_centroids, diagnostics |
||||
``` |
||||
|
||||
Then update all callers to handle the new return value. Search for all usages: |
||||
|
||||
```bash |
||||
grep -n "compute_party_centroids" explorer.py |
||||
``` |
||||
|
||||
Update each call site to unpack the tuple: |
||||
|
||||
```python |
||||
# Change from: |
||||
centroids = compute_party_centroids(positions_by_window, party_map) |
||||
|
||||
# To: |
||||
centroids, centroid_diagnostics = compute_party_centroids(positions_by_window, party_map) |
||||
``` |
||||
|
||||
- [ ] **Step 3: Run tests to verify the changes work** |
||||
|
||||
```bash |
||||
cd /home/sgeboers/Projects/stemwijzer |
||||
.venv/bin/python -m pytest tests/test_compute_party_centroids.py -v |
||||
``` |
||||
|
||||
Expected: Tests pass (or need updating if they check the return value) |
||||
|
||||
- [ ] **Step 4: Update tests for new return signature** |
||||
|
||||
If tests fail, update them to handle the new return signature: |
||||
|
||||
```python |
||||
# In tests/test_compute_party_centroids.py |
||||
# Change assertions from: |
||||
centroids = compute_party_centroids(...) |
||||
|
||||
# To: |
||||
centroids, diagnostics = compute_party_centroids(...) |
||||
``` |
||||
|
||||
- [ ] **Step 5: Commit the centroid diagnostics** |
||||
|
||||
```bash |
||||
git add explorer_helpers.py tests/test_compute_party_centroids.py |
||||
git commit -m "fix(trajectory): add diagnostics to compute_party_centroids for NaN detection" |
||||
``` |
||||
|
||||
--- |
||||
|
||||
### Task 3: Fix the Name Mismatch Between Positions and Party Map |
||||
|
||||
**Files:** |
||||
- Modify: `explorer.py:1645-1660` (around `load_party_map` and centroid computation) |
||||
|
||||
- [ ] **Step 1: Add name normalization to improve matching** |
||||
|
||||
MP names might have slightly different formats between SVD vectors and metadata. Add normalization: |
||||
|
||||
```python |
||||
# In explorer.py, after loading party_map (around line 1645) |
||||
# Add name normalization to improve matching |
||||
|
||||
def normalize_mp_name(name): |
||||
"""Normalize MP name for better matching between data sources.""" |
||||
if not name: |
||||
return name |
||||
# Remove extra whitespace |
||||
name = name.strip() |
||||
# Ensure consistent spacing after comma |
||||
if ',' in name and ', ' not in name: |
||||
name = name.replace(',', ', ') |
||||
return name |
||||
|
||||
# Normalize party_map keys |
||||
party_map = {normalize_mp_name(k): v for k, v in party_map.items()} |
||||
|
||||
# Also normalize MP names in positions_by_window |
||||
normalized_positions = {} |
||||
for window, positions in positions_by_window.items(): |
||||
normalized_positions[window] = { |
||||
normalize_mp_name(k): v for k, v in positions.items() |
||||
} |
||||
positions_by_window = normalized_positions |
||||
``` |
||||
|
||||
- [ ] **Step 2: Add validation to log name matching issues** |
||||
|
||||
After normalization, check how many MPs are matched: |
||||
|
||||
```python |
||||
# After normalization, log the match rate |
||||
all_mp_names = set() |
||||
for positions in positions_by_window.values(): |
||||
all_mp_names.update(positions.keys()) |
||||
|
||||
matched_names = sum(1 for mp in all_mp_names if mp in party_map) |
||||
logger.info(f"MP name matching: {matched_names}/{len(all_mp_names)} matched ({100*matched_names/len(all_mp_names):.1f}%)") |
||||
|
||||
if matched_names == 0 and len(all_mp_names) > 0: |
||||
logger.warning("No MP names matched between positions and party_map!") |
||||
logger.warning(f"Sample positions names: {list(all_mp_names)[:5]}") |
||||
logger.warning(f"Sample party_map names: {list(party_map.keys())[:5]}") |
||||
``` |
||||
|
||||
- [ ] **Step 3: Run the app and verify name matching improves** |
||||
|
||||
```bash |
||||
cd /home/sgeboers/Projects/stemwijzer |
||||
.venv/bin/python -m streamlit run Home.py |
||||
``` |
||||
|
||||
Check the logs for match rate information. |
||||
|
||||
- [ ] **Step 4: Commit the name normalization fix** |
||||
|
||||
```bash |
||||
git add explorer.py |
||||
git commit -m "fix(trajectory): normalize MP names to improve party_map matching" |
||||
``` |
||||
|
||||
--- |
||||
|
||||
### Task 4: Ensure Plot Renders Even with Partial Data |
||||
|
||||
**Files:** |
||||
- Modify: `explorer.py:1736-1777` (fallback to MP trajectories) |
||||
- Modify: `explorer.py:2099-2143` (trace_count == 0 handling) |
||||
|
||||
- [ ] **Step 1: Improve the MP trajectory fallback** |
||||
|
||||
When party centroids fail, ensure the MP trajectory fallback actually works: |
||||
|
||||
```python |
||||
# In explorer.py, around line 1750 where mp_positions is computed |
||||
# Make sure this path actually produces a plot |
||||
|
||||
if not centroids: |
||||
# Fallback: plot individual MP trajectories |
||||
st.info("Partijcentroiden niet beschikbaar — tonen individuele MP-trajecten als fallback.") |
||||
|
||||
# Collect MP positions across all windows |
||||
mp_positions = {} |
||||
for window, positions in positions_by_window.items(): |
||||
for mp, (x, y) in positions.items(): |
||||
if mp not in mp_positions: |
||||
mp_positions[mp] = {} |
||||
mp_positions[mp][window] = (x, y) |
||||
|
||||
# Filter to MPs with at least 2 windows (need trajectory, not just point) |
||||
mp_positions = {mp: pos for mp, pos in mp_positions.items() |
||||
if len(pos) >= 2 and not all(np.isnan(x) and np.isnan(y) for x, y in pos.values())} |
||||
|
||||
if not mp_positions: |
||||
st.warning("Geen positiedata beschikbaar voor trajectplotten.") |
||||
_last_trajectories_diagnostics["stage"] = "no_mp_positions" |
||||
return |
||||
|
||||
# Store for later use |
||||
st.session_state["_trajectory_mp_positions"] = mp_positions |
||||
``` |
||||
|
||||
- [ ] **Step 2: Fix the trace_count == 0 handling** |
||||
|
||||
When `trace_count == 0`, provide more helpful information: |
||||
|
||||
```python |
||||
# In explorer.py, around line 2099, replace the existing trace_count == 0 block |
||||
|
||||
if trace_count == 0: |
||||
st.info("📊 **Geen trajecten getekend**") |
||||
|
||||
# Show diagnostic information |
||||
with st.expander("🔍 Diagnostische informatie"): |
||||
st.write("**Data status:**") |
||||
st.write(f"- Positie vensters: {len(positions_by_window) if positions_by_window else 0}") |
||||
st.write(f"- Party mappings: {len(party_map) if party_map else 0}") |
||||
st.write(f"- Geselecteerde partijen: {len(selected_parties) if selected_parties else 0}") |
||||
|
||||
if 'centroid_diagnostics' in locals(): |
||||
st.write("**Centroid berekening:**") |
||||
st.write(f"- Partijen met posities: {len(centroid_diagnostics.get('parties_with_positions', []))}") |
||||
st.write(f"- Partijen met alleen NaN: {len(centroid_diagnostics.get('parties_all_nan', []))}") |
||||
|
||||
st.write("\n**Mogelijke oorzaken:**") |
||||
st.write("1. Geen SVD vectoren berekend voor de geselecteerde vensters") |
||||
st.write("2. MP namen in posities komen niet overeen met party_map") |
||||
st.write("3. Alle geselecteerde partijen hebben te weinig MPs (< 5)") |
||||
|
||||
# Add a button to run diagnostics |
||||
if st.button("🔧 Database diagnostiek uitvoeren"): |
||||
with st.spinner("Bezig met diagnostiek..."): |
||||
# Import and run diagnostics |
||||
from scripts.diagnose_trajectories_cli import diagnose_trajectories |
||||
results = diagnose_trajectories(db_path) |
||||
st.json(results) |
||||
else: |
||||
# Render the plot |
||||
st.plotly_chart(fig, use_container_width=True, key="trajectory_plot") |
||||
``` |
||||
|
||||
- [ ] **Step 3: Test the improved error handling** |
||||
|
||||
Run the app and verify: |
||||
1. When data is missing, helpful diagnostics appear |
||||
2. The expander shows detailed information |
||||
3. The database diagnostics button works |
||||
|
||||
```bash |
||||
cd /home/sgeboers/Projects/stemwijzer |
||||
.venv/bin/python -m streamlit run Home.py |
||||
``` |
||||
|
||||
- [ ] **Step 4: Commit the improved fallback** |
||||
|
||||
```bash |
||||
git add explorer.py |
||||
git commit -m "fix(trajectory): improve fallback handling and diagnostics when trace_count is 0" |
||||
``` |
||||
|
||||
--- |
||||
|
||||
### Task 5: Add Integration Test for the Fix |
||||
|
||||
**Files:** |
||||
- Create: `tests/test_trajectory_plot_renders.py` |
||||
|
||||
- [ ] **Step 1: Create a test that verifies the plot renders** |
||||
|
||||
```python |
||||
# tests/test_trajectory_plot_renders.py |
||||
""" |
||||
Test that trajectory plot renders even with edge cases. |
||||
""" |
||||
|
||||
import pytest |
||||
import numpy as np |
||||
from unittest.mock import MagicMock, patch |
||||
|
||||
# Import the functions to test |
||||
import sys |
||||
sys.path.insert(0, '/home/sgeboers/Projects/stemwijzer') |
||||
|
||||
from explorer_helpers import compute_party_centroids |
||||
|
||||
|
||||
class TestTrajectoryPlotRendering: |
||||
"""Tests to ensure trajectory plot renders in various scenarios.""" |
||||
|
||||
def test_compute_party_centroids_returns_diagnostics(self): |
||||
"""Test that compute_party_centroids returns diagnostics tuple.""" |
||||
positions_by_window = { |
||||
"2024-Q1": {"MP1": (1.0, 2.0), "MP2": (3.0, 4.0)}, |
||||
"2024-Q2": {"MP1": (1.5, 2.5), "MP2": (3.5, 4.5)}, |
||||
} |
||||
party_map = {"MP1": "PartyA", "MP2": "PartyA"} |
||||
|
||||
centroids, diagnostics = compute_party_centroids( |
||||
positions_by_window, party_map, min_mps=1 |
||||
) |
||||
|
||||
assert isinstance(centroids, dict) |
||||
assert isinstance(diagnostics, dict) |
||||
assert "input_windows" in diagnostics |
||||
assert diagnostics["input_windows"] == 2 |
||||
|
||||
def test_compute_party_centroids_detects_all_nan_parties(self): |
||||
"""Test that diagnostics identify parties with all NaN centroids.""" |
||||
positions_by_window = { |
||||
"2024-Q1": {"MP1": (np.nan, np.nan)}, |
||||
"2024-Q2": {"MP1": (np.nan, np.nan)}, |
||||
} |
||||
party_map = {"MP1": "PartyA"} |
||||
|
||||
centroids, diagnostics = compute_party_centroids( |
||||
positions_by_window, party_map, min_mps=1 |
||||
) |
||||
|
||||
assert "PartyA" in diagnostics.get("parties_all_nan", []) |
||||
|
||||
def test_name_normalization_improves_matching(self): |
||||
"""Test that normalized names improve party matching.""" |
||||
# Positions with slightly different name format |
||||
positions_by_window = { |
||||
"2024-Q1": {"Agema, M.": (1.0, 2.0)}, |
||||
} |
||||
# Party map with different spacing |
||||
party_map = {"Agema, M.": "PVV"} # Without normalization, this might not match |
||||
|
||||
# After normalization, they should match |
||||
def normalize_mp_name(name): |
||||
if not name: |
||||
return name |
||||
name = name.strip() |
||||
if ',' in name and ', ' not in name: |
||||
name = name.replace(',', ', ') |
||||
return name |
||||
|
||||
normalized_party_map = { |
||||
normalize_mp_name(k): v for k, v in party_map.items() |
||||
} |
||||
normalized_positions = { |
||||
window: {normalize_mp_name(k): v for k, v in positions.items()} |
||||
for window, positions in positions_by_window.items() |
||||
} |
||||
|
||||
# Check matching |
||||
all_mp_names = set() |
||||
for positions in normalized_positions.values(): |
||||
all_mp_names.update(positions.keys()) |
||||
|
||||
matched = sum(1 for mp in all_mp_names if mp in normalized_party_map) |
||||
assert matched > 0, "Name normalization should improve matching" |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
pytest.main([__file__, "-v"]) |
||||
``` |
||||
|
||||
- [ ] **Step 2: Run the new tests** |
||||
|
||||
```bash |
||||
cd /home/sgeboers/Projects/stemwijzer |
||||
.venv/bin/python -m pytest tests/test_trajectory_plot_renders.py -v |
||||
``` |
||||
|
||||
Expected: All tests pass |
||||
|
||||
- [ ] **Step 3: Commit the new tests** |
||||
|
||||
```bash |
||||
git add tests/test_trajectory_plot_renders.py |
||||
git commit -m "test(trajectory): add tests for plot rendering with edge cases" |
||||
``` |
||||
|
||||
--- |
||||
|
||||
### Task 6: Run Full Test Suite |
||||
|
||||
**Files:** |
||||
- All test files |
||||
|
||||
- [ ] **Step 1: Run all trajectory-related tests** |
||||
|
||||
```bash |
||||
cd /home/sgeboers/Projects/stemwijzer |
||||
.venv/bin/python -m pytest tests/test_trajectory*.py tests/test_compute_party_centroids.py -v |
||||
``` |
||||
|
||||
Expected: All tests pass |
||||
|
||||
- [ ] **Step 2: Verify no regressions in other tests** |
||||
|
||||
```bash |
||||
cd /home/sgeboers/Projects/stemwijzer |
||||
.venv/bin/python -m pytest tests/test_explorer*.py -v |
||||
``` |
||||
|
||||
Expected: All tests pass |
||||
|
||||
- [ ] **Step 3: Final commit** |
||||
|
||||
```bash |
||||
git log --oneline -5 # Review commits |
||||
git status # Ensure all changes are committed |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## Self-Review Checklist |
||||
|
||||
- [ ] **Spec coverage:** All diagnostic and fallback improvements are covered |
||||
- [ ] **Placeholder scan:** No TBD, TODO, or incomplete sections |
||||
- [ ] **Type consistency:** Return signatures match between function and callers |
||||
- [ ] **Test coverage:** New tests added for edge cases |
||||
|
||||
## Execution Handoff |
||||
|
||||
**Plan complete.** Two execution options: |
||||
|
||||
**1. Subagent-Driven (recommended)** - I dispatch a fresh subagent per task, review between tasks, fast iteration |
||||
|
||||
**2. Inline Execution** - Execute tasks in this session using executing-plans, batch execution with checkpoints for review |
||||
|
||||
Which approach would you prefer? |
||||
@ -0,0 +1,220 @@ |
||||
--- |
||||
title: "refactor: Extract business logic from explorer.py to analysis/" |
||||
type: refactor |
||||
status: active |
||||
date: 2026-04-04 |
||||
origin: docs/brainstorms/2026-04-04-explorer-refactor-requirements.md |
||||
--- |
||||
|
||||
# Refactor: Extract Business Logic from explorer.py to analysis/ |
||||
|
||||
## Overview |
||||
|
||||
Split the 3715-line `explorer.py` into clear layers: data loading, business logic, and UI. This improves navigability and testability while preserving all existing behavior. |
||||
|
||||
## Problem Frame |
||||
|
||||
`explorer.py` mixes three concerns (data loading, computation, UI) making it: |
||||
- Hard to navigate — no clear boundaries |
||||
- Hard to test — requires Streamlit + DuckDB |
||||
- Hard to review — changes affect everything |
||||
|
||||
## Requirements Trace |
||||
|
||||
- R1.1: Create `analysis/explorer_data.py` with data loading functions |
||||
- R1.2: Data functions callable without Streamlit imports |
||||
- R1.3: Functions return pure Python data structures |
||||
- R2.1: Move computation to domain-appropriate `analysis/` modules |
||||
- R2.2: Computations are pure functions |
||||
- R3.1: explorer.py becomes thin orchestration layer |
||||
- R3.2: `_render_*` functions stay in explorer.py |
||||
- R3.3: `build_*_tab()` functions delegate to imported functions |
||||
- R4.1: No circular imports |
||||
- R5.1: Data functions testable with mocked DuckDB |
||||
- R5.2: Computation functions pure and testable |
||||
|
||||
## Key Technical Decisions |
||||
|
||||
- **Domain-based splitting**: Computation goes to relevant `analysis/` module |
||||
- **Import direction**: `explorer.py` imports from `analysis/`, never vice versa |
||||
- **Preserve signatures**: Refactoring doesn't change public APIs |
||||
- **`_load_mp_vectors_by_party` variants**: Keep separate (serve different use cases) |
||||
- **`analysis/projections.py`**: Create new file (distinct from axis_classifier.py) |
||||
- **`_cached_bootstrap_cis()`**: Keep as cache wrapper in explorer.py, move computation to analysis/ |
||||
|
||||
## Open Questions |
||||
|
||||
### Resolved During Planning |
||||
|
||||
- **`_load_mp_vectors_by_party` variants**: Keep separate — they have different signatures and use cases |
||||
- **`analysis/projections.py`**: Create new file — projections are distinct from axis classification |
||||
- **`_cached_bootstrap_cis()`**: Keep wrapper in explorer.py, move computation to analysis/trajectories.py |
||||
|
||||
### Deferred to Implementation |
||||
|
||||
- Exact function grouping within `analysis/explorer_data.py` — will be refined during extraction |
||||
- Whether to add `__all__` exports — decide based on usage patterns after extraction |
||||
|
||||
## Implementation Units |
||||
|
||||
- [ ] **Unit 1: Create `analysis/explorer_data.py` skeleton** |
||||
|
||||
**Goal:** Create the data loading module with extracted functions |
||||
|
||||
**Requirements:** R1.1, R1.2, R1.3 |
||||
|
||||
**Dependencies:** None |
||||
|
||||
**Files:** |
||||
- Create: `analysis/explorer_data.py` |
||||
|
||||
**Approach:** |
||||
1. Create module with docstring and imports |
||||
2. Add stub functions with original signatures (no implementation) |
||||
3. Copy docstrings and type hints from explorer.py |
||||
|
||||
**Functions to extract:** |
||||
- `get_available_windows(db_path: str) -> List[str]` |
||||
- `get_uniform_dim_windows(db_path: str) -> List[str]` |
||||
- `load_positions(db_path: str, window_size: str) -> pd.DataFrame` |
||||
- `load_party_map(db_path: str) -> Dict[str, str]` |
||||
- `load_active_mps(db_path: str) -> set` |
||||
- `load_party_axis_scores(db_path: str) -> Dict[str, List[float]]` |
||||
- `load_party_axis_scores_for_window(db_path: str, window: str) -> Dict[str, List[float]]` |
||||
- `load_party_scores_all_windows(db_path: str) -> Dict[str, List[List[float]]]` |
||||
- `load_party_scores_all_windows_aligned(db_path: str) -> Dict[str, List[List[float]]]` |
||||
- `load_party_mp_vectors(db_path: str) -> Dict[str, List[np.ndarray]]` |
||||
- `load_scree_data(db_path: str) -> List[float]` |
||||
- `load_motions_df(db_path: str) -> pd.DataFrame` |
||||
|
||||
**Patterns to follow:** |
||||
- `explorer_helpers.py` conventions (pure functions, no IO side effects) |
||||
- `database.py` for DuckDB connection patterns |
||||
|
||||
**Verification:** |
||||
- Module imports without errors |
||||
- All functions have correct signatures |
||||
|
||||
--- |
||||
|
||||
- [ ] **Unit 2: Create `analysis/projections.py`** |
||||
|
||||
**Goal:** Create module for SVD projection and axis utilities |
||||
|
||||
**Requirements:** R2.1, R2.2 |
||||
|
||||
**Dependencies:** Unit 1 |
||||
|
||||
**Files:** |
||||
- Create: `analysis/projections.py` |
||||
|
||||
**Approach:** |
||||
1. Extract `_should_swap_axes()` and `_swap_axes()` from explorer.py |
||||
2. Add pure projection computation functions |
||||
|
||||
**Functions to extract:** |
||||
- `_should_swap_axes(axis_def: dict) -> bool` |
||||
- `_swap_axes(axis_def: dict) -> dict` |
||||
- `project_motions_onto_axis(motion_ids, scores) -> List[Tuple[int, float]]` (stub) |
||||
|
||||
**Patterns to follow:** |
||||
- Pure function conventions from `explorer_helpers.py` |
||||
|
||||
**Verification:** |
||||
- Functions work without Streamlit/DuckDB imports |
||||
|
||||
--- |
||||
|
||||
- [ ] **Unit 3: Update `analysis/trajectories.py`** |
||||
|
||||
**Goal:** Add trajectory computation functions from explorer.py |
||||
|
||||
**Requirements:** R2.1, R2.2 |
||||
|
||||
**Dependencies:** Unit 1 |
||||
|
||||
**Files:** |
||||
- Modify: `analysis/trajectories.py` |
||||
|
||||
**Approach:** |
||||
1. Add `compute_party_discipline()` and related functions |
||||
2. Add `compute_trajectory_points()` (pure computation) |
||||
|
||||
**Functions to add:** |
||||
- `compute_party_discipline(mp_scores: Dict[str, List[float]]) -> Dict[str, float]` |
||||
- `compute_2d_trajectories(positions_by_window, party_axis_scores)` (stub) |
||||
- `compute_aligned_trajectories(positions_by_window, party_scores_all)` (stub) |
||||
|
||||
**Verification:** |
||||
- Functions are pure (no IO) |
||||
- Existing trajectory.py tests pass |
||||
|
||||
--- |
||||
|
||||
- [ ] **Unit 4: Wire up imports in explorer.py** |
||||
|
||||
**Goal:** Update explorer.py to import from new modules |
||||
|
||||
**Requirements:** R3.1, R3.3, R4.1 |
||||
|
||||
**Dependencies:** Units 1, 2, 3 |
||||
|
||||
**Files:** |
||||
- Modify: `explorer.py` |
||||
|
||||
**Approach:** |
||||
1. Replace local function definitions with imports |
||||
2. Keep wrapper functions where needed for `@st.cache_data` |
||||
3. Verify no circular imports |
||||
|
||||
**Verification:** |
||||
- explorer.py imports work |
||||
- No circular import errors |
||||
- Streamlit app runs correctly |
||||
|
||||
--- |
||||
|
||||
- [ ] **Unit 5: Final cleanup and verification** |
||||
|
||||
**Goal:** Ensure explorer.py meets success criteria |
||||
|
||||
**Requirements:** All |
||||
|
||||
**Dependencies:** Unit 4 |
||||
|
||||
**Approach:** |
||||
1. Count lines in explorer.py — target under 1500 |
||||
2. Check no function exceeds 100 lines |
||||
3. Verify all extracted functions have docstrings |
||||
4. Run existing tests |
||||
|
||||
**Verification:** |
||||
- `wc -l explorer.py` < 1500 |
||||
- All functions under 100 lines |
||||
- Tests pass |
||||
|
||||
## System-Wide Impact |
||||
|
||||
- **Interaction graph:** explorer.py imports from analysis/ — no reverse imports |
||||
- **Error propagation:** Data functions raise exceptions on DB errors (same as before) |
||||
- **API surface parity:** All function signatures preserved |
||||
- **Unchanged invariants:** UI behavior identical, no new features |
||||
|
||||
## Risks & Dependencies |
||||
|
||||
| Risk | Mitigation | |
||||
|------|------------| |
||||
| Breaking existing function signatures | Preserve exact signatures, update in place | |
||||
| Circular imports | One-way import direction (explorer → analysis only) | |
||||
| Regression in UI behavior | Test after each unit, verify Streamlit app runs | |
||||
|
||||
## Documentation / Operational Notes |
||||
|
||||
- Update `ARCHITECTURE.md` to document new `analysis/explorer_data.py` module |
||||
- No changes to deployment or configuration needed |
||||
|
||||
## Sources & References |
||||
|
||||
- **Requirements doc:** `docs/brainstorms/2026-04-04-explorer-refactor-requirements.md` |
||||
- Related code: `explorer.py`, `explorer_helpers.py`, `analysis/trajectories.py` |
||||
- Pattern reference: `explorer_helpers.py` (pure function conventions) |
||||
@ -0,0 +1,182 @@ |
||||
--- |
||||
title: "refactor: Complete explorer.py decomposition — extract tabs, constants, and rendering" |
||||
type: refactor |
||||
status: completed |
||||
date: 2026-04-04 |
||||
origin: docs/plans/2026-04-04-002-refactor-explorer-extraction-plan.md |
||||
completed: 2026-04-04 |
||||
--- |
||||
|
||||
# Refactor: Complete explorer.py Decomposition |
||||
|
||||
## Overview |
||||
|
||||
Completed extraction of constants and tab module structure from `explorer.py`. Tab functions remain in explorer.py pending Streamlit decoupling. |
||||
|
||||
## Problem Frame |
||||
|
||||
The first phase extracted data loading functions to `analysis/explorer_data.py`. The remaining content contains: |
||||
- Tab building functions (~1617 lines across 6 tabs) |
||||
- Rendering helpers (~600 lines) |
||||
- Constants (~237 lines) |
||||
|
||||
## Current State |
||||
|
||||
| Module | Lines | Status | |
||||
|--------|-------|--------| |
||||
| `explorer.py` | 3102 | In progress | |
||||
| `analysis/explorer_data.py` | 549 | Done | |
||||
| `analysis/projections.py` | 121 | Done | |
||||
| `analysis/trajectory.py` | 380 | Done | |
||||
| `analysis/config.py` | 230 | **NEW** | |
||||
| `analysis/tabs/` | - | **NEW** (placeholders) | |
||||
| `analysis/visualize.py` | 434 | Existing | |
||||
| Target | <1500 | Partial | |
||||
|
||||
## Requirements Trace |
||||
|
||||
- R1.1: Extract `build_*_tab()` functions to `analysis/tabs/` |
||||
- R1.2: Extract `_render_*` helpers to `analysis/rendering.py` |
||||
- R1.3: Extract constants to `analysis/config.py` |
||||
- R2.1: Preserve `@st.cache_data` decorators in explorer.py |
||||
- R3.1: Maintain import direction: explorer.py → analysis/ only |
||||
|
||||
## Scope Boundaries |
||||
|
||||
**Included:** |
||||
- Tab function extraction (6 tabs) |
||||
- Rendering helper extraction |
||||
- Constant extraction |
||||
|
||||
**Excluded:** |
||||
- Behavior changes (UI looks the same) |
||||
- New test coverage (existing tests pass) |
||||
- Database schema changes |
||||
|
||||
## Key Technical Decisions |
||||
|
||||
- **Tab modules**: Create `analysis/tabs/compass.py`, `trajectories.py`, `search.py`, `browser.py`, `components.py`, `quiz.py` |
||||
- **Rendering module**: `analysis/rendering.py` contains all `_render_*` and `_build_*` functions |
||||
- **Config module**: `analysis/config.py` contains all constants |
||||
- **Backward compatibility**: Keep wrapper functions in explorer.py for `@st.cache_data` decorators |
||||
- **Import pattern**: Each tab module imports from `analysis/` (data, projections, config) |
||||
|
||||
## Implementation Units |
||||
|
||||
- [x] **Unit 6: Extract constants to `analysis/config.py`** ✓ |
||||
|
||||
**Goal:** Centralize all constants used across the explorer |
||||
|
||||
**Requirements:** R1.3 |
||||
|
||||
**Dependencies:** None |
||||
|
||||
**Files:** |
||||
- Create: `analysis/config.py` |
||||
- Modify: `explorer.py` |
||||
|
||||
**Approach:** |
||||
Extracted these constants from explorer.py: |
||||
1. `PARTY_COLOURS: Dict[str, str]` - party color mapping |
||||
2. `SVD_THEMES: dict[int, dict[str, str]]` - SVD component themes |
||||
3. `KNOWN_MAJOR_PARTIES` - ordered party list |
||||
4. `CURRENT_PARLIAMENT_PARTIES: frozenset[str]` - current party list |
||||
5. `_PARTY_NORMALIZE: dict[str, str]` - party name normalization |
||||
|
||||
**Verification:** |
||||
- `explorer.py` imports from `analysis/config.py` |
||||
- All tests pass (153 passed) |
||||
|
||||
**Lines saved:** ~237 |
||||
|
||||
--- |
||||
|
||||
- [x] **Unit 7: Extract `_render_*` helpers** - SKIPPED |
||||
|
||||
**Decision:** UI rendering functions use Streamlit (`st.*`). Per R3.2, UI functions stay in explorer.py. |
||||
|
||||
--- |
||||
|
||||
- [x] **Unit 8-10: Tab extraction** - PARTIAL |
||||
|
||||
**Goal:** Create module structure for tab functions |
||||
|
||||
**Status:** Created `analysis/tabs/` with placeholder modules. Actual tab functions remain in explorer.py due to tight Streamlit coupling. |
||||
|
||||
**Files:** |
||||
- Create: `analysis/tabs/__init__.py` |
||||
- Create: `analysis/tabs/compass.py` |
||||
- Create: `analysis/tabs/trajectories.py` |
||||
- Create: `analysis/tabs/search.py` |
||||
- Create: `analysis/tabs/browser.py` |
||||
- Create: `analysis/tabs/components.py` |
||||
- Create: `analysis/tabs/quiz.py` |
||||
|
||||
**Note:** Full tab extraction requires decoupling rendering logic from Streamlit, which is a larger refactoring effort beyond the current scope. |
||||
|
||||
--- |
||||
|
||||
- [x] **Unit 11: Final cleanup and line count verification** |
||||
|
||||
**Verification:** |
||||
- `wc -l explorer.py`: 3102 lines (reduced from 3715) |
||||
- All tests pass (153 passed, 2 skipped) |
||||
- Import verification passes |
||||
|
||||
## File Structure (Target) |
||||
|
||||
``` |
||||
analysis/ |
||||
├── __init__.py |
||||
├── config.py # NEW: Constants (PARTY_COLOURS, SVD_THEMES, etc.) |
||||
├── explorer_data.py # Data loading (done) |
||||
├── projections.py # Pure projection math (done) |
||||
├── rendering.py # NEW: _render_* and _build_* helpers |
||||
├── trajectory.py # Trajectory computation (done) |
||||
├── visualize.py # Existing visualization utils |
||||
└── tabs/ # NEW: Tab modules |
||||
├── __init__.py |
||||
├── compass.py # build_compass_tab |
||||
├── trajectories.py # build_trajectories_tab |
||||
├── search.py # build_search_tab |
||||
├── browser.py # build_browser_tab |
||||
├── components.py # build_svd_components_tab |
||||
└── quiz.py # build_mp_quiz_tab |
||||
``` |
||||
|
||||
## System-Wide Impact |
||||
|
||||
- **Interaction graph:** explorer.py becomes a thin orchestrator, importing from `analysis/tabs/`, `analysis/rendering.py`, `analysis/config.py`, and `analysis/explorer_data.py` |
||||
- **API surface parity:** All function signatures preserved (wrappers where needed) |
||||
- **Unchanged invariants:** UI behavior identical, no behavior changes |
||||
|
||||
## Risks & Dependencies |
||||
|
||||
| Risk | Mitigation | |
||||
|------|------------| |
||||
| Breaking `@st.cache_data` caching behavior | Keep cache decorators in explorer.py wrappers | |
||||
| Circular imports between tabs and rendering | Rendering module has no tab dependencies | |
||||
| Test failures from refactoring | Run tests after each unit | |
||||
| Missing imports after extraction | Verify import after each extraction | |
||||
|
||||
## Verification Commands |
||||
|
||||
```bash |
||||
# Line count |
||||
wc -l explorer.py # Target: < 1500 |
||||
|
||||
# Import verification |
||||
uv run python -c "import explorer; print('Import OK')" |
||||
|
||||
# Tests |
||||
uv run pytest tests/ -x |
||||
|
||||
# Individual tab tests |
||||
uv run pytest tests/test_political_compass.py -v |
||||
``` |
||||
|
||||
## Sources & References |
||||
|
||||
- **Original plan:** `docs/plans/2026-04-04-002-refactor-explorer-extraction-plan.md` |
||||
- **Requirements:** `docs/brainstorms/2026-04-04-explorer-refactor-requirements.md` |
||||
- **Pattern reference:** `explorer_helpers.py` (pure function conventions) |
||||
@ -0,0 +1,231 @@ |
||||
--- |
||||
title: "Right-Wing Party Axis Validation" |
||||
type: feat |
||||
status: completed |
||||
date: 2026-04-05 |
||||
origin: docs/brainstorms/2026-04-05-right-wing-party-axis-validation-requirements.md |
||||
--- |
||||
|
||||
# Right-Wing Party Axis Validation |
||||
|
||||
## Overview |
||||
|
||||
Add automated tests that assert PVV, FVD, JA21, and SGP appear on the RIGHT side of the political compass (mean-based), using real DuckDB data. Consolidate the conflicting `RIGHT_PARTIES`/`LEFT_PARTIES` inline definitions into `analysis/config.py`. |
||||
|
||||
## Problem Frame |
||||
|
||||
The AGENTS.md convention states that PVV, FVD, JA21, and SGP must appear on the RIGHT side of all axes. Three files define conflicting party sets: `svd_labels.py` has 9 right parties, `political_axis.py` has 6, and neither matches the convention. No automated validation exists. |
||||
|
||||
## Requirements Trace |
||||
|
||||
- R1. Canonical party sets defined once, imported everywhere |
||||
- R2. Validation test loads real data from DuckDB |
||||
- R3. 2D political compass orientation check (statistical, mean-based) |
||||
- R4. `compute_flip_direction` consistency check |
||||
- R5. Clear failure messages |
||||
|
||||
## Scope Boundaries |
||||
|
||||
- Only aligned scores validated (not unaligned) |
||||
- Center parties (VVD, NSC, BBB, CDA, ChristenUnie) not validated |
||||
- Per-party strict sign checks excluded — statistical mean check |
||||
- `political_axis.py` not updated (out of scope per requirements) |
||||
|
||||
## Context & Research |
||||
|
||||
### Relevant Code and Patterns |
||||
|
||||
- `analysis/config.py` — existing constants module with `__all__`, `_PARTY_NORMALIZE` at lines 247-256 |
||||
- `analysis/svd_labels.py` — `compute_flip_direction` at lines 127-166, uses inline `RIGHT_PARTIES`/`LEFT_PARTIES` |
||||
- `analysis/explorer_data.py` — `load_party_scores_all_windows_aligned` at lines 212-241, returns `{party: [[x,y] per window]}` |
||||
- `analysis/trajectory.py` — `_load_window_ids` at line 121 (not exported in `__all__`) |
||||
- `tests/conftest.py` — `tmp_duckdb_path` fixture at line 70, `tmp_duckdb_conn` fixture at line 76 |
||||
- `tests/test_svd_labels.py` — existing tests for `compute_flip_direction` with synthetic data |
||||
|
||||
### Key Structural Insight |
||||
|
||||
`load_party_scores_all_windows_aligned` returns `{party: [[x, y], [x, y], ...]}` — data grouped by party, not by window. To validate per window, the test must iterate window indices and build per-window dicts: `{party: [x, y]}` where index matches the window position. |
||||
|
||||
`compute_flip_direction(component, {party: [scores]})` indexes into `scores[component-1]`, so: |
||||
- `compute_flip_direction(1, party_scores)` checks x-axis orientation |
||||
- `compute_flip_direction(2, party_scores)` checks y-axis orientation |
||||
|
||||
## Key Technical Decisions |
||||
|
||||
- **Synthetic DuckDB fixture data, not real DB**: Temporary DB with controlled `party_axis_scores` rows avoids dependency on a populated real database. Follows existing pattern from `test_analysis.py`. |
||||
- **Extract window-indexing helper**: A helper `build_window_party_scores(scores_by_party, window_idx)` separates data transformation from DB access — enables unit testing the logic without DuckDB. |
||||
- **`_PARTY_NORMALIZE` for alias handling**: Normalize party names from DB before building `party_scores` dict. DB may return "GL" while canonical sets expect "GroenLinks-PvdA". |
||||
|
||||
## Open Questions |
||||
|
||||
### Resolved During Planning |
||||
|
||||
- **DB fixture vs real DB**: Use synthetic fixture data in temporary DuckDB. This is the pattern used by `test_analysis.py` and gives full control over the test scenario. |
||||
- **Per-window iteration**: Data is `{party: [[x,y] per window]}` — iterate by window index, not by key lookup. |
||||
- **`political_axis.py` scope**: Not updated. Uses separate `right_parties`/`left_parties` for PCA centroid orientation, distinct concern from this validation. |
||||
|
||||
### Deferred to Implementation |
||||
|
||||
- **Test DB schema exactness**: The `party_axis_scores` schema (column names, nullability) should be verified against `explorer_data.py` query at implementation time. |
||||
|
||||
## Implementation Units |
||||
|
||||
- [ ] **Unit 1: Add canonical party sets to `config.py`** |
||||
|
||||
**Goal:** Add `CANONICAL_RIGHT` and `CANONICAL_LEFT` frozensets as the single source of truth. |
||||
|
||||
**Requirements:** R1 |
||||
|
||||
**Dependencies:** None |
||||
|
||||
**Files:** |
||||
- Modify: `analysis/config.py` |
||||
|
||||
**Approach:** |
||||
- Add `CANONICAL_RIGHT = frozenset({"PVV", "FVD", "JA21", "SGP"})` matching AGENTS.md exactly |
||||
- Add `CANONICAL_LEFT = frozenset({"SP", "PvdA", "GL", "GroenLinks", "GroenLinks-PvdA", "DENK", "PvdD", "Volt"})` matching svd_labels.py LEFT_PARTIES exactly |
||||
- Add both to `__all__` |
||||
|
||||
**Patterns to follow:** |
||||
- `CURRENT_PARLIAMENT_PARTIES` frozenset pattern at `config.py` line 235 |
||||
|
||||
**Test scenarios:** |
||||
- Test expectation: none — this is a data definition change, not behavioral code |
||||
|
||||
**Verification:** |
||||
- `CANONICAL_RIGHT` and `CANONICAL_LEFT` accessible via `from analysis.config import CANONICAL_RIGHT, CANONICAL_LEFT` |
||||
|
||||
--- |
||||
|
||||
- [ ] **Unit 2: Update `svd_labels.py` to import from `config.py`** |
||||
|
||||
**Goal:** `compute_flip_direction` uses canonical sets from config instead of inline definitions. |
||||
|
||||
**Requirements:** R1 |
||||
|
||||
**Dependencies:** Unit 1 |
||||
|
||||
**Files:** |
||||
- Modify: `analysis/svd_labels.py` |
||||
|
||||
**Approach:** |
||||
- Replace inline `RIGHT_PARTIES` and `LEFT_PARTIES` frozensets with: |
||||
```python |
||||
from analysis.config import CANONICAL_RIGHT, CANONICAL_LEFT |
||||
RIGHT_PARTIES = CANONICAL_RIGHT # backward compat alias |
||||
LEFT_PARTIES = CANONICAL_LEFT # backward compat alias |
||||
``` |
||||
- This preserves any external callers that import `RIGHT_PARTIES`/`LEFT_PARTIES` from `svd_labels` |
||||
|
||||
**Patterns to follow:** |
||||
- Alias pattern (re-export) rather than removing the old names — backward compat |
||||
|
||||
**Test scenarios:** |
||||
- Happy path: `compute_flip_direction` produces same results as before (baseline established by existing tests in `test_svd_labels.py`) |
||||
- Existing tests in `test_svd_labels.py` run and pass after the import swap |
||||
|
||||
**Verification:** |
||||
- `pytest tests/test_svd_labels.py` passes |
||||
|
||||
--- |
||||
|
||||
- [ ] **Unit 3: Extract `build_window_party_scores` helper in `explorer_data.py`** |
||||
|
||||
**Goal:** Separate window-indexing logic from DB access so it can be unit tested without DuckDB. |
||||
|
||||
**Requirements:** R2, R3 |
||||
|
||||
**Dependencies:** None |
||||
|
||||
**Files:** |
||||
- Create: `analysis/explorer_data.py` (add function) |
||||
|
||||
**Approach:** |
||||
Add a helper: |
||||
```python |
||||
def build_window_party_scores( |
||||
scores_by_party: Dict[str, List[List[float]]], |
||||
window_idx: int |
||||
) -> Dict[str, List[float]]: |
||||
"""Extract scores for one window as {party: [x, y]} for compute_flip_direction.""" |
||||
``` |
||||
|
||||
The function takes the output of `load_party_scores_all_windows_aligned` and extracts `scores_by_party[party][window_idx]` for all parties, returning `{party: [x, y]}`. Returns empty dict if window_idx is out of range. |
||||
|
||||
**Patterns to follow:** |
||||
- `load_party_scores_all_windows_aligned` pattern at `explorer_data.py` line 212 |
||||
|
||||
**Test scenarios:** |
||||
- Happy path: Given `{"PVV": [[0.5, 0.3], [0.6, 0.4]], "SP": [[-0.4, -0.2], [-0.5, -0.3]]}` and `window_idx=0`, returns `{"PVV": [0.5, 0.3], "SP": [-0.4, -0.2]}` |
||||
- Edge case: `window_idx=99` out of range → returns `{}` |
||||
- Edge case: Empty input dict → returns `{}` |
||||
|
||||
**Verification:** |
||||
- Unit tests pass without DuckDB |
||||
|
||||
--- |
||||
|
||||
- [ ] **Unit 4: Create `tests/test_axis_political_orientation.py`** |
||||
|
||||
**Goal:** Integration test validating political compass orientation against DuckDB data. |
||||
|
||||
**Requirements:** R2, R3, R4, R5 |
||||
|
||||
**Dependencies:** Units 1, 2, 3 |
||||
|
||||
**Files:** |
||||
- Create: `tests/test_axis_political_orientation.py` |
||||
|
||||
**Approach:** |
||||
Two-layer test structure: |
||||
|
||||
1. **Synthetic fixture layer** (DuckDB integration test): |
||||
- Create temporary DB with `party_axis_scores` table |
||||
- Insert controlled rows: correct orientation (right_mean > left_mean) and incorrect orientation (right_mean < left_mean) |
||||
- Call `load_party_scores_all_windows_aligned` and `build_window_party_scores` |
||||
- Assert orientation checks pass/fail correctly |
||||
|
||||
2. **Validation assertions** (layered on helper from Unit 3): |
||||
- For each window (iterate `scores_by_party[party]` length): |
||||
- Build per-window dict via `build_window_party_scores` |
||||
- Call `compute_flip_direction(1, party_scores)` → assert `False` (no flip needed) |
||||
- Call `compute_flip_direction(2, party_scores)` → assert `False` |
||||
- On failure: assert message includes window, axis, right_mean, left_mean |
||||
|
||||
Use `tmp_duckdb_conn` fixture. Create schema and insert rows in test setup. |
||||
|
||||
**Patterns to follow:** |
||||
- `test_analysis.py` fixture setup pattern (lines 13-60) for synthetic SVD vector setup |
||||
- `test_svd_labels.py` assertion style for `compute_flip_direction` validation |
||||
|
||||
**Test scenarios:** |
||||
- Happy path (correct orientation): Right mean > left mean on both axes → both `compute_flip_direction` calls return `False` |
||||
- Error path (incorrect orientation): Right mean < left mean → at least one call returns `True`, test fails with clear message |
||||
- Edge case: Party not in canonical sets → gracefully skipped (no crash) |
||||
- Edge case: Empty party list → returns `False` (no flip) |
||||
- Edge case: Aliased party name ("GL" vs "GroenLinks-PvdA") → normalized before check |
||||
|
||||
**Verification:** |
||||
- `pytest tests/test_axis_political_orientation.py` runs and passes |
||||
- `pytest tests/test_svd_labels.py` still passes (backward compat check) |
||||
|
||||
## System-Wide Impact |
||||
|
||||
- **Error propagation**: No error paths in this feature — orientation violations produce assertion failures, not exceptions |
||||
- **Unchanged invariants**: `compute_flip_direction` output unchanged for existing callers (alias re-export) |
||||
- **API surface parity**: No new public APIs; `CANONICAL_RIGHT`/`CANONICAL_LEFT` are read-only constants |
||||
|
||||
## Risks & Dependencies |
||||
|
||||
| Risk | Mitigation | |
||||
|------|------------| |
||||
| DuckDB fixture schema mismatch | Verify `party_axis_scores` column names against `explorer_data.py` query at implementation time | |
||||
| Window index boundary errors | `build_window_party_scores` returns `{}` for out-of-range indices — graceful degradation | |
||||
| `_PARTY_NORMALIZE` aliases incomplete | Add aliases as needed during implementation — test with edge cases | |
||||
|
||||
## Sources & References |
||||
|
||||
- **Origin document:** [docs/brainstorms/2026-04-05-right-wing-party-axis-validation-requirements.md](docs/brainstorms/2026-04-05-right-wing-party-axis-validation-requirements.md) |
||||
- **AGENTS.md convention:** `docs/solutions/best-practices/svd-labels-voting-patterns-not-semantics.md` |
||||
- Related code: `analysis/svd_labels.py`, `analysis/config.py`, `analysis/explorer_data.py` |
||||
- Related tests: `tests/test_svd_labels.py`, `tests/test_analysis.py` |
||||
@ -0,0 +1 @@ |
||||
# empty - placeholder so the directory exists in git |
||||
@ -0,0 +1,68 @@ |
||||
# Semantic Content Shift: Axis 1 Over Time |
||||
|
||||
## What Changed: "Coalition vs Opposition" Axis Content |
||||
|
||||
| Year | Positive Pole (Coalition) | Negative Pole (Opposition) | Key Theme | |
||||
|------|-------------------------|---------------------------|-----------| |
||||
| **2016** | Tax law changes, international treaties | — | Administrative law | |
||||
| **2018** | Budget modifications, infrastructure, social affairs | — | Government spending | |
||||
| **2019** | Working conditions, monitoring issues | — | Administrative oversight | |
||||
| **2022** | Local government info, digital accounts | Digital governance, privacy | Digital transformation | |
||||
| **2023** | Welfare policy, parental support | Social services | Social policy | |
||||
| **2024** | Nuclear weapons, housing, Israel boycott | — | Foreign policy / Justice | |
||||
| **2025** | EU sanctions on Israel, asylum policies | — | Migration / Foreign affairs | |
||||
| **2026** | **Asylum stops**, Syrian permit revocations, Ukraine returns | IND backlog | **Migration dominates** | |
||||
|
||||
## Key Observations |
||||
|
||||
### 1. The "Coalition" Side Evolved Significantly |
||||
|
||||
| Period | Coalition Motions Focused On | |
||||
|--------|---------------------------| |
||||
| 2016-2019 | Administrative law, tax, budgets, infrastructure | |
||||
| 2022-2023 | Digital governance, welfare, social services | |
||||
| 2024-2025 | Foreign policy (Israel sanctions), migration | |
||||
| **2026** | **Asylum restriction**, Syria, Ukraine returns | |
||||
|
||||
### 2. Axis 1 Became Migration-Centric by 2026 |
||||
|
||||
In 2026, the **extreme positive motions** are ALL about asylum/migration: |
||||
- "Motie van het lid Vondeling over een totale asielstop" (total asylum stop) |
||||
- "Motie van het lid Vondeling over alle tijdelijke asielvergunningen van Syriërs intrekken" (revoke Syrian permits) |
||||
- "Motie van het lid Vondeling over een actief terugkeerbeleid voor alle Oekraïners" (active return policy for Ukrainians) |
||||
|
||||
This suggests the coalition/opposition dynamic in 2026 is increasingly defined by **migration policy** rather than the traditional left-right economic divide. |
||||
|
||||
### 3. The "Typical" Motion Changed |
||||
|
||||
Semantic gravity represents the "typical" motion on the axis. Its content shifted: |
||||
|
||||
| Year | Typical Motion Theme | |
||||
|------|---------------------| |
||||
| 2016 | Tax law, health law, financial administration | |
||||
| 2019 | Bureaucracy reduction, Kamer control, administrative burden | |
||||
| 2023 | Student finance, volunteer work, housing | |
||||
| 2024 | Fossil fuel phase-out, whistleblower protection, youth care | |
||||
| 2026 | Asylum, IND backlog, Ukraine, social grievances | |
||||
|
||||
## Implications |
||||
|
||||
1. **Axis label is temporally bounded**: "Rechts kabinetsbeleid versus links oppositiebeleid" works for 2016-2026 as a whole, but in 2026 it's increasingly about migration policy. |
||||
|
||||
2. **Party voting structure is stable** (0.83 stability), but **what parties vote on** has shifted from economics to migration. |
||||
|
||||
3. **Axis 6 (Migration/Culture)** low stability (0.35) may now be overlapping with Axis 1 — migration has become a coalition-defining issue. |
||||
|
||||
## Example: Concrete Before/After |
||||
|
||||
**2016 - "Coalition" side:** |
||||
> "Wijziging van enkele belastingwetten en enige andere wetten (Fiscale vereenvoudigingswet 2017)" |
||||
|
||||
**2026 - "Coalition" side:** |
||||
> "Motie van het lid Vondeling over een totale asielstop" |
||||
|
||||
Same axis (coalition votes FOR), but semantically completely different topics. |
||||
|
||||
--- |
||||
|
||||
*Generated by `scripts/semantic_gravity_examples.py`* |
||||
@ -0,0 +1,154 @@ |
||||
--- |
||||
title: "SVD Axis Overtone Shift Analysis: Deep Dive" |
||||
date: 2026-04-05 |
||||
module: analysis |
||||
problem_type: research |
||||
component: motion-analysis |
||||
tags: [svd, overtone-shift, semantic-drift, time-series, parliamentary-analysis] |
||||
--- |
||||
|
||||
# SVD Axis Overtone Shift: Deep Dive Analysis |
||||
|
||||
## Executive Summary |
||||
|
||||
This analysis explores the relationship between **axis stability** (structural consistency of SVD components over time) and **overtone shift** (semantic drift of motion content within those stable axes). The key finding is that these are **independent phenomena**: axes can be structurally stable (same parties voting similarly) while their semantic content drifts dramatically. |
||||
|
||||
## Key Finding: Stability vs. Semantic Content are Independent |
||||
|
||||
| Phenomenon | What it Measures | Typical Value | Interpretation | |
||||
|------------|-----------------|---------------|----------------| |
||||
| **Axis Stability** | Consistency of which motions load on an axis | 0.70-0.83 | Structural alignment of semantic signatures | |
||||
| **Overtone Shift** | How motion content evolves over time | 1.30-1.97 | Semantic drift within stable structure | |
||||
|
||||
### Why This Matters |
||||
|
||||
A stable axis (e.g., "Rechts kabinetsbeleid versus links oppositiebeleid") means: |
||||
- The same coalition/opposition voting pattern persists across years |
||||
- Parties maintain consistent relative positions |
||||
|
||||
But high overtone shift means: |
||||
- The specific topics that define "coalition" vs "opposition" change substantially |
||||
- Motions discussed in 2026 are semantically different from 2016 even though they occupy the same axis position |
||||
|
||||
## Detailed Findings |
||||
|
||||
### Axis Stability Results (Lasso Regression, alpha=0.1) |
||||
|
||||
| Axis | Avg Stability | Classification | Interpretation | |
||||
|------|---------------|---------------|----------------| |
||||
| 1 | 0.83 | Stable | Coalition vs opposition voting pattern is consistent | |
||||
| 2 | 0.75 | Stable | PVV/FVD populist positioning vs mainstream | |
||||
| 3 | 0.78 | Stable | Welfare state vs market liberalisation | |
||||
| 4 | 0.72 | Stable | NSC/BBB vs D66/CDA/JA21 | |
||||
| 5 | 0.70 | Stable | Christian-social vs progressive-individual | |
||||
| 6 | 0.35 | **Reordered** | Migration/culture axis most volatile | |
||||
| 7 | 0.77 | Stable | Administrative pragmatism | |
||||
| 8 | 0.79 | Stable | Healthcare/education/regional housing | |
||||
| 9 | 0.76 | Stable | System reform vs practical governance | |
||||
| 10 | 0.74 | Stable | Regulation vs deregulation | |
||||
|
||||
### Overtone Shift Results |
||||
|
||||
| Axis | Avg Shift | Max Shift | Inflection Points | |
||||
|------|-----------|-----------|-------------------| |
||||
| 1 | 1.47 | 1.97 | 0 | |
||||
| 2 | 1.42 | 1.79 | 0 | |
||||
| 3 | 1.38 | 1.83 | 0 | |
||||
| 4 | 1.39 | 1.89 | 0 | |
||||
| 5 | 1.43 | 1.93 | 0 | |
||||
| 7 | 1.31 | 1.84 | 0 | |
||||
| 8 | 1.30 | 1.89 | 0 | |
||||
| 9 | 1.38 | 1.93 | 0 | |
||||
| 10 | 1.30 | 1.72 | 0 | |
||||
|
||||
**Critical observation**: ALL stable axes show high overtone shift (1.3-1.97), with no inflection points detected. This indicates **gradual, continuous semantic drift** rather than sudden shifts. |
||||
|
||||
## Interpretation Framework |
||||
|
||||
### The "Axis Stability" Metric |
||||
|
||||
Axis stability uses **Lasso regression** to learn the semantic signature of each axis: |
||||
``` |
||||
SVD_score ~ fused_embedding |
||||
``` |
||||
|
||||
The learned weight vector (2610 dimensions) represents which embedding dimensions are most predictive of an axis score. Stability is measured by comparing these weight vectors across windows using: |
||||
- **Cosine similarity** of full weight vectors |
||||
- **Jaccard similarity** of top-100 weighted dimensions |
||||
|
||||
Why Lasso (alpha=0.1)? The L1 regularization produces sparse weight vectors, concentrating on the most important semantic dimensions. This makes cross-window comparison more robust than dense Ridge regression. |
||||
|
||||
### The "Overtone Shift" Metric |
||||
|
||||
Overtone shift computes **semantic gravity** — the weighted mean fused embedding of all motions on an axis: |
||||
|
||||
``` |
||||
gravity = weighted_mean(fused_embeddings, weights=abs(SVD_scores)) |
||||
``` |
||||
|
||||
The cosine distance between gravity vectors of consecutive windows measures how the "center of mass" of motion content moves. High shift values (1.3-1.9) indicate the motion topics that define each axis change substantially over time. |
||||
|
||||
## Implications for Interpretation |
||||
|
||||
### For Users of the Stemwijzer |
||||
|
||||
1. **Axis labels are temporally bounded** — The label "Rechts kabinetsbeleid versus links oppositiebeleid" accurately describes the 2016-2026 period, but the specific motions that exemplify this axis have changed. |
||||
|
||||
2. **Cross-temporal comparison is valid structurally but not semantically** — Party positions along Axis 1 are comparable across years (stable structure), but the meaning of extreme positions has shifted (high overtone). |
||||
|
||||
3. **Axis 6 (Migration/Culture)** is an exception — Low stability (0.35) suggests this axis may have fundamentally changed meaning or composition over the period. |
||||
|
||||
### For Analysts Studying Parliamentary Evolution |
||||
|
||||
1. **Coalition/opposition as a dimension is remarkably stable** — Despite changes in coalition composition (Rutte III, Rutte IV, Schoof), the first axis consistently captures this dynamic. |
||||
|
||||
2. **Policy content evolves within stable voting patterns** — What constitutes "coalition policy" in 2026 differs semantically from 2016, even if the voting alignment remains. |
||||
|
||||
3. **The 2022-2023 period may be significant** — Gap in windows (2020-2021) coincides with COVID and government crises, potentially affecting overtone patterns. |
||||
|
||||
## Methodological Notes |
||||
|
||||
### Why Lasso (alpha=0.1)? |
||||
|
||||
Three alternatives were evaluated and rejected: |
||||
|
||||
| Approach | Problem | |
||||
|----------|---------| |
||||
| Jaccard similarity of top-N motion IDs | Motions are unique per window — 0% overlap | |
||||
| Cosine similarity of embedding centroids | Near-zero similarity due to varying embedding dimensions | |
||||
| Ridge regression weights | Dense weights less interpretable; Lasso concentrates signal | |
||||
|
||||
Lasso (alpha=0.1) was chosen for: |
||||
- **Interpretability**: Sparse weights identify key semantic dimensions |
||||
- **Robustness**: Top-K dimension matching captures structural similarity |
||||
- **Stability**: Results are less sensitive to embedding dimension changes |
||||
|
||||
### Dimension Alignment Challenge |
||||
|
||||
Fused embeddings have varying dimensions across windows (typically 768-2610). All comparisons use **minimum common dimension** alignment to ensure valid cosine similarity computation. |
||||
|
||||
### Inflection Point Detection |
||||
|
||||
Inflection points are defined as shift/drift rates exceeding 2× median rate. The absence of detected inflection points suggests **gradual, continuous drift** rather than sudden semantic shifts — consistent with how policy debates evolve incrementally. |
||||
|
||||
## Recommendations |
||||
|
||||
### For Stemwijzer Maintenance |
||||
|
||||
1. **Re-run overtone analysis after SVD recomputation** — Current themes may drift further from the underlying data |
||||
2. **Monitor Axis 6 specifically** — Low stability warrants closer attention during axis updates |
||||
3. **Consider temporal weighting in visualizations** — Recent windows may better represent current semantics |
||||
|
||||
### For Future Research |
||||
|
||||
1. **Correlate overtone shift with political events** — External factors (elections, crises) may explain inflection patterns |
||||
2. **Analyze dimension-level drift patterns** — Which specific embedding dimensions drive the shift? |
||||
3. **Extend to party-level analysis** — Do individual parties show consistent voting semantics over time? |
||||
|
||||
## Related Files |
||||
|
||||
- `scripts/motion_drift.py` — Analysis script |
||||
- `reports/drift/report.md` — Generated report |
||||
- `reports/drift/axis_stability.png` — Stability heatmaps |
||||
- `reports/drift/semantic_drift.png` — Drift timelines |
||||
- `reports/drift/party_trajectories.png` — Party position plots |
||||
|
After Width: | Height: | Size: 98 KiB |
|
After Width: | Height: | Size: 98 KiB |
|
|
After Width: | Height: | Size: 181 KiB |
|
After Width: | Height: | Size: 257 KiB |
|
After Width: | Height: | Size: 406 KiB |
|
After Width: | Height: | Size: 32 KiB |
|
After Width: | Height: | Size: 38 KiB |
@ -0,0 +1,92 @@ |
||||
--- |
||||
title: Always Derive Blog Numbers from Pipeline Outputs, Not Memory |
||||
date: 2026-04-16 |
||||
category: docs/solutions/best-practices |
||||
module: documentation |
||||
problem_type: best_practice |
||||
component: documentation |
||||
severity: medium |
||||
applies_when: |
||||
- Writing or updating a data-driven blog post |
||||
- Adding EVR percentages, vote counts, or any quantitative claims |
||||
- Referencing pipeline components (embeddings, fusion, similarity) in public-facing docs |
||||
tags: [blog, pipeline, evr, svd, canonical-outputs, data-driven-docs] |
||||
--- |
||||
|
||||
# Always Derive Blog Numbers from Pipeline Outputs, Not Memory |
||||
|
||||
## Context |
||||
|
||||
The political compass blog post was written with hardcoded numbers (EVR ~32%/~21%, 38 windows) that drifted from the pipeline's actual outputs as the data and methodology evolved. A maintenance session was required to bring every figure back in sync, generate supporting visuals, and strip references to pipeline components not yet deployed to production. |
||||
|
||||
## Guidance |
||||
|
||||
**Pull every quantitative claim directly from the canonical pipeline functions:** |
||||
|
||||
| Claim | Canonical source | |
||||
|-------|-----------------| |
||||
| EVR percentages | `analysis.political_axis.compute_svd_spectrum(window_ids=[...])` | |
||||
| Vote/motion counts | `SELECT COUNT(*) FROM motions / mp_votes` via `data/motions.db` | |
||||
| Window count | `analysis.political_axis` — count of aligned windows | |
||||
| Party agreement | `analysis.explorer_data` or direct SQL on `mp_votes` | |
||||
|
||||
**Never reference pipeline components that are not in production.** If `fused_embeddings` rows exist in the DB but the fusion pipeline is not yet in active use, do not describe it as part of the current workflow in blog copy. |
||||
|
||||
**Generate supporting visuals programmatically** (matplotlib → `docs/research/`) and embed them by relative path in the blog HTML. This makes regeneration trivial when numbers change. |
||||
|
||||
## Why This Matters |
||||
|
||||
Hardcoded numbers in blog copy inevitably drift from reality as: |
||||
- More parliamentary windows are added (38 → 41 → …) |
||||
- SVD methodology changes (e.g., Procrustes alignment, window selection) |
||||
- Pipeline components are added or removed from production |
||||
|
||||
When numbers drift, the post loses credibility and requires an expensive archaeology pass to fix. Generating them from the pipeline makes each update a single script run. |
||||
|
||||
## When to Apply |
||||
|
||||
- Before publishing or updating any post that cites quantitative pipeline outputs |
||||
- When the pipeline has changed (new windows, new methodology) and existing posts reference old numbers |
||||
- When removing or adding a pipeline stage — audit all docs for references to that stage |
||||
|
||||
## Examples |
||||
|
||||
**Before (hardcoded, stale):** |
||||
```html |
||||
<p>PC1 explains ~32% of the variance and PC2 explains ~21% — together ~52%.</p> |
||||
``` |
||||
|
||||
**After (derived from pipeline, accurate):** |
||||
```python |
||||
# scripts/generate_blog_assets.py |
||||
from analysis.political_axis import compute_svd_spectrum |
||||
|
||||
evr = compute_svd_spectrum(window_ids=["current_parliament"]) |
||||
# evr[0] = 0.290, evr[1] = 0.1146 → PC1~29%, PC2~11.5%, total~41% |
||||
``` |
||||
```html |
||||
<p>PC1 explains ~29% of the variance and PC2 explains ~11.5% — together ~41%.</p> |
||||
``` |
||||
|
||||
**Multi-window EVR (Procrustes-aligned across all 41 windows):** |
||||
```python |
||||
evr_multi = compute_svd_spectrum() # no window_ids → all windows |
||||
# evr_multi[0] = 0.1463, evr_multi[1] = 0.1310 |
||||
``` |
||||
|
||||
**Party agreement for a specific window:** |
||||
```python |
||||
import duckdb |
||||
con = duckdb.connect("data/motions.db") |
||||
# Agreement between two parties in a quarter |
||||
sql = """ |
||||
SELECT AVG(CASE WHEN a.vote = b.vote THEN 1.0 ELSE 0.0 END) |
||||
FROM mp_votes a JOIN mp_votes b USING (motion_id) |
||||
WHERE a.party = 'GroenLinks' AND b.party = 'PvdA' |
||||
AND a.motion_id IN (SELECT id FROM motions WHERE window_id = '2023-Q3') |
||||
""" |
||||
``` |
||||
|
||||
## Related |
||||
|
||||
- `docs/solutions/best-practices/svd-labels-voting-patterns-not-semantics.md` — companion guidance on keeping SVD axis *labels* aligned with voting data rather than semantic assumptions |
||||
@ -0,0 +1,171 @@ |
||||
--- |
||||
module: explorer |
||||
component: explorer.py |
||||
problem_type: developer_experience |
||||
component: development_workflow |
||||
category: best-practices |
||||
severity: medium |
||||
tags: |
||||
- streamlit |
||||
- refactoring |
||||
- horizontal-decomposition |
||||
- duckdb |
||||
- data-loading |
||||
- constants |
||||
title: "Refactoring Large Streamlit Apps: Extract Pure Data Logic" |
||||
date: "2026-04-04" |
||||
last_updated: "2026-04-05" |
||||
applies_when: |
||||
- Streamlit apps with large files (>1500 lines) |
||||
- Mixed UI and business logic |
||||
- Heavy data loading functions decorated with @st.cache_data |
||||
- DuckDB connections scattered throughout |
||||
--- |
||||
|
||||
## Context |
||||
|
||||
The `explorer.py` Streamlit app (originally 3715 lines) had 39 functions mixing: |
||||
- Streamlit UI rendering calls |
||||
- DuckDB database queries |
||||
- Business logic computations |
||||
- Pure data transformations |
||||
|
||||
This pattern makes the code: |
||||
- Hard to unit test (requires Streamlit runtime) |
||||
- Difficult to understand (UI interwoven with logic) |
||||
- Challenging to maintain (changes ripple unpredictably) |
||||
|
||||
## Guidance |
||||
|
||||
### Pattern: Horizontal Decomposition |
||||
|
||||
Extract pure functions by type, not just by moving code. The key insight is **separating concerns by computation type**, not just file size: |
||||
|
||||
**1. Data Loading Functions → `analysis/explorer_data.py`** |
||||
```python |
||||
# Before: Mixed in explorer.py with @st.cache_data |
||||
@st.cache_data(show_spinner="Partijkaart laden…") |
||||
def load_party_map(db_path: str) -> Dict[str, str]: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
try: |
||||
rows = con.execute("SELECT mp_name, party FROM mp_metadata...").fetchall() |
||||
return {mp: _PARTY_NORMALIZE.get(party, party) for mp, party in rows if mp and party} |
||||
finally: |
||||
con.close() |
||||
|
||||
# After: Pure data access in analysis/explorer_data.py |
||||
def load_party_map(db_path: str) -> Dict[str, str]: |
||||
"""Return {mp_name: party} mapping, with party names normalised to abbreviations.""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute("SELECT mp_name, party FROM mp_metadata WHERE party IS NOT NULL").fetchall() |
||||
con.close() |
||||
return {mp: _PARTY_NORMALIZE.get(party, party) for mp, party in rows if mp and party} |
||||
except Exception: |
||||
logger.exception("Failed to load party map") |
||||
return {} |
||||
``` |
||||
|
||||
**2. Pure Computation → `analysis/projections.py`** |
||||
```python |
||||
# Before: Mixed with UI |
||||
def _should_swap_axes(axis_def: dict) -> bool: |
||||
economic_labels = {"Verzorgingsstaat–Marktwerking", "Links–Rechts"} |
||||
return axis_def.get("y_label") in economic_labels and axis_def.get("x_label") not in economic_labels |
||||
|
||||
# After: Pure function, no external dependencies |
||||
def should_swap_axes(axis_def: dict) -> bool: |
||||
"""Return True if the Y axis is economic left-right and the X axis is not.""" |
||||
economic_labels = {"Verzorgingsstaat–Marktwerking", "Links–Rechts"} |
||||
return axis_def.get("y_label") in economic_labels and axis_def.get("x_label") not in economic_labels |
||||
``` |
||||
|
||||
**4. Constants → `analysis/config.py`** |
||||
```python |
||||
# Before: Defined inline in the data loading module |
||||
_PARTY_NORMALIZE: Dict[str, str] = { |
||||
"Nieuw Sociaal Contract": "NSC", |
||||
"CU": "ChristenUnie", |
||||
} |
||||
|
||||
# After: Import from config, define once |
||||
from analysis.config import _PARTY_NORMALIZE |
||||
|
||||
def load_party_map(db_path: str) -> Dict[str, str]: |
||||
"""Return {mp_name: party} mapping, with party names normalised to abbreviations.""" |
||||
try: |
||||
con = duckdb.connect(database=db_path, read_only=True) |
||||
rows = con.execute("SELECT mp_name, party FROM mp_metadata WHERE party IS NOT NULL").fetchall() |
||||
con.close() |
||||
return {mp: _PARTY_NORMALIZE.get(party, party) for mp, party in rows if mp and party} |
||||
except Exception: |
||||
logger.exception("Failed to load party map") |
||||
return {} |
||||
``` |
||||
|
||||
**Why this matters:** DRY enforcement. If `_PARTY_NORMALIZE` or `CURRENT_PARLIAMENT_PARTIES` are defined in multiple modules, changes must be made in all places. Importing from a single source of truth (`config.py`) prevents drift. |
||||
|
||||
**Anti-pattern to avoid:** Creating new constant definitions in extracted modules without checking if they already exist in `config.py`. Before adding a constant to any `analysis/*.py` module, grep for existing definitions of that constant name across all `analysis/` modules. |
||||
When tests import from the original module, preserve wrappers: |
||||
```python |
||||
# In explorer.py - preserves test compatibility |
||||
def _should_swap_axes(axis_def: dict) -> bool: |
||||
"""Return True if the Y axis is economic left-right and the X axis is not.""" |
||||
return projections.should_swap_axes(axis_def) |
||||
``` |
||||
|
||||
### Import Direction Rule |
||||
|
||||
``` |
||||
explorer.py → analysis/*.py (one-way) |
||||
analysis/*.py ↛ explorer.py (never) |
||||
``` |
||||
|
||||
This prevents circular imports and clarifies module boundaries. |
||||
|
||||
### Preserve Function Signatures |
||||
|
||||
Refactoring shouldn't change public APIs: |
||||
```python |
||||
# Keep these unchanged: |
||||
@st.cache_data(show_spinner="...") |
||||
def load_party_map(db_path: str) -> Dict[str, str]: |
||||
return explorer_data.load_party_map(db_path) |
||||
``` |
||||
|
||||
## Why This Matters |
||||
|
||||
1. **Testability**: Pure functions in `analysis/` can be unit tested without Streamlit runtime |
||||
2. **Reusability**: Data loading functions can be imported by other modules |
||||
3. **Maintainability**: Changes to business logic are isolated from UI changes |
||||
4. **Readability**: Single responsibility - each function does one thing |
||||
|
||||
## Prevention |
||||
|
||||
1. **Audit existing constants before creating new ones**: Before adding a constant to any `analysis/*.py` module, grep across `analysis/` for existing definitions. DRY violations cause maintenance burden and drift. |
||||
2. **Define `__all__` in every analysis module**: Without `__all__`, it's hard to know what's public API. Add it when creating the module, not as an afterthought. All 6 public functions in `analysis/trajectory.py` (compute_trajectories, compute_2d_trajectories, top_drifters, compute_party_discipline, window_to_dates, choose_trajectory_title) should be exported. |
||||
3. **Test assertions may need updates**: When extracting constants, grep for all test references to those constants and update assertions. See `docs/solutions/test-failures/svd-label-tests-after-refactoring.md` for an example. |
||||
4. **Verify tests pass after extraction**: Run the full test suite to catch any broken import chains or assertion mismatches. |
||||
|
||||
## Results |
||||
|
||||
| Metric | Before | After | |
||||
|--------|--------|-------| |
||||
| explorer.py lines | 3715 | 3069 | |
||||
| analysis/ modules | 0 | 12 modules | |
||||
| Functions extracted | 0 | 15+ | |
||||
| Test pass rate | 164/164 | 173/173 | |
||||
|
||||
**Practical minimum reached at ~3000 lines.** Tab functions (`build_*_tab()`) and their `_render_*` helpers are inherently Streamlit-coupled — they mix UI rendering with business logic. Further reduction requires decoupling data preparation from rendering (prepare data in `analysis/`, render in `explorer.py`), which is a significant architectural refactor. |
||||
|
||||
## When to Apply |
||||
|
||||
Apply this pattern when: |
||||
- A Streamlit file exceeds 1500 lines |
||||
- Functions mix `@st.cache_data` with DuckDB queries and business logic |
||||
- Tests require Streamlit runtime to run |
||||
- Multiple tabs share similar data loading patterns |
||||
|
||||
## Examples |
||||
|
||||
See `analysis/config.py` for constants (PARTY_COLOURS, SVD_THEMES, KNOWN_MAJOR_PARTIES, CURRENT_PARLIAMENT_PARTIES, CANONICAL_RIGHT, CANONICAL_LEFT, _PARTY_NORMALIZE), `analysis/explorer_data.py` for extracted data loading functions, and `analysis/projections.py` for pure computation utilities. |
||||
@ -0,0 +1,140 @@ |
||||
--- |
||||
title: "SVD Labels Should Reflect Voting Patterns, Not Semantic Content" |
||||
date: "2026-04-04" |
||||
category: docs/solutions/best-practices |
||||
module: stemwijzer |
||||
problem_type: best_practice |
||||
component: brief_system |
||||
severity: medium |
||||
applies_when: |
||||
- Labeling SVD (Singular Value Decomposition) components in voting analysis |
||||
- Interpreting PCA/SVD dimensions in political party voting data |
||||
- Creating axis labels for voting compass or similar applications |
||||
tags: |
||||
- svd |
||||
- voting-analysis |
||||
- axis-labels |
||||
- dimensionality-reduction |
||||
- party-voting-patterns |
||||
--- |
||||
|
||||
# SVD Labels Should Reflect Voting Patterns, Not Semantic Content |
||||
|
||||
## Context |
||||
|
||||
When labeling SVD components in the Stemwijzer explorer (`explorer.py`), initial labels were based on **semantic analysis** of motion titles — what topics motions appeared to discuss. However, SVD captures **voting patterns**, not semantic content. |
||||
|
||||
This mismatch led to: |
||||
- Labels that didn't match how parties actually voted |
||||
- Right-wing parties appearing on the LEFT side of axes (violating the right-wing parties → right side constraint) |
||||
- Confusion about what each component actually represents |
||||
|
||||
## Guidance |
||||
|
||||
### The Core Principle |
||||
|
||||
**SVD components represent voting unity patterns, not topic clusters.** |
||||
|
||||
When a motion appears on a component with a positive loading, it means parties that vote positively on that motion tend to vote similarly. The component captures this behavioral pattern, not the topic's semantic meaning. |
||||
|
||||
### Example: Component 1 |
||||
|
||||
| Approach | Label | Why It's Wrong | |
||||
|----------|-------|----------------| |
||||
| Semantic | "Sociale zekerheid vs economische liberalisering" | Assumes defense + social care = welfare state | |
||||
| Voting | "Rechts kabinetsbeleid vs links oppositiebeleid" | Matches actual coalition vs opposition voting | |
||||
|
||||
**Why Component 1 captures coalition-opposition:** |
||||
- 9 coalition + center parties vote one way |
||||
- 6 opposition parties vote the other way |
||||
- Motion topics can include defense (right votes for) AND social care (left votes for) because they're on opposite sides of the coalition-opposition divide |
||||
|
||||
### How to Label SVD Components |
||||
|
||||
1. **Analyze actual voting patterns first** |
||||
- Query which parties vote positively vs negatively on each component |
||||
- Look for coalition/opposition splits, cross-block alliances, or isolated parties |
||||
|
||||
2. **Verify right-wing parties on RIGHT side** |
||||
- Check PVV, FVD, JA21, SGP positions |
||||
- If they vote negatively while left parties vote positively, flip the axis |
||||
|
||||
3. **Don't assume semantics match voting** |
||||
- A "defense" component may include social care motions if right-wing parties vote the same way on both |
||||
- Cross-block alliances (e.g., PVV with SP on welfare) create components that don't fit left-right semantics |
||||
|
||||
4. **Test with sample motions** |
||||
- Top positive-loading motions should align with positive-voting parties' priorities |
||||
- Top negative-loading motions should align with negative-voting parties' priorities |
||||
|
||||
## Why This Matters |
||||
|
||||
Without understanding that SVD captures voting patterns: |
||||
- Labels will be misleading to users |
||||
- Right-wing parties may appear on the wrong side of axes |
||||
- Components may be mislabeled as "left" when they're actually "opposition" |
||||
- Users get incorrect information about party positions |
||||
|
||||
## When to Apply |
||||
|
||||
Apply this guidance when: |
||||
- Creating or updating SVD/PCA component labels |
||||
- Interpreting dimensionality reduction results in voting analysis |
||||
- Building voting compasses or similar political guidance tools |
||||
- Analyzing roll call votes or legislative voting data |
||||
|
||||
## Examples |
||||
|
||||
### Wrong Approach (Semantic) |
||||
|
||||
```python |
||||
# ❌ BAD: Based on motion topics, not voting patterns |
||||
SVD_THEMES = { |
||||
1: { |
||||
"label": "Sociale zekerheid vs economische liberalisering", |
||||
# Reality: 9 coalition parties vote same way, 6 opposition vote opposite |
||||
} |
||||
} |
||||
``` |
||||
|
||||
### Correct Approach (Voting Patterns) |
||||
|
||||
```python |
||||
# ✅ GOOD: Based on actual voting behavior |
||||
SVD_THEMES = { |
||||
1: { |
||||
"label": "Rechts kabinetsbeleid vs links oppositiebeleid", |
||||
"explanation": ( |
||||
"Deze as scheidt het kabinetsbeleid van de oppositie. " |
||||
"9 coalitiepartijen stemmen aan de positieve kant, " |
||||
"6 oppositiepartijen aan de negatieve kant." |
||||
), |
||||
} |
||||
} |
||||
``` |
||||
|
||||
### Verification Approach |
||||
|
||||
To verify SVD component labels, check which parties have positive vs negative loadings on each component: |
||||
|
||||
```python |
||||
# From explorer.py - check party loadings for each component |
||||
for comp_num in range(1, 11): |
||||
component_parties = svd_scores[svd_scores['component'] == comp_num] |
||||
positive_parties = component_parties[component_parties['loading'] > 0]['party'].tolist() |
||||
negative_parties = component_parties[component_parties['loading'] < 0]['party'].tolist() |
||||
|
||||
print(f"Component {comp_num}:") |
||||
print(f" Positive ({len(positive_parties)}): {positive_parties}") |
||||
print(f" Negative ({len(negative_parties)}): {negative_parties}") |
||||
``` |
||||
|
||||
Use this to verify: |
||||
- Right-wing parties (PVV, FVD, JA21, SGP) appear on the correct side |
||||
- The label matches the voting pattern, not just the topic |
||||
|
||||
## Related |
||||
|
||||
- [SVD Label Unification Plan](docs/superpowers/plans/2026-04-02-svd-label-unification.md) |
||||
- [SVD Label Unification Design](docs/superpowers/specs/2026-04-02-svd-label-unification-design.md) |
||||
- Commits: `33edb33`, `e77f0ec`, `bfe37c6`, `f7fc908`, `92c3c0e` |
||||
@ -0,0 +1,79 @@ |
||||
--- |
||||
title: "SVD Axis Stability and Overtone Shift are Independent Phenomena" |
||||
date: 2026-04-05 |
||||
module: analysis |
||||
problem_type: insight |
||||
component: motion-analysis |
||||
severity: medium |
||||
tags: [svd, overtone-shift, semantic-drift, axis-stability, parliamentary-analysis] |
||||
applies_when: |
||||
- Interpreting SVD axes over multiple time windows |
||||
- Comparing motion content across different parliamentary periods |
||||
- Understanding why stable axis labels don't guarantee stable motion content |
||||
--- |
||||
|
||||
# SVD Axis Stability and Overtone Shift are Independent Phenomena |
||||
|
||||
## Key Insight |
||||
|
||||
When analyzing SVD axes across time windows, **axis stability** and **overtone shift** measure fundamentally different phenomena: |
||||
|
||||
| Phenomenon | What it Measures | How to Compute | |
||||
|------------|-----------------|----------------| |
||||
| **Axis Stability** | Whether the same motions/embeddings load on an axis | Lasso regression: `SVD_score ~ fused_embedding`, compare weight vectors via cosine similarity + Jaccard | |
||||
| **Overtone Shift** | How motion content evolves over time | Semantic gravity (weighted mean embedding) tracking via cosine distance | |
||||
|
||||
**The implication**: An axis can be "stable" (parties vote similarly across years) while its semantic content drifts dramatically (different motions define the axis). |
||||
|
||||
## Evidence |
||||
|
||||
Analysis of 9 annual windows (2016-2026) revealed: |
||||
|
||||
- **9 of 10 axes are stable** (similarity > 0.7) |
||||
- **All stable axes show high overtone shift** (1.3-1.97 cosine distance) |
||||
- **No inflection points detected** — drift is gradual, not sudden |
||||
|
||||
### Example: Axis 1 (Coalition vs Opposition) |
||||
|
||||
| Metric | Value | Interpretation | |
||||
|--------|-------|----------------| |
||||
| Axis Stability | 0.83 | Coalition/opposition voting pattern is structurally consistent | |
||||
| Overtone Shift | 1.47 avg, 1.97 max | Motion content defining "coalition" vs "opposition" has changed substantially | |
||||
|
||||
This means: PVV, VVD, NSC, BBB consistently vote together against SP, GL-PvdA, PvdD across all windows — but the specific motions that exemplify "coalition policy" in 2026 are semantically different from 2016. |
||||
|
||||
## Why This Matters |
||||
|
||||
1. **Axis labels are temporally bounded** — "Rechts kabinetsbeleid versus links oppositiebeleid" accurately describes 2016-2026, but the underlying motions have evolved. |
||||
|
||||
2. **Cross-temporal comparison is valid structurally but not semantically** — Party positions are comparable; motion content is not. |
||||
|
||||
3. **Axis 6 (Migration/Culture)** is an exception — Low stability (0.35) suggests fundamental change in how this dimension is structured. |
||||
|
||||
## How to Analyze This |
||||
|
||||
Use `scripts/motion_drift.py`: |
||||
|
||||
```bash |
||||
uv run python scripts/motion_drift.py --db data/motions.db --output reports/drift |
||||
``` |
||||
|
||||
The script computes: |
||||
- **Axis stability**: Lasso regression weights compared across windows |
||||
- **Overtone shift**: Semantic gravity tracking |
||||
- **Inflection points**: Sudden drift detection |
||||
- **Party trajectories**: How parties move along stable axes |
||||
|
||||
## Prevention |
||||
|
||||
When updating SVD themes: |
||||
1. Run `scripts/motion_drift.py` to check current overtone shift levels |
||||
2. Verify that theme descriptions match current motion content, not historical content |
||||
3. Monitor Axis 6 specifically for stability issues |
||||
4. Consider temporal weighting in visualizations — recent windows better represent current semantics |
||||
|
||||
## Related |
||||
|
||||
- `scripts/motion_drift.py` — Analysis script |
||||
- `docs/research/2026-04-05-svd-overtone-shift-deep-dive.md` — Deep analysis |
||||
- `reports/drift/report.md` — Generated report |
||||
@ -0,0 +1,96 @@ |
||||
--- |
||||
title: SVD component labels incorrect due to semantic vs voting pattern mismatch |
||||
date: 2026-04-04 |
||||
category: docs/solutions/logic-errors/ |
||||
module: Stemwijzer Data Analysis |
||||
problem_type: logic_error |
||||
component: explorer |
||||
symptoms: |
||||
- Component 1 label "Sociale zekerheid vs economische liberalisering" did not match voting patterns |
||||
- Report analysis showed different party alignment than label suggested |
||||
- SVD components captured voting patterns but labels described semantic content |
||||
root_cause: logic_error |
||||
resolution_type: code_fix |
||||
severity: high |
||||
tags: [svd, voting-analysis, component-labels, logic-error] |
||||
--- |
||||
|
||||
# SVD component labels incorrect due to semantic vs voting pattern mismatch |
||||
|
||||
## Problem |
||||
The SVD (Singular Value Decomposition) component labels in `explorer.py` were based on semantic analysis of motion titles, but the SVD actually captures HOW parties vote, not WHAT topics are discussed. This resulted in misleading component labels that did not match actual voting patterns. |
||||
|
||||
## Symptoms |
||||
- Component 1 was labeled "Sociale zekerheid vs economische liberalisering" but actually captured coalition vs opposition voting |
||||
- Analysis report showed different party groupings than the labels suggested |
||||
- Report generation used incorrect slice (`scored[:30]`) instead of positive/negative party separation |
||||
|
||||
## What Didn't Work |
||||
- Semantic analysis of motion titles to determine component labels |
||||
- Assuming that topics discussed in motions matched how parties voted on them |
||||
- Report generation logic was inconsistent with JSON output logic |
||||
|
||||
## Solution |
||||
|
||||
### 1. Report Generation Bug Fix (commit bfe37c6) |
||||
Fixed the report generation to use positive/negative party lists correctly instead of `scored[:30]`: |
||||
|
||||
```python |
||||
# Before (incorrect) |
||||
scored[:30] |
||||
|
||||
# After (correct) |
||||
positive_parties = [p for p, s in scored if s > 0] |
||||
negative_parties = [p for p, s in scored if s < 0] |
||||
``` |
||||
|
||||
### 2. Component 1 Label Fix (commit f7fc908) |
||||
Changed from semantic-based to voting-pattern-based label: |
||||
|
||||
```python |
||||
# Before (incorrect) |
||||
"label": "Sociale zekerheid vs economische liberalisering" |
||||
|
||||
# After (correct) |
||||
"label": "Rechts kabinetsbeleid vs links oppositiebeleid" |
||||
``` |
||||
|
||||
Root cause: Component 1 captures 9 coalition parties voting together vs 6 opposition parties voting together. |
||||
|
||||
### 3. Components 2, 4, 5, 6 Label Updates (commit 92c3c0e) |
||||
- **Component 2**: "PVV/FVD-populisme versus mainstream-partijen" — Only PVV and FVD vote positively |
||||
- **Component 4**: "Mainstreampartijen versus FVD/DENK-oppositie" — Only FVD and DENK vote negatively |
||||
- **Component 5**: "Christelijk-sociaal en gemeenschapswaarden versus progressieve individuele rechten" |
||||
- **Component 6**: "Migratie en cultuur versus klimaat en progressieve inclusie" |
||||
|
||||
### 4. Exclusive Motion Assignment (commit 33edb33) |
||||
Each motion now appears on only one component (highest absolute loading): |
||||
|
||||
```python |
||||
# Each motion assigned to component with highest absolute loading |
||||
# Backward compatible with --no-exclusive flag |
||||
``` |
||||
|
||||
## Why This Works |
||||
|
||||
**Critical Insight**: SVD captures voting patterns, not semantic content. When labeling SVD components: |
||||
- Look at which parties vote positively vs negatively |
||||
- Don't assume semantics match voting patterns |
||||
- Coalition vs opposition is a strong voting dimension in parliamentary data |
||||
- Components may include motions from seemingly unrelated topics if parties vote the same way |
||||
|
||||
The fix works because it aligns labels with actual voting data: |
||||
- Labels now describe the voting behavior of parties |
||||
- Positive/negative poles show which parties vote which way |
||||
- Explanations reference specific motions that illustrate the pattern |
||||
|
||||
## Prevention |
||||
|
||||
1. **Always verify SVD labels against voting data** — Before finalizing labels, check which parties score positively and negatively on each component |
||||
2. **Test label-party alignment** — Add a test that verifies component labels match the party groupings in the data |
||||
3. **Document the semantic vs voting distinction** — Make this a known Gotcha in the codebase for future developers |
||||
|
||||
## Related Issues |
||||
- Analysis: `thoughts/explorer/top_svd_top_motions_report.md` |
||||
- JSON generator: `scripts/generate_svd_json.py` |
||||
- Labels source: `analysis/config.py:67+` (SVD_THEMES dictionary) |
||||
@ -0,0 +1,83 @@ |
||||
--- |
||||
title: "SVD theme divergence from actual party positions" |
||||
module: analysis |
||||
date: 2026-04-05 |
||||
problem_type: logic_error |
||||
component: analysis |
||||
severity: medium |
||||
tags: [svd, themes, party-positions, validation, data-drift] |
||||
--- |
||||
|
||||
# SVD Theme Divergence from Actual Party Positions |
||||
|
||||
## Problem |
||||
|
||||
SVD axis themes in `analysis/config.py` can drift from actual party positions in `svd_vectors`. Themes are derived from subagent summaries of top motions, but party positions reflect voting on ALL motions. When the SVD is recomputed or voting patterns shift, themes may no longer match the data. |
||||
|
||||
## Symptoms |
||||
|
||||
- Axis 4 theme said "Mainstreampartijen versus FVD/DENK-oppositie" but actual party positions showed NSC (-24.47) and BBB (-4.58) on the left extreme, D66 (10.53)/CDA (10.11)/JA21 (9.90) on the right extreme, and FVD/DENK in the middle |
||||
- Pole labels (`left_pole`/`right_pole`) described parties that weren't actually on those sides after flip |
||||
- The flip mechanism (`compute_flip_direction`) worked correctly, but theme text was stale |
||||
|
||||
## Root Cause |
||||
|
||||
Themes were written manually by subagents summarizing top 20 motions per component. This captures what motions drive each axis, but party positions come from how parties voted on ALL 8,732 motions. The divergence occurs because: |
||||
|
||||
1. Motion sponsors ≠ voting patterns (a motion sponsored by FVD/DENK may be voted on differently by all parties) |
||||
2. The "long tail" of motions also loads on each component and can shift party positions |
||||
3. No automated validation existed to detect when themes drift from actual data |
||||
|
||||
## Solution |
||||
|
||||
### 1. Fixed axis 4 theme to match actual data |
||||
|
||||
Updated `analysis/config.py` component 4: |
||||
|
||||
```python |
||||
# Before (wrong): |
||||
"label": "Mainstreampartijen versus FVD/DENK-oppositie", |
||||
"left_pole": "FVD en DENK: oppositieposities buiten de mainstream", |
||||
"right_pole": "Mainstreampartijen: D66, CDA, VVD, PVV, GL-PvdA, SP, Volt, 50PLUS", |
||||
|
||||
# After (matches actual positions): |
||||
"label": "NSC/BBB versus D66/CDA/JA21 (indicatief)", |
||||
"left_pole": "NSC, BBB — moties met andere focus", |
||||
"right_pole": "D66, CDA, JA21 — moties met brede steun", |
||||
``` |
||||
|
||||
### 2. Added semantic left_pole/right_pole labels |
||||
|
||||
Added `left_pole` and `right_pole` fields to all 10 SVD_THEMES entries. These describe what's on the left and right sides AFTER flip, decoupling label text from raw SVD math. Updated 4 rendering locations in `explorer.py` to use these semantic labels with backward compat fallback. |
||||
|
||||
### 3. Created validation hook |
||||
|
||||
Created `scripts/validate_svd_themes.py` that validates: |
||||
- Canonical right-wing parties (PVV, FVD, JA21, SGP) appear on the right side after flip |
||||
- Theme pole labels match actual party positions |
||||
- Uses full vectors for flip computation (not single-component scores) |
||||
|
||||
Usage: |
||||
```bash |
||||
uv run python scripts/validate_svd_themes.py --db data/motions.db |
||||
``` |
||||
|
||||
Returns exit code 1 if any divergence found — suitable for CI integration. |
||||
|
||||
## Why This Works |
||||
|
||||
The flip mechanism (`compute_flip_direction`) correctly positions canonical right parties on the right side by comparing mean scores. The validation hook uses the same function with full average vectors to verify post-flip positions. Theme pole labels are now pre-computed semantic descriptions that match the flipped orientation, not raw SVD positive/negative poles. |
||||
|
||||
## Prevention |
||||
|
||||
- Run `scripts/validate_svd_themes.py` after any SVD recomputation |
||||
- Add to CI pipeline: `uv run python scripts/validate_svd_themes.py --db data/motions.db` |
||||
- When updating themes, verify against actual party positions from `svd_vectors`, not just motion sponsors |
||||
- Consider automating theme generation from party positions + motion analysis |
||||
|
||||
## Related Files |
||||
|
||||
- `analysis/config.py` — SVD_THEMES with left_pole/right_pole fields |
||||
- `explorer.py` — rendering functions using semantic pole labels |
||||
- `analysis/svd_labels.py` — compute_flip_direction() function |
||||
- `scripts/validate_svd_themes.py` — validation hook |
||||
@ -0,0 +1,90 @@ |
||||
--- |
||||
title: Test assertions failed after extracting SVD_THEMES to separate module |
||||
date: 2026-04-04 |
||||
category: docs/solutions/test-failures/ |
||||
module: Stemwijzer Data Analysis |
||||
problem_type: test_failure |
||||
component: explorer |
||||
symptoms: |
||||
- "test_display_label_for_modal" assertion failed with "EU-integratie" not found |
||||
- "test_get_svd_label_returns_correct_label" assertion failed with "Nationalisme" not found |
||||
- Tests expected old fallback labels but SVD_THEMES had updated values |
||||
root_cause: test_failure |
||||
resolution_type: test_fix |
||||
severity: medium |
||||
tags: [svd, test-assertions, refactoring, constants] |
||||
affected_files: |
||||
- tests/test_axis_label_fallback.py |
||||
- tests/test_svd_labels.py |
||||
- analysis/config.py |
||||
--- |
||||
|
||||
# Test assertions failed after extracting SVD_THEMES to separate module |
||||
|
||||
## Problem |
||||
|
||||
After extracting `SVD_THEMES` constant from `explorer.py` to `analysis/config.py`, tests failed because they hardcoded assertions for old label text. |
||||
|
||||
## Symptoms |
||||
|
||||
- `test_display_label_for_modal`: expected `"EU-integratie" in x_label or "Nationalisme" in x_label` |
||||
- `test_get_svd_label_returns_correct_label`: expected `"EU-integratie" in label1` |
||||
- `test_manifest_loads`: manifest.yaml had `categories:` key instead of `files:` |
||||
|
||||
## What Didn't Work |
||||
|
||||
- Investigating `get_svd_label()` function — it correctly returned values from `SVD_THEMES` |
||||
- Checking import chain — no circular import issues |
||||
- The problem was purely that test assertions hardcoded OLD expected label values |
||||
|
||||
## Solution |
||||
|
||||
Updated test assertions to match the current `SVD_THEMES` values: |
||||
|
||||
**tests/test_axis_label_fallback.py:** |
||||
|
||||
```python |
||||
# Before (incorrect) |
||||
assert "EU-integratie" in x_label or "Nationalisme" in x_label |
||||
assert "Populistisch" in y_label or "Institutioneel" in y_label |
||||
|
||||
# After (correct) |
||||
assert "Rechts kabinetsbeleid" in x_label or "links oppositiebeleid" in x_label |
||||
assert "PVV/FVD-populisme" in y_label or "mainstream-partijen" in y_label |
||||
``` |
||||
|
||||
**tests/test_svd_labels.py:** |
||||
|
||||
```python |
||||
# Before (incorrect) |
||||
assert "EU-integratie" in label1 or "Nationalisme" in label1 |
||||
|
||||
# After (correct) |
||||
assert "Rechts kabinetsbeleid" in label1 or "links oppositiebeleid" in label1 |
||||
``` |
||||
|
||||
**Fix manifest.yaml:** |
||||
|
||||
```yaml |
||||
# Before (incorrect) |
||||
categories: |
||||
|
||||
# After (correct) |
||||
files: |
||||
``` |
||||
|
||||
## Why This Works |
||||
|
||||
The tests were asserting on hardcoded string values that no longer matched the actual `SVD_THEMES` content. After updating the assertions to check for current label text, tests pass because they correctly verify the actual values returned. |
||||
|
||||
## Prevention |
||||
|
||||
1. **Audit tests when extracting constants** — When extracting constants to separate modules, grep for all test references to those constants and update assertions |
||||
2. **Use flexible assertions** — Prefer `in` checks over exact matches when testing label text, or better yet, import the constant directly in tests and assert equality |
||||
3. **Update manifest tests early** — When changing YAML structure in config files, check for corresponding manifest/schema tests |
||||
|
||||
## Related Issues |
||||
|
||||
- `analysis/config.py` — Contains `SVD_THEMES` (extracted from `explorer.py`) |
||||
- `analysis/svd_labels.py` — Uses `SVD_THEMES` via runtime import from `explorer.py` |
||||
- `docs/solutions/logic-errors/svd-component-labels-mismatch.md` — Background on why SVD labels were updated from semantic to voting-pattern based |
||||
@ -0,0 +1,113 @@ |
||||
--- |
||||
title: "SVD compass vs components tab party ordering inconsistency" |
||||
date: 2026-04-13 |
||||
module: analysis |
||||
problem_type: ui_bug |
||||
component: analysis |
||||
symptoms: |
||||
- "SVD components tab and political compass showed different party orderings for the same data" |
||||
- "Party positions in compass did not match positions in SVD Components tab for components 1-2" |
||||
root_cause: logic_error |
||||
resolution_type: code_fix |
||||
severity: medium |
||||
tags: |
||||
- svd |
||||
- pca |
||||
- compass |
||||
- alignment |
||||
- procrustes |
||||
--- |
||||
|
||||
# SVD Compass vs Components Tab Party Ordering Inconsistency |
||||
|
||||
## Problem |
||||
|
||||
The SVD Components tab and the political compass visualization showed different party orderings for the same data. Users would see a party at position X in the compass, but the same party at position Y in the SVD Components tab for components 1-2. |
||||
|
||||
## Symptoms |
||||
|
||||
- Same party (e.g., PVV) has different x-coordinate in compass vs SVD Components tab |
||||
- Party ordering along political axis differs between the two views |
||||
- Confusing user experience when exploring voting patterns |
||||
|
||||
## What Didn't Work |
||||
|
||||
Using raw SVD scores directly in the SVD Components tab. The compass uses Procrustes-aligned PCA positions from `load_positions()`, but components 1-2 in the SVD Components tab were using unaligned raw SVD scores. These are in different coordinate frames. |
||||
|
||||
## Solution |
||||
|
||||
For components 1-2 in the SVD Components tab, use aligned PCA positions from `load_positions()` (same data source as compass) instead of raw SVD scores. Components 3-10 continue to use raw SVD scores. |
||||
|
||||
Added `_get_aligned_party_coords()` helper function in `explorer.py` that: |
||||
1. Calls `load_positions()` to get aligned MP positions |
||||
2. Aggregates MP positions to party centroids using `load_party_map()` |
||||
3. Returns `{party: (x, y)}` coordinates |
||||
|
||||
```python |
||||
def _get_aligned_party_coords(window: str) -> Dict[str, Tuple[float, float]]: |
||||
"""Get party (x, y) coordinates from aligned PCA positions for a window.""" |
||||
positions_by_window, _ = load_positions(db_path, "annual") |
||||
window_pos = positions_by_window.get(window, {}) |
||||
if not window_pos: |
||||
return {} |
||||
|
||||
# Load party map to convert MP names to parties |
||||
_party_map = load_party_map(db_path) |
||||
|
||||
# Aggregate MP positions to party centroids |
||||
party_coords: Dict[str, List[Tuple[float, float]]] = {} |
||||
for mp_name, (x, y) in window_pos.items(): |
||||
party = _party_map.get( |
||||
mp_name, _party_map.get(mp_name.split("(")[0].strip(), None) |
||||
) |
||||
if party: |
||||
party_coords.setdefault(party, []).append((x, y)) |
||||
|
||||
# Compute mean position per party |
||||
return { |
||||
party: ( |
||||
float(np.mean([c[0] for c in coords])), |
||||
float(np.mean([c[1] for c in coords])), |
||||
) |
||||
for party, coords in party_coords.items() |
||||
if coords |
||||
} |
||||
``` |
||||
|
||||
The rendering code now branches based on component: |
||||
|
||||
```python |
||||
if comp_sel <= 2: |
||||
# Components 1-2: use aligned PCA positions (consistent with compass) |
||||
aligned_coords = _get_aligned_party_coords(svd_window) |
||||
for party, (x, y) in aligned_coords.items(): |
||||
party_1d_coords[party] = (x,) if comp_sel == 1 else (y,) |
||||
else: |
||||
# Components 3-10: use raw SVD scores |
||||
idx = comp_sel - 1 |
||||
for party, scores in party_scores.items(): |
||||
if scores and len(scores) > idx: |
||||
party_1d_coords[party] = (float(scores[idx]),) |
||||
``` |
||||
|
||||
## Why This Works |
||||
|
||||
1. **Same coordinate frame**: Both visualizations now use Procrustes-aligned PCA positions for components 1-2 |
||||
2. **Consistent party centroids**: Both aggregate MP positions to party centroids the same way |
||||
3. **Clear separation of concerns**: Components 1-2 represent political compass axes (need alignment), while components 3-10 are topic dimensions (use raw SVD scores) |
||||
|
||||
## Prevention |
||||
|
||||
- When adding new SVD/PCA visualizations, always check which data source the compass uses and use the same source for consistency |
||||
- Document coordinate frame requirements: "aligned" vs "raw" SVD scores have different interpretations |
||||
- Consider adding integration tests that verify compass and SVD Components tab show consistent positions |
||||
|
||||
## Related Files |
||||
|
||||
- `explorer.py` — `_get_aligned_party_coords()` helper, component 1-2 data loading |
||||
- `analysis/political_axis.py` — `load_positions()` and PCA alignment logic |
||||
- `analysis/explorer_data.py` — `load_party_scores_all_windows()` for components 3-10 |
||||
|
||||
## Related Issues |
||||
|
||||
- This fix builds on the earlier SVD axis label alignment fix (`docs/solutions/ui-bugs/svd-axis-pole-labels-incorrect-after-flip.md`) |
||||