Delete 17 malformed YAML constraint files and 10 stale numbered constraint files. Convert domain glossary, patterns, stack, and anti-patterns to markdown format. Update manifest.yaml to reference new markdown files.main
parent
910ef0dc3b
commit
88595c869b
@ -0,0 +1,127 @@ |
||||
--- |
||||
title: Anti-Patterns in Stemwijzer |
||||
category: anti-patterns |
||||
severity: critical |
||||
--- |
||||
|
||||
# Anti-Patterns |
||||
|
||||
> **NOTE**: Some anti-patterns below were investigated and found to be resolved or invalid. See individual entries for details. |
||||
|
||||
## CRITICAL: print() Instead of Logging |
||||
|
||||
**File**: `api_client.py` |
||||
**Evidence**: 11 instances of `print(f"...")` instead of `_logger.info(...)` |
||||
|
||||
**Broken code**: |
||||
```python |
||||
def get_motions(self, ...): |
||||
try: |
||||
# ... |
||||
print(f"Fetched {len(voting_records)} voting records from API") # BAD |
||||
print(f"Processed into {len(motions)} unique motions") # BAD |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") # BAD - no traceback |
||||
``` |
||||
|
||||
**Fix**: |
||||
```python |
||||
import logging |
||||
|
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
def get_motions(self, ...): |
||||
try: |
||||
_logger.info("Fetched %d voting records from API", len(voting_records)) |
||||
_logger.info("Processed into %d unique motions", len(motions)) |
||||
except Exception as e: |
||||
_logger.exception("Error fetching motions from API: %s", e) |
||||
return [] |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## CRITICAL: Global `_DummySt` Replacement |
||||
|
||||
**File**: `explorer.py` |
||||
**Evidence**: Lines ~50-70, module-level `st = _DummySt()` global replacement |
||||
|
||||
**Problem**: Creates a module-level variable `st` that shadows `streamlit` module, causing subtle bugs. |
||||
|
||||
**Fix**: Use conditional flags instead of global replacement: |
||||
```python |
||||
# GOOD: Use conditional logic |
||||
try: |
||||
import plotly.express as px |
||||
import plotly.graph_objects as go |
||||
HAS_PLOTLY = True |
||||
except ImportError: |
||||
HAS_PLOTLY = False |
||||
px = None |
||||
go = None |
||||
|
||||
def render_chart(data): |
||||
if not HAS_PLOTLY: |
||||
_logger.warning("Plotly not available") |
||||
return |
||||
# ... rest of chart logic |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## WARNING: Logger Naming Inconsistency |
||||
|
||||
**Evidence**: 16 files use `logger`, 17 files use `_logger` |
||||
|
||||
**Files with `logger`** (without underscore): |
||||
- api_client.py, ai_provider.py, pipeline files, analysis files |
||||
|
||||
**Files with `_logger`** (with underscore): |
||||
- database.py, explorer.py, explorer_helpers.py |
||||
|
||||
**Recommendation**: Standardize on `_logger` for module-level loggers. |
||||
|
||||
--- |
||||
|
||||
## WARNING: Bare except with pass |
||||
|
||||
**File**: `database.py`, line 47 |
||||
|
||||
```python |
||||
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: # bare except |
||||
pass |
||||
``` |
||||
|
||||
**Fix**: |
||||
```python |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except Exception as exc: |
||||
_logger.debug("Sequence creation skipped: %s", exc) |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## INVESTIGATED: Entity-ID / Party-Name Mismatch |
||||
|
||||
**Status**: INVALID - investigated and resolved |
||||
|
||||
**Investigation Summary**: `svd_vectors.entity_id` only contains MP names (not party names). Party centroids are correctly computed via `mp_metadata` lookups. No production bug exists. |
||||
|
||||
--- |
||||
|
||||
## Pattern: Three Separate Party Alias Dictionaries |
||||
|
||||
**Problem**: Party name variations exist in 3+ places with no canonical alias mapping. |
||||
|
||||
**Fix**: Create one `PARTY_ALIASES` dict in `config.py`: |
||||
```python |
||||
PARTY_ALIASES = { |
||||
"GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"], |
||||
"PVV": ["Partij voor de Vrijheid"], |
||||
# ... |
||||
} |
||||
``` |
||||
@ -1,34 +0,0 @@ |
||||
# Naming & Style Conventions |
||||
|
||||
## Rules |
||||
- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py |
||||
- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py) |
||||
- Classes: PascalCase. Evidence: MotionDatabase (database.py) |
||||
- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred) |
||||
- Imports order: stdlib, third-party, local; prefer absolute imports and grouped. |
||||
- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections). |
||||
|
||||
## Examples |
||||
|
||||
### Function example (from pipeline/run_pipeline.py) |
||||
```python |
||||
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
||||
"""Return list of (window_id, start_str, end_str) tuples.""" |
||||
``` |
||||
|
||||
### Class example (from database.py) |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
... |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files. |
||||
|
||||
## Remediations |
||||
- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step. |
||||
|
||||
## Evidence pointers |
||||
- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120) |
||||
- database.py: MotionDatabase class and methods (file database.py lines 1-400+) |
||||
@ -1,74 +0,0 @@ |
||||
# Database Schema (DuckDB) — extracted DDL |
||||
|
||||
## Rules |
||||
- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). |
||||
- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py). |
||||
|
||||
## Examples (DDL snippets extracted from database.py) |
||||
|
||||
### motions table |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS motions ( |
||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||
title TEXT NOT NULL, |
||||
description TEXT, |
||||
date DATE, |
||||
policy_area TEXT, |
||||
voting_results JSON, |
||||
winning_margin FLOAT, |
||||
controversy_score FLOAT, |
||||
layman_explanation TEXT, |
||||
externe_identifier TEXT, |
||||
body_text TEXT, |
||||
url TEXT UNIQUE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
### mp_votes table |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS mp_votes ( |
||||
id INTEGER DEFAULT nextval('mp_votes_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
mp_name TEXT NOT NULL, |
||||
party TEXT, |
||||
vote TEXT NOT NULL, |
||||
date DATE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
### embeddings / fused_embeddings |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS embeddings ( |
||||
id INTEGER DEFAULT nextval('embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
model TEXT, |
||||
vector JSON NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
|
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior. |
||||
|
||||
## Remediations |
||||
- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically. |
||||
- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80). |
||||
|
||||
## Evidence pointers |
||||
- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings. |
||||
@ -1,22 +0,0 @@ |
||||
# Domain Glossary |
||||
|
||||
## Rules |
||||
- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id. |
||||
|
||||
## Terms |
||||
- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110) |
||||
- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes |
||||
- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`. |
||||
- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table. |
||||
- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows |
||||
- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score |
||||
|
||||
## Examples / Usage |
||||
- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120 |
||||
|
||||
## Evidence pointers |
||||
- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py) |
||||
- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py) |
||||
|
||||
## Anti-patterns |
||||
- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations. |
||||
@ -1,30 +0,0 @@ |
||||
# Code Clusters / Organization |
||||
|
||||
## Rules |
||||
- The repository organizes code into the following clusters (observed): |
||||
- UI / Streamlit: Home.py, pages/, app.py, explorer.py |
||||
- Database & persistence: database.py, config.py |
||||
- ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion) |
||||
- AI provider & summarization: ai_provider.py, pipeline/..., analysis/ |
||||
- Similarity & caching: similarity/*, similarity_cache table in DB |
||||
- API client & scraping: api_client.py, pipeline/fetch_mp_metadata |
||||
- Analysis & visualization: analysis/visualize.py, explorer.py |
||||
- CLI & scheduler: scheduler.py, pipeline/run_pipeline.py |
||||
- Tests & migrations: tests/ (pytest) and database reset helpers |
||||
|
||||
## Examples |
||||
|
||||
### Pipeline orchestrator (cluster: CLI & pipeline) |
||||
```python |
||||
from database import MotionDatabase |
||||
db = MotionDatabase(db_path) |
||||
# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window |
||||
``` |
||||
|
||||
## Remediations |
||||
- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests. |
||||
|
||||
## Evidence pointers |
||||
- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py) |
||||
- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py) |
||||
- analysis/visualize.py: visualization cluster (file: analysis/visualize.py) |
||||
@ -1,46 +0,0 @@ |
||||
# Design Patterns & Code Patterns |
||||
|
||||
## Rules |
||||
- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management. |
||||
- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback. |
||||
- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes). |
||||
|
||||
## Examples |
||||
|
||||
### Repository pattern (database.py MotionDatabase) |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._init_database() |
||||
|
||||
def insert_motion(self, motion_data: Dict) -> bool: |
||||
"""Insert a new motion into database""" |
||||
# uses duckdb.connect and parameterized queries |
||||
``` |
||||
|
||||
### Provider adapter with retries (ai_provider.py) |
||||
```python |
||||
def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response: |
||||
# Implements retries/backoff, handles 429 with Retry-After and 5xx responses |
||||
``` |
||||
|
||||
### Pipeline parallelism pattern (run_pipeline) |
||||
```python |
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool: |
||||
for window_id, w_start, w_end in windows: |
||||
fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k) |
||||
futures[fut] = window_id |
||||
# wait then write sequentially to DuckDB |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors. |
||||
|
||||
## Remediations |
||||
- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md. |
||||
|
||||
## Evidence pointers |
||||
- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300) |
||||
- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260) |
||||
- database.py: MotionDatabase methods (file: database.py) |
||||
@ -1,24 +0,0 @@ |
||||
# Anti-patterns, Issues and Recommended Fixes |
||||
|
||||
## Rules |
||||
- Flagged issues discovered in Phase 1 must be remediated with concrete actions. |
||||
|
||||
## Issues |
||||
- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml |
||||
- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports. |
||||
- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility. |
||||
- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps. |
||||
- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging. |
||||
|
||||
## Remediations / Recommended fixes |
||||
- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml. |
||||
- Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain. |
||||
- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var. |
||||
- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges. |
||||
- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage. |
||||
- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks. |
||||
|
||||
## Evidence pointers |
||||
- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40) |
||||
- database.py: multiple broad except blocks (file: database.py top and methods) |
||||
- ai_provider.py: uses requests + env keys (file: ai_provider.py) |
||||
@ -1,117 +0,0 @@ |
||||
# Example Extractions |
||||
|
||||
## Rules |
||||
- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions. |
||||
|
||||
## (a) Function signatures with docstrings (5 examples) |
||||
1) pipeline/run_pipeline.py::_generate_windows |
||||
```python |
||||
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
||||
"""Return list of (window_id, start_str, end_str) tuples. |
||||
|
||||
window_id format: |
||||
quarterly → "2024-Q1", "2024-Q2", … |
||||
annual → "2024" |
||||
""" |
||||
``` |
||||
|
||||
2) database.py::append_audit_event |
||||
```python |
||||
def append_audit_event( |
||||
self, |
||||
actor_id: Optional[str], |
||||
action: str, |
||||
target_type: Optional[str] = None, |
||||
target_id: Optional[str] = None, |
||||
metadata: Optional[Dict] = None, |
||||
) -> bool: |
||||
"""Record an audit event. Tries DB then falls back to ledger file.""" |
||||
``` |
||||
|
||||
3) ai_provider.py::get_embedding |
||||
```python |
||||
def get_embedding(text: str, model: str | None = None) -> list[float]: |
||||
"""Return an embedding vector for `text` using the configured provider. |
||||
|
||||
Raises ProviderError for configuration or provider-side failures. |
||||
""" |
||||
``` |
||||
|
||||
4) ai_provider.py::get_embeddings_batch |
||||
```python |
||||
def get_embeddings_batch( |
||||
texts: list[str], model: str | None = None, batch_size: int = 50 |
||||
) -> list[list[float]]: |
||||
"""Return embedding vectors for multiple texts using batched API calls.""" |
||||
``` |
||||
|
||||
5) analysis/visualize.py::plot_umap_scatter |
||||
```python |
||||
def plot_umap_scatter( |
||||
motion_ids: List[int], |
||||
coords: List[List[float]], |
||||
labels: Optional[List[int]] = None, |
||||
window_id: Optional[str] = None, |
||||
output_path: str = "analysis_umap.html", |
||||
) -> str: |
||||
"""Produce a 2D scatter plot of UMAP-reduced fused embeddings.""" |
||||
``` |
||||
|
||||
## (b) SQL / DDL snippets (3 examples inferred from database.py) |
||||
1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110) |
||||
|
||||
2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes |
||||
|
||||
3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings |
||||
|
||||
## (c) Pytest stubs (4 sample tests matching conventions) |
||||
Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add. |
||||
|
||||
1) tests/test_database_basic.py |
||||
```python |
||||
def test_init_database_creates_tables(tmp_path): |
||||
db_path = str(tmp_path / "motions.db") |
||||
from database import MotionDatabase |
||||
|
||||
db = MotionDatabase(db_path=db_path) |
||||
# If duckdb not available, JSON fallback should create .embeddings.json |
||||
assert db is not None |
||||
``` |
||||
|
||||
2) tests/test_ai_provider.py |
||||
```python |
||||
def test_local_embedding_fallback(): |
||||
from ai_provider import _local_embedding |
||||
|
||||
v = _local_embedding("hello world", dim=16) |
||||
assert isinstance(v, list) and len(v) == 16 |
||||
``` |
||||
|
||||
3) tests/test_pipeline_windows.py |
||||
```python |
||||
from pipeline.run_pipeline import _generate_windows |
||||
|
||||
def test_generate_quarterly_windows(): |
||||
from datetime import date |
||||
|
||||
start = date(2024, 1, 1) |
||||
end = date(2024, 3, 31) |
||||
windows = _generate_windows(start, end, "quarterly") |
||||
assert any(w[0].endswith("Q1") for w in windows) |
||||
``` |
||||
|
||||
4) tests/test_visualize_plot.py |
||||
```python |
||||
def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path): |
||||
# If plotly missing, function should raise ImportError with guidance |
||||
import analysis.visualize as vis |
||||
|
||||
try: |
||||
vis._require_plotly() |
||||
except ImportError: |
||||
assert True |
||||
``` |
||||
|
||||
## Evidence pointers |
||||
- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py |
||||
- DDL: database.py create table blocks |
||||
@ -1,43 +0,0 @@ |
||||
# Stack and Dependencies |
||||
|
||||
## Rules |
||||
- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13") |
||||
- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile |
||||
- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py |
||||
- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/ |
||||
|
||||
## Examples |
||||
|
||||
### pyproject dependencies (evidence: pyproject.toml) |
||||
```toml |
||||
dependencies = [ |
||||
"duckdb>=1.3.2", |
||||
"ibis-framework[duckdb]>=10.8.0", |
||||
"openai>=1.99.7", |
||||
"scipy>=1.11", |
||||
"umap-learn>=0.5", |
||||
"plotly>=5.0", |
||||
"pytest>=9.0.2", |
||||
"requests>=2.32.4", |
||||
"schedule>=1.2.2", |
||||
"streamlit>=1.48.0", |
||||
"scikit-learn>=1.8.0", |
||||
"beautifulsoup4>=4.14.3", |
||||
"lxml>=6.0.2", |
||||
] |
||||
``` |
||||
|
||||
## Anti-patterns / Notes |
||||
- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml |
||||
- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility. |
||||
- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai). |
||||
|
||||
## Remediations |
||||
- Move test-only libs (pytest) to dev-dependencies in pyproject.toml. |
||||
- Add lockfile and CI step to check for pinned dependencies. |
||||
- Audit declared but unused packages (openai) and remove or confirm dynamic usage. |
||||
|
||||
## Evidence pointers |
||||
- pyproject.toml: full dependency list (lines 1-40) |
||||
- Home.py: streamlit usage and app entry (file: Home.py) |
||||
- database.py: duckdb table creation and connection (file: database.py lines ~1-350) |
||||
@ -1,29 +0,0 @@ |
||||
# DB connection handling constraints |
||||
|
||||
rules: |
||||
- name: use_context_managers_for_connections |
||||
rule: "Prefer using 'with duckdb.connect(path, read_only=...) as conn' for scoped DB interactions where possible." |
||||
rationale: "Ensures proper resource cleanup and avoids connection leaks." |
||||
|
||||
- name: read_only_for_compute |
||||
rule: "Use read_only=True for compute steps that only read data (SVD, similarity compute)." |
||||
rationale: "Allows safe parallel workers and reduces write contention." |
||||
|
||||
- name: short_lived_writes |
||||
rule: "When performing database writes, open short-lived connections, commit quickly and close." |
||||
rationale: "Avoids long-lived transactions and reduces lock windows." |
||||
|
||||
examples: |
||||
- path: pipeline/svd_pipeline.py |
||||
snippet: | |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute(...).fetchall() |
||||
finally: |
||||
conn.close() |
||||
|
||||
anti_patterns_and_remediations: |
||||
- bad: "Creating a global connection at import that performs migrations." |
||||
remediation: "Move migrations to an explicit init function that runs at deployment/upgrade time." |
||||
- bad: "Not closing connections on exceptions." |
||||
remediation: "Wrap connects in `with` or finally: conn.close() blocks." |
||||
@ -0,0 +1,143 @@ |
||||
--- |
||||
title: Error Handling Patterns |
||||
category: constraints |
||||
severity: high |
||||
--- |
||||
|
||||
# Error Handling Patterns |
||||
|
||||
## Core Rules |
||||
|
||||
1. **Catch `Exception`, return safe fallbacks** (False/[]/None) |
||||
2. **Log exceptions with traceback** using `_logger.exception()` |
||||
3. **Never swallow exceptions silently** - always log or return sensible default |
||||
4. **Avoid nested try/except blocks** - flatten exception handling |
||||
|
||||
## Pattern: Try/Except Safe Fallback |
||||
|
||||
This is the dominant pattern in the codebase (219+ instances). |
||||
|
||||
```python |
||||
# Standard pattern from database.py, api_client.py, etc. |
||||
try: |
||||
result = risky_operation() |
||||
return process(result) |
||||
except Exception as exc: |
||||
_logger.warning("Operation failed: %s", exc) |
||||
return safe_fallback # False, [], None, {} |
||||
``` |
||||
|
||||
### Examples from Codebase |
||||
|
||||
**database.py** - DuckDB operations: |
||||
```python |
||||
def get_svd_vectors(self, window: str): |
||||
try: |
||||
conn = duckdb.connect(self.db_path, read_only=True) |
||||
try: |
||||
result = conn.execute(query, (window,)).fetchall() |
||||
return self._parse_vectors(result) |
||||
finally: |
||||
conn.close() |
||||
except Exception as exc: |
||||
_logger.warning("Failed to get SVD vectors: %s", exc) |
||||
return [] |
||||
``` |
||||
|
||||
**ai_provider.py** - HTTP retries: |
||||
```python |
||||
try: |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
resp.raise_for_status() |
||||
return resp.json() |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Connection error: {exc}") from exc |
||||
# ... retry logic |
||||
``` |
||||
|
||||
## Pattern: Optional Dependency Fallback |
||||
|
||||
Gracefully degrade when optional packages are unavailable. |
||||
|
||||
```python |
||||
# UMAP fallback in explorer_helpers.py |
||||
try: |
||||
import umap |
||||
HAS_UMAP = True |
||||
except ImportError: |
||||
HAS_UMAP = False |
||||
_logger.debug("UMAP not available, using SVD vectors directly") |
||||
|
||||
def project_to_2d(vectors): |
||||
if HAS_UMAP: |
||||
return umap.UMAP().fit_transform(vectors) |
||||
return vectors[:, :2] # Fallback: first 2 SVD dimensions |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### 1. Bare except with pass (CRITICAL) |
||||
**File**: `database.py`, line 47 |
||||
|
||||
```python |
||||
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: # bare except |
||||
pass |
||||
``` |
||||
|
||||
**Fix**: Catch specific exception or log and continue: |
||||
```python |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except Exception as exc: |
||||
_logger.debug("Sequence creation skipped (may already exist): %s", exc) |
||||
``` |
||||
|
||||
### 2. Nested Exception Handling |
||||
**File**: `explorer.py`, lines 244-261 |
||||
|
||||
```python |
||||
# BAD - opaque error paths |
||||
try: |
||||
result = compute_svd(motions) |
||||
except Exception: |
||||
try: |
||||
result = fallback_compute(motions) |
||||
except Exception: |
||||
pass # Both exceptions silently dropped |
||||
``` |
||||
|
||||
**Fix**: Flatten and handle each case explicitly: |
||||
```python |
||||
# GOOD - explicit handling |
||||
try: |
||||
result = compute_svd(motions) |
||||
except Exception as exc: |
||||
_logger.warning("SVD failed, trying fallback: %s", exc) |
||||
try: |
||||
result = fallback_compute(motions) |
||||
except Exception as fallback_exc: |
||||
_logger.error("Both SVD approaches failed: %s, %s", exc, fallback_exc) |
||||
raise |
||||
``` |
||||
|
||||
## Rule Summary |
||||
|
||||
| Pattern | When to Use | Return Value | |
||||
|---------|-------------|--------------| |
||||
| Safe fallback | Best-effort operations | `[]`, `{}`, `False`, `None` | |
||||
| Re-raise | Critical operations that must succeed | raise | |
||||
| Log and continue | Optional steps in pipeline | (continue) | |
||||
| Graceful degradation | Optional dependencies | Default behavior | |
||||
|
||||
## When to Log vs Return |
||||
|
||||
| Scenario | Action | |
||||
|----------|--------| |
||||
| User action fails | Log warning, return safe default | |
||||
| Internal error (corrupt data) | Log error, return safe default | |
||||
| Transient failure (network) | Log warning, retry if appropriate | |
||||
| Configuration error | Log error, raise with clear message | |
||||
@ -1,184 +0,0 @@ |
||||
# Error Handling Constraints |
||||
|
||||
## Core Rule |
||||
|
||||
**Catch `Exception`, return safe fallbacks (False/[]/None)** |
||||
|
||||
Never let exceptions propagate to user-facing code. Always provide a safe default. |
||||
|
||||
## Patterns |
||||
|
||||
### For Not-Found Operations |
||||
|
||||
Return `None` or falsy value when item not found: |
||||
|
||||
```python |
||||
# GOOD: Return None on not found |
||||
def get_motion_by_id(self, motion_id: int) -> Optional[Dict]: |
||||
try: |
||||
conn = duckdb.connect(self.db_path) |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
||||
).fetchone() |
||||
conn.close() |
||||
return result |
||||
except Exception: |
||||
conn.close() |
||||
return None |
||||
``` |
||||
|
||||
### For Collection Operations |
||||
|
||||
Return empty list when no results: |
||||
|
||||
```python |
||||
# GOOD: Return empty list on failure |
||||
def get_filtered_motions(self, **kwargs) -> List[Dict]: |
||||
try: |
||||
conn = duckdb.connect(self.db_path) |
||||
rows = conn.execute(query, params).fetchall() |
||||
conn.close() |
||||
return rows |
||||
except Exception: |
||||
conn.close() |
||||
return [] |
||||
``` |
||||
|
||||
### For Boolean Operations |
||||
|
||||
Return `False` for failed boolean checks: |
||||
|
||||
```python |
||||
# GOOD: Return False on failure |
||||
def motion_exists(self, motion_id: int) -> bool: |
||||
try: |
||||
conn = duckdb.connect(self.db_path) |
||||
count = conn.execute( |
||||
"SELECT COUNT(*) FROM motions WHERE id = ?", (motion_id,) |
||||
).fetchone()[0] |
||||
conn.close() |
||||
return count > 0 |
||||
except Exception: |
||||
return False |
||||
``` |
||||
|
||||
### For Creation Operations |
||||
|
||||
Return `False` or empty string on failure: |
||||
|
||||
```python |
||||
# GOOD: Return empty string on failure |
||||
def generate_summary(self, title: str, body: str) -> str: |
||||
try: |
||||
return ai_provider.chat_completion(messages) |
||||
except ai_provider.ProviderError: |
||||
logger.exception("AI provider failed") |
||||
return "" |
||||
``` |
||||
|
||||
## Anti-Patterns to Avoid |
||||
|
||||
### Don't Catch Specific Exceptions Only |
||||
```python |
||||
# BAD: Catches only FileNotFoundError, misses other issues |
||||
try: |
||||
with open(path) as f: |
||||
return json.load(f) |
||||
except FileNotFoundError: |
||||
return None |
||||
``` |
||||
|
||||
### Don't Re-raise Without Context |
||||
```python |
||||
# BAD: Loses information |
||||
try: |
||||
process(data) |
||||
except Exception: |
||||
raise # No context added |
||||
``` |
||||
|
||||
### Don't Swallow Exceptions Silently |
||||
```python |
||||
# BAD: No logging, no fallback |
||||
try: |
||||
return risky_operation() |
||||
except Exception: |
||||
pass # What happened? |
||||
``` |
||||
|
||||
## Nested Exception Handling |
||||
|
||||
When calling code that has its own error handling, wrap only if needed: |
||||
|
||||
```python |
||||
# Accept result from wrapped function (it handles errors) |
||||
def fetch_motions(self, start_date): |
||||
# ai_provider_wrapper handles retries internally |
||||
embeddings = get_embeddings_with_retry(texts) |
||||
|
||||
# Only wrap if wrapper doesn't handle errors |
||||
if all(e is None for e in embeddings): |
||||
logger.error("All embeddings failed") |
||||
return [] |
||||
|
||||
return process(embeddings) |
||||
``` |
||||
|
||||
## Context Managers |
||||
|
||||
Use `try/finally` for cleanup: |
||||
|
||||
```python |
||||
def process_with_temp_file(self): |
||||
temp = NamedTemporaryFile(delete=False) |
||||
try: |
||||
temp.write(data) |
||||
temp.close() |
||||
return process_file(temp.name) |
||||
finally: |
||||
os.unlink(temp.name) |
||||
temp.close() |
||||
``` |
||||
|
||||
## When to Log vs Return |
||||
|
||||
| Scenario | Action | |
||||
|----------|--------| |
||||
| User action fails | Log warning, return safe default | |
||||
| Internal error (corrupt data) | Log error, return safe default | |
||||
| Transient failure (network) | Log warning, retry if appropriate | |
||||
| Configuration error | Log error, raise with clear message | |
||||
|
||||
## Exception Propagation |
||||
|
||||
Only raise exceptions for: |
||||
1. Configuration/setup errors (missing required env vars) |
||||
2. Programming errors (invalid arguments) |
||||
3. Fatal system errors (database corruption) |
||||
|
||||
```python |
||||
# GOOD: Raise for configuration errors |
||||
def _get_api_key(self) -> str: |
||||
key = os.environ.get("OPENROUTER_API_KEY") |
||||
if not key: |
||||
raise ProviderError( |
||||
"OPENROUTER_API_KEY environment variable is required" |
||||
) |
||||
return key |
||||
``` |
||||
|
||||
## Logging Errors |
||||
|
||||
Always include context: |
||||
|
||||
```python |
||||
# GOOD: Include relevant context |
||||
_logger.error( |
||||
"Failed to fetch motion %d: %s", |
||||
motion_id, |
||||
exc |
||||
) |
||||
|
||||
# BAD: No context |
||||
_logger.error("Failed to fetch") |
||||
``` |
||||
@ -1,36 +0,0 @@ |
||||
# Error handling style rules (YAML constraint example) |
||||
|
||||
rules: |
||||
- name: explicit_exceptions |
||||
rule: "Raise explicit exceptions (ValueError, ProviderError) for known error conditions rather than returning magic values." |
||||
examples: |
||||
- good: | |
||||
if not isinstance(text, str): |
||||
raise ProviderError('text must be a string') |
||||
- bad: | |
||||
if not isinstance(text, str): |
||||
return [] |
||||
|
||||
- name: avoid_broad_except |
||||
rule: "Avoid 'except Exception:' that swallows errors. If broad except is used for best-effort, log the exception with logger.exception and re-raise or convert." |
||||
examples: |
||||
- bad: | |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
- remediation: | |
||||
try: |
||||
do_work() |
||||
except SpecificError as exc: |
||||
logger.warning('Handled error: %s', exc) |
||||
raise |
||||
|
||||
- name: logging_over_print |
||||
rule: "Prefer logger.* over print() for messages and errors." |
||||
examples: |
||||
- bad: "print('Error fetching motions from API: %s' % e)" |
||||
- good: "logger.exception('Error fetching motions from API')" |
||||
|
||||
enforcement_examples: |
||||
- "Add a static code check to flag 'print(' in modules (except in simple scripts) and 'except Exception:' usages without logger.exception." |
||||
@ -0,0 +1,92 @@ |
||||
--- |
||||
title: Dependencies and Library Usage |
||||
category: dependencies |
||||
--- |
||||
|
||||
# Dependencies and Library Usage |
||||
|
||||
## Core Dependencies |
||||
|
||||
### duckdb |
||||
- **Required**: Yes |
||||
- **Fallback**: None (core functionality) |
||||
- **Usage**: SQL database for motions, embeddings, SVD vectors |
||||
- **Files**: database.py, analysis/*.py, pipeline/*.py |
||||
|
||||
### streamlit |
||||
- **Required**: Yes |
||||
- **Fallback**: None |
||||
- **Usage**: Web UI framework |
||||
- **Files**: app.py, pages/*.py, explorer.py |
||||
|
||||
### requests |
||||
- **Required**: Yes |
||||
- **Fallback**: None |
||||
- **Usage**: HTTP client for API calls |
||||
- **Files**: api_client.py, ai_provider.py |
||||
|
||||
### plotly |
||||
- **Required**: Yes |
||||
- **Fallback**: None (raises ImportError) |
||||
- **Usage**: Interactive charts for explorer |
||||
- **Files**: explorer.py, explorer_helpers.py |
||||
|
||||
## Optional Dependencies |
||||
|
||||
### umap-learn |
||||
- **Required**: No |
||||
- **Fallback**: Use raw SVD vectors (first 2 dimensions) |
||||
- **Usage**: Dimensionality reduction for visualization |
||||
- **Files**: analysis/clustering.py |
||||
|
||||
### matplotlib |
||||
- **Required**: No |
||||
- **Fallback**: Plotly or raw output |
||||
- **Usage**: Static charting |
||||
- **Files**: Various analysis scripts |
||||
|
||||
## ML Dependencies |
||||
|
||||
### sklearn |
||||
- **Required**: Yes |
||||
- **Usage**: KMeans clustering, cosine_similarity, StandardScaler |
||||
- **Files**: analysis/clustering.py, similarity/compute.py |
||||
|
||||
### scipy |
||||
- **Required**: Yes |
||||
- **Usage**: SVD (scipy.linalg.svd), spatial.procrustes for alignment |
||||
- **Files**: analysis/trajectory.py, pipeline/svd_pipeline.py |
||||
|
||||
### numpy |
||||
- **Required**: Yes |
||||
- **Usage**: Array operations, linear algebra |
||||
- **Files**: Throughout codebase |
||||
|
||||
## Key Imports by File |
||||
|
||||
### explorer.py |
||||
- `import streamlit as st` |
||||
- `from database import db` |
||||
- `from explorer_helpers import *` |
||||
|
||||
### explorer_helpers.py |
||||
- `import pandas as pd` |
||||
- `import plotly.graph_objects as go` |
||||
- `from database import db` (optional, for type hints) |
||||
|
||||
### database.py |
||||
- `import ibis` |
||||
- `import duckdb` |
||||
- `from config import config, PARTY_COLOURS` |
||||
|
||||
### config.py |
||||
- `from dataclasses import dataclass, field` |
||||
- `import streamlit as st` (optional, for warnings) |
||||
|
||||
## Singleton Instances |
||||
|
||||
| Module | Instance | Type | |
||||
|--------|----------|------| |
||||
| `database.py` | `db` | `MotionDatabase` | |
||||
| `config.py` | `config` | `Config` (dataclass) | |
||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||
@ -1,78 +0,0 @@ |
||||
# Dependencies |
||||
|
||||
## Core Library Wiring |
||||
|
||||
### Database Layer |
||||
``` |
||||
ibis → DuckDB → MotionDatabase singleton (database.py) |
||||
↑ |
||||
sqlglot (ibis dependency) |
||||
``` |
||||
|
||||
### Data Processing |
||||
``` |
||||
pandas → (used throughout for DataFrame operations) |
||||
numpy → (used by sklearn, scipy, umap) |
||||
scipy → spatial.procrustes for window alignment |
||||
``` |
||||
|
||||
### ML Pipeline |
||||
``` |
||||
sklearn.cluster → KMeans, Procrustes |
||||
sklearn.preprocessing → StandardScaler |
||||
umap → UMAP (optional, graceful fallback) |
||||
``` |
||||
|
||||
### Visualization |
||||
``` |
||||
plotly → explorer_helpers.py chart builders |
||||
st.plotly_chart → explorer.py rendering |
||||
``` |
||||
|
||||
### Streamlit |
||||
``` |
||||
streamlit → all pages, @st.cache_data decorators |
||||
``` |
||||
|
||||
## Optional Dependencies |
||||
| Package | Required | Fallback | |
||||
|---------|----------|----------| |
||||
| `umap` | No | Use raw SVD vectors (first 2 dims) | |
||||
| `plotly` | Yes | Raises ImportError | |
||||
| `duckdb` | Yes | — | |
||||
| `ibis` | Yes | — | |
||||
| `sklearn` | Yes | — | |
||||
|
||||
## Singleton Instances |
||||
| Module | Instance | Type | |
||||
|--------|----------|------| |
||||
| `database.py` | `db` | `MotionDatabase` | |
||||
| `config.py` | `config` | `Config` (dataclass) | |
||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||
|
||||
## Key Imports by File |
||||
``` |
||||
explorer.py: |
||||
- import streamlit as st |
||||
- from database import db |
||||
- from explorer_helpers import * |
||||
|
||||
explorer_helpers.py: |
||||
- import pandas as pd |
||||
- import plotly.graph_objects as go |
||||
- from database import db (optional, for type hints) |
||||
|
||||
database.py: |
||||
- import ibis |
||||
- import duckdb |
||||
- from config import config, PARTY_COLOURS |
||||
|
||||
config.py: |
||||
- from dataclasses import dataclass, field |
||||
- import streamlit as st (optional, for warnings) |
||||
``` |
||||
|
||||
## Environment |
||||
- Python ≥3.13 |
||||
- Environment variables via `.env` (DB path, API keys) |
||||
- No `.env` values in constraint files (security) |
||||
@ -0,0 +1,146 @@ |
||||
--- |
||||
title: Domain Glossary |
||||
category: domain |
||||
--- |
||||
|
||||
# Domain Glossary - Dutch Political Terms |
||||
|
||||
## CRITICAL INVARIANTS |
||||
|
||||
> **Rule 1**: Centroid of right-wing parties on RIGHT side of ALL axes |
||||
> - PVV, FVD, JA21, SGP centroid must appear on the RIGHT |
||||
> - Individual right-wing parties may vary slightly from the centroid |
||||
> - This is non-negotiable for any compass/axis visualization |
||||
|
||||
> **Rule 2**: SVD labels are empirically derived from voting data |
||||
> - Labels represent WHAT THE DATA SHOWS, not party self-identification or public opinion |
||||
> - Labels are derived from outliers and 20 representative motions (10 positive, 10 negative) |
||||
> - See SVD Label Derivation section below |
||||
|
||||
--- |
||||
|
||||
## SVD Label Derivation |
||||
|
||||
### The Process |
||||
|
||||
SVD (Singular Value Decomposition) finds axes that maximize variance in the MP × Motion voting matrix. To label each axis: |
||||
|
||||
1. **Identify outliers**: Find the two MPs with most extreme positions on that axis |
||||
2. **Select representative motions**: Pick 20 motions where these outliers disagreed most sharply (10 they voted opposite on, 10 where both voted same direction but with other extremes) |
||||
3. **Interpret theme**: Read the motion titles to derive what the axis represents |
||||
4. **Assign label**: Label describes the empirical theme, could be: |
||||
- Left-Right |
||||
- Coalition-Opposition |
||||
- Progressive-Conservative |
||||
- EU-National sovereignty |
||||
- Populist-Establishment |
||||
- Or whatever the voting patterns show |
||||
|
||||
### Example |
||||
|
||||
| Step | Description | |
||||
|------|-------------| |
||||
| Outlier A | Wilders (PVV) - extreme positive on Dim 1 | |
||||
| Outlier B | Marijnissen (SP) - extreme negative on Dim 1 | |
||||
| 20 Motions | Immigration, integration, law & order themes dominate | |
||||
| Label | "Links-Rechts" (Left-Right) | |
||||
|
||||
### Labeling Rules |
||||
|
||||
- **Never use party names in labels** (e.g., not "PVV-SP axis") |
||||
- **Never use semantic/ideological labels** (e.g., not "progressive-conservative" unless that's what the motions show) |
||||
- **Use motion-derived themes** (e.g., "Immigration", "EU", "Economy") |
||||
- **Fallback**: If theme is unclear, use "Axis 1", "Axis 2" |
||||
|
||||
--- |
||||
|
||||
## Core Entities |
||||
|
||||
### Motion / Motie |
||||
- Parliamentary motion submitted by MPs |
||||
- Fields: `id`, `title`, `date`, `category` |
||||
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** |
||||
|
||||
### MP / Kamerlid |
||||
- Member of Parliament (Tweede Kamerlid) |
||||
- Identified by full name (e.g., "Van Dijk, I.") |
||||
- Has voting record, party affiliation, SVD position vector |
||||
|
||||
### Party / Fractie |
||||
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") |
||||
- Party centroids: average SVD position of all MPs in party |
||||
|
||||
### Vote / Stemming |
||||
- Individual MP's vote on a motion: +1, 0, -1 |
||||
- Aggregated to compute SVD vectors |
||||
|
||||
--- |
||||
|
||||
## Time & Analysis Concepts |
||||
|
||||
### Window / Tijdsvenster |
||||
- Time period for analysis (annual or quarterly) |
||||
- Values: "2023", "2023-Q1", "2024", etc. |
||||
- SVD vectors computed per window |
||||
|
||||
### Trajectory |
||||
- MP's position change across multiple windows |
||||
- Computed from `svd_vectors` + window ordering |
||||
|
||||
--- |
||||
|
||||
## Mathematical / Algorithmic Terms |
||||
|
||||
### SVD Vector |
||||
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix |
||||
- Represents MP's position in political space |
||||
|
||||
### SVD Label |
||||
- Empirically derived axis label based on outlier MPs and representative motions |
||||
- Describes the theme of disagreement on that axis |
||||
- NOT based on party ideology or semantic labels |
||||
|
||||
### Political Compass |
||||
- 2D visualization with SVD axes mapped to compass quadrants |
||||
- X-axis: First SVD dimension (labeled from voting data) |
||||
- Y-axis: Second SVD dimension (labeled from voting data) |
||||
|
||||
### Procrustes Alignment |
||||
- Algorithm to align SVD vectors across time windows |
||||
- Ensures comparable positions across years/quarters |
||||
|
||||
### UMAP |
||||
- Uniform Manifold Approximation and Projection |
||||
- Dimensionality reduction for visualization |
||||
- Optional dependency with graceful SVD fallback |
||||
|
||||
--- |
||||
|
||||
## Database Table Reference |
||||
|
||||
| Table | Key Fields | |
||||
|-------|-----------| |
||||
| `motions` | id, title, date, category | |
||||
| `mp_votes` | mp_id, motion_id, vote | |
||||
| `svd_vectors` | entity_id, window, vector_2d (list[2]) | |
||||
| `mp_party_history` | mp_id, party, start_date, end_date | |
||||
| `windows` | window_id, start_date, end_date, period_type | |
||||
| `mp_trajectories` | mp_id, window, trajectory_vector | |
||||
|
||||
--- |
||||
|
||||
## Dutch Political Parties |
||||
|
||||
### Canonical Right-Wing (centroid on RIGHT of axes) |
||||
- PVV (Partij voor de Vrijheid) |
||||
- FVD (Forum voor Democratie) |
||||
- JA21 |
||||
- SGP (Staatkundig Gereformeerde Partij) |
||||
|
||||
### Other Major Parties |
||||
- VVD (Volkspartij voor Vrijheid en Democratie) |
||||
- GL-PvdA (GroenLinks-PvdA) |
||||
- NSC (Nieuw Sociaal Contract) |
||||
- BBB (BoerBurgerBeweging) |
||||
- SP (Socialistische Partij) |
||||
- D66 (Democraten 66) |
||||
@ -1,107 +0,0 @@ |
||||
# Domain Glossary - Dutch Political Terms |
||||
|
||||
## Core Entities |
||||
|
||||
### Motion / Motie |
||||
- Parliamentary motion submitted by MPs |
||||
- Fields: `id`, `title`, `date`, `category` |
||||
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** |
||||
|
||||
### MP / Kamerlid |
||||
- Member of Parliament (Tweede Kamerlid) |
||||
- Identified by full name (e.g., "Van Dijk, I.") |
||||
- Has voting record, party affiliation, SVD position vector |
||||
- Historical: `mp_party_history` tracks party changes over time |
||||
|
||||
### Party / Fractie |
||||
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") |
||||
- Party centroids: average SVD position of all MPs in party |
||||
- Aliases: multiple spelling variants exist (see anti-patterns.yaml) |
||||
|
||||
### Vote / Stemming |
||||
- Individual MP's vote on a motion: +1, 0, -1 |
||||
- Aggregated to compute SVD vectors |
||||
|
||||
--- |
||||
|
||||
## Time & Analysis Concepts |
||||
|
||||
### Window / Tijdsvenster |
||||
- Time period for analysis (annual or quarterly) |
||||
- Values: "2023", "2023-Q1", "2024", etc. |
||||
- SVD vectors computed per window |
||||
- Windows can be aligned across time using Procrustes |
||||
|
||||
### Trajectory |
||||
- MP's position change across multiple windows |
||||
- Computed from `svd_vectors` + window ordering |
||||
- Used for trend analysis in Evolution tab |
||||
|
||||
--- |
||||
|
||||
## Mathematical / Algorithmic Terms |
||||
|
||||
### SVD Vector |
||||
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix |
||||
- Represents MP's position in political space |
||||
- `entity_id` in `svd_vectors`: either MP name (when individual MPs) or party name (when party-level) |
||||
|
||||
### Political Compass |
||||
- 2D visualization: X-axis = Left↔Right, Y-axis = Progressive↔Conservative |
||||
- SVD vectors mapped to compass quadrants |
||||
- UMAP used for projection |
||||
|
||||
### Procrustes Alignment |
||||
- Algorithm to align SVD vectors across time windows |
||||
- Ensures comparable positions across years/quarters |
||||
- Implemented via `scipy.spatial.procrustes` or scikit-learn |
||||
|
||||
### Centroid |
||||
- Geometric center of a set of points |
||||
- Party centroid = average SVD position of all MPs in that party |
||||
- Computed from `svd_vectors` filtered by party |
||||
|
||||
### UMAP |
||||
- Uniform Manifold Approximation and Projection |
||||
- Dimensionality reduction for visualization |
||||
- Optional dependency — graceful fallback if unavailable |
||||
|
||||
--- |
||||
|
||||
## Visualization |
||||
|
||||
### PARTY_COLOURS |
||||
- Dict mapping party names to hex color codes |
||||
- Used in all Plotly charts for consistent party coloring |
||||
- Source: `config.py` → `PARTY_COLOURS` constant |
||||
- **Issue**: 3 separate alias dictionaries exist (no single source of truth) |
||||
|
||||
--- |
||||
|
||||
## Application Pages |
||||
|
||||
### Home |
||||
- Landing page with app overview |
||||
|
||||
### Stemwijzer (Quiz) |
||||
- User answers questions → matched to parties |
||||
- Thin wrapper around quiz module |
||||
|
||||
### Explorer (4 tabs) |
||||
- **Motion tab**: SVD positions colored by vote on selected motion |
||||
- **MP tab**: Individual MP trajectories across windows |
||||
- **Party tab**: Party centroids with members as scatter |
||||
- **Evolution tab**: How positions change over time |
||||
|
||||
--- |
||||
|
||||
## Database Table Reference |
||||
| Table | Key Fields | |
||||
|-------|-----------| |
||||
| `motions` | id, title, date, category | |
||||
| `mp_votes` | mp_id, motion_id, vote | |
||||
| `svd_vectors` | entity_id, window, vector_2d (list[2]) | |
||||
| `party_centroids` | party, window, centroid_2d | |
||||
| `mp_party_history` | mp_id, party, start_date, end_date | |
||||
| `windows` | window_id, start_date, end_date, period_type | |
||||
| `mp_trajectories` | mp_id, window, trajectory_vector | |
||||
@ -0,0 +1,79 @@ |
||||
--- |
||||
title: DuckDB Access Pattern |
||||
category: patterns |
||||
--- |
||||
# DuckDB Access Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
||||
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
||||
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
||||
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
||||
|
||||
## Examples |
||||
|
||||
### database.py - Explicit connect/close for schema init |
||||
|
||||
```python |
||||
conn = duckdb.connect(self.db_path) |
||||
... |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
conn.close() |
||||
``` |
||||
|
||||
### pipeline/svd_pipeline.py - Read-only connection |
||||
|
||||
```python |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute( |
||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||
(start_date, end_date), |
||||
).fetchall() |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
### similarity/compute.py - Preferred 'with' context |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
|
||||
with duckdb.connect(db.db_path) as conn: |
||||
rows = conn.execute(query, params).fetchall() |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Connection without closure |
||||
|
||||
```python |
||||
# BAD: connection may leak if exception occurs before explicit close |
||||
conn = duckdb.connect(db_path) |
||||
rows = conn.execute("SELECT ...").fetchall() |
||||
# missing finally/close |
||||
``` |
||||
|
||||
**Remediation**: Use "with" context or ensure conn.close() in finally block. |
||||
|
||||
### Bad: Parallel write connections |
||||
|
||||
**Problem**: Opening write connections from many parallel workers without coordination. |
||||
|
||||
**Remediation**: Open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
||||
@ -1,70 +0,0 @@ |
||||
name: duckdb_access |
||||
|
||||
rules: |
||||
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
||||
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
||||
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
||||
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
||||
|
||||
examples: |
||||
- path: database.py |
||||
excerpt: | |
||||
```python |
||||
conn = duckdb.connect(self.db_path) |
||||
... |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
conn.close() |
||||
``` |
||||
note: explicit connect/close used when initializing schema |
||||
|
||||
- path: pipeline/svd_pipeline.py |
||||
excerpt: | |
||||
```python |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute( |
||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||
(start_date, end_date), |
||||
).fetchall() |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
note: read_only connection used for compute-heavy worker |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
|
||||
with duckdb.connect(db.db_path) as conn: |
||||
rows = conn.execute(query, params).fetchall() |
||||
``` |
||||
note: preferred 'with' context for automatic close |
||||
|
||||
anti_patterns: |
||||
- Bad: creating a connection without closure in a long-running process |
||||
remediation: use "with" context or ensure conn.close() in finally block |
||||
example: | |
||||
```python |
||||
# BAD: connection may leak if exception occurs before explicit close |
||||
conn = duckdb.connect(db_path) |
||||
rows = conn.execute("SELECT ...").fetchall() |
||||
# missing finally/close |
||||
``` |
||||
- Bad: Opening write connections from many parallel workers without coordination |
||||
remediation: open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
||||
@ -0,0 +1,74 @@ |
||||
--- |
||||
title: Embeddings Similarity Pipeline |
||||
category: patterns |
||||
--- |
||||
# Embeddings Similarity Pipeline |
||||
|
||||
## Rules |
||||
|
||||
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
||||
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
||||
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
||||
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
||||
|
||||
## Examples |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Batched embed + fallback |
||||
|
||||
```python |
||||
for start in range(0, len(texts), batch_size): |
||||
chunk = texts[start : start + batch_size] |
||||
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
||||
... |
||||
for j in range(i, end): |
||||
t = texts[j] |
||||
single, single_exc = _attempt_batch([t], j) |
||||
if single: |
||||
results[j] = single[0] |
||||
``` |
||||
|
||||
### pipeline/fusion.py - Concatenation and storage |
||||
|
||||
```python |
||||
try: |
||||
svd_vec = json.loads(svd_json) |
||||
except Exception: |
||||
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
||||
skipped_missing_svd += 1 |
||||
continue |
||||
... |
||||
fused = list(svd_vec) + list(text_vec) |
||||
res = db.store_fused_embedding( |
||||
int(entity_id), |
||||
window_id, |
||||
fused, |
||||
svd_dims=len(svd_vec), |
||||
text_dims=len(text_vec), |
||||
) |
||||
``` |
||||
|
||||
### similarity/compute.py - Normalized cosine similarity |
||||
|
||||
```python |
||||
# Normalize rows |
||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||
norms[norms == 0] = 1.0 |
||||
normalized = matrix / norms |
||||
sim = normalized @ normalized.T |
||||
... |
||||
# pick top-k neighbors and write to similarity_cache |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Assuming consistent vector length |
||||
|
||||
**Problem**: Assuming consistent vector length without checks leads to shape errors. |
||||
|
||||
**Remediation**: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
||||
|
||||
### Bad: Inline heavy computation in UI |
||||
|
||||
**Problem**: Recomputing heavy pipelines inline in UI requests. |
||||
|
||||
**Remediation**: Schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
||||
@ -1,63 +0,0 @@ |
||||
name: embeddings_similarity_pipeline |
||||
|
||||
rules: |
||||
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
||||
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
||||
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
||||
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
||||
|
||||
examples: |
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
for start in range(0, len(texts), batch_size): |
||||
chunk = texts[start : start + batch_size] |
||||
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
||||
... |
||||
for j in range(i, end): |
||||
t = texts[j] |
||||
single, single_exc = _attempt_batch([t], j) |
||||
if single: |
||||
results[j] = single[0] |
||||
``` |
||||
note: batched embed + fallback per-item retry |
||||
|
||||
- path: pipeline/fusion.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
svd_vec = json.loads(svd_json) |
||||
except Exception: |
||||
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
||||
skipped_missing_svd += 1 |
||||
continue |
||||
... |
||||
fused = list(svd_vec) + list(text_vec) |
||||
res = db.store_fused_embedding( |
||||
int(entity_id), |
||||
window_id, |
||||
fused, |
||||
svd_dims=len(svd_vec), |
||||
text_dims=len(text_vec), |
||||
) |
||||
``` |
||||
note: concatenation of vectors and storage via MotionDatabase |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
# Normalize rows |
||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||
norms[norms == 0] = 1.0 |
||||
normalized = matrix / norms |
||||
sim = normalized @ normalized.T |
||||
... |
||||
# pick top-k neighbors and write to similarity_cache |
||||
``` |
||||
note: numeric pipeline and padding to consistent dimensionality |
||||
|
||||
anti_patterns: |
||||
- Bad: Assuming consistent vector length without checks (leads to shape errors). |
||||
remediation: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
||||
- Bad: Recomputing heavy pipelines inline in UI requests. |
||||
remediation: schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
||||
@ -0,0 +1,63 @@ |
||||
--- |
||||
title: Error Handling Pattern |
||||
category: patterns |
||||
--- |
||||
# Error Handling Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
||||
- Prefer logging.exception when catching an exception where stack trace is useful. |
||||
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
||||
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - Network error to ProviderError |
||||
|
||||
```python |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError( |
||||
f"Connection error when calling provider: {exc}" |
||||
) from exc |
||||
... |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Best-effort with logging |
||||
|
||||
```python |
||||
except Exception: |
||||
_logger.exception("Failed to append audit event for embedding failure") |
||||
results[j] = None |
||||
``` |
||||
|
||||
### similarity/compute.py - Defensive import handling |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Silent exception swallowing |
||||
|
||||
```python |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
# BAD: hides the root cause and returns an ambiguous default |
||||
``` |
||||
|
||||
**Remediation**: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
||||
|
||||
### Bad: Mixing print() and logging |
||||
|
||||
**Problem**: Mixing print() and logging for errors. |
||||
|
||||
**Remediation**: Replace print() calls with logger.* calls; use structured logging configuration. |
||||
@ -1,54 +0,0 @@ |
||||
name: error_handling |
||||
|
||||
rules: |
||||
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
||||
- Prefer logging.exception when catching an exception where stack trace is useful. |
||||
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
||||
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError( |
||||
f"Connection error when calling provider: {exc}" |
||||
) from exc |
||||
... |
||||
``` |
||||
note: mapping network error to ProviderError with re-raise chaining |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
except Exception: |
||||
_logger.exception("Failed to append audit event for embedding failure") |
||||
results[j] = None |
||||
``` |
||||
note: logs and assigns None for failure; fallback behavior documented earlier in wrapper rule |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
``` |
||||
note: defensive import handling and early return on failure |
||||
|
||||
anti_patterns: |
||||
- Bad: Broad except without logging and without re-raising (silently hides bugs) |
||||
remediation: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
||||
example: | |
||||
```python |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
# BAD: hides the root cause and returns an ambiguous default |
||||
``` |
||||
- Bad: Mixing print() and logging for errors |
||||
remediation: Replace print() calls with logger.* calls; use structured logging configuration. |
||||
@ -0,0 +1,41 @@ |
||||
--- |
||||
title: Module Singletons Pattern |
||||
category: patterns |
||||
--- |
||||
# Module Singletons Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
||||
- Avoid expensive initialization at import time. |
||||
- Provide a way to construct with a test DB path or to reinitialize in tests. |
||||
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
||||
|
||||
## Examples |
||||
|
||||
### database.py - Safe class initialization |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
# If duckdb is not available, operate in lightweight file-backed mode |
||||
self._file_mode = duckdb is None |
||||
self._init_database() |
||||
``` |
||||
|
||||
### similarity/lookup.py - Local instances |
||||
|
||||
```python |
||||
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
||||
if hasattr(db, "get_cached_similarities"): |
||||
rows = db.get_cached_similarities(...) |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Heavy initialization at import time |
||||
|
||||
**Problem**: Creating connections and performing heavy schema migrations during import. |
||||
|
||||
**Remediation**: Move heavy init to an explicit initialize() method and keep import fast. |
||||
@ -1,33 +0,0 @@ |
||||
name: module_singletons |
||||
|
||||
rules: |
||||
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
||||
- Avoid expensive initialization at import time. |
||||
- Provide a way to construct with a test DB path or to reinitialize in tests. |
||||
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
||||
|
||||
examples: |
||||
- path: database.py |
||||
excerpt: | |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
# If duckdb is not available, operate in lightweight file-backed mode |
||||
self._file_mode = duckdb is None |
||||
self._init_database() |
||||
``` |
||||
note: class is safe to instantiate and creates DB at init; consider lazy init if heavy |
||||
|
||||
- path: similarity/lookup.py |
||||
excerpt: | |
||||
```python |
||||
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
||||
if hasattr(db, "get_cached_similarities"): |
||||
rows = db.get_cached_similarities(...) |
||||
``` |
||||
note: consumers create local MotionDatabase instances, not relying on a single global |
||||
|
||||
anti_patterns: |
||||
- Bad: Creating connections and performing heavy schema migrations during import |
||||
remediation: Move heavy init to an explicit initialize() method and keep import fast. |
||||
@ -0,0 +1,77 @@ |
||||
--- |
||||
title: Requests HTTP Pattern |
||||
category: patterns |
||||
--- |
||||
# Requests HTTP Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
||||
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
||||
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
||||
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - 429 handling with Retry-After |
||||
|
||||
```python |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
... |
||||
if getattr(resp, "status_code", 0) == 429: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||
retry_after = None |
||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||
if raw: |
||||
try: |
||||
retry_after = int(raw) |
||||
except Exception: |
||||
... |
||||
if retry_after is not None: |
||||
time.sleep(retry_after) |
||||
continue |
||||
``` |
||||
|
||||
### api_client.py - Session + raise_for_status |
||||
|
||||
```python |
||||
response = self.session.get( |
||||
base_url, params=params, timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Retry/backoff wrapper |
||||
|
||||
```python |
||||
def _attempt_batch(chunk_texts, start_index): |
||||
backoff = 0.5 |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
emb_chunk = _embedder( |
||||
chunk_texts, model=model, batch_size=len(chunk_texts) |
||||
) |
||||
return emb_chunk, None |
||||
except Exception as exc: |
||||
if attempt == retries: |
||||
break |
||||
sleep = backoff * (2 ** (attempt - 1)) |
||||
time.sleep(sleep) |
||||
continue |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Silent exception swallowing |
||||
|
||||
**Problem**: Blindly catching all requests exceptions and returning empty response. |
||||
|
||||
**Remediation**: Map network exceptions to retryable vs terminal (ProviderError) and log details. |
||||
|
||||
### Bad: Using print() for errors |
||||
|
||||
**Problem**: Using print() for network errors instead of structured logging. |
||||
|
||||
**Remediation**: Use `_logger.exception()` instead (see api_client.py needs fixing). |
||||
@ -1,65 +0,0 @@ |
||||
name: requests_http |
||||
|
||||
rules: |
||||
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
||||
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
||||
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
||||
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
... |
||||
if getattr(resp, "status_code", 0) == 429: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||
retry_after = None |
||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||
if raw: |
||||
try: |
||||
retry_after = int(raw) |
||||
except Exception: |
||||
... |
||||
if retry_after is not None: |
||||
time.sleep(retry_after) |
||||
continue |
||||
``` |
||||
note: explicit handling of 429 and Retry-After |
||||
|
||||
- path: api_client.py |
||||
excerpt: | |
||||
```python |
||||
response = self.session.get( |
||||
base_url, params=params, timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
``` |
||||
note: uses session + raise_for_status() to surface HTTP errors |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
def _attempt_batch(chunk_texts, start_index): |
||||
backoff = 0.5 |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
emb_chunk = _embedder( |
||||
chunk_texts, model=model, batch_size=len(chunk_texts) |
||||
) |
||||
return emb_chunk, None |
||||
except Exception as exc: |
||||
if attempt == retries: |
||||
break |
||||
sleep = backoff * (2 ** (attempt - 1)) |
||||
time.sleep(sleep) |
||||
continue |
||||
``` |
||||
note: wrapper adds retry/backoff and per-item fallback |
||||
|
||||
anti_patterns: |
||||
- Bad: Blindly catching all requests exceptions and returning empty response |
||||
remediation: map network exceptions to retryable vs terminal (ProviderError) and log details. |
||||
- Bad: Using print() for network errors instead of structured logging (see api_client.py where print() is used; prefer logging). |
||||
@ -0,0 +1,37 @@ |
||||
--- |
||||
title: Validation Pattern |
||||
category: patterns |
||||
--- |
||||
# Validation Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
||||
- Tests should assert that invalid inputs raise the expected exceptions. |
||||
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - Type validation |
||||
|
||||
```python |
||||
if not isinstance(text, str): |
||||
raise ProviderError("text must be a string") |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Defensive empty handling |
||||
|
||||
```python |
||||
if not texts: |
||||
return [] |
||||
if motion_ids is None: |
||||
motion_ids = [None for _ in texts] |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Invalid values into computation |
||||
|
||||
**Problem**: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
||||
|
||||
**Remediation**: Fail fast with a typed exception and add unit tests to cover validations. |
||||
@ -1,29 +0,0 @@ |
||||
name: validation |
||||
|
||||
rules: |
||||
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
||||
- Tests should assert that invalid inputs raise the expected exceptions. |
||||
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
if not isinstance(text, str): |
||||
raise ProviderError("text must be a string") |
||||
``` |
||||
note: explicit type validation before network call |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
if not texts: |
||||
return [] |
||||
if motion_ids is None: |
||||
motion_ids = [None for _ in texts] |
||||
``` |
||||
note: defensive handling of empty inputs |
||||
|
||||
anti_patterns: |
||||
- Bad: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
||||
remediation: Fail fast with a typed exception and add unit tests to cover validations. |
||||
@ -0,0 +1,67 @@ |
||||
--- |
||||
title: Tech Stack |
||||
category: stack |
||||
--- |
||||
|
||||
# Tech Stack |
||||
|
||||
## Runtime & Language |
||||
- **Python >=3.13** |
||||
|
||||
## Web Framework |
||||
- **Streamlit** - Multi-page app with Home, Stemwijzer, Explorer pages |
||||
|
||||
## Data Layer |
||||
- **DuckDB** - Embedded OLAP database |
||||
- Tables: motions, mp_votes, svd_vectors, fused_embeddings, embeddings, user_sessions, party_results, mp_metadata |
||||
- **ibis** - ORM (referenced but DuckDB-native implementation used) |
||||
|
||||
## AI / LLM |
||||
- **OpenRouter** - API abstraction for AI providers |
||||
- **QWEN** - Primary model |
||||
- Embeddings: `qwen/qwen3-embedding-4b` |
||||
- Chat: `qwen/qwen-2.5-72b-instruct` |
||||
- **requests** - HTTP client (not raw openai) |
||||
|
||||
## ML / Analytics |
||||
- **scikit-learn** - KMeans clustering, cosine_similarity, StandardScaler |
||||
- **scipy** - SVD (scipy.linalg.svd), spatial.procrustes |
||||
- **umap-learn** - Dimensionality reduction (optional, graceful fallback to SVD) |
||||
- **numpy** - Numerical computing |
||||
|
||||
## Visualization |
||||
- **Plotly** - Interactive charts (go.Figure, _DummyTrace fallback) |
||||
- **matplotlib** - Static plotting (optional) |
||||
|
||||
## HTTP & Parsing |
||||
- **requests** - Session pooling, retry with backoff |
||||
- **beautifulsoup4** - HTML parsing |
||||
- **lxml** - XML/HTML processing |
||||
|
||||
## Key Source Files |
||||
|
||||
| File | Purpose | |
||||
|------|---------| |
||||
| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema | |
||||
| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) | |
||||
| `explorer_helpers.py` | Pure helper functions, Plotly chart builders | |
||||
| `analysis/` | SVD pipeline, UMAP projection, clustering | |
||||
| `pipeline/` | Data fetch, transform, store pipeline | |
||||
| `pages/1_Stemwijzer.py` | Quiz page | |
||||
| `pages/2_Explorer.py` | Explorer page | |
||||
| `config.py` | Dataclass Config pattern | |
||||
| `ai_provider.py` | OpenRouter API wrapper with retry | |
||||
| `api_client.py` | TweedeKamer OData API client | |
||||
|
||||
## Singleton Instances |
||||
|
||||
| Module | Instance | Type | |
||||
|--------|----------|------| |
||||
| `database.py` | `db` | `MotionDatabase` | |
||||
| `config.py` | `config` | `Config` (dataclass) | |
||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||
|
||||
## Environment |
||||
- Python >=3.13 |
||||
- Environment variables via `.env` (DB path, API keys) |
||||
- No `.env` values in constraint files (security) |
||||
Loading…
Reference in new issue