Delete 17 malformed YAML constraint files and 10 stale numbered constraint files. Convert domain glossary, patterns, stack, and anti-patterns to markdown format. Update manifest.yaml to reference new markdown files.main
parent
910ef0dc3b
commit
88595c869b
@ -0,0 +1,127 @@ |
|||||||
|
--- |
||||||
|
title: Anti-Patterns in Stemwijzer |
||||||
|
category: anti-patterns |
||||||
|
severity: critical |
||||||
|
--- |
||||||
|
|
||||||
|
# Anti-Patterns |
||||||
|
|
||||||
|
> **NOTE**: Some anti-patterns below were investigated and found to be resolved or invalid. See individual entries for details. |
||||||
|
|
||||||
|
## CRITICAL: print() Instead of Logging |
||||||
|
|
||||||
|
**File**: `api_client.py` |
||||||
|
**Evidence**: 11 instances of `print(f"...")` instead of `_logger.info(...)` |
||||||
|
|
||||||
|
**Broken code**: |
||||||
|
```python |
||||||
|
def get_motions(self, ...): |
||||||
|
try: |
||||||
|
# ... |
||||||
|
print(f"Fetched {len(voting_records)} voting records from API") # BAD |
||||||
|
print(f"Processed into {len(motions)} unique motions") # BAD |
||||||
|
except Exception as e: |
||||||
|
print(f"Error fetching motions from API: {e}") # BAD - no traceback |
||||||
|
``` |
||||||
|
|
||||||
|
**Fix**: |
||||||
|
```python |
||||||
|
import logging |
||||||
|
|
||||||
|
_logger = logging.getLogger(__name__) |
||||||
|
|
||||||
|
def get_motions(self, ...): |
||||||
|
try: |
||||||
|
_logger.info("Fetched %d voting records from API", len(voting_records)) |
||||||
|
_logger.info("Processed into %d unique motions", len(motions)) |
||||||
|
except Exception as e: |
||||||
|
_logger.exception("Error fetching motions from API: %s", e) |
||||||
|
return [] |
||||||
|
``` |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## CRITICAL: Global `_DummySt` Replacement |
||||||
|
|
||||||
|
**File**: `explorer.py` |
||||||
|
**Evidence**: Lines ~50-70, module-level `st = _DummySt()` global replacement |
||||||
|
|
||||||
|
**Problem**: Creates a module-level variable `st` that shadows `streamlit` module, causing subtle bugs. |
||||||
|
|
||||||
|
**Fix**: Use conditional flags instead of global replacement: |
||||||
|
```python |
||||||
|
# GOOD: Use conditional logic |
||||||
|
try: |
||||||
|
import plotly.express as px |
||||||
|
import plotly.graph_objects as go |
||||||
|
HAS_PLOTLY = True |
||||||
|
except ImportError: |
||||||
|
HAS_PLOTLY = False |
||||||
|
px = None |
||||||
|
go = None |
||||||
|
|
||||||
|
def render_chart(data): |
||||||
|
if not HAS_PLOTLY: |
||||||
|
_logger.warning("Plotly not available") |
||||||
|
return |
||||||
|
# ... rest of chart logic |
||||||
|
``` |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## WARNING: Logger Naming Inconsistency |
||||||
|
|
||||||
|
**Evidence**: 16 files use `logger`, 17 files use `_logger` |
||||||
|
|
||||||
|
**Files with `logger`** (without underscore): |
||||||
|
- api_client.py, ai_provider.py, pipeline files, analysis files |
||||||
|
|
||||||
|
**Files with `_logger`** (with underscore): |
||||||
|
- database.py, explorer.py, explorer_helpers.py |
||||||
|
|
||||||
|
**Recommendation**: Standardize on `_logger` for module-level loggers. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## WARNING: Bare except with pass |
||||||
|
|
||||||
|
**File**: `database.py`, line 47 |
||||||
|
|
||||||
|
```python |
||||||
|
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
||||||
|
try: |
||||||
|
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||||
|
except: # bare except |
||||||
|
pass |
||||||
|
``` |
||||||
|
|
||||||
|
**Fix**: |
||||||
|
```python |
||||||
|
try: |
||||||
|
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||||
|
except Exception as exc: |
||||||
|
_logger.debug("Sequence creation skipped: %s", exc) |
||||||
|
``` |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## INVESTIGATED: Entity-ID / Party-Name Mismatch |
||||||
|
|
||||||
|
**Status**: INVALID - investigated and resolved |
||||||
|
|
||||||
|
**Investigation Summary**: `svd_vectors.entity_id` only contains MP names (not party names). Party centroids are correctly computed via `mp_metadata` lookups. No production bug exists. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Pattern: Three Separate Party Alias Dictionaries |
||||||
|
|
||||||
|
**Problem**: Party name variations exist in 3+ places with no canonical alias mapping. |
||||||
|
|
||||||
|
**Fix**: Create one `PARTY_ALIASES` dict in `config.py`: |
||||||
|
```python |
||||||
|
PARTY_ALIASES = { |
||||||
|
"GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"], |
||||||
|
"PVV": ["Partij voor de Vrijheid"], |
||||||
|
# ... |
||||||
|
} |
||||||
|
``` |
||||||
@ -1,34 +0,0 @@ |
|||||||
# Naming & Style Conventions |
|
||||||
|
|
||||||
## Rules |
|
||||||
- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py |
|
||||||
- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py) |
|
||||||
- Classes: PascalCase. Evidence: MotionDatabase (database.py) |
|
||||||
- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred) |
|
||||||
- Imports order: stdlib, third-party, local; prefer absolute imports and grouped. |
|
||||||
- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections). |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### Function example (from pipeline/run_pipeline.py) |
|
||||||
```python |
|
||||||
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
|
||||||
"""Return list of (window_id, start_str, end_str) tuples.""" |
|
||||||
``` |
|
||||||
|
|
||||||
### Class example (from database.py) |
|
||||||
```python |
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-patterns |
|
||||||
- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files. |
|
||||||
|
|
||||||
## Remediations |
|
||||||
- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step. |
|
||||||
|
|
||||||
## Evidence pointers |
|
||||||
- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120) |
|
||||||
- database.py: MotionDatabase class and methods (file database.py lines 1-400+) |
|
||||||
@ -1,74 +0,0 @@ |
|||||||
# Database Schema (DuckDB) — extracted DDL |
|
||||||
|
|
||||||
## Rules |
|
||||||
- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). |
|
||||||
- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py). |
|
||||||
|
|
||||||
## Examples (DDL snippets extracted from database.py) |
|
||||||
|
|
||||||
### motions table |
|
||||||
```sql |
|
||||||
CREATE TABLE IF NOT EXISTS motions ( |
|
||||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
|
||||||
title TEXT NOT NULL, |
|
||||||
description TEXT, |
|
||||||
date DATE, |
|
||||||
policy_area TEXT, |
|
||||||
voting_results JSON, |
|
||||||
winning_margin FLOAT, |
|
||||||
controversy_score FLOAT, |
|
||||||
layman_explanation TEXT, |
|
||||||
externe_identifier TEXT, |
|
||||||
body_text TEXT, |
|
||||||
url TEXT UNIQUE, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
``` |
|
||||||
|
|
||||||
### mp_votes table |
|
||||||
```sql |
|
||||||
CREATE TABLE IF NOT EXISTS mp_votes ( |
|
||||||
id INTEGER DEFAULT nextval('mp_votes_id_seq'), |
|
||||||
motion_id INTEGER NOT NULL, |
|
||||||
mp_name TEXT NOT NULL, |
|
||||||
party TEXT, |
|
||||||
vote TEXT NOT NULL, |
|
||||||
date DATE, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
``` |
|
||||||
|
|
||||||
### embeddings / fused_embeddings |
|
||||||
```sql |
|
||||||
CREATE TABLE IF NOT EXISTS embeddings ( |
|
||||||
id INTEGER DEFAULT nextval('embeddings_id_seq'), |
|
||||||
motion_id INTEGER NOT NULL, |
|
||||||
model TEXT, |
|
||||||
vector JSON NOT NULL, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
|
|
||||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
|
||||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
|
||||||
motion_id INTEGER NOT NULL, |
|
||||||
window_id TEXT NOT NULL, |
|
||||||
vector JSON NOT NULL, |
|
||||||
svd_dims INTEGER NOT NULL, |
|
||||||
text_dims INTEGER NOT NULL, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-patterns |
|
||||||
- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior. |
|
||||||
|
|
||||||
## Remediations |
|
||||||
- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically. |
|
||||||
- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80). |
|
||||||
|
|
||||||
## Evidence pointers |
|
||||||
- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings. |
|
||||||
@ -1,22 +0,0 @@ |
|||||||
# Domain Glossary |
|
||||||
|
|
||||||
## Rules |
|
||||||
- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id. |
|
||||||
|
|
||||||
## Terms |
|
||||||
- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110) |
|
||||||
- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes |
|
||||||
- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`. |
|
||||||
- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table. |
|
||||||
- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows |
|
||||||
- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score |
|
||||||
|
|
||||||
## Examples / Usage |
|
||||||
- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120 |
|
||||||
|
|
||||||
## Evidence pointers |
|
||||||
- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py) |
|
||||||
- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py) |
|
||||||
|
|
||||||
## Anti-patterns |
|
||||||
- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations. |
|
||||||
@ -1,30 +0,0 @@ |
|||||||
# Code Clusters / Organization |
|
||||||
|
|
||||||
## Rules |
|
||||||
- The repository organizes code into the following clusters (observed): |
|
||||||
- UI / Streamlit: Home.py, pages/, app.py, explorer.py |
|
||||||
- Database & persistence: database.py, config.py |
|
||||||
- ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion) |
|
||||||
- AI provider & summarization: ai_provider.py, pipeline/..., analysis/ |
|
||||||
- Similarity & caching: similarity/*, similarity_cache table in DB |
|
||||||
- API client & scraping: api_client.py, pipeline/fetch_mp_metadata |
|
||||||
- Analysis & visualization: analysis/visualize.py, explorer.py |
|
||||||
- CLI & scheduler: scheduler.py, pipeline/run_pipeline.py |
|
||||||
- Tests & migrations: tests/ (pytest) and database reset helpers |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### Pipeline orchestrator (cluster: CLI & pipeline) |
|
||||||
```python |
|
||||||
from database import MotionDatabase |
|
||||||
db = MotionDatabase(db_path) |
|
||||||
# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window |
|
||||||
``` |
|
||||||
|
|
||||||
## Remediations |
|
||||||
- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests. |
|
||||||
|
|
||||||
## Evidence pointers |
|
||||||
- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py) |
|
||||||
- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py) |
|
||||||
- analysis/visualize.py: visualization cluster (file: analysis/visualize.py) |
|
||||||
@ -1,46 +0,0 @@ |
|||||||
# Design Patterns & Code Patterns |
|
||||||
|
|
||||||
## Rules |
|
||||||
- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management. |
|
||||||
- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback. |
|
||||||
- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes). |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### Repository pattern (database.py MotionDatabase) |
|
||||||
```python |
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self.db_path = db_path |
|
||||||
self._init_database() |
|
||||||
|
|
||||||
def insert_motion(self, motion_data: Dict) -> bool: |
|
||||||
"""Insert a new motion into database""" |
|
||||||
# uses duckdb.connect and parameterized queries |
|
||||||
``` |
|
||||||
|
|
||||||
### Provider adapter with retries (ai_provider.py) |
|
||||||
```python |
|
||||||
def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response: |
|
||||||
# Implements retries/backoff, handles 429 with Retry-After and 5xx responses |
|
||||||
``` |
|
||||||
|
|
||||||
### Pipeline parallelism pattern (run_pipeline) |
|
||||||
```python |
|
||||||
with ThreadPoolExecutor(max_workers=max_workers) as pool: |
|
||||||
for window_id, w_start, w_end in windows: |
|
||||||
fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k) |
|
||||||
futures[fut] = window_id |
|
||||||
# wait then write sequentially to DuckDB |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-patterns |
|
||||||
- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors. |
|
||||||
|
|
||||||
## Remediations |
|
||||||
- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md. |
|
||||||
|
|
||||||
## Evidence pointers |
|
||||||
- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300) |
|
||||||
- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260) |
|
||||||
- database.py: MotionDatabase methods (file: database.py) |
|
||||||
@ -1,24 +0,0 @@ |
|||||||
# Anti-patterns, Issues and Recommended Fixes |
|
||||||
|
|
||||||
## Rules |
|
||||||
- Flagged issues discovered in Phase 1 must be remediated with concrete actions. |
|
||||||
|
|
||||||
## Issues |
|
||||||
- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml |
|
||||||
- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports. |
|
||||||
- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility. |
|
||||||
- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps. |
|
||||||
- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging. |
|
||||||
|
|
||||||
## Remediations / Recommended fixes |
|
||||||
- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml. |
|
||||||
- Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain. |
|
||||||
- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var. |
|
||||||
- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges. |
|
||||||
- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage. |
|
||||||
- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks. |
|
||||||
|
|
||||||
## Evidence pointers |
|
||||||
- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40) |
|
||||||
- database.py: multiple broad except blocks (file: database.py top and methods) |
|
||||||
- ai_provider.py: uses requests + env keys (file: ai_provider.py) |
|
||||||
@ -1,117 +0,0 @@ |
|||||||
# Example Extractions |
|
||||||
|
|
||||||
## Rules |
|
||||||
- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions. |
|
||||||
|
|
||||||
## (a) Function signatures with docstrings (5 examples) |
|
||||||
1) pipeline/run_pipeline.py::_generate_windows |
|
||||||
```python |
|
||||||
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
|
||||||
"""Return list of (window_id, start_str, end_str) tuples. |
|
||||||
|
|
||||||
window_id format: |
|
||||||
quarterly → "2024-Q1", "2024-Q2", … |
|
||||||
annual → "2024" |
|
||||||
""" |
|
||||||
``` |
|
||||||
|
|
||||||
2) database.py::append_audit_event |
|
||||||
```python |
|
||||||
def append_audit_event( |
|
||||||
self, |
|
||||||
actor_id: Optional[str], |
|
||||||
action: str, |
|
||||||
target_type: Optional[str] = None, |
|
||||||
target_id: Optional[str] = None, |
|
||||||
metadata: Optional[Dict] = None, |
|
||||||
) -> bool: |
|
||||||
"""Record an audit event. Tries DB then falls back to ledger file.""" |
|
||||||
``` |
|
||||||
|
|
||||||
3) ai_provider.py::get_embedding |
|
||||||
```python |
|
||||||
def get_embedding(text: str, model: str | None = None) -> list[float]: |
|
||||||
"""Return an embedding vector for `text` using the configured provider. |
|
||||||
|
|
||||||
Raises ProviderError for configuration or provider-side failures. |
|
||||||
""" |
|
||||||
``` |
|
||||||
|
|
||||||
4) ai_provider.py::get_embeddings_batch |
|
||||||
```python |
|
||||||
def get_embeddings_batch( |
|
||||||
texts: list[str], model: str | None = None, batch_size: int = 50 |
|
||||||
) -> list[list[float]]: |
|
||||||
"""Return embedding vectors for multiple texts using batched API calls.""" |
|
||||||
``` |
|
||||||
|
|
||||||
5) analysis/visualize.py::plot_umap_scatter |
|
||||||
```python |
|
||||||
def plot_umap_scatter( |
|
||||||
motion_ids: List[int], |
|
||||||
coords: List[List[float]], |
|
||||||
labels: Optional[List[int]] = None, |
|
||||||
window_id: Optional[str] = None, |
|
||||||
output_path: str = "analysis_umap.html", |
|
||||||
) -> str: |
|
||||||
"""Produce a 2D scatter plot of UMAP-reduced fused embeddings.""" |
|
||||||
``` |
|
||||||
|
|
||||||
## (b) SQL / DDL snippets (3 examples inferred from database.py) |
|
||||||
1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110) |
|
||||||
|
|
||||||
2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes |
|
||||||
|
|
||||||
3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings |
|
||||||
|
|
||||||
## (c) Pytest stubs (4 sample tests matching conventions) |
|
||||||
Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add. |
|
||||||
|
|
||||||
1) tests/test_database_basic.py |
|
||||||
```python |
|
||||||
def test_init_database_creates_tables(tmp_path): |
|
||||||
db_path = str(tmp_path / "motions.db") |
|
||||||
from database import MotionDatabase |
|
||||||
|
|
||||||
db = MotionDatabase(db_path=db_path) |
|
||||||
# If duckdb not available, JSON fallback should create .embeddings.json |
|
||||||
assert db is not None |
|
||||||
``` |
|
||||||
|
|
||||||
2) tests/test_ai_provider.py |
|
||||||
```python |
|
||||||
def test_local_embedding_fallback(): |
|
||||||
from ai_provider import _local_embedding |
|
||||||
|
|
||||||
v = _local_embedding("hello world", dim=16) |
|
||||||
assert isinstance(v, list) and len(v) == 16 |
|
||||||
``` |
|
||||||
|
|
||||||
3) tests/test_pipeline_windows.py |
|
||||||
```python |
|
||||||
from pipeline.run_pipeline import _generate_windows |
|
||||||
|
|
||||||
def test_generate_quarterly_windows(): |
|
||||||
from datetime import date |
|
||||||
|
|
||||||
start = date(2024, 1, 1) |
|
||||||
end = date(2024, 3, 31) |
|
||||||
windows = _generate_windows(start, end, "quarterly") |
|
||||||
assert any(w[0].endswith("Q1") for w in windows) |
|
||||||
``` |
|
||||||
|
|
||||||
4) tests/test_visualize_plot.py |
|
||||||
```python |
|
||||||
def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path): |
|
||||||
# If plotly missing, function should raise ImportError with guidance |
|
||||||
import analysis.visualize as vis |
|
||||||
|
|
||||||
try: |
|
||||||
vis._require_plotly() |
|
||||||
except ImportError: |
|
||||||
assert True |
|
||||||
``` |
|
||||||
|
|
||||||
## Evidence pointers |
|
||||||
- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py |
|
||||||
- DDL: database.py create table blocks |
|
||||||
@ -1,43 +0,0 @@ |
|||||||
# Stack and Dependencies |
|
||||||
|
|
||||||
## Rules |
|
||||||
- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13") |
|
||||||
- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile |
|
||||||
- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py |
|
||||||
- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/ |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### pyproject dependencies (evidence: pyproject.toml) |
|
||||||
```toml |
|
||||||
dependencies = [ |
|
||||||
"duckdb>=1.3.2", |
|
||||||
"ibis-framework[duckdb]>=10.8.0", |
|
||||||
"openai>=1.99.7", |
|
||||||
"scipy>=1.11", |
|
||||||
"umap-learn>=0.5", |
|
||||||
"plotly>=5.0", |
|
||||||
"pytest>=9.0.2", |
|
||||||
"requests>=2.32.4", |
|
||||||
"schedule>=1.2.2", |
|
||||||
"streamlit>=1.48.0", |
|
||||||
"scikit-learn>=1.8.0", |
|
||||||
"beautifulsoup4>=4.14.3", |
|
||||||
"lxml>=6.0.2", |
|
||||||
] |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-patterns / Notes |
|
||||||
- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml |
|
||||||
- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility. |
|
||||||
- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai). |
|
||||||
|
|
||||||
## Remediations |
|
||||||
- Move test-only libs (pytest) to dev-dependencies in pyproject.toml. |
|
||||||
- Add lockfile and CI step to check for pinned dependencies. |
|
||||||
- Audit declared but unused packages (openai) and remove or confirm dynamic usage. |
|
||||||
|
|
||||||
## Evidence pointers |
|
||||||
- pyproject.toml: full dependency list (lines 1-40) |
|
||||||
- Home.py: streamlit usage and app entry (file: Home.py) |
|
||||||
- database.py: duckdb table creation and connection (file: database.py lines ~1-350) |
|
||||||
@ -1,29 +0,0 @@ |
|||||||
# DB connection handling constraints |
|
||||||
|
|
||||||
rules: |
|
||||||
- name: use_context_managers_for_connections |
|
||||||
rule: "Prefer using 'with duckdb.connect(path, read_only=...) as conn' for scoped DB interactions where possible." |
|
||||||
rationale: "Ensures proper resource cleanup and avoids connection leaks." |
|
||||||
|
|
||||||
- name: read_only_for_compute |
|
||||||
rule: "Use read_only=True for compute steps that only read data (SVD, similarity compute)." |
|
||||||
rationale: "Allows safe parallel workers and reduces write contention." |
|
||||||
|
|
||||||
- name: short_lived_writes |
|
||||||
rule: "When performing database writes, open short-lived connections, commit quickly and close." |
|
||||||
rationale: "Avoids long-lived transactions and reduces lock windows." |
|
||||||
|
|
||||||
examples: |
|
||||||
- path: pipeline/svd_pipeline.py |
|
||||||
snippet: | |
|
||||||
conn = duckdb.connect(db_path, read_only=True) |
|
||||||
try: |
|
||||||
rows = conn.execute(...).fetchall() |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
|
|
||||||
anti_patterns_and_remediations: |
|
||||||
- bad: "Creating a global connection at import that performs migrations." |
|
||||||
remediation: "Move migrations to an explicit init function that runs at deployment/upgrade time." |
|
||||||
- bad: "Not closing connections on exceptions." |
|
||||||
remediation: "Wrap connects in `with` or finally: conn.close() blocks." |
|
||||||
@ -0,0 +1,143 @@ |
|||||||
|
--- |
||||||
|
title: Error Handling Patterns |
||||||
|
category: constraints |
||||||
|
severity: high |
||||||
|
--- |
||||||
|
|
||||||
|
# Error Handling Patterns |
||||||
|
|
||||||
|
## Core Rules |
||||||
|
|
||||||
|
1. **Catch `Exception`, return safe fallbacks** (False/[]/None) |
||||||
|
2. **Log exceptions with traceback** using `_logger.exception()` |
||||||
|
3. **Never swallow exceptions silently** - always log or return sensible default |
||||||
|
4. **Avoid nested try/except blocks** - flatten exception handling |
||||||
|
|
||||||
|
## Pattern: Try/Except Safe Fallback |
||||||
|
|
||||||
|
This is the dominant pattern in the codebase (219+ instances). |
||||||
|
|
||||||
|
```python |
||||||
|
# Standard pattern from database.py, api_client.py, etc. |
||||||
|
try: |
||||||
|
result = risky_operation() |
||||||
|
return process(result) |
||||||
|
except Exception as exc: |
||||||
|
_logger.warning("Operation failed: %s", exc) |
||||||
|
return safe_fallback # False, [], None, {} |
||||||
|
``` |
||||||
|
|
||||||
|
### Examples from Codebase |
||||||
|
|
||||||
|
**database.py** - DuckDB operations: |
||||||
|
```python |
||||||
|
def get_svd_vectors(self, window: str): |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path, read_only=True) |
||||||
|
try: |
||||||
|
result = conn.execute(query, (window,)).fetchall() |
||||||
|
return self._parse_vectors(result) |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
except Exception as exc: |
||||||
|
_logger.warning("Failed to get SVD vectors: %s", exc) |
||||||
|
return [] |
||||||
|
``` |
||||||
|
|
||||||
|
**ai_provider.py** - HTTP retries: |
||||||
|
```python |
||||||
|
try: |
||||||
|
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||||
|
resp.raise_for_status() |
||||||
|
return resp.json() |
||||||
|
except requests.ConnectionError as exc: |
||||||
|
if attempt == retries: |
||||||
|
raise ProviderError(f"Connection error: {exc}") from exc |
||||||
|
# ... retry logic |
||||||
|
``` |
||||||
|
|
||||||
|
## Pattern: Optional Dependency Fallback |
||||||
|
|
||||||
|
Gracefully degrade when optional packages are unavailable. |
||||||
|
|
||||||
|
```python |
||||||
|
# UMAP fallback in explorer_helpers.py |
||||||
|
try: |
||||||
|
import umap |
||||||
|
HAS_UMAP = True |
||||||
|
except ImportError: |
||||||
|
HAS_UMAP = False |
||||||
|
_logger.debug("UMAP not available, using SVD vectors directly") |
||||||
|
|
||||||
|
def project_to_2d(vectors): |
||||||
|
if HAS_UMAP: |
||||||
|
return umap.UMAP().fit_transform(vectors) |
||||||
|
return vectors[:, :2] # Fallback: first 2 SVD dimensions |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-Patterns |
||||||
|
|
||||||
|
### 1. Bare except with pass (CRITICAL) |
||||||
|
**File**: `database.py`, line 47 |
||||||
|
|
||||||
|
```python |
||||||
|
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
||||||
|
try: |
||||||
|
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||||
|
except: # bare except |
||||||
|
pass |
||||||
|
``` |
||||||
|
|
||||||
|
**Fix**: Catch specific exception or log and continue: |
||||||
|
```python |
||||||
|
try: |
||||||
|
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||||
|
except Exception as exc: |
||||||
|
_logger.debug("Sequence creation skipped (may already exist): %s", exc) |
||||||
|
``` |
||||||
|
|
||||||
|
### 2. Nested Exception Handling |
||||||
|
**File**: `explorer.py`, lines 244-261 |
||||||
|
|
||||||
|
```python |
||||||
|
# BAD - opaque error paths |
||||||
|
try: |
||||||
|
result = compute_svd(motions) |
||||||
|
except Exception: |
||||||
|
try: |
||||||
|
result = fallback_compute(motions) |
||||||
|
except Exception: |
||||||
|
pass # Both exceptions silently dropped |
||||||
|
``` |
||||||
|
|
||||||
|
**Fix**: Flatten and handle each case explicitly: |
||||||
|
```python |
||||||
|
# GOOD - explicit handling |
||||||
|
try: |
||||||
|
result = compute_svd(motions) |
||||||
|
except Exception as exc: |
||||||
|
_logger.warning("SVD failed, trying fallback: %s", exc) |
||||||
|
try: |
||||||
|
result = fallback_compute(motions) |
||||||
|
except Exception as fallback_exc: |
||||||
|
_logger.error("Both SVD approaches failed: %s, %s", exc, fallback_exc) |
||||||
|
raise |
||||||
|
``` |
||||||
|
|
||||||
|
## Rule Summary |
||||||
|
|
||||||
|
| Pattern | When to Use | Return Value | |
||||||
|
|---------|-------------|--------------| |
||||||
|
| Safe fallback | Best-effort operations | `[]`, `{}`, `False`, `None` | |
||||||
|
| Re-raise | Critical operations that must succeed | raise | |
||||||
|
| Log and continue | Optional steps in pipeline | (continue) | |
||||||
|
| Graceful degradation | Optional dependencies | Default behavior | |
||||||
|
|
||||||
|
## When to Log vs Return |
||||||
|
|
||||||
|
| Scenario | Action | |
||||||
|
|----------|--------| |
||||||
|
| User action fails | Log warning, return safe default | |
||||||
|
| Internal error (corrupt data) | Log error, return safe default | |
||||||
|
| Transient failure (network) | Log warning, retry if appropriate | |
||||||
|
| Configuration error | Log error, raise with clear message | |
||||||
@ -1,184 +0,0 @@ |
|||||||
# Error Handling Constraints |
|
||||||
|
|
||||||
## Core Rule |
|
||||||
|
|
||||||
**Catch `Exception`, return safe fallbacks (False/[]/None)** |
|
||||||
|
|
||||||
Never let exceptions propagate to user-facing code. Always provide a safe default. |
|
||||||
|
|
||||||
## Patterns |
|
||||||
|
|
||||||
### For Not-Found Operations |
|
||||||
|
|
||||||
Return `None` or falsy value when item not found: |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD: Return None on not found |
|
||||||
def get_motion_by_id(self, motion_id: int) -> Optional[Dict]: |
|
||||||
try: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
result = conn.execute( |
|
||||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
|
||||||
).fetchone() |
|
||||||
conn.close() |
|
||||||
return result |
|
||||||
except Exception: |
|
||||||
conn.close() |
|
||||||
return None |
|
||||||
``` |
|
||||||
|
|
||||||
### For Collection Operations |
|
||||||
|
|
||||||
Return empty list when no results: |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD: Return empty list on failure |
|
||||||
def get_filtered_motions(self, **kwargs) -> List[Dict]: |
|
||||||
try: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
rows = conn.execute(query, params).fetchall() |
|
||||||
conn.close() |
|
||||||
return rows |
|
||||||
except Exception: |
|
||||||
conn.close() |
|
||||||
return [] |
|
||||||
``` |
|
||||||
|
|
||||||
### For Boolean Operations |
|
||||||
|
|
||||||
Return `False` for failed boolean checks: |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD: Return False on failure |
|
||||||
def motion_exists(self, motion_id: int) -> bool: |
|
||||||
try: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
count = conn.execute( |
|
||||||
"SELECT COUNT(*) FROM motions WHERE id = ?", (motion_id,) |
|
||||||
).fetchone()[0] |
|
||||||
conn.close() |
|
||||||
return count > 0 |
|
||||||
except Exception: |
|
||||||
return False |
|
||||||
``` |
|
||||||
|
|
||||||
### For Creation Operations |
|
||||||
|
|
||||||
Return `False` or empty string on failure: |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD: Return empty string on failure |
|
||||||
def generate_summary(self, title: str, body: str) -> str: |
|
||||||
try: |
|
||||||
return ai_provider.chat_completion(messages) |
|
||||||
except ai_provider.ProviderError: |
|
||||||
logger.exception("AI provider failed") |
|
||||||
return "" |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns to Avoid |
|
||||||
|
|
||||||
### Don't Catch Specific Exceptions Only |
|
||||||
```python |
|
||||||
# BAD: Catches only FileNotFoundError, misses other issues |
|
||||||
try: |
|
||||||
with open(path) as f: |
|
||||||
return json.load(f) |
|
||||||
except FileNotFoundError: |
|
||||||
return None |
|
||||||
``` |
|
||||||
|
|
||||||
### Don't Re-raise Without Context |
|
||||||
```python |
|
||||||
# BAD: Loses information |
|
||||||
try: |
|
||||||
process(data) |
|
||||||
except Exception: |
|
||||||
raise # No context added |
|
||||||
``` |
|
||||||
|
|
||||||
### Don't Swallow Exceptions Silently |
|
||||||
```python |
|
||||||
# BAD: No logging, no fallback |
|
||||||
try: |
|
||||||
return risky_operation() |
|
||||||
except Exception: |
|
||||||
pass # What happened? |
|
||||||
``` |
|
||||||
|
|
||||||
## Nested Exception Handling |
|
||||||
|
|
||||||
When calling code that has its own error handling, wrap only if needed: |
|
||||||
|
|
||||||
```python |
|
||||||
# Accept result from wrapped function (it handles errors) |
|
||||||
def fetch_motions(self, start_date): |
|
||||||
# ai_provider_wrapper handles retries internally |
|
||||||
embeddings = get_embeddings_with_retry(texts) |
|
||||||
|
|
||||||
# Only wrap if wrapper doesn't handle errors |
|
||||||
if all(e is None for e in embeddings): |
|
||||||
logger.error("All embeddings failed") |
|
||||||
return [] |
|
||||||
|
|
||||||
return process(embeddings) |
|
||||||
``` |
|
||||||
|
|
||||||
## Context Managers |
|
||||||
|
|
||||||
Use `try/finally` for cleanup: |
|
||||||
|
|
||||||
```python |
|
||||||
def process_with_temp_file(self): |
|
||||||
temp = NamedTemporaryFile(delete=False) |
|
||||||
try: |
|
||||||
temp.write(data) |
|
||||||
temp.close() |
|
||||||
return process_file(temp.name) |
|
||||||
finally: |
|
||||||
os.unlink(temp.name) |
|
||||||
temp.close() |
|
||||||
``` |
|
||||||
|
|
||||||
## When to Log vs Return |
|
||||||
|
|
||||||
| Scenario | Action | |
|
||||||
|----------|--------| |
|
||||||
| User action fails | Log warning, return safe default | |
|
||||||
| Internal error (corrupt data) | Log error, return safe default | |
|
||||||
| Transient failure (network) | Log warning, retry if appropriate | |
|
||||||
| Configuration error | Log error, raise with clear message | |
|
||||||
|
|
||||||
## Exception Propagation |
|
||||||
|
|
||||||
Only raise exceptions for: |
|
||||||
1. Configuration/setup errors (missing required env vars) |
|
||||||
2. Programming errors (invalid arguments) |
|
||||||
3. Fatal system errors (database corruption) |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD: Raise for configuration errors |
|
||||||
def _get_api_key(self) -> str: |
|
||||||
key = os.environ.get("OPENROUTER_API_KEY") |
|
||||||
if not key: |
|
||||||
raise ProviderError( |
|
||||||
"OPENROUTER_API_KEY environment variable is required" |
|
||||||
) |
|
||||||
return key |
|
||||||
``` |
|
||||||
|
|
||||||
## Logging Errors |
|
||||||
|
|
||||||
Always include context: |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD: Include relevant context |
|
||||||
_logger.error( |
|
||||||
"Failed to fetch motion %d: %s", |
|
||||||
motion_id, |
|
||||||
exc |
|
||||||
) |
|
||||||
|
|
||||||
# BAD: No context |
|
||||||
_logger.error("Failed to fetch") |
|
||||||
``` |
|
||||||
@ -1,36 +0,0 @@ |
|||||||
# Error handling style rules (YAML constraint example) |
|
||||||
|
|
||||||
rules: |
|
||||||
- name: explicit_exceptions |
|
||||||
rule: "Raise explicit exceptions (ValueError, ProviderError) for known error conditions rather than returning magic values." |
|
||||||
examples: |
|
||||||
- good: | |
|
||||||
if not isinstance(text, str): |
|
||||||
raise ProviderError('text must be a string') |
|
||||||
- bad: | |
|
||||||
if not isinstance(text, str): |
|
||||||
return [] |
|
||||||
|
|
||||||
- name: avoid_broad_except |
|
||||||
rule: "Avoid 'except Exception:' that swallows errors. If broad except is used for best-effort, log the exception with logger.exception and re-raise or convert." |
|
||||||
examples: |
|
||||||
- bad: | |
|
||||||
try: |
|
||||||
do_work() |
|
||||||
except Exception: |
|
||||||
return [] |
|
||||||
- remediation: | |
|
||||||
try: |
|
||||||
do_work() |
|
||||||
except SpecificError as exc: |
|
||||||
logger.warning('Handled error: %s', exc) |
|
||||||
raise |
|
||||||
|
|
||||||
- name: logging_over_print |
|
||||||
rule: "Prefer logger.* over print() for messages and errors." |
|
||||||
examples: |
|
||||||
- bad: "print('Error fetching motions from API: %s' % e)" |
|
||||||
- good: "logger.exception('Error fetching motions from API')" |
|
||||||
|
|
||||||
enforcement_examples: |
|
||||||
- "Add a static code check to flag 'print(' in modules (except in simple scripts) and 'except Exception:' usages without logger.exception." |
|
||||||
@ -0,0 +1,92 @@ |
|||||||
|
--- |
||||||
|
title: Dependencies and Library Usage |
||||||
|
category: dependencies |
||||||
|
--- |
||||||
|
|
||||||
|
# Dependencies and Library Usage |
||||||
|
|
||||||
|
## Core Dependencies |
||||||
|
|
||||||
|
### duckdb |
||||||
|
- **Required**: Yes |
||||||
|
- **Fallback**: None (core functionality) |
||||||
|
- **Usage**: SQL database for motions, embeddings, SVD vectors |
||||||
|
- **Files**: database.py, analysis/*.py, pipeline/*.py |
||||||
|
|
||||||
|
### streamlit |
||||||
|
- **Required**: Yes |
||||||
|
- **Fallback**: None |
||||||
|
- **Usage**: Web UI framework |
||||||
|
- **Files**: app.py, pages/*.py, explorer.py |
||||||
|
|
||||||
|
### requests |
||||||
|
- **Required**: Yes |
||||||
|
- **Fallback**: None |
||||||
|
- **Usage**: HTTP client for API calls |
||||||
|
- **Files**: api_client.py, ai_provider.py |
||||||
|
|
||||||
|
### plotly |
||||||
|
- **Required**: Yes |
||||||
|
- **Fallback**: None (raises ImportError) |
||||||
|
- **Usage**: Interactive charts for explorer |
||||||
|
- **Files**: explorer.py, explorer_helpers.py |
||||||
|
|
||||||
|
## Optional Dependencies |
||||||
|
|
||||||
|
### umap-learn |
||||||
|
- **Required**: No |
||||||
|
- **Fallback**: Use raw SVD vectors (first 2 dimensions) |
||||||
|
- **Usage**: Dimensionality reduction for visualization |
||||||
|
- **Files**: analysis/clustering.py |
||||||
|
|
||||||
|
### matplotlib |
||||||
|
- **Required**: No |
||||||
|
- **Fallback**: Plotly or raw output |
||||||
|
- **Usage**: Static charting |
||||||
|
- **Files**: Various analysis scripts |
||||||
|
|
||||||
|
## ML Dependencies |
||||||
|
|
||||||
|
### sklearn |
||||||
|
- **Required**: Yes |
||||||
|
- **Usage**: KMeans clustering, cosine_similarity, StandardScaler |
||||||
|
- **Files**: analysis/clustering.py, similarity/compute.py |
||||||
|
|
||||||
|
### scipy |
||||||
|
- **Required**: Yes |
||||||
|
- **Usage**: SVD (scipy.linalg.svd), spatial.procrustes for alignment |
||||||
|
- **Files**: analysis/trajectory.py, pipeline/svd_pipeline.py |
||||||
|
|
||||||
|
### numpy |
||||||
|
- **Required**: Yes |
||||||
|
- **Usage**: Array operations, linear algebra |
||||||
|
- **Files**: Throughout codebase |
||||||
|
|
||||||
|
## Key Imports by File |
||||||
|
|
||||||
|
### explorer.py |
||||||
|
- `import streamlit as st` |
||||||
|
- `from database import db` |
||||||
|
- `from explorer_helpers import *` |
||||||
|
|
||||||
|
### explorer_helpers.py |
||||||
|
- `import pandas as pd` |
||||||
|
- `import plotly.graph_objects as go` |
||||||
|
- `from database import db` (optional, for type hints) |
||||||
|
|
||||||
|
### database.py |
||||||
|
- `import ibis` |
||||||
|
- `import duckdb` |
||||||
|
- `from config import config, PARTY_COLOURS` |
||||||
|
|
||||||
|
### config.py |
||||||
|
- `from dataclasses import dataclass, field` |
||||||
|
- `import streamlit as st` (optional, for warnings) |
||||||
|
|
||||||
|
## Singleton Instances |
||||||
|
|
||||||
|
| Module | Instance | Type | |
||||||
|
|--------|----------|------| |
||||||
|
| `database.py` | `db` | `MotionDatabase` | |
||||||
|
| `config.py` | `config` | `Config` (dataclass) | |
||||||
|
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||||
@ -1,78 +0,0 @@ |
|||||||
# Dependencies |
|
||||||
|
|
||||||
## Core Library Wiring |
|
||||||
|
|
||||||
### Database Layer |
|
||||||
``` |
|
||||||
ibis → DuckDB → MotionDatabase singleton (database.py) |
|
||||||
↑ |
|
||||||
sqlglot (ibis dependency) |
|
||||||
``` |
|
||||||
|
|
||||||
### Data Processing |
|
||||||
``` |
|
||||||
pandas → (used throughout for DataFrame operations) |
|
||||||
numpy → (used by sklearn, scipy, umap) |
|
||||||
scipy → spatial.procrustes for window alignment |
|
||||||
``` |
|
||||||
|
|
||||||
### ML Pipeline |
|
||||||
``` |
|
||||||
sklearn.cluster → KMeans, Procrustes |
|
||||||
sklearn.preprocessing → StandardScaler |
|
||||||
umap → UMAP (optional, graceful fallback) |
|
||||||
``` |
|
||||||
|
|
||||||
### Visualization |
|
||||||
``` |
|
||||||
plotly → explorer_helpers.py chart builders |
|
||||||
st.plotly_chart → explorer.py rendering |
|
||||||
``` |
|
||||||
|
|
||||||
### Streamlit |
|
||||||
``` |
|
||||||
streamlit → all pages, @st.cache_data decorators |
|
||||||
``` |
|
||||||
|
|
||||||
## Optional Dependencies |
|
||||||
| Package | Required | Fallback | |
|
||||||
|---------|----------|----------| |
|
||||||
| `umap` | No | Use raw SVD vectors (first 2 dims) | |
|
||||||
| `plotly` | Yes | Raises ImportError | |
|
||||||
| `duckdb` | Yes | — | |
|
||||||
| `ibis` | Yes | — | |
|
||||||
| `sklearn` | Yes | — | |
|
||||||
|
|
||||||
## Singleton Instances |
|
||||||
| Module | Instance | Type | |
|
||||||
|--------|----------|------| |
|
||||||
| `database.py` | `db` | `MotionDatabase` | |
|
||||||
| `config.py` | `config` | `Config` (dataclass) | |
|
||||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
|
||||||
|
|
||||||
## Key Imports by File |
|
||||||
``` |
|
||||||
explorer.py: |
|
||||||
- import streamlit as st |
|
||||||
- from database import db |
|
||||||
- from explorer_helpers import * |
|
||||||
|
|
||||||
explorer_helpers.py: |
|
||||||
- import pandas as pd |
|
||||||
- import plotly.graph_objects as go |
|
||||||
- from database import db (optional, for type hints) |
|
||||||
|
|
||||||
database.py: |
|
||||||
- import ibis |
|
||||||
- import duckdb |
|
||||||
- from config import config, PARTY_COLOURS |
|
||||||
|
|
||||||
config.py: |
|
||||||
- from dataclasses import dataclass, field |
|
||||||
- import streamlit as st (optional, for warnings) |
|
||||||
``` |
|
||||||
|
|
||||||
## Environment |
|
||||||
- Python ≥3.13 |
|
||||||
- Environment variables via `.env` (DB path, API keys) |
|
||||||
- No `.env` values in constraint files (security) |
|
||||||
@ -0,0 +1,146 @@ |
|||||||
|
--- |
||||||
|
title: Domain Glossary |
||||||
|
category: domain |
||||||
|
--- |
||||||
|
|
||||||
|
# Domain Glossary - Dutch Political Terms |
||||||
|
|
||||||
|
## CRITICAL INVARIANTS |
||||||
|
|
||||||
|
> **Rule 1**: Centroid of right-wing parties on RIGHT side of ALL axes |
||||||
|
> - PVV, FVD, JA21, SGP centroid must appear on the RIGHT |
||||||
|
> - Individual right-wing parties may vary slightly from the centroid |
||||||
|
> - This is non-negotiable for any compass/axis visualization |
||||||
|
|
||||||
|
> **Rule 2**: SVD labels are empirically derived from voting data |
||||||
|
> - Labels represent WHAT THE DATA SHOWS, not party self-identification or public opinion |
||||||
|
> - Labels are derived from outliers and 20 representative motions (10 positive, 10 negative) |
||||||
|
> - See SVD Label Derivation section below |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## SVD Label Derivation |
||||||
|
|
||||||
|
### The Process |
||||||
|
|
||||||
|
SVD (Singular Value Decomposition) finds axes that maximize variance in the MP × Motion voting matrix. To label each axis: |
||||||
|
|
||||||
|
1. **Identify outliers**: Find the two MPs with most extreme positions on that axis |
||||||
|
2. **Select representative motions**: Pick 20 motions where these outliers disagreed most sharply (10 they voted opposite on, 10 where both voted same direction but with other extremes) |
||||||
|
3. **Interpret theme**: Read the motion titles to derive what the axis represents |
||||||
|
4. **Assign label**: Label describes the empirical theme, could be: |
||||||
|
- Left-Right |
||||||
|
- Coalition-Opposition |
||||||
|
- Progressive-Conservative |
||||||
|
- EU-National sovereignty |
||||||
|
- Populist-Establishment |
||||||
|
- Or whatever the voting patterns show |
||||||
|
|
||||||
|
### Example |
||||||
|
|
||||||
|
| Step | Description | |
||||||
|
|------|-------------| |
||||||
|
| Outlier A | Wilders (PVV) - extreme positive on Dim 1 | |
||||||
|
| Outlier B | Marijnissen (SP) - extreme negative on Dim 1 | |
||||||
|
| 20 Motions | Immigration, integration, law & order themes dominate | |
||||||
|
| Label | "Links-Rechts" (Left-Right) | |
||||||
|
|
||||||
|
### Labeling Rules |
||||||
|
|
||||||
|
- **Never use party names in labels** (e.g., not "PVV-SP axis") |
||||||
|
- **Never use semantic/ideological labels** (e.g., not "progressive-conservative" unless that's what the motions show) |
||||||
|
- **Use motion-derived themes** (e.g., "Immigration", "EU", "Economy") |
||||||
|
- **Fallback**: If theme is unclear, use "Axis 1", "Axis 2" |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Core Entities |
||||||
|
|
||||||
|
### Motion / Motie |
||||||
|
- Parliamentary motion submitted by MPs |
||||||
|
- Fields: `id`, `title`, `date`, `category` |
||||||
|
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** |
||||||
|
|
||||||
|
### MP / Kamerlid |
||||||
|
- Member of Parliament (Tweede Kamerlid) |
||||||
|
- Identified by full name (e.g., "Van Dijk, I.") |
||||||
|
- Has voting record, party affiliation, SVD position vector |
||||||
|
|
||||||
|
### Party / Fractie |
||||||
|
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") |
||||||
|
- Party centroids: average SVD position of all MPs in party |
||||||
|
|
||||||
|
### Vote / Stemming |
||||||
|
- Individual MP's vote on a motion: +1, 0, -1 |
||||||
|
- Aggregated to compute SVD vectors |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Time & Analysis Concepts |
||||||
|
|
||||||
|
### Window / Tijdsvenster |
||||||
|
- Time period for analysis (annual or quarterly) |
||||||
|
- Values: "2023", "2023-Q1", "2024", etc. |
||||||
|
- SVD vectors computed per window |
||||||
|
|
||||||
|
### Trajectory |
||||||
|
- MP's position change across multiple windows |
||||||
|
- Computed from `svd_vectors` + window ordering |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Mathematical / Algorithmic Terms |
||||||
|
|
||||||
|
### SVD Vector |
||||||
|
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix |
||||||
|
- Represents MP's position in political space |
||||||
|
|
||||||
|
### SVD Label |
||||||
|
- Empirically derived axis label based on outlier MPs and representative motions |
||||||
|
- Describes the theme of disagreement on that axis |
||||||
|
- NOT based on party ideology or semantic labels |
||||||
|
|
||||||
|
### Political Compass |
||||||
|
- 2D visualization with SVD axes mapped to compass quadrants |
||||||
|
- X-axis: First SVD dimension (labeled from voting data) |
||||||
|
- Y-axis: Second SVD dimension (labeled from voting data) |
||||||
|
|
||||||
|
### Procrustes Alignment |
||||||
|
- Algorithm to align SVD vectors across time windows |
||||||
|
- Ensures comparable positions across years/quarters |
||||||
|
|
||||||
|
### UMAP |
||||||
|
- Uniform Manifold Approximation and Projection |
||||||
|
- Dimensionality reduction for visualization |
||||||
|
- Optional dependency with graceful SVD fallback |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Database Table Reference |
||||||
|
|
||||||
|
| Table | Key Fields | |
||||||
|
|-------|-----------| |
||||||
|
| `motions` | id, title, date, category | |
||||||
|
| `mp_votes` | mp_id, motion_id, vote | |
||||||
|
| `svd_vectors` | entity_id, window, vector_2d (list[2]) | |
||||||
|
| `mp_party_history` | mp_id, party, start_date, end_date | |
||||||
|
| `windows` | window_id, start_date, end_date, period_type | |
||||||
|
| `mp_trajectories` | mp_id, window, trajectory_vector | |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Dutch Political Parties |
||||||
|
|
||||||
|
### Canonical Right-Wing (centroid on RIGHT of axes) |
||||||
|
- PVV (Partij voor de Vrijheid) |
||||||
|
- FVD (Forum voor Democratie) |
||||||
|
- JA21 |
||||||
|
- SGP (Staatkundig Gereformeerde Partij) |
||||||
|
|
||||||
|
### Other Major Parties |
||||||
|
- VVD (Volkspartij voor Vrijheid en Democratie) |
||||||
|
- GL-PvdA (GroenLinks-PvdA) |
||||||
|
- NSC (Nieuw Sociaal Contract) |
||||||
|
- BBB (BoerBurgerBeweging) |
||||||
|
- SP (Socialistische Partij) |
||||||
|
- D66 (Democraten 66) |
||||||
@ -1,107 +0,0 @@ |
|||||||
# Domain Glossary - Dutch Political Terms |
|
||||||
|
|
||||||
## Core Entities |
|
||||||
|
|
||||||
### Motion / Motie |
|
||||||
- Parliamentary motion submitted by MPs |
|
||||||
- Fields: `id`, `title`, `date`, `category` |
|
||||||
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** |
|
||||||
|
|
||||||
### MP / Kamerlid |
|
||||||
- Member of Parliament (Tweede Kamerlid) |
|
||||||
- Identified by full name (e.g., "Van Dijk, I.") |
|
||||||
- Has voting record, party affiliation, SVD position vector |
|
||||||
- Historical: `mp_party_history` tracks party changes over time |
|
||||||
|
|
||||||
### Party / Fractie |
|
||||||
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") |
|
||||||
- Party centroids: average SVD position of all MPs in party |
|
||||||
- Aliases: multiple spelling variants exist (see anti-patterns.yaml) |
|
||||||
|
|
||||||
### Vote / Stemming |
|
||||||
- Individual MP's vote on a motion: +1, 0, -1 |
|
||||||
- Aggregated to compute SVD vectors |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Time & Analysis Concepts |
|
||||||
|
|
||||||
### Window / Tijdsvenster |
|
||||||
- Time period for analysis (annual or quarterly) |
|
||||||
- Values: "2023", "2023-Q1", "2024", etc. |
|
||||||
- SVD vectors computed per window |
|
||||||
- Windows can be aligned across time using Procrustes |
|
||||||
|
|
||||||
### Trajectory |
|
||||||
- MP's position change across multiple windows |
|
||||||
- Computed from `svd_vectors` + window ordering |
|
||||||
- Used for trend analysis in Evolution tab |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Mathematical / Algorithmic Terms |
|
||||||
|
|
||||||
### SVD Vector |
|
||||||
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix |
|
||||||
- Represents MP's position in political space |
|
||||||
- `entity_id` in `svd_vectors`: either MP name (when individual MPs) or party name (when party-level) |
|
||||||
|
|
||||||
### Political Compass |
|
||||||
- 2D visualization: X-axis = Left↔Right, Y-axis = Progressive↔Conservative |
|
||||||
- SVD vectors mapped to compass quadrants |
|
||||||
- UMAP used for projection |
|
||||||
|
|
||||||
### Procrustes Alignment |
|
||||||
- Algorithm to align SVD vectors across time windows |
|
||||||
- Ensures comparable positions across years/quarters |
|
||||||
- Implemented via `scipy.spatial.procrustes` or scikit-learn |
|
||||||
|
|
||||||
### Centroid |
|
||||||
- Geometric center of a set of points |
|
||||||
- Party centroid = average SVD position of all MPs in that party |
|
||||||
- Computed from `svd_vectors` filtered by party |
|
||||||
|
|
||||||
### UMAP |
|
||||||
- Uniform Manifold Approximation and Projection |
|
||||||
- Dimensionality reduction for visualization |
|
||||||
- Optional dependency — graceful fallback if unavailable |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Visualization |
|
||||||
|
|
||||||
### PARTY_COLOURS |
|
||||||
- Dict mapping party names to hex color codes |
|
||||||
- Used in all Plotly charts for consistent party coloring |
|
||||||
- Source: `config.py` → `PARTY_COLOURS` constant |
|
||||||
- **Issue**: 3 separate alias dictionaries exist (no single source of truth) |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Application Pages |
|
||||||
|
|
||||||
### Home |
|
||||||
- Landing page with app overview |
|
||||||
|
|
||||||
### Stemwijzer (Quiz) |
|
||||||
- User answers questions → matched to parties |
|
||||||
- Thin wrapper around quiz module |
|
||||||
|
|
||||||
### Explorer (4 tabs) |
|
||||||
- **Motion tab**: SVD positions colored by vote on selected motion |
|
||||||
- **MP tab**: Individual MP trajectories across windows |
|
||||||
- **Party tab**: Party centroids with members as scatter |
|
||||||
- **Evolution tab**: How positions change over time |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Database Table Reference |
|
||||||
| Table | Key Fields | |
|
||||||
|-------|-----------| |
|
||||||
| `motions` | id, title, date, category | |
|
||||||
| `mp_votes` | mp_id, motion_id, vote | |
|
||||||
| `svd_vectors` | entity_id, window, vector_2d (list[2]) | |
|
||||||
| `party_centroids` | party, window, centroid_2d | |
|
||||||
| `mp_party_history` | mp_id, party, start_date, end_date | |
|
||||||
| `windows` | window_id, start_date, end_date, period_type | |
|
||||||
| `mp_trajectories` | mp_id, window, trajectory_vector | |
|
||||||
@ -0,0 +1,79 @@ |
|||||||
|
--- |
||||||
|
title: DuckDB Access Pattern |
||||||
|
category: patterns |
||||||
|
--- |
||||||
|
# DuckDB Access Pattern |
||||||
|
|
||||||
|
## Rules |
||||||
|
|
||||||
|
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
||||||
|
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
||||||
|
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
||||||
|
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### database.py - Explicit connect/close for schema init |
||||||
|
|
||||||
|
```python |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
... |
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||||
|
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
window_id TEXT NOT NULL, |
||||||
|
vector JSON NOT NULL, |
||||||
|
svd_dims INTEGER NOT NULL, |
||||||
|
text_dims INTEGER NOT NULL, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
""") |
||||||
|
conn.close() |
||||||
|
``` |
||||||
|
|
||||||
|
### pipeline/svd_pipeline.py - Read-only connection |
||||||
|
|
||||||
|
```python |
||||||
|
conn = duckdb.connect(db_path, read_only=True) |
||||||
|
try: |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||||
|
(start_date, end_date), |
||||||
|
).fetchall() |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
``` |
||||||
|
|
||||||
|
### similarity/compute.py - Preferred 'with' context |
||||||
|
|
||||||
|
```python |
||||||
|
try: |
||||||
|
import duckdb |
||||||
|
except Exception: |
||||||
|
logger.exception("duckdb import failed; cannot load vectors") |
||||||
|
return 0 |
||||||
|
|
||||||
|
with duckdb.connect(db.db_path) as conn: |
||||||
|
rows = conn.execute(query, params).fetchall() |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-Patterns |
||||||
|
|
||||||
|
### Bad: Connection without closure |
||||||
|
|
||||||
|
```python |
||||||
|
# BAD: connection may leak if exception occurs before explicit close |
||||||
|
conn = duckdb.connect(db_path) |
||||||
|
rows = conn.execute("SELECT ...").fetchall() |
||||||
|
# missing finally/close |
||||||
|
``` |
||||||
|
|
||||||
|
**Remediation**: Use "with" context or ensure conn.close() in finally block. |
||||||
|
|
||||||
|
### Bad: Parallel write connections |
||||||
|
|
||||||
|
**Problem**: Opening write connections from many parallel workers without coordination. |
||||||
|
|
||||||
|
**Remediation**: Open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
||||||
@ -1,70 +0,0 @@ |
|||||||
name: duckdb_access |
|
||||||
|
|
||||||
rules: |
|
||||||
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
|
||||||
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
|
||||||
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
|
||||||
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
|
||||||
|
|
||||||
examples: |
|
||||||
- path: database.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
... |
|
||||||
conn.execute(""" |
|
||||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
|
||||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
|
||||||
motion_id INTEGER NOT NULL, |
|
||||||
window_id TEXT NOT NULL, |
|
||||||
vector JSON NOT NULL, |
|
||||||
svd_dims INTEGER NOT NULL, |
|
||||||
text_dims INTEGER NOT NULL, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
""") |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
note: explicit connect/close used when initializing schema |
|
||||||
|
|
||||||
- path: pipeline/svd_pipeline.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
conn = duckdb.connect(db_path, read_only=True) |
|
||||||
try: |
|
||||||
rows = conn.execute( |
|
||||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
|
||||||
(start_date, end_date), |
|
||||||
).fetchall() |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
note: read_only connection used for compute-heavy worker |
|
||||||
|
|
||||||
- path: similarity/compute.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
try: |
|
||||||
import duckdb |
|
||||||
except Exception: |
|
||||||
logger.exception("duckdb import failed; cannot load vectors") |
|
||||||
return 0 |
|
||||||
|
|
||||||
with duckdb.connect(db.db_path) as conn: |
|
||||||
rows = conn.execute(query, params).fetchall() |
|
||||||
``` |
|
||||||
note: preferred 'with' context for automatic close |
|
||||||
|
|
||||||
anti_patterns: |
|
||||||
- Bad: creating a connection without closure in a long-running process |
|
||||||
remediation: use "with" context or ensure conn.close() in finally block |
|
||||||
example: | |
|
||||||
```python |
|
||||||
# BAD: connection may leak if exception occurs before explicit close |
|
||||||
conn = duckdb.connect(db_path) |
|
||||||
rows = conn.execute("SELECT ...").fetchall() |
|
||||||
# missing finally/close |
|
||||||
``` |
|
||||||
- Bad: Opening write connections from many parallel workers without coordination |
|
||||||
remediation: open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
|
||||||
@ -0,0 +1,74 @@ |
|||||||
|
--- |
||||||
|
title: Embeddings Similarity Pipeline |
||||||
|
category: patterns |
||||||
|
--- |
||||||
|
# Embeddings Similarity Pipeline |
||||||
|
|
||||||
|
## Rules |
||||||
|
|
||||||
|
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
||||||
|
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
||||||
|
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
||||||
|
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### pipeline/ai_provider_wrapper.py - Batched embed + fallback |
||||||
|
|
||||||
|
```python |
||||||
|
for start in range(0, len(texts), batch_size): |
||||||
|
chunk = texts[start : start + batch_size] |
||||||
|
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
||||||
|
... |
||||||
|
for j in range(i, end): |
||||||
|
t = texts[j] |
||||||
|
single, single_exc = _attempt_batch([t], j) |
||||||
|
if single: |
||||||
|
results[j] = single[0] |
||||||
|
``` |
||||||
|
|
||||||
|
### pipeline/fusion.py - Concatenation and storage |
||||||
|
|
||||||
|
```python |
||||||
|
try: |
||||||
|
svd_vec = json.loads(svd_json) |
||||||
|
except Exception: |
||||||
|
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
||||||
|
skipped_missing_svd += 1 |
||||||
|
continue |
||||||
|
... |
||||||
|
fused = list(svd_vec) + list(text_vec) |
||||||
|
res = db.store_fused_embedding( |
||||||
|
int(entity_id), |
||||||
|
window_id, |
||||||
|
fused, |
||||||
|
svd_dims=len(svd_vec), |
||||||
|
text_dims=len(text_vec), |
||||||
|
) |
||||||
|
``` |
||||||
|
|
||||||
|
### similarity/compute.py - Normalized cosine similarity |
||||||
|
|
||||||
|
```python |
||||||
|
# Normalize rows |
||||||
|
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||||
|
norms[norms == 0] = 1.0 |
||||||
|
normalized = matrix / norms |
||||||
|
sim = normalized @ normalized.T |
||||||
|
... |
||||||
|
# pick top-k neighbors and write to similarity_cache |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-Patterns |
||||||
|
|
||||||
|
### Bad: Assuming consistent vector length |
||||||
|
|
||||||
|
**Problem**: Assuming consistent vector length without checks leads to shape errors. |
||||||
|
|
||||||
|
**Remediation**: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
||||||
|
|
||||||
|
### Bad: Inline heavy computation in UI |
||||||
|
|
||||||
|
**Problem**: Recomputing heavy pipelines inline in UI requests. |
||||||
|
|
||||||
|
**Remediation**: Schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
||||||
@ -1,63 +0,0 @@ |
|||||||
name: embeddings_similarity_pipeline |
|
||||||
|
|
||||||
rules: |
|
||||||
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
|
||||||
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
|
||||||
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
|
||||||
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
|
||||||
|
|
||||||
examples: |
|
||||||
- path: pipeline/ai_provider_wrapper.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
for start in range(0, len(texts), batch_size): |
|
||||||
chunk = texts[start : start + batch_size] |
|
||||||
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
|
||||||
... |
|
||||||
for j in range(i, end): |
|
||||||
t = texts[j] |
|
||||||
single, single_exc = _attempt_batch([t], j) |
|
||||||
if single: |
|
||||||
results[j] = single[0] |
|
||||||
``` |
|
||||||
note: batched embed + fallback per-item retry |
|
||||||
|
|
||||||
- path: pipeline/fusion.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
try: |
|
||||||
svd_vec = json.loads(svd_json) |
|
||||||
except Exception: |
|
||||||
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
|
||||||
skipped_missing_svd += 1 |
|
||||||
continue |
|
||||||
... |
|
||||||
fused = list(svd_vec) + list(text_vec) |
|
||||||
res = db.store_fused_embedding( |
|
||||||
int(entity_id), |
|
||||||
window_id, |
|
||||||
fused, |
|
||||||
svd_dims=len(svd_vec), |
|
||||||
text_dims=len(text_vec), |
|
||||||
) |
|
||||||
``` |
|
||||||
note: concatenation of vectors and storage via MotionDatabase |
|
||||||
|
|
||||||
- path: similarity/compute.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
# Normalize rows |
|
||||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
|
||||||
norms[norms == 0] = 1.0 |
|
||||||
normalized = matrix / norms |
|
||||||
sim = normalized @ normalized.T |
|
||||||
... |
|
||||||
# pick top-k neighbors and write to similarity_cache |
|
||||||
``` |
|
||||||
note: numeric pipeline and padding to consistent dimensionality |
|
||||||
|
|
||||||
anti_patterns: |
|
||||||
- Bad: Assuming consistent vector length without checks (leads to shape errors). |
|
||||||
remediation: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
|
||||||
- Bad: Recomputing heavy pipelines inline in UI requests. |
|
||||||
remediation: schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
|
||||||
@ -0,0 +1,63 @@ |
|||||||
|
--- |
||||||
|
title: Error Handling Pattern |
||||||
|
category: patterns |
||||||
|
--- |
||||||
|
# Error Handling Pattern |
||||||
|
|
||||||
|
## Rules |
||||||
|
|
||||||
|
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
||||||
|
- Prefer logging.exception when catching an exception where stack trace is useful. |
||||||
|
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
||||||
|
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### ai_provider.py - Network error to ProviderError |
||||||
|
|
||||||
|
```python |
||||||
|
except requests.ConnectionError as exc: |
||||||
|
if attempt == retries: |
||||||
|
raise ProviderError( |
||||||
|
f"Connection error when calling provider: {exc}" |
||||||
|
) from exc |
||||||
|
... |
||||||
|
``` |
||||||
|
|
||||||
|
### pipeline/ai_provider_wrapper.py - Best-effort with logging |
||||||
|
|
||||||
|
```python |
||||||
|
except Exception: |
||||||
|
_logger.exception("Failed to append audit event for embedding failure") |
||||||
|
results[j] = None |
||||||
|
``` |
||||||
|
|
||||||
|
### similarity/compute.py - Defensive import handling |
||||||
|
|
||||||
|
```python |
||||||
|
try: |
||||||
|
import duckdb |
||||||
|
except Exception: |
||||||
|
logger.exception("duckdb import failed; cannot load vectors") |
||||||
|
return 0 |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-Patterns |
||||||
|
|
||||||
|
### Bad: Silent exception swallowing |
||||||
|
|
||||||
|
```python |
||||||
|
try: |
||||||
|
do_work() |
||||||
|
except Exception: |
||||||
|
return [] |
||||||
|
# BAD: hides the root cause and returns an ambiguous default |
||||||
|
``` |
||||||
|
|
||||||
|
**Remediation**: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
||||||
|
|
||||||
|
### Bad: Mixing print() and logging |
||||||
|
|
||||||
|
**Problem**: Mixing print() and logging for errors. |
||||||
|
|
||||||
|
**Remediation**: Replace print() calls with logger.* calls; use structured logging configuration. |
||||||
@ -1,54 +0,0 @@ |
|||||||
name: error_handling |
|
||||||
|
|
||||||
rules: |
|
||||||
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
|
||||||
- Prefer logging.exception when catching an exception where stack trace is useful. |
|
||||||
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
|
||||||
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
|
||||||
|
|
||||||
examples: |
|
||||||
- path: ai_provider.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
except requests.ConnectionError as exc: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError( |
|
||||||
f"Connection error when calling provider: {exc}" |
|
||||||
) from exc |
|
||||||
... |
|
||||||
``` |
|
||||||
note: mapping network error to ProviderError with re-raise chaining |
|
||||||
|
|
||||||
- path: pipeline/ai_provider_wrapper.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
except Exception: |
|
||||||
_logger.exception("Failed to append audit event for embedding failure") |
|
||||||
results[j] = None |
|
||||||
``` |
|
||||||
note: logs and assigns None for failure; fallback behavior documented earlier in wrapper rule |
|
||||||
|
|
||||||
- path: similarity/compute.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
try: |
|
||||||
import duckdb |
|
||||||
except Exception: |
|
||||||
logger.exception("duckdb import failed; cannot load vectors") |
|
||||||
return 0 |
|
||||||
``` |
|
||||||
note: defensive import handling and early return on failure |
|
||||||
|
|
||||||
anti_patterns: |
|
||||||
- Bad: Broad except without logging and without re-raising (silently hides bugs) |
|
||||||
remediation: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
|
||||||
example: | |
|
||||||
```python |
|
||||||
try: |
|
||||||
do_work() |
|
||||||
except Exception: |
|
||||||
return [] |
|
||||||
# BAD: hides the root cause and returns an ambiguous default |
|
||||||
``` |
|
||||||
- Bad: Mixing print() and logging for errors |
|
||||||
remediation: Replace print() calls with logger.* calls; use structured logging configuration. |
|
||||||
@ -0,0 +1,41 @@ |
|||||||
|
--- |
||||||
|
title: Module Singletons Pattern |
||||||
|
category: patterns |
||||||
|
--- |
||||||
|
# Module Singletons Pattern |
||||||
|
|
||||||
|
## Rules |
||||||
|
|
||||||
|
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
||||||
|
- Avoid expensive initialization at import time. |
||||||
|
- Provide a way to construct with a test DB path or to reinitialize in tests. |
||||||
|
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### database.py - Safe class initialization |
||||||
|
|
||||||
|
```python |
||||||
|
class MotionDatabase: |
||||||
|
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||||
|
self.db_path = db_path |
||||||
|
# If duckdb is not available, operate in lightweight file-backed mode |
||||||
|
self._file_mode = duckdb is None |
||||||
|
self._init_database() |
||||||
|
``` |
||||||
|
|
||||||
|
### similarity/lookup.py - Local instances |
||||||
|
|
||||||
|
```python |
||||||
|
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
||||||
|
if hasattr(db, "get_cached_similarities"): |
||||||
|
rows = db.get_cached_similarities(...) |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-Patterns |
||||||
|
|
||||||
|
### Bad: Heavy initialization at import time |
||||||
|
|
||||||
|
**Problem**: Creating connections and performing heavy schema migrations during import. |
||||||
|
|
||||||
|
**Remediation**: Move heavy init to an explicit initialize() method and keep import fast. |
||||||
@ -1,33 +0,0 @@ |
|||||||
name: module_singletons |
|
||||||
|
|
||||||
rules: |
|
||||||
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
|
||||||
- Avoid expensive initialization at import time. |
|
||||||
- Provide a way to construct with a test DB path or to reinitialize in tests. |
|
||||||
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
|
||||||
|
|
||||||
examples: |
|
||||||
- path: database.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self.db_path = db_path |
|
||||||
# If duckdb is not available, operate in lightweight file-backed mode |
|
||||||
self._file_mode = duckdb is None |
|
||||||
self._init_database() |
|
||||||
``` |
|
||||||
note: class is safe to instantiate and creates DB at init; consider lazy init if heavy |
|
||||||
|
|
||||||
- path: similarity/lookup.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
|
||||||
if hasattr(db, "get_cached_similarities"): |
|
||||||
rows = db.get_cached_similarities(...) |
|
||||||
``` |
|
||||||
note: consumers create local MotionDatabase instances, not relying on a single global |
|
||||||
|
|
||||||
anti_patterns: |
|
||||||
- Bad: Creating connections and performing heavy schema migrations during import |
|
||||||
remediation: Move heavy init to an explicit initialize() method and keep import fast. |
|
||||||
@ -0,0 +1,77 @@ |
|||||||
|
--- |
||||||
|
title: Requests HTTP Pattern |
||||||
|
category: patterns |
||||||
|
--- |
||||||
|
# Requests HTTP Pattern |
||||||
|
|
||||||
|
## Rules |
||||||
|
|
||||||
|
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
||||||
|
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
||||||
|
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
||||||
|
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### ai_provider.py - 429 handling with Retry-After |
||||||
|
|
||||||
|
```python |
||||||
|
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||||
|
... |
||||||
|
if getattr(resp, "status_code", 0) == 429: |
||||||
|
if attempt == retries: |
||||||
|
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||||
|
retry_after = None |
||||||
|
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||||
|
if raw: |
||||||
|
try: |
||||||
|
retry_after = int(raw) |
||||||
|
except Exception: |
||||||
|
... |
||||||
|
if retry_after is not None: |
||||||
|
time.sleep(retry_after) |
||||||
|
continue |
||||||
|
``` |
||||||
|
|
||||||
|
### api_client.py - Session + raise_for_status |
||||||
|
|
||||||
|
```python |
||||||
|
response = self.session.get( |
||||||
|
base_url, params=params, timeout=config.API_TIMEOUT |
||||||
|
) |
||||||
|
response.raise_for_status() |
||||||
|
data = response.json() |
||||||
|
``` |
||||||
|
|
||||||
|
### pipeline/ai_provider_wrapper.py - Retry/backoff wrapper |
||||||
|
|
||||||
|
```python |
||||||
|
def _attempt_batch(chunk_texts, start_index): |
||||||
|
backoff = 0.5 |
||||||
|
for attempt in range(1, retries + 1): |
||||||
|
try: |
||||||
|
emb_chunk = _embedder( |
||||||
|
chunk_texts, model=model, batch_size=len(chunk_texts) |
||||||
|
) |
||||||
|
return emb_chunk, None |
||||||
|
except Exception as exc: |
||||||
|
if attempt == retries: |
||||||
|
break |
||||||
|
sleep = backoff * (2 ** (attempt - 1)) |
||||||
|
time.sleep(sleep) |
||||||
|
continue |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-Patterns |
||||||
|
|
||||||
|
### Bad: Silent exception swallowing |
||||||
|
|
||||||
|
**Problem**: Blindly catching all requests exceptions and returning empty response. |
||||||
|
|
||||||
|
**Remediation**: Map network exceptions to retryable vs terminal (ProviderError) and log details. |
||||||
|
|
||||||
|
### Bad: Using print() for errors |
||||||
|
|
||||||
|
**Problem**: Using print() for network errors instead of structured logging. |
||||||
|
|
||||||
|
**Remediation**: Use `_logger.exception()` instead (see api_client.py needs fixing). |
||||||
@ -1,65 +0,0 @@ |
|||||||
name: requests_http |
|
||||||
|
|
||||||
rules: |
|
||||||
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
|
||||||
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
|
||||||
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
|
||||||
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
|
||||||
|
|
||||||
examples: |
|
||||||
- path: ai_provider.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
|
||||||
... |
|
||||||
if getattr(resp, "status_code", 0) == 429: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
|
||||||
retry_after = None |
|
||||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
|
||||||
if raw: |
|
||||||
try: |
|
||||||
retry_after = int(raw) |
|
||||||
except Exception: |
|
||||||
... |
|
||||||
if retry_after is not None: |
|
||||||
time.sleep(retry_after) |
|
||||||
continue |
|
||||||
``` |
|
||||||
note: explicit handling of 429 and Retry-After |
|
||||||
|
|
||||||
- path: api_client.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
response = self.session.get( |
|
||||||
base_url, params=params, timeout=config.API_TIMEOUT |
|
||||||
) |
|
||||||
response.raise_for_status() |
|
||||||
data = response.json() |
|
||||||
``` |
|
||||||
note: uses session + raise_for_status() to surface HTTP errors |
|
||||||
|
|
||||||
- path: pipeline/ai_provider_wrapper.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
def _attempt_batch(chunk_texts, start_index): |
|
||||||
backoff = 0.5 |
|
||||||
for attempt in range(1, retries + 1): |
|
||||||
try: |
|
||||||
emb_chunk = _embedder( |
|
||||||
chunk_texts, model=model, batch_size=len(chunk_texts) |
|
||||||
) |
|
||||||
return emb_chunk, None |
|
||||||
except Exception as exc: |
|
||||||
if attempt == retries: |
|
||||||
break |
|
||||||
sleep = backoff * (2 ** (attempt - 1)) |
|
||||||
time.sleep(sleep) |
|
||||||
continue |
|
||||||
``` |
|
||||||
note: wrapper adds retry/backoff and per-item fallback |
|
||||||
|
|
||||||
anti_patterns: |
|
||||||
- Bad: Blindly catching all requests exceptions and returning empty response |
|
||||||
remediation: map network exceptions to retryable vs terminal (ProviderError) and log details. |
|
||||||
- Bad: Using print() for network errors instead of structured logging (see api_client.py where print() is used; prefer logging). |
|
||||||
@ -0,0 +1,37 @@ |
|||||||
|
--- |
||||||
|
title: Validation Pattern |
||||||
|
category: patterns |
||||||
|
--- |
||||||
|
# Validation Pattern |
||||||
|
|
||||||
|
## Rules |
||||||
|
|
||||||
|
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
||||||
|
- Tests should assert that invalid inputs raise the expected exceptions. |
||||||
|
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### ai_provider.py - Type validation |
||||||
|
|
||||||
|
```python |
||||||
|
if not isinstance(text, str): |
||||||
|
raise ProviderError("text must be a string") |
||||||
|
``` |
||||||
|
|
||||||
|
### pipeline/ai_provider_wrapper.py - Defensive empty handling |
||||||
|
|
||||||
|
```python |
||||||
|
if not texts: |
||||||
|
return [] |
||||||
|
if motion_ids is None: |
||||||
|
motion_ids = [None for _ in texts] |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-Patterns |
||||||
|
|
||||||
|
### Bad: Invalid values into computation |
||||||
|
|
||||||
|
**Problem**: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
||||||
|
|
||||||
|
**Remediation**: Fail fast with a typed exception and add unit tests to cover validations. |
||||||
@ -1,29 +0,0 @@ |
|||||||
name: validation |
|
||||||
|
|
||||||
rules: |
|
||||||
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
|
||||||
- Tests should assert that invalid inputs raise the expected exceptions. |
|
||||||
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
|
||||||
|
|
||||||
examples: |
|
||||||
- path: ai_provider.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
if not isinstance(text, str): |
|
||||||
raise ProviderError("text must be a string") |
|
||||||
``` |
|
||||||
note: explicit type validation before network call |
|
||||||
|
|
||||||
- path: pipeline/ai_provider_wrapper.py |
|
||||||
excerpt: | |
|
||||||
```python |
|
||||||
if not texts: |
|
||||||
return [] |
|
||||||
if motion_ids is None: |
|
||||||
motion_ids = [None for _ in texts] |
|
||||||
``` |
|
||||||
note: defensive handling of empty inputs |
|
||||||
|
|
||||||
anti_patterns: |
|
||||||
- Bad: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
|
||||||
remediation: Fail fast with a typed exception and add unit tests to cover validations. |
|
||||||
@ -0,0 +1,67 @@ |
|||||||
|
--- |
||||||
|
title: Tech Stack |
||||||
|
category: stack |
||||||
|
--- |
||||||
|
|
||||||
|
# Tech Stack |
||||||
|
|
||||||
|
## Runtime & Language |
||||||
|
- **Python >=3.13** |
||||||
|
|
||||||
|
## Web Framework |
||||||
|
- **Streamlit** - Multi-page app with Home, Stemwijzer, Explorer pages |
||||||
|
|
||||||
|
## Data Layer |
||||||
|
- **DuckDB** - Embedded OLAP database |
||||||
|
- Tables: motions, mp_votes, svd_vectors, fused_embeddings, embeddings, user_sessions, party_results, mp_metadata |
||||||
|
- **ibis** - ORM (referenced but DuckDB-native implementation used) |
||||||
|
|
||||||
|
## AI / LLM |
||||||
|
- **OpenRouter** - API abstraction for AI providers |
||||||
|
- **QWEN** - Primary model |
||||||
|
- Embeddings: `qwen/qwen3-embedding-4b` |
||||||
|
- Chat: `qwen/qwen-2.5-72b-instruct` |
||||||
|
- **requests** - HTTP client (not raw openai) |
||||||
|
|
||||||
|
## ML / Analytics |
||||||
|
- **scikit-learn** - KMeans clustering, cosine_similarity, StandardScaler |
||||||
|
- **scipy** - SVD (scipy.linalg.svd), spatial.procrustes |
||||||
|
- **umap-learn** - Dimensionality reduction (optional, graceful fallback to SVD) |
||||||
|
- **numpy** - Numerical computing |
||||||
|
|
||||||
|
## Visualization |
||||||
|
- **Plotly** - Interactive charts (go.Figure, _DummyTrace fallback) |
||||||
|
- **matplotlib** - Static plotting (optional) |
||||||
|
|
||||||
|
## HTTP & Parsing |
||||||
|
- **requests** - Session pooling, retry with backoff |
||||||
|
- **beautifulsoup4** - HTML parsing |
||||||
|
- **lxml** - XML/HTML processing |
||||||
|
|
||||||
|
## Key Source Files |
||||||
|
|
||||||
|
| File | Purpose | |
||||||
|
|------|---------| |
||||||
|
| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema | |
||||||
|
| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) | |
||||||
|
| `explorer_helpers.py` | Pure helper functions, Plotly chart builders | |
||||||
|
| `analysis/` | SVD pipeline, UMAP projection, clustering | |
||||||
|
| `pipeline/` | Data fetch, transform, store pipeline | |
||||||
|
| `pages/1_Stemwijzer.py` | Quiz page | |
||||||
|
| `pages/2_Explorer.py` | Explorer page | |
||||||
|
| `config.py` | Dataclass Config pattern | |
||||||
|
| `ai_provider.py` | OpenRouter API wrapper with retry | |
||||||
|
| `api_client.py` | TweedeKamer OData API client | |
||||||
|
|
||||||
|
## Singleton Instances |
||||||
|
|
||||||
|
| Module | Instance | Type | |
||||||
|
|--------|----------|------| |
||||||
|
| `database.py` | `db` | `MotionDatabase` | |
||||||
|
| `config.py` | `config` | `Config` (dataclass) | |
||||||
|
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||||
|
|
||||||
|
## Environment |
||||||
|
- Python >=3.13 |
||||||
|
- Environment variables via `.env` (DB path, API keys) |
||||||
|
- No `.env` values in constraint files (security) |
||||||
Loading…
Reference in new issue