Removes: - .mindmodel/ directory and related CI workflows (mindmodel-schedule.yml, mindmodel-validation.yml) - scripts/mindmodel/ and scripts/validate_mindmodel.py - src/types/ and src/validators/ (orphaned type modules, only used by mindmodel) - tests/ci/, tests/scripts/mindmodel/, tests/types/, tests/validators/ (mindmodel-only tests) - thoughts/ledgers/ and thoughts/shared/ (stale transient directories) - .venv_axis and .venv_plotly (orphaned virtual environments, ~1.1 GB) - outputs/blog-charts/ (stale generated HTML files) - data/*.json sidecars (empty cache artifacts) - __pycache__ and *.pyc files across repo Updates: - .gitignore: remove thoughts/shared/analyses/ entry Space reclaimed: ~1.1 GB+main
parent
6e36fa2604
commit
07dd393533
@ -1,37 +0,0 @@ |
||||
name: mindmodel scheduled validate |
||||
|
||||
on: |
||||
schedule: |
||||
- cron: '0 0 * * 0' # weekly |
||||
|
||||
jobs: |
||||
validate: |
||||
runs-on: ubuntu-latest |
||||
steps: |
||||
- name: Checkout |
||||
uses: actions/checkout@v4 |
||||
|
||||
- name: Install uv |
||||
uses: astral-sh/setup-uv@v5 |
||||
with: |
||||
version: "0.6.x" |
||||
|
||||
- name: Set up Python |
||||
uses: actions/setup-python@v5 |
||||
with: |
||||
python-version: "3.13" |
||||
|
||||
- name: Install dependencies |
||||
run: uv sync --locked |
||||
|
||||
- name: Run tests |
||||
run: uv run pytest tests/ -q |
||||
|
||||
- name: Run mindmodel validator if manifest exists |
||||
if: ${{ always() }} |
||||
run: | |
||||
if [ -f .mindmodel/manifest.yaml ]; then |
||||
uv run python -m scripts.mindmodel.cli || true |
||||
else |
||||
echo "No .mindmodel/manifest.yaml present — skipping validator" |
||||
fi |
||||
@ -1,47 +0,0 @@ |
||||
name: mindmodel validation |
||||
|
||||
on: |
||||
push: |
||||
branches: [ main ] |
||||
pull_request: |
||||
branches: [ main ] |
||||
|
||||
jobs: |
||||
validate: |
||||
runs-on: ubuntu-latest |
||||
steps: |
||||
- name: Checkout |
||||
uses: actions/checkout@v4 |
||||
|
||||
- name: Set up Python |
||||
uses: actions/setup-python@v4 |
||||
with: |
||||
python-version: '3.x' |
||||
|
||||
- name: Install development dependencies (if present) |
||||
run: | |
||||
python -m pip install --upgrade pip |
||||
if [ -f requirements-dev.txt ]; then |
||||
pip install -r requirements-dev.txt |
||||
else |
||||
echo "requirements-dev.txt not found, skipping" |
||||
fi |
||||
|
||||
- name: Run mindmodel validator (report-only) |
||||
if: ${{ always() }} |
||||
run: | |
||||
# Make this step report-only: run the validator but always exit 0 so PRs are not blocked |
||||
set +e |
||||
if [ -f .mindmodel/manifest.yaml ]; then |
||||
python scripts/validate_mindmodel.py --manifest .mindmodel/manifest.yaml --report reports/out.json || true |
||||
else |
||||
echo "No .mindmodel/manifest.yaml present — skipping validator" |
||||
fi |
||||
exit 0 |
||||
|
||||
- name: Upload mindmodel reports |
||||
if: ${{ always() }} |
||||
uses: actions/upload-artifact@v4 |
||||
with: |
||||
name: mindmodel-reports |
||||
path: reports/mindmodel-report-*.json |
||||
@ -1,11 +0,0 @@ |
||||
# .mindmodel |
||||
|
||||
This directory contains a generated, read-only snapshot of the repository's "mind model" — structured metadata and evidence used by tooling to reason about repository intent, patterns, and decisions. |
||||
|
||||
Guidelines |
||||
- Read-only: Treat files in this directory as generated artifacts. Local tooling or CI may regenerate or validate them; avoid manual edits unless you are intentionally updating the generator. |
||||
- No secrets: Do not place any credentials, tokens, or sensitive data here. The validator that consumes this folder is designed to detect common secret patterns and will fail if secrets are found. |
||||
- Safe to read: Tools and CI may read these files. They must avoid opening or parsing arbitrary repository secrets and should operate in read-only mode. |
||||
- Validation: CI workflows will run a validator against this folder (if present) to ensure manifest shape, evidence snippets, and referenced files meet project rules. |
||||
|
||||
If you need to propose a change to the mind model, open a PR describing the intent and the generator changes. The CI validator will validate the submitted artifact before merge. |
||||
@ -1,127 +0,0 @@ |
||||
--- |
||||
title: Anti-Patterns in Stemwijzer |
||||
category: anti-patterns |
||||
severity: critical |
||||
--- |
||||
|
||||
# Anti-Patterns |
||||
|
||||
> **NOTE**: Some anti-patterns below were investigated and found to be resolved or invalid. See individual entries for details. |
||||
|
||||
## CRITICAL: print() Instead of Logging |
||||
|
||||
**File**: `api_client.py` |
||||
**Evidence**: 11 instances of `print(f"...")` instead of `_logger.info(...)` |
||||
|
||||
**Broken code**: |
||||
```python |
||||
def get_motions(self, ...): |
||||
try: |
||||
# ... |
||||
print(f"Fetched {len(voting_records)} voting records from API") # BAD |
||||
print(f"Processed into {len(motions)} unique motions") # BAD |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") # BAD - no traceback |
||||
``` |
||||
|
||||
**Fix**: |
||||
```python |
||||
import logging |
||||
|
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
def get_motions(self, ...): |
||||
try: |
||||
_logger.info("Fetched %d voting records from API", len(voting_records)) |
||||
_logger.info("Processed into %d unique motions", len(motions)) |
||||
except Exception as e: |
||||
_logger.exception("Error fetching motions from API: %s", e) |
||||
return [] |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## CRITICAL: Global `_DummySt` Replacement |
||||
|
||||
**File**: `explorer.py` |
||||
**Evidence**: Lines ~50-70, module-level `st = _DummySt()` global replacement |
||||
|
||||
**Problem**: Creates a module-level variable `st` that shadows `streamlit` module, causing subtle bugs. |
||||
|
||||
**Fix**: Use conditional flags instead of global replacement: |
||||
```python |
||||
# GOOD: Use conditional logic |
||||
try: |
||||
import plotly.express as px |
||||
import plotly.graph_objects as go |
||||
HAS_PLOTLY = True |
||||
except ImportError: |
||||
HAS_PLOTLY = False |
||||
px = None |
||||
go = None |
||||
|
||||
def render_chart(data): |
||||
if not HAS_PLOTLY: |
||||
_logger.warning("Plotly not available") |
||||
return |
||||
# ... rest of chart logic |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## WARNING: Logger Naming Inconsistency |
||||
|
||||
**Evidence**: 16 files use `logger`, 17 files use `_logger` |
||||
|
||||
**Files with `logger`** (without underscore): |
||||
- api_client.py, ai_provider.py, pipeline files, analysis files |
||||
|
||||
**Files with `_logger`** (with underscore): |
||||
- database.py, explorer.py, explorer_helpers.py |
||||
|
||||
**Recommendation**: Standardize on `_logger` for module-level loggers. |
||||
|
||||
--- |
||||
|
||||
## WARNING: Bare except with pass |
||||
|
||||
**File**: `database.py`, line 47 |
||||
|
||||
```python |
||||
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: # bare except |
||||
pass |
||||
``` |
||||
|
||||
**Fix**: |
||||
```python |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except Exception as exc: |
||||
_logger.debug("Sequence creation skipped: %s", exc) |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## INVESTIGATED: Entity-ID / Party-Name Mismatch |
||||
|
||||
**Status**: INVALID - investigated and resolved |
||||
|
||||
**Investigation Summary**: `svd_vectors.entity_id` only contains MP names (not party names). Party centroids are correctly computed via `mp_metadata` lookups. No production bug exists. |
||||
|
||||
--- |
||||
|
||||
## Pattern: Three Separate Party Alias Dictionaries |
||||
|
||||
**Problem**: Party name variations exist in 3+ places with no canonical alias mapping. |
||||
|
||||
**Fix**: Create one `PARTY_ALIASES` dict in `config.py`: |
||||
```python |
||||
PARTY_ALIASES = { |
||||
"GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"], |
||||
"PVV": ["Partij voor de Vrijheid"], |
||||
# ... |
||||
} |
||||
``` |
||||
@ -1,143 +0,0 @@ |
||||
--- |
||||
title: Error Handling Patterns |
||||
category: constraints |
||||
severity: high |
||||
--- |
||||
|
||||
# Error Handling Patterns |
||||
|
||||
## Core Rules |
||||
|
||||
1. **Catch `Exception`, return safe fallbacks** (False/[]/None) |
||||
2. **Log exceptions with traceback** using `_logger.exception()` |
||||
3. **Never swallow exceptions silently** - always log or return sensible default |
||||
4. **Avoid nested try/except blocks** - flatten exception handling |
||||
|
||||
## Pattern: Try/Except Safe Fallback |
||||
|
||||
This is the dominant pattern in the codebase (219+ instances). |
||||
|
||||
```python |
||||
# Standard pattern from database.py, api_client.py, etc. |
||||
try: |
||||
result = risky_operation() |
||||
return process(result) |
||||
except Exception as exc: |
||||
_logger.warning("Operation failed: %s", exc) |
||||
return safe_fallback # False, [], None, {} |
||||
``` |
||||
|
||||
### Examples from Codebase |
||||
|
||||
**database.py** - DuckDB operations: |
||||
```python |
||||
def get_svd_vectors(self, window: str): |
||||
try: |
||||
conn = duckdb.connect(self.db_path, read_only=True) |
||||
try: |
||||
result = conn.execute(query, (window,)).fetchall() |
||||
return self._parse_vectors(result) |
||||
finally: |
||||
conn.close() |
||||
except Exception as exc: |
||||
_logger.warning("Failed to get SVD vectors: %s", exc) |
||||
return [] |
||||
``` |
||||
|
||||
**ai_provider.py** - HTTP retries: |
||||
```python |
||||
try: |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
resp.raise_for_status() |
||||
return resp.json() |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Connection error: {exc}") from exc |
||||
# ... retry logic |
||||
``` |
||||
|
||||
## Pattern: Optional Dependency Fallback |
||||
|
||||
Gracefully degrade when optional packages are unavailable. |
||||
|
||||
```python |
||||
# UMAP fallback in explorer_helpers.py |
||||
try: |
||||
import umap |
||||
HAS_UMAP = True |
||||
except ImportError: |
||||
HAS_UMAP = False |
||||
_logger.debug("UMAP not available, using SVD vectors directly") |
||||
|
||||
def project_to_2d(vectors): |
||||
if HAS_UMAP: |
||||
return umap.UMAP().fit_transform(vectors) |
||||
return vectors[:, :2] # Fallback: first 2 SVD dimensions |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### 1. Bare except with pass (CRITICAL) |
||||
**File**: `database.py`, line 47 |
||||
|
||||
```python |
||||
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: # bare except |
||||
pass |
||||
``` |
||||
|
||||
**Fix**: Catch specific exception or log and continue: |
||||
```python |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except Exception as exc: |
||||
_logger.debug("Sequence creation skipped (may already exist): %s", exc) |
||||
``` |
||||
|
||||
### 2. Nested Exception Handling |
||||
**File**: `explorer.py`, lines 244-261 |
||||
|
||||
```python |
||||
# BAD - opaque error paths |
||||
try: |
||||
result = compute_svd(motions) |
||||
except Exception: |
||||
try: |
||||
result = fallback_compute(motions) |
||||
except Exception: |
||||
pass # Both exceptions silently dropped |
||||
``` |
||||
|
||||
**Fix**: Flatten and handle each case explicitly: |
||||
```python |
||||
# GOOD - explicit handling |
||||
try: |
||||
result = compute_svd(motions) |
||||
except Exception as exc: |
||||
_logger.warning("SVD failed, trying fallback: %s", exc) |
||||
try: |
||||
result = fallback_compute(motions) |
||||
except Exception as fallback_exc: |
||||
_logger.error("Both SVD approaches failed: %s, %s", exc, fallback_exc) |
||||
raise |
||||
``` |
||||
|
||||
## Rule Summary |
||||
|
||||
| Pattern | When to Use | Return Value | |
||||
|---------|-------------|--------------| |
||||
| Safe fallback | Best-effort operations | `[]`, `{}`, `False`, `None` | |
||||
| Re-raise | Critical operations that must succeed | raise | |
||||
| Log and continue | Optional steps in pipeline | (continue) | |
||||
| Graceful degradation | Optional dependencies | Default behavior | |
||||
|
||||
## When to Log vs Return |
||||
|
||||
| Scenario | Action | |
||||
|----------|--------| |
||||
| User action fails | Log warning, return safe default | |
||||
| Internal error (corrupt data) | Log error, return safe default | |
||||
| Transient failure (network) | Log warning, retry if appropriate | |
||||
| Configuration error | Log error, raise with clear message | |
||||
@ -1,205 +0,0 @@ |
||||
# Import Organization Constraints |
||||
|
||||
## Standard Order |
||||
|
||||
Organize imports in three groups with blank lines between: |
||||
|
||||
```python |
||||
# 1. Standard library imports (alphabetical within group) |
||||
import json |
||||
import logging |
||||
import os |
||||
from datetime import datetime, timedelta |
||||
from typing import Dict, List, Optional, Tuple |
||||
|
||||
# 2. Third-party packages (alphabetical within group) |
||||
import duckdb |
||||
import requests |
||||
from config import config |
||||
|
||||
# 3. Local application modules (can use relative imports) |
||||
from database import db |
||||
from summarizer import summarizer |
||||
``` |
||||
|
||||
## Alphabetical Ordering |
||||
|
||||
Within each group, sort imports alphabetically: |
||||
|
||||
```python |
||||
# GOOD - alphabetical |
||||
import json |
||||
import logging |
||||
from datetime import datetime |
||||
from typing import Dict, List, Optional |
||||
|
||||
# BAD - random order |
||||
from typing import Optional |
||||
import json |
||||
from datetime import datetime |
||||
import logging |
||||
from typing import Dict, List |
||||
``` |
||||
|
||||
## Grouping Rules |
||||
|
||||
### Standard Library |
||||
- `json`, `logging`, `os`, `sys`, `time` |
||||
- `datetime`, `timedelta` from `datetime` |
||||
- `Dict`, `List`, `Optional`, etc. from `typing` |
||||
- `argparse`, `pathlib`, `re`, `uuid` |
||||
|
||||
### Third-Party |
||||
- `duckdb`, `requests`, `streamlit` |
||||
- `numpy`, `scipy`, `sklearn` |
||||
- `plotly`, `beautifulsoup4` |
||||
- `pytest` |
||||
|
||||
### Local Application |
||||
- Modules from same package |
||||
- Relative imports when appropriate |
||||
|
||||
## When to Use `from X import Y` |
||||
|
||||
### Prefer `from module import specific_items` for: |
||||
- Constants and config |
||||
- Single classes or functions used frequently |
||||
- Type annotations |
||||
|
||||
```python |
||||
# GOOD - clear about what we're using |
||||
from config import config |
||||
from database import db |
||||
|
||||
# GOOD - type hints |
||||
from typing import Dict, List, Optional |
||||
``` |
||||
|
||||
### Use `import module` when: |
||||
- You need multiple items from the module |
||||
- Using module.namespace is clearer |
||||
|
||||
```python |
||||
# GOOD - duckdb used for types and module access |
||||
import duckdb |
||||
|
||||
conn = duckdb.connect(...) |
||||
result = conn.execute(...) |
||||
|
||||
# Also acceptable for types |
||||
from typing import Dict |
||||
``` |
||||
|
||||
## Relative Imports |
||||
|
||||
In package modules, prefer relative imports: |
||||
|
||||
```python |
||||
# pipeline/svd_pipeline.py |
||||
from ..database import MotionDatabase # relative import |
||||
from .text_pipeline import process_text # relative import |
||||
``` |
||||
|
||||
## Circular Imports |
||||
|
||||
Avoid circular imports by: |
||||
1. Moving shared code to a third module |
||||
2. Using TYPE_CHECKING for type hints only |
||||
|
||||
```python |
||||
# types.py - shared type definitions |
||||
from typing import TypedDict |
||||
|
||||
class MotionDict(TypedDict): |
||||
id: int |
||||
title: str |
||||
... |
||||
|
||||
# module_a.py |
||||
from .types import MotionDict |
||||
|
||||
# module_b.py - if needed here too |
||||
from .types import MotionDict |
||||
``` |
||||
|
||||
## Import Patterns to Avoid |
||||
|
||||
### Wildcard Imports |
||||
```python |
||||
# BAD |
||||
from database import * |
||||
|
||||
# GOOD |
||||
from database import db, MotionDatabase |
||||
``` |
||||
|
||||
### Import in Function Scope (unless necessary) |
||||
```python |
||||
# AVOID - delays import, makes dependencies unclear |
||||
def some_function(): |
||||
import pandas as pd # Late import |
||||
return pd.DataFrame(...) |
||||
|
||||
# PREFER - import at module level |
||||
import pandas as pd |
||||
|
||||
def some_function(): |
||||
return pd.DataFrame(...) |
||||
``` |
||||
|
||||
### Reassigning Imported Names |
||||
```python |
||||
# BAD - confusing |
||||
from module import process |
||||
process = something_else # Reassigning |
||||
|
||||
# GOOD - clear naming |
||||
from module import process as process_data |
||||
``` |
||||
|
||||
## Type Checking Imports |
||||
|
||||
For type hints only, use TYPE_CHECKING: |
||||
|
||||
```python |
||||
from typing import TYPE_CHECKING |
||||
|
||||
if TYPE_CHECKING: |
||||
from .models import Motion |
||||
|
||||
def get_motion(motion_id: int) -> "Motion": # String quote for forward ref |
||||
... |
||||
``` |
||||
|
||||
## Optional Dependency Imports |
||||
|
||||
Handle optional dependencies gracefully: |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
duckdb = None # Will be checked later |
||||
|
||||
class MotionDatabase: |
||||
def __init__(self): |
||||
if duckdb is None: |
||||
self._file_mode = True # Fallback mode |
||||
``` |
||||
|
||||
## Example: Complete Import Block |
||||
|
||||
```python |
||||
# Complete example from database.py |
||||
import json |
||||
import logging |
||||
import uuid |
||||
from datetime import datetime, timedelta |
||||
from typing import Dict, List, Optional, Tuple |
||||
|
||||
import duckdb |
||||
|
||||
from config import config |
||||
|
||||
from database import db |
||||
``` |
||||
@ -1,131 +0,0 @@ |
||||
--- |
||||
title: Logging Constraints |
||||
category: constraints |
||||
severity: critical |
||||
--- |
||||
|
||||
# Logging Constraints |
||||
|
||||
## Core Rule |
||||
|
||||
Use `logging.getLogger(__name__)` - never use `print()` |
||||
|
||||
**CRITICAL ANTI-PATTERN**: `api_client.py` uses `print()` instead of logging (11 instances). |
||||
|
||||
## CRITICAL Anti-Pattern: print() Instead of Logging |
||||
|
||||
**File**: `api_client.py` |
||||
**Evidence**: Lines with `print(f"...")` instead of `_logger.info(...)` |
||||
|
||||
**Broken code**: |
||||
```python |
||||
def get_motions(self, ...): |
||||
try: |
||||
# ... |
||||
print(f"Fetched {len(voting_records)} voting records from API") # BAD |
||||
print(f"Processed into {len(motions)} unique motions") # BAD |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") # BAD - no traceback |
||||
``` |
||||
|
||||
**Fix**: |
||||
```python |
||||
import logging |
||||
|
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
def get_motions(self, ...): |
||||
try: |
||||
_logger.info("Fetched %d voting records from API", len(voting_records)) |
||||
_logger.info("Processed into %d unique motions", len(motions)) |
||||
except Exception as e: |
||||
_logger.exception("Error fetching motions from API: %s", e) |
||||
return [] |
||||
``` |
||||
|
||||
## Logger Initialization |
||||
|
||||
Get logger at module level: |
||||
|
||||
```python |
||||
# GOOD: Use logging.getLogger(__name__) |
||||
import logging |
||||
|
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
def some_function(): |
||||
_logger.info("Processing started") |
||||
_logger.debug("Detail: %s", detail) |
||||
``` |
||||
|
||||
## Logger Naming |
||||
|
||||
Use `__name__` for automatic module path: |
||||
|
||||
```python |
||||
# In database.py - logger will be "database" |
||||
_logger = logging.getLogger(__name__) |
||||
|
||||
# In pipeline/svd_pipeline.py - logger will be "pipeline.svd_pipeline" |
||||
_logger = logging.getLogger(__name__) |
||||
``` |
||||
|
||||
**INCONSISTENCY WARNING**: 16 files use `logger`, 17 files use `_logger`. Choose one convention. |
||||
|
||||
**Recommendation**: Use `_logger` (with underscore) for module-level loggers to distinguish from class-level loggers. |
||||
|
||||
## Log Levels |
||||
|
||||
| Level | When to Use | |
||||
|-------|-------------| |
||||
| DEBUG | Detailed diagnostic info (dev only) | |
||||
| INFO | Normal operation milestones | |
||||
| WARNING | Unexpected but handled (fallbacks) | |
||||
| ERROR | Operation failed, may need attention | |
||||
| CRITICAL | Fatal error, program may crash | |
||||
|
||||
## Exception Logging |
||||
|
||||
Use `_logger.exception()` for caught exceptions (includes traceback): |
||||
|
||||
```python |
||||
try: |
||||
result = risky_operation() |
||||
except Exception as exc: |
||||
_logger.exception("Operation failed: %s", exc) |
||||
return fallback_value |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Debug Prints in Production Code |
||||
```python |
||||
# BAD |
||||
print(f"[TRAJ DEBUG] processing window {wid}") |
||||
|
||||
# GOOD |
||||
_logger.debug("Processing window %s", wid) |
||||
``` |
||||
|
||||
### Inconsistent Logger Names |
||||
```python |
||||
# BAD - mixing _logger and logger |
||||
_logger = logging.getLogger(__name__) |
||||
logger = logging.getLogger("other") # Inconsistent |
||||
``` |
||||
|
||||
## Sensitive Data |
||||
|
||||
Never log sensitive information: |
||||
- API keys |
||||
- User votes |
||||
- Session IDs (if tied to user data) |
||||
- Personal information |
||||
|
||||
```python |
||||
# BAD |
||||
_logger.info("User %s voted %s", user_id, vote) |
||||
|
||||
# GOOD - log aggregates, not individual votes |
||||
_logger.info("Vote recorded for session %s", session_id[:8]) |
||||
``` |
||||
@ -1,141 +0,0 @@ |
||||
# Naming Constraints |
||||
|
||||
## File Names |
||||
|
||||
### Python Modules |
||||
- **Convention**: `snake_case.py` |
||||
- **Examples**: `motion_database.py`, `api_client.py`, `text_pipeline.py` |
||||
|
||||
### Test Files |
||||
- **Convention**: `test_<module_name>.py` |
||||
- **Examples**: `test_database.py`, `test_api_client.py` |
||||
|
||||
### Config Files |
||||
- **Convention**: `snake_case` |
||||
- **Examples**: `config.py`, `.env.example`, `pyproject.toml` |
||||
|
||||
### Directories |
||||
- **Convention**: `snake_case/` |
||||
- **Examples**: `pipeline/`, `tests/integration/`, `src/validators/` |
||||
|
||||
## Class Names |
||||
|
||||
- **Convention**: `PascalCase` |
||||
- **Examples**: `MotionDatabase`, `TweedeKamerAPI`, `MotionSummarizer` |
||||
|
||||
### Naming Patterns |
||||
| Pattern | Example | |
||||
|---------|---------| |
||||
| Database wrapper | `MotionDatabase` | |
||||
| API client | `TweedeKamerAPI` | |
||||
| Service/Helpers | `MotionScraper`, `MotionAnalyzer` | |
||||
| Exceptions | `ProviderError` | |
||||
|
||||
## Function Names |
||||
|
||||
- **Convention**: `snake_case` |
||||
- **Examples**: `get_motions`, `compute_similarity`, `process_voting_records` |
||||
|
||||
### Private Methods |
||||
- **Convention**: `_snake_case` (single underscore prefix) |
||||
- **Examples**: `_get_voting_records`, `_parse_response` |
||||
|
||||
## Variable Names |
||||
|
||||
### Regular Variables |
||||
- **Convention**: `snake_case` |
||||
- **Examples**: `motion_id`, `party_name`, `voting_results` |
||||
|
||||
### Constants (Module-Level) |
||||
- **Convention**: `UPPER_SNAKE_CASE` |
||||
- **Examples**: `DATABASE_PATH`, `API_TIMEOUT`, `MAX_RETRIES` |
||||
|
||||
### Config Variables (in dataclass) |
||||
- **Convention**: `UPPER_SNAKE_CASE` |
||||
- **Examples**: `QWEN_MODEL`, `POLICY_AREAS` |
||||
|
||||
### Booleans |
||||
- **Convention**: `is_`, `has_`, `can_` prefixes or `_flag` suffix |
||||
- **Examples**: `is_active`, `has_votes`, `skip_extract` |
||||
|
||||
### Private Variables |
||||
- **Convention**: `_underscore_prefix` |
||||
- **Examples**: `_conn`, `_cache`, `_session` |
||||
|
||||
## Singleton Instances |
||||
|
||||
- **Convention**: `lower_snake_case` at module level |
||||
- **Examples**: `db = MotionDatabase()`, `summarizer = MotionSummarizer()` |
||||
|
||||
```python |
||||
# database.py |
||||
class MotionDatabase: |
||||
... |
||||
|
||||
# Singleton instance |
||||
db = MotionDatabase() |
||||
|
||||
# Usage |
||||
from database import db |
||||
motions = db.get_motions() |
||||
``` |
||||
|
||||
## Type Variables |
||||
|
||||
- **Convention**: `PascalCase` |
||||
- **Examples**: `T = TypeVar('T')`, `MotionDict = Dict[str, Any]` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Inconsistent Naming |
||||
```python |
||||
# BAD - mixing styles |
||||
get_motions() # snake_case |
||||
GetMotionById() # PascalCase |
||||
processData() # camelCase |
||||
|
||||
# GOOD - consistent snake_case |
||||
get_motions() |
||||
get_motion_by_id() |
||||
process_voting_data() |
||||
``` |
||||
|
||||
### Abbreviations |
||||
```python |
||||
# AVOID - unclear abbreviations |
||||
calc_similarity() # calculate_* |
||||
proc_votes() # process_* |
||||
get_mp_data() # get_mp_metadata() |
||||
|
||||
# PREFER - full words |
||||
calculate_similarity() |
||||
process_votes() |
||||
get_mp_metadata() |
||||
``` |
||||
|
||||
### Hungarian Notation |
||||
```python |
||||
# BAD - Hungarian notation |
||||
str_title = "..." |
||||
int_count = 0 |
||||
b_is_active = True |
||||
|
||||
# GOOD - clear types via naming |
||||
title = "..." |
||||
count = 0 |
||||
is_active = True |
||||
``` |
||||
|
||||
## Special Cases |
||||
|
||||
### Window IDs |
||||
- **Format**: `"YYYY-QN"` or `"YYYY"` |
||||
- **Examples**: `"2024-Q1"`, `"2024-Q2"`, `"2024"` |
||||
|
||||
### Policy Areas |
||||
- **Convention**: PascalCase with spaces |
||||
- **Examples**: `"Economie"`, `"Sociale Zaken"`, `"Klimaat"` |
||||
|
||||
### Vote Values |
||||
- **Convention**: PascalCase Dutch terms |
||||
- **Values**: `"Voor"`, `"Tegen"`, `"Onthouden"`, `"Geen stem"`, `"Afwezig"` |
||||
@ -1,26 +0,0 @@ |
||||
# Testing conventions constraint (YAML) |
||||
|
||||
rules: |
||||
- name: test_naming |
||||
rule: "Use pytest and name tests test_*.py and test_* functions." |
||||
examples: |
||||
- good: "tests/test_text_pipeline.py" |
||||
- bad: "tests/text_pipeline_test.py" |
||||
|
||||
- name: fixtures_and_conftest |
||||
rule: "Place shared fixtures in tests/conftest.py or tests/fixtures/ for reuse." |
||||
examples: |
||||
- good: "use fixtures declared in tests/conftest.py" |
||||
|
||||
- name: assert_raises |
||||
rule: "Explicitly assert expected exceptions with pytest.raises for invalid input." |
||||
examples: |
||||
- good: | |
||||
import pytest |
||||
|
||||
def test_invalid_input(): |
||||
with pytest.raises(ValueError): |
||||
function_under_test('bad') |
||||
|
||||
enforcement_examples: |
||||
- "Run pytest in CI; fail if tests don't run or if there are regressions." |
||||
@ -1,233 +0,0 @@ |
||||
# Type Hint Constraints |
||||
|
||||
## Core Rule |
||||
|
||||
**Use type hints on all public functions and methods** |
||||
|
||||
## Function Type Hints |
||||
|
||||
### Required on Public APIs |
||||
|
||||
```python |
||||
# GOOD - complete type hints |
||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
||||
... |
||||
|
||||
def get_filtered_motions( |
||||
self, |
||||
policy_area: str = "Alle", |
||||
limit: int = 10 |
||||
) -> List[Dict]: |
||||
... |
||||
|
||||
def calculate_similarity(self, motion_a: int, motion_b: int) -> float: |
||||
... |
||||
``` |
||||
|
||||
### Optional Parameters |
||||
|
||||
Use `Optional[X]` or `X | None`: |
||||
|
||||
```python |
||||
# Both forms are acceptable |
||||
def get_motion(self, motion_id: Optional[int] = None) -> Optional[Dict]: |
||||
... |
||||
|
||||
def get_motion(self, motion_id: int | None = None) -> dict | None: |
||||
... |
||||
``` |
||||
|
||||
### Multiple Return Types |
||||
|
||||
Use `Union[X, Y]` or `|` operator: |
||||
|
||||
```python |
||||
# Acceptable forms |
||||
def parse_value(self, value: str) -> Union[bool, str, None]: |
||||
... |
||||
|
||||
def parse_value(self, value: str) -> bool | str | None: |
||||
... |
||||
``` |
||||
|
||||
### Generic Types |
||||
|
||||
Use `List[X]`, `Dict[K, V]`, `Tuple[X, Y]`: |
||||
|
||||
```python |
||||
from typing import Dict, List, Optional, Tuple |
||||
|
||||
def get_motions(self, ids: List[int]) -> Dict[int, Dict]: |
||||
"""Map motion_id -> motion data.""" |
||||
... |
||||
|
||||
def process_batch(self, items: List[str]) -> Tuple[List[str], List[str]]: |
||||
"""Returns (successes, failures).""" |
||||
... |
||||
``` |
||||
|
||||
## Collection Types |
||||
|
||||
Prefer specific types over bare `list`/`dict`: |
||||
|
||||
```python |
||||
# GOOD - specific types |
||||
def get_votes(self) -> List[str]: |
||||
... |
||||
|
||||
def get_metadata(self) -> Dict[str, Any]: |
||||
... |
||||
|
||||
# ACCEPTABLE - for truly generic collections |
||||
def merge_dicts(*dicts: dict) -> dict: |
||||
... |
||||
``` |
||||
|
||||
## DuckDB Result Types |
||||
|
||||
DuckDB returns tuples/lists - document expected structure: |
||||
|
||||
```python |
||||
def get_motion(self, motion_id: int) -> Optional[Tuple]: |
||||
"""Returns (id, title, description, date, ...) or None.""" |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
||||
).fetchone() |
||||
return result |
||||
finally: |
||||
conn.close() |
||||
|
||||
# Or use Dict for clarity |
||||
def get_motion_as_dict(self, motion_id: int) -> Optional[Dict]: |
||||
"""Returns motion dict or None.""" |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
row = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
||||
).fetchone() |
||||
if row: |
||||
return { |
||||
"id": row[0], |
||||
"title": row[1], |
||||
"description": row[2], |
||||
... |
||||
} |
||||
return None |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
## Class/Instance Types |
||||
|
||||
Use `Self` for methods returning instance type: |
||||
|
||||
```python |
||||
from typing import Self |
||||
|
||||
class MotionDatabase: |
||||
def with_connection(self, path: str) -> Self: |
||||
"""Return new instance with different path.""" |
||||
return MotionDatabase(db_path=path) |
||||
``` |
||||
|
||||
## Callback/Function Types |
||||
|
||||
Use `Callable` for function parameters: |
||||
|
||||
```python |
||||
from typing import Callable |
||||
|
||||
def process_motions( |
||||
motions: List[Dict], |
||||
processor: Callable[[Dict], Any] |
||||
) -> List[Any]: |
||||
return [processor(m) for m in motions] |
||||
``` |
||||
|
||||
## Type Aliases |
||||
|
||||
Define clear type aliases for domain concepts: |
||||
|
||||
```python |
||||
from typing import Dict, List, TypedDict, Literal |
||||
|
||||
# Vote values |
||||
VoteValue = Literal["Voor", "Tegen", "Onthouden", "Geen stem", "Afwezig"] |
||||
|
||||
# Policy areas |
||||
PolicyArea = Literal["Alle", "Economie", "Klimaat", "Immigratie", ...] |
||||
|
||||
# Motion dict |
||||
class MotionDict(TypedDict): |
||||
id: int |
||||
title: str |
||||
description: Optional[str] |
||||
date: Optional[str] |
||||
policy_area: Optional[str] |
||||
voting_results: Optional[str] # JSON string |
||||
winning_margin: Optional[float] |
||||
|
||||
def get_motion(self, motion_id: int) -> Optional[MotionDict]: |
||||
... |
||||
``` |
||||
|
||||
## Avoid `Any` |
||||
|
||||
Use `Any` sparingly - prefer specific types: |
||||
|
||||
```python |
||||
# AVOID - too vague |
||||
def process(data: Any) -> Any: |
||||
... |
||||
|
||||
# PREFER - specific types |
||||
def process(motion: MotionDict) -> Optional[SimilarityResult]: |
||||
... |
||||
``` |
||||
|
||||
## Inline Type Hints |
||||
|
||||
For simple cases, inline hints are fine: |
||||
|
||||
```python |
||||
def get_count(self) -> int: |
||||
... |
||||
|
||||
def is_empty(self) -> bool: |
||||
... |
||||
``` |
||||
|
||||
## Docstring Type Hints |
||||
|
||||
For complex types, include in docstrings: |
||||
|
||||
```python |
||||
def get_party_positions(self, window_id: str) -> Dict[str, List[float]]: |
||||
"""Get party positions in political space. |
||||
|
||||
Args: |
||||
window_id: Time window (e.g., "2024-Q1") |
||||
|
||||
Returns: |
||||
Dict mapping party_name -> [x, y] coordinates |
||||
|
||||
Example: |
||||
>>> positions = db.get_party_positions("2024-Q1") |
||||
>>> positions["VVD"] |
||||
[0.5, -0.3] |
||||
""" |
||||
... |
||||
``` |
||||
|
||||
## Type Checking |
||||
|
||||
For runtime type checking, use runtime checks: |
||||
|
||||
```python |
||||
def set_count(self, count: int) -> None: |
||||
if not isinstance(count, int): |
||||
raise TypeError(f"Expected int, got {type(count).__name__}") |
||||
self._count = count |
||||
``` |
||||
@ -1,124 +0,0 @@ |
||||
# Naming Conventions |
||||
|
||||
## Files |
||||
- **snake_case** for all Python files: `database.py`, `explorer_helpers.py`, `motion_cache.py` |
||||
- **PascalCase** NOT used for files |
||||
|
||||
## Functions |
||||
- **snake_case**: `get_svd_vectors()`, `compute_party_coords()`, `build_scatter_trace()` |
||||
- Private helpers prefixed with `_`: `_get_window_data()` |
||||
|
||||
## Classes |
||||
- **PascalCase**: `MotionDatabase`, `Config` |
||||
- **Dataclass pattern** for Config: `@dataclass` decorator with typed fields |
||||
|
||||
## Variables |
||||
- **snake_case**: `party_map`, `mp_name`, `svd_vectors`, `party_centroids` |
||||
- **CONSTANT_SNAKE_CASE** for module-level constants: `PARTY_COLOURS`, `DEFAULT_WINDOW` |
||||
|
||||
## Module-Level Exports |
||||
- **Singleton instance**: `db = MotionDatabase()` at module bottom (not class-level) |
||||
- **Config instance**: `config = Config(...)` at module bottom |
||||
- **Dicts**: `PARTY_COLOURS` exported from `config.py` |
||||
|
||||
--- |
||||
|
||||
# Error Handling |
||||
|
||||
## Known Patterns |
||||
1. **Bare except with pass** (ANTI-PATTERN - see anti-patterns.yaml) |
||||
```python |
||||
except: |
||||
pass # database.py:47 |
||||
``` |
||||
|
||||
2. **Graceful degradation**: catch specific exceptions, fall back to default |
||||
```python |
||||
try: |
||||
result = compute_svd() |
||||
except ImportError: |
||||
result = DEFAULT_SVD |
||||
``` |
||||
|
||||
3. **Optional dependency fallbacks**: |
||||
```python |
||||
try: |
||||
import umap |
||||
use_umap = True |
||||
except ImportError: |
||||
use_umap = False |
||||
``` |
||||
|
||||
4. **Nested exception handling** (ANTI-PATTERN - see anti-patterns.yaml): |
||||
```python |
||||
try: |
||||
... |
||||
except Exception: |
||||
try: |
||||
... |
||||
except Exception: |
||||
pass |
||||
``` |
||||
|
||||
## Rules |
||||
- Never use bare `except:` — always specify exception type |
||||
- Never swallow exceptions silently — log or return a sensible default |
||||
- For optional deps, use `ImportError` or `ModuleNotFoundError` explicitly |
||||
- Avoid nested try/except blocks |
||||
|
||||
--- |
||||
|
||||
# Code Organization |
||||
|
||||
## Singleton Pattern |
||||
Each module owns one shared instance: |
||||
```python |
||||
# database.py |
||||
db = MotionDatabase() |
||||
|
||||
# config.py |
||||
config = Config(...) |
||||
PARTY_COLOURS = {...} |
||||
``` |
||||
|
||||
## Pure Functions in Helpers |
||||
`explorer_helpers.py` contains only pure functions (no IO, no Streamlit calls): |
||||
```python |
||||
def compute_party_coords(svd_vectors, party_map): |
||||
"""Pure: no side effects, no imports from this module""" |
||||
... |
||||
|
||||
def build_scatter_trace(df, color_col): |
||||
"""Pure: returns Plotly trace dict""" |
||||
... |
||||
``` |
||||
|
||||
## Cached Data Loaders |
||||
Use `@st.cache_data` for expensive data loading: |
||||
```python |
||||
@st.cache_data |
||||
def load_svd_vectors(window: str) -> pd.DataFrame: |
||||
return db.get_svd_vectors(window) |
||||
``` |
||||
|
||||
## Dataclass Config |
||||
```python |
||||
@dataclass |
||||
class Config: |
||||
db_path: str = "data/stemwijzer.duckdb" |
||||
default_window: str = "2023" |
||||
party_colours: dict = field(default_factory=lambda: PARTY_COLOURS) |
||||
``` |
||||
|
||||
--- |
||||
|
||||
# Imports |
||||
|
||||
## Ordering (convention) |
||||
1. Standard library |
||||
2. Third-party (streamlit, ibis, plotly, sklearn, umap) |
||||
3. Local/relative imports |
||||
|
||||
## Avoid |
||||
- Wildcard imports (`from module import *`) |
||||
- Circular imports (ensure dependency direction: helpers → database → config) |
||||
@ -1,92 +0,0 @@ |
||||
--- |
||||
title: Dependencies and Library Usage |
||||
category: dependencies |
||||
--- |
||||
|
||||
# Dependencies and Library Usage |
||||
|
||||
## Core Dependencies |
||||
|
||||
### duckdb |
||||
- **Required**: Yes |
||||
- **Fallback**: None (core functionality) |
||||
- **Usage**: SQL database for motions, embeddings, SVD vectors |
||||
- **Files**: database.py, analysis/*.py, pipeline/*.py |
||||
|
||||
### streamlit |
||||
- **Required**: Yes |
||||
- **Fallback**: None |
||||
- **Usage**: Web UI framework |
||||
- **Files**: app.py, pages/*.py, explorer.py |
||||
|
||||
### requests |
||||
- **Required**: Yes |
||||
- **Fallback**: None |
||||
- **Usage**: HTTP client for API calls |
||||
- **Files**: api_client.py, ai_provider.py |
||||
|
||||
### plotly |
||||
- **Required**: Yes |
||||
- **Fallback**: None (raises ImportError) |
||||
- **Usage**: Interactive charts for explorer |
||||
- **Files**: explorer.py, explorer_helpers.py |
||||
|
||||
## Optional Dependencies |
||||
|
||||
### umap-learn |
||||
- **Required**: No |
||||
- **Fallback**: Use raw SVD vectors (first 2 dimensions) |
||||
- **Usage**: Dimensionality reduction for visualization |
||||
- **Files**: analysis/clustering.py |
||||
|
||||
### matplotlib |
||||
- **Required**: No |
||||
- **Fallback**: Plotly or raw output |
||||
- **Usage**: Static charting |
||||
- **Files**: Various analysis scripts |
||||
|
||||
## ML Dependencies |
||||
|
||||
### sklearn |
||||
- **Required**: Yes |
||||
- **Usage**: KMeans clustering, cosine_similarity, StandardScaler |
||||
- **Files**: analysis/clustering.py, similarity/compute.py |
||||
|
||||
### scipy |
||||
- **Required**: Yes |
||||
- **Usage**: SVD (scipy.linalg.svd), spatial.procrustes for alignment |
||||
- **Files**: analysis/trajectory.py, pipeline/svd_pipeline.py |
||||
|
||||
### numpy |
||||
- **Required**: Yes |
||||
- **Usage**: Array operations, linear algebra |
||||
- **Files**: Throughout codebase |
||||
|
||||
## Key Imports by File |
||||
|
||||
### explorer.py |
||||
- `import streamlit as st` |
||||
- `from database import db` |
||||
- `from explorer_helpers import *` |
||||
|
||||
### explorer_helpers.py |
||||
- `import pandas as pd` |
||||
- `import plotly.graph_objects as go` |
||||
- `from database import db` (optional, for type hints) |
||||
|
||||
### database.py |
||||
- `import ibis` |
||||
- `import duckdb` |
||||
- `from config import config, PARTY_COLOURS` |
||||
|
||||
### config.py |
||||
- `from dataclasses import dataclass, field` |
||||
- `import streamlit as st` (optional, for warnings) |
||||
|
||||
## Singleton Instances |
||||
|
||||
| Module | Instance | Type | |
||||
|--------|----------|------| |
||||
| `database.py` | `db` | `MotionDatabase` | |
||||
| `config.py` | `config` | `Config` (dataclass) | |
||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||
@ -1,146 +0,0 @@ |
||||
--- |
||||
title: Domain Glossary |
||||
category: domain |
||||
--- |
||||
|
||||
# Domain Glossary - Dutch Political Terms |
||||
|
||||
## CRITICAL INVARIANTS |
||||
|
||||
> **Rule 1**: Centroid of right-wing parties on RIGHT side of ALL axes |
||||
> - PVV, FVD, JA21, SGP centroid must appear on the RIGHT |
||||
> - Individual right-wing parties may vary slightly from the centroid |
||||
> - This is non-negotiable for any compass/axis visualization |
||||
|
||||
> **Rule 2**: SVD labels are empirically derived from voting data |
||||
> - Labels represent WHAT THE DATA SHOWS, not party self-identification or public opinion |
||||
> - Labels are derived from outliers and 20 representative motions (10 positive, 10 negative) |
||||
> - See SVD Label Derivation section below |
||||
|
||||
--- |
||||
|
||||
## SVD Label Derivation |
||||
|
||||
### The Process |
||||
|
||||
SVD (Singular Value Decomposition) finds axes that maximize variance in the MP × Motion voting matrix. To label each axis: |
||||
|
||||
1. **Identify outliers**: Find the two MPs with most extreme positions on that axis |
||||
2. **Select representative motions**: Pick 20 motions where these outliers disagreed most sharply (10 they voted opposite on, 10 where both voted same direction but with other extremes) |
||||
3. **Interpret theme**: Read the motion titles to derive what the axis represents |
||||
4. **Assign label**: Label describes the empirical theme, could be: |
||||
- Left-Right |
||||
- Coalition-Opposition |
||||
- Progressive-Conservative |
||||
- EU-National sovereignty |
||||
- Populist-Establishment |
||||
- Or whatever the voting patterns show |
||||
|
||||
### Example |
||||
|
||||
| Step | Description | |
||||
|------|-------------| |
||||
| Outlier A | Wilders (PVV) - extreme positive on Dim 1 | |
||||
| Outlier B | Marijnissen (SP) - extreme negative on Dim 1 | |
||||
| 20 Motions | Immigration, integration, law & order themes dominate | |
||||
| Label | "Links-Rechts" (Left-Right) | |
||||
|
||||
### Labeling Rules |
||||
|
||||
- **Never use party names in labels** (e.g., not "PVV-SP axis") |
||||
- **Never use semantic/ideological labels** (e.g., not "progressive-conservative" unless that's what the motions show) |
||||
- **Use motion-derived themes** (e.g., "Immigration", "EU", "Economy") |
||||
- **Fallback**: If theme is unclear, use "Axis 1", "Axis 2" |
||||
|
||||
--- |
||||
|
||||
## Core Entities |
||||
|
||||
### Motion / Motie |
||||
- Parliamentary motion submitted by MPs |
||||
- Fields: `id`, `title`, `date`, `category` |
||||
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** |
||||
|
||||
### MP / Kamerlid |
||||
- Member of Parliament (Tweede Kamerlid) |
||||
- Identified by full name (e.g., "Van Dijk, I.") |
||||
- Has voting record, party affiliation, SVD position vector |
||||
|
||||
### Party / Fractie |
||||
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") |
||||
- Party centroids: average SVD position of all MPs in party |
||||
|
||||
### Vote / Stemming |
||||
- Individual MP's vote on a motion: +1, 0, -1 |
||||
- Aggregated to compute SVD vectors |
||||
|
||||
--- |
||||
|
||||
## Time & Analysis Concepts |
||||
|
||||
### Window / Tijdsvenster |
||||
- Time period for analysis (annual or quarterly) |
||||
- Values: "2023", "2023-Q1", "2024", etc. |
||||
- SVD vectors computed per window |
||||
|
||||
### Trajectory |
||||
- MP's position change across multiple windows |
||||
- Computed from `svd_vectors` + window ordering |
||||
|
||||
--- |
||||
|
||||
## Mathematical / Algorithmic Terms |
||||
|
||||
### SVD Vector |
||||
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix |
||||
- Represents MP's position in political space |
||||
|
||||
### SVD Label |
||||
- Empirically derived axis label based on outlier MPs and representative motions |
||||
- Describes the theme of disagreement on that axis |
||||
- NOT based on party ideology or semantic labels |
||||
|
||||
### Political Compass |
||||
- 2D visualization with SVD axes mapped to compass quadrants |
||||
- X-axis: First SVD dimension (labeled from voting data) |
||||
- Y-axis: Second SVD dimension (labeled from voting data) |
||||
|
||||
### Procrustes Alignment |
||||
- Algorithm to align SVD vectors across time windows |
||||
- Ensures comparable positions across years/quarters |
||||
|
||||
### UMAP |
||||
- Uniform Manifold Approximation and Projection |
||||
- Dimensionality reduction for visualization |
||||
- Optional dependency with graceful SVD fallback |
||||
|
||||
--- |
||||
|
||||
## Database Table Reference |
||||
|
||||
| Table | Key Fields | |
||||
|-------|-----------| |
||||
| `motions` | id, title, date, category | |
||||
| `mp_votes` | mp_id, motion_id, vote | |
||||
| `svd_vectors` | entity_id, window, vector_2d (list[2]) | |
||||
| `mp_party_history` | mp_id, party, start_date, end_date | |
||||
| `windows` | window_id, start_date, end_date, period_type | |
||||
| `mp_trajectories` | mp_id, window, trajectory_vector | |
||||
|
||||
--- |
||||
|
||||
## Dutch Political Parties |
||||
|
||||
### Canonical Right-Wing (centroid on RIGHT of axes) |
||||
- PVV (Partij voor de Vrijheid) |
||||
- FVD (Forum voor Democratie) |
||||
- JA21 |
||||
- SGP (Staatkundig Gereformeerde Partij) |
||||
|
||||
### Other Major Parties |
||||
- VVD (Volkspartij voor Vrijheid en Democratie) |
||||
- GL-PvdA (GroenLinks-PvdA) |
||||
- NSC (Nieuw Sociaal Contract) |
||||
- BBB (BoerBurgerBeweging) |
||||
- SP (Socialistische Partij) |
||||
- D66 (Democraten 66) |
||||
@ -1,196 +0,0 @@ |
||||
"""Example: TweedeKamerAPI usage - from api_client.py and actual codebase.""" |
||||
|
||||
from datetime import datetime, timedelta |
||||
from typing import Dict, List |
||||
|
||||
# Import the API client |
||||
from api_client import TweedeKamerAPI |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 1: Basic API usage |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_fetch_motions(): |
||||
"""Fetch recent parliamentary motions from TweedeKamer API.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
# Fetch motions from last 30 days |
||||
start_date = datetime.now() - timedelta(days=30) |
||||
|
||||
try: |
||||
motions = api.get_motions(start_date=start_date, limit=100) |
||||
|
||||
print(f"Fetched {len(motions)} motions") |
||||
|
||||
for motion in motions[:5]: # Show first 5 |
||||
print(f" - {motion.get('title', 'N/A')}") |
||||
|
||||
return motions |
||||
finally: |
||||
api.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 2: Fetching with date range |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_date_range(): |
||||
"""Fetch motions from a specific date range.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
start = datetime(2024, 1, 1) |
||||
end = datetime(2024, 3, 31) # Q1 2024 |
||||
|
||||
try: |
||||
motions = api.get_motions(start_date=start, end_date=end, limit=500) |
||||
|
||||
# Group by policy area |
||||
by_area = {} |
||||
for m in motions: |
||||
area = m.get("policy_area", "Onbekend") |
||||
by_area.setdefault(area, []).append(m) |
||||
|
||||
for area, area_motions in sorted(by_area.items()): |
||||
print(f"{area}: {len(area_motions)} motions") |
||||
|
||||
return motions |
||||
finally: |
||||
api.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 3: Context manager usage |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_context_manager(): |
||||
"""Use API client as context manager.""" |
||||
|
||||
with TweedeKamerAPI() as api: |
||||
motions = api.get_motions( |
||||
start_date=datetime.now() - timedelta(days=7), limit=50 |
||||
) |
||||
|
||||
print(f"Fetched {len(motions)} motions this week") |
||||
|
||||
return motions |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 4: Processing voting records |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_process_votes(): |
||||
"""Process individual voting records from API.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
start_date = datetime.now() - timedelta(days=7) |
||||
|
||||
try: |
||||
# Get voting records directly |
||||
voting_records, besluit_meta = api._get_voting_records( |
||||
start_date=start_date, limit=1000 |
||||
) |
||||
|
||||
print(f"Fetched {len(voting_records)} voting records") |
||||
print(f"From {len(besluit_meta)} unique decisions") |
||||
|
||||
# Count votes by party |
||||
party_votes = {} |
||||
for record in voting_records: |
||||
party = record.get("Fractie", "Onbekend") |
||||
vote = record.get("Soort", "Onbekend") |
||||
party_votes.setdefault(party, {})[vote] = ( |
||||
party_votes.get(party, {}).get(vote, 0) + 1 |
||||
) |
||||
|
||||
for party, votes in sorted(party_votes.items()): |
||||
total = sum(votes.values()) |
||||
voor = votes.get("Voor", 0) |
||||
print(f"{party}: {total} votes ({voor} voor)") |
||||
|
||||
return voting_records |
||||
finally: |
||||
api.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 5: Safe API call with fallback |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_safe_call(): |
||||
"""Make API call with safe fallback on failure.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
try: |
||||
# This will return [] on any error |
||||
motions = api.get_motions( |
||||
start_date=datetime.now() - timedelta(days=30), limit=100 |
||||
) |
||||
|
||||
if not motions: |
||||
print("No motions returned - using cached data") |
||||
# Fallback to cached/local data |
||||
from database import db |
||||
|
||||
return db.get_filtered_motions(limit=10) |
||||
|
||||
return motions |
||||
finally: |
||||
api.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 6: Pagination handling |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_pagination(): |
||||
"""Understand how pagination works in the API.""" |
||||
|
||||
api = TweedeKamerAPI() |
||||
|
||||
start_date = datetime.now() - timedelta(days=365) |
||||
|
||||
# Simulate pagination |
||||
page_size = 250 |
||||
total_limit = 500 |
||||
|
||||
all_motions = [] |
||||
skip = 0 |
||||
|
||||
while len(all_motions) < total_limit: |
||||
print(f"Fetching page with skip={skip}...") |
||||
|
||||
# In real usage, get_motions handles pagination internally |
||||
# This demonstrates what's happening under the hood |
||||
page_motions = api._fetch_page(start_date=start_date, skip=skip, top=page_size) |
||||
|
||||
if not page_motions: |
||||
break |
||||
|
||||
all_motions.extend(page_motions) |
||||
skip += page_size |
||||
|
||||
if len(page_motions) < page_size: |
||||
break # Last page |
||||
|
||||
print(f"Total fetched: {len(all_motions)} motions") |
||||
return all_motions |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
print("=== Basic Fetch ===") |
||||
example_fetch_motions() |
||||
|
||||
print("\n=== Process Votes ===") |
||||
example_process_votes() |
||||
@ -1,191 +0,0 @@ |
||||
"""Example: MotionDatabase usage - from database.py and actual codebase.""" |
||||
|
||||
from typing import Dict, List, Optional |
||||
import duckdb |
||||
import json |
||||
from config import config |
||||
|
||||
# Import the singleton instance |
||||
from database import db |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 1: Getting filtered motions |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_get_filtered_motions(): |
||||
"""Get controversial motions from a specific policy area.""" |
||||
|
||||
motions = db.get_filtered_motions( |
||||
policy_area="Klimaat", |
||||
min_margin=0.0, |
||||
max_margin=0.3, # Controversial: close margin |
||||
limit=10, |
||||
) |
||||
|
||||
for motion in motions: |
||||
print(f"{motion['title']}: {motion['winning_margin']:.1%} margin") |
||||
|
||||
return motions |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 2: Creating a voting session |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_voting_session(): |
||||
"""Create a new user session and record votes.""" |
||||
|
||||
# Create session for 10 motions |
||||
session_id = db.create_session(total_motions=10) |
||||
print(f"Created session: {session_id}") |
||||
|
||||
# Get motions for the session |
||||
motions = db.get_filtered_motions(policy_area="Alle", limit=10) |
||||
|
||||
# Record votes |
||||
for motion in motions: |
||||
# In real app, user would choose vote |
||||
vote = "Voor" # Example vote |
||||
db.record_vote(session_id=session_id, motion_id=motion["id"], vote=vote) |
||||
|
||||
# Get results |
||||
results = db.get_party_results(session_id) |
||||
|
||||
for party, result in sorted(results.items(), key=lambda x: -x[1]["agreement"]): |
||||
print(f"{party}: {result['agreement']:.1%} agreement") |
||||
|
||||
return results |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 3: Working with DuckDB connections directly |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_direct_duckdb(): |
||||
"""Example of proper DuckDB connection handling.""" |
||||
|
||||
conn = duckdb.connect(config.DATABASE_PATH) |
||||
try: |
||||
# Get motion with votes |
||||
result = conn.execute( |
||||
""" |
||||
SELECT m.*, |
||||
JSON_EXTRACT(voting_results, '$.total_votes') as total_votes |
||||
FROM motions m |
||||
WHERE m.id = ? |
||||
""", |
||||
(123,), |
||||
).fetchone() |
||||
|
||||
if result: |
||||
print(f"Motion: {result[1]}") # title is index 1 |
||||
|
||||
return result |
||||
finally: |
||||
conn.close() |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 4: Bulk operations |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_bulk_insert(): |
||||
"""Example of bulk inserting motions.""" |
||||
|
||||
# Sample data |
||||
motions = [ |
||||
{ |
||||
"title": "Motion about climate policy", |
||||
"description": "Proposal to reduce emissions", |
||||
"date": "2024-01-15", |
||||
"policy_area": "Klimaat", |
||||
"voting_results": json.dumps({"Voor": 75, "Tegen": 65}), |
||||
"winning_margin": 0.07, |
||||
"controversy_score": 0.85, |
||||
}, |
||||
{ |
||||
"title": "Motion about healthcare", |
||||
"description": "Increase healthcare budget", |
||||
"date": "2024-01-20", |
||||
"policy_area": "Zorg", |
||||
"voting_results": json.dumps({"Voor": 90, "Tegen": 50}), |
||||
"winning_margin": 0.29, |
||||
"controversy_score": 0.42, |
||||
}, |
||||
] |
||||
|
||||
conn = duckdb.connect(config.DATABASE_PATH) |
||||
try: |
||||
for motion in motions: |
||||
conn.execute( |
||||
""" |
||||
INSERT INTO motions |
||||
(title, description, date, policy_area, voting_results, |
||||
winning_margin, controversy_score) |
||||
VALUES (?, ?, ?, ?, ?, ?, ?) |
||||
""", |
||||
( |
||||
motion["title"], |
||||
motion["description"], |
||||
motion["date"], |
||||
motion["policy_area"], |
||||
motion["voting_results"], |
||||
motion["winning_margin"], |
||||
motion["controversy_score"], |
||||
), |
||||
) |
||||
conn.close() |
||||
print(f"Inserted {len(motions)} motions") |
||||
except Exception as e: |
||||
conn.close() |
||||
print(f"Error inserting motions: {e}") |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 5: Query with aggregation |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_aggregation(): |
||||
"""Example of aggregate queries.""" |
||||
|
||||
conn = duckdb.connect(config.DATABASE_PATH) |
||||
try: |
||||
# Get statistics by policy area |
||||
results = conn.execute(""" |
||||
SELECT |
||||
policy_area, |
||||
COUNT(*) as motion_count, |
||||
AVG(winning_margin) as avg_margin, |
||||
AVG(controversy_score) as avg_controversy |
||||
FROM motions |
||||
WHERE policy_area IS NOT NULL |
||||
GROUP BY policy_area |
||||
ORDER BY motion_count DESC |
||||
""").fetchall() |
||||
|
||||
for row in results: |
||||
print( |
||||
f"{row[0]}: {row[1]} motions, " |
||||
f"avg margin {row[2]:.1%}, " |
||||
f"controversy {row[3]:.2f}" |
||||
) |
||||
|
||||
conn.close() |
||||
return results |
||||
except Exception as e: |
||||
conn.close() |
||||
return [] |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
print("=== Filtered Motions ===") |
||||
example_get_filtered_motions() |
||||
|
||||
print("\n=== Aggregation ===") |
||||
example_aggregation() |
||||
@ -1,116 +0,0 @@ |
||||
# Extracted pattern examples (representative snippets) |
||||
|
||||
Note: snippets are verbatim extracts from repository files (Phase 1). Paths shown. |
||||
|
||||
## DuckDB connect + schema init (database.py) |
||||
```python |
||||
conn = duckdb.connect(self.db_path) |
||||
|
||||
# Create sequence for auto-incrementing IDs |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: |
||||
pass |
||||
|
||||
# Create tables with proper ID handling |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS motions ( |
||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||
title TEXT NOT NULL, |
||||
description TEXT, |
||||
date DATE, |
||||
policy_area TEXT, |
||||
voting_results JSON, |
||||
winning_margin FLOAT, |
||||
controversy_score FLOAT, |
||||
layman_explanation TEXT, |
||||
externe_identifier TEXT, |
||||
body_text TEXT, |
||||
url TEXT UNIQUE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
conn.close() |
||||
``` |
||||
|
||||
## Read-only compute worker (svd_pipeline.py) |
||||
```python |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute( |
||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||
(start_date, end_date), |
||||
).fetchall() |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
## Requests with retry/backoff (ai_provider.py) |
||||
```python |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
... |
||||
if getattr(resp, "status_code", 0) == 429: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||
retry_after = None |
||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||
if raw: |
||||
try: |
||||
retry_after = int(raw) |
||||
except Exception: |
||||
try: |
||||
dt = parsedate_to_datetime(raw) |
||||
now = datetime.now(tz=dt.tzinfo or timezone.utc) |
||||
secs = (dt - now).total_seconds() |
||||
retry_after = max(0, int(secs)) |
||||
except Exception: |
||||
retry_after = None |
||||
|
||||
if retry_after is not None: |
||||
time.sleep(retry_after) |
||||
continue |
||||
``` |
||||
|
||||
## Embedding batch + per-item fallback (pipeline/ai_provider_wrapper.py) |
||||
```python |
||||
for start in range(0, len(texts), batch_size): |
||||
chunk = texts[i:end] |
||||
emb_chunk, emb_exc = _attempt_batch(chunk, i) |
||||
if emb_chunk is not None: |
||||
for j, emb in enumerate(emb_chunk): |
||||
results[i + j] = emb |
||||
i = end |
||||
continue |
||||
|
||||
# batch failed -> fallback to per-item attempts |
||||
for j in range(i, end): |
||||
t = texts[j] |
||||
single, single_exc = _attempt_batch([t], j) |
||||
if single: |
||||
results[j] = single[0] |
||||
continue |
||||
results[j] = None |
||||
``` |
||||
|
||||
## Similarity compute (similarity/compute.py) |
||||
```python |
||||
# Ensure consistent dimensionality: pad shorter vectors with zeros |
||||
lengths = [len(v) for v in vecs] |
||||
max_dim = max(lengths) |
||||
if len(set(lengths)) != 1: |
||||
logger.warning( |
||||
"Inconsistent vector dimensions detected (max=%d). Padding shorter vectors with zeros.", |
||||
max_dim, |
||||
) |
||||
|
||||
matrix = np.zeros((len(vecs), max_dim), dtype=np.float32) |
||||
for i, v in enumerate(vecs): |
||||
matrix[i, : len(v)] = v |
||||
|
||||
# Normalize rows and compute cosine similarity |
||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||
norms[norms == 0] = 1.0 |
||||
normalized = matrix / norms |
||||
sim = normalized @ normalized.T |
||||
``` |
||||
@ -1,217 +0,0 @@ |
||||
"""Example: Pipeline phase execution - from pipeline/run_pipeline.py and actual codebase.""" |
||||
|
||||
import argparse |
||||
from datetime import date, timedelta |
||||
from typing import List, Tuple |
||||
|
||||
# Import pipeline modules |
||||
from pipeline.fetch_mp_metadata import fetch_mp_metadata |
||||
from pipeline.extract_mp_votes import extract_mp_votes |
||||
from pipeline.svd_pipeline import run_svd_pipeline |
||||
from pipeline.text_pipeline import run_text_pipeline |
||||
from pipeline.fusion import run_fusion |
||||
|
||||
from database import MotionDatabase |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 1: Running full pipeline |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_full_pipeline(): |
||||
"""Run the complete data ingestion pipeline.""" |
||||
|
||||
# Parse arguments like CLI would |
||||
parser = argparse.ArgumentParser(description="Pipeline runner") |
||||
parser.add_argument("--db-path", default="data/motions.db") |
||||
parser.add_argument("--start-date", default=None) |
||||
parser.add_argument("--end-date", default=None) |
||||
parser.add_argument( |
||||
"--window-size", choices=["quarterly", "annual"], default="quarterly" |
||||
) |
||||
parser.add_argument("--svd-k", type=int, default=50) |
||||
|
||||
args = parser.parse_args([]) |
||||
|
||||
# Resolve dates |
||||
end_date = date.fromisoformat(args.end_date) if args.end_date else date.today() |
||||
start_date = ( |
||||
date.fromisoformat(args.start_date) |
||||
if args.start_date |
||||
else end_date - timedelta(days=730) |
||||
) |
||||
|
||||
print(f"Running pipeline: {start_date} → {end_date}") |
||||
print(f"Window size: {args.window_size}") |
||||
print(f"DB path: {args.db_path}") |
||||
|
||||
# Initialize database |
||||
db = MotionDatabase(args.db_path) |
||||
|
||||
# Phase 1: Fetch MP metadata |
||||
print("\n=== Phase 1: MP Metadata ===") |
||||
n_mp = fetch_mp_metadata(db_path=args.db_path) |
||||
print(f"Processed {n_mp} MPs") |
||||
|
||||
# Phase 2: Extract MP votes |
||||
print("\n=== Phase 2: Extract Votes ===") |
||||
n_votes = extract_mp_votes(db_path=args.db_path) |
||||
print(f"Extracted {n_votes} vote records") |
||||
|
||||
# Phase 3: Generate time windows |
||||
print("\n=== Phase 3: SVD Pipeline ===") |
||||
windows = generate_windows(start_date, end_date, args.window_size) |
||||
print(f"Generated {len(windows)} windows: {windows}") |
||||
|
||||
# Phase 4: SVD per window |
||||
run_svd_pipeline(db, windows, args.svd_k) |
||||
print(f"Computed SVD for {len(windows)} windows") |
||||
|
||||
# Phase 5: Text embeddings |
||||
print("\n=== Phase 4: Text Embeddings ===") |
||||
run_text_pipeline(args.db_path, batch_size=50) |
||||
print("Text embeddings completed") |
||||
|
||||
# Phase 6: Fusion |
||||
print("\n=== Phase 5: Fusion ===") |
||||
run_fusion(args.db_path, windows) |
||||
print("Fusion completed") |
||||
|
||||
print("\n=== Pipeline Complete ===") |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 2: Generate time windows |
||||
# ============================================================================= |
||||
|
||||
|
||||
def generate_windows( |
||||
start: date, end: date, granularity: str |
||||
) -> List[Tuple[str, str, str]]: |
||||
"""Generate time windows for pipeline processing.""" |
||||
|
||||
windows = [] |
||||
cursor = date(start.year, start.month, 1) |
||||
|
||||
if granularity == "annual": |
||||
cursor = date(start.year, 1, 1) |
||||
while cursor <= end: |
||||
year_end = date(cursor.year, 12, 31) |
||||
w_end = min(year_end, end) |
||||
windows.append((str(cursor.year), cursor.isoformat(), w_end.isoformat())) |
||||
cursor = date(cursor.year + 1, 1, 1) |
||||
else: |
||||
# quarterly |
||||
quarter_starts = {1: 1, 2: 4, 3: 7, 4: 10} |
||||
quarter_ends = {1: 3, 2: 6, 3: 9, 4: 12} |
||||
|
||||
q = (cursor.month - 1) // 3 + 1 |
||||
cursor = date(cursor.year, quarter_starts[q], 1) |
||||
|
||||
while cursor <= end: |
||||
q = (cursor.month - 1) // 3 + 1 |
||||
import calendar |
||||
|
||||
q_end_month = quarter_ends[q] |
||||
last_day = calendar.monthrange(cursor.year, q_end_month)[1] |
||||
q_end = date(cursor.year, q_end_month, last_day) |
||||
w_end = min(q_end, end) |
||||
window_id = f"{cursor.year}-Q{q}" |
||||
windows.append((window_id, cursor.isoformat(), w_end.isoformat())) |
||||
cursor = q_end + timedelta(days=1) |
||||
|
||||
return windows |
||||
|
||||
|
||||
def example_window_generation(): |
||||
"""Example of window generation.""" |
||||
|
||||
start = date(2023, 1, 1) |
||||
end = date(2024, 6, 30) |
||||
|
||||
print("Quarterly windows:") |
||||
quarterly = generate_windows(start, end, "quarterly") |
||||
for wid, s, e in quarterly: |
||||
print(f" {wid}: {s} to {e}") |
||||
|
||||
print("\nAnnual windows:") |
||||
annual = generate_windows(start, end, "annual") |
||||
for wid, s, e in annual: |
||||
print(f" {wid}: {s} to {e}") |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 3: Running individual phases |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_individual_phases(): |
||||
"""Run pipeline phases individually for debugging.""" |
||||
|
||||
db_path = "data/motions.db" |
||||
db = MotionDatabase(db_path) |
||||
|
||||
# Only run MP metadata fetch |
||||
print("Fetching MP metadata...") |
||||
n = fetch_mp_metadata(db_path=db_path) |
||||
print(f" {n} MPs processed") |
||||
|
||||
# Only run vote extraction |
||||
print("Extracting votes...") |
||||
n = extract_mp_votes(db_path=db_path) |
||||
print(f" {n} votes extracted") |
||||
|
||||
# Only run SVD for specific window |
||||
print("Computing SVD...") |
||||
windows = [("2024-Q1", "2024-01-01", "2024-03-31")] |
||||
run_svd_pipeline(db, windows, k=50) |
||||
print(" SVD computed") |
||||
|
||||
# Only run text embeddings |
||||
print("Computing embeddings...") |
||||
run_text_pipeline(db_path, batch_size=25) # Smaller batch for testing |
||||
print(" Embeddings computed") |
||||
|
||||
|
||||
# ============================================================================= |
||||
# Example 4: Dry run |
||||
# ============================================================================= |
||||
|
||||
|
||||
def example_dry_run(): |
||||
"""Show what pipeline would do without making changes.""" |
||||
|
||||
print("DRY RUN - no writes will be made") |
||||
|
||||
start_date = date(2024, 1, 1) |
||||
end_date = date(2024, 6, 30) |
||||
|
||||
# Generate and show windows |
||||
windows = generate_windows(start_date, end_date, "quarterly") |
||||
|
||||
print(f"Would process {len(windows)} windows:") |
||||
for wid, s, e in windows: |
||||
print(f" {wid}: {s} to {e}") |
||||
|
||||
print("\nWould run phases:") |
||||
print(" 1. fetch_mp_metadata") |
||||
print(" 2. extract_mp_votes") |
||||
print(" 3. svd_pipeline") |
||||
print(" 4. text_pipeline") |
||||
print(" 5. fusion") |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
import logging |
||||
|
||||
logging.basicConfig( |
||||
level=logging.INFO, |
||||
format="%(asctime)s %(levelname)s %(name)s: %(message)s", |
||||
) |
||||
|
||||
print("=== Window Generation ===") |
||||
example_window_generation() |
||||
|
||||
print("\n=== Dry Run ===") |
||||
example_dry_run() |
||||
@ -1,108 +0,0 @@ |
||||
# stemwijzer Mind Model - Manifest |
||||
# Generated: 2026-04-12 |
||||
# Phase: 2 - Assembly from Phase 1 Analysis |
||||
|
||||
name: stemwijzer |
||||
version: 2 |
||||
description: Dutch political voting compass (Stemwijzer) - Mind Model constraints |
||||
|
||||
categories: |
||||
# Core documentation |
||||
- path: system.md |
||||
description: System overview and architecture summary |
||||
group: docs |
||||
- path: stack/stack.md |
||||
description: Technology stack with versions and purposes |
||||
group: stack |
||||
- path: domain/domain-glossary.md |
||||
description: Domain entities, terms, relationships, and CRITICAL INVARIANTS |
||||
group: domain |
||||
|
||||
# Design patterns |
||||
- path: patterns/patterns.yaml |
||||
description: Code patterns (Singleton, Repository, Pipeline, etc.) |
||||
group: patterns |
||||
- path: patterns/streamlit.yaml |
||||
description: Streamlit-specific patterns (session state, cache) |
||||
group: patterns |
||||
- path: patterns/api.yaml |
||||
description: API client patterns with retry and pagination |
||||
group: patterns |
||||
- path: patterns/database.yaml |
||||
description: DuckDB patterns and connection management |
||||
group: patterns |
||||
- path: patterns/python.yaml |
||||
description: Python-specific patterns (dataclass, typing) |
||||
group: patterns |
||||
- path: patterns/duckdb-access.md |
||||
description: DuckDB connection patterns and best practices |
||||
group: patterns |
||||
- path: patterns/embeddings-similarity.md |
||||
description: Embeddings and similarity computation patterns |
||||
group: patterns |
||||
- path: patterns/error-handling.md |
||||
description: Error handling and exception patterns |
||||
group: patterns |
||||
- path: patterns/module-singletons.md |
||||
description: Module-level singleton patterns |
||||
group: patterns |
||||
- path: patterns/requests-http.md |
||||
description: HTTP client patterns with retry |
||||
group: patterns |
||||
- path: patterns/validation.md |
||||
description: Input validation patterns |
||||
group: patterns |
||||
|
||||
# Coding constraints |
||||
- path: constraints/error-handling.md |
||||
description: Error handling patterns with safe fallbacks |
||||
group: constraints |
||||
- path: constraints/logging.md |
||||
description: Logging conventions |
||||
group: constraints |
||||
- path: constraints/naming.yaml |
||||
description: File, class, function naming rules |
||||
group: constraints |
||||
- path: constraints/imports.yaml |
||||
description: Import organization and module structure |
||||
group: constraints |
||||
- path: constraints/types.yaml |
||||
description: Type hint conventions |
||||
group: constraints |
||||
- path: constraints/testing.yaml |
||||
description: Testing conventions |
||||
group: constraints |
||||
|
||||
# Anti-patterns |
||||
- path: anti-patterns/anti-patterns.md |
||||
description: Known anti-patterns with evidence and fixes |
||||
group: anti-patterns |
||||
|
||||
# Dependencies |
||||
- path: dependencies/dependencies.md |
||||
description: Library usage and singleton instances |
||||
group: dependencies |
||||
|
||||
# Code examples |
||||
- path: examples/database-example.py |
||||
description: MotionDatabase usage examples |
||||
group: examples |
||||
- path: examples/api-client-example.py |
||||
description: TweedeKamerAPI usage examples |
||||
group: examples |
||||
- path: examples/pipeline-example.py |
||||
description: Pipeline orchestration examples |
||||
group: examples |
||||
- path: examples/streamlit-page-example.py |
||||
description: Streamlit page patterns |
||||
group: examples |
||||
- path: examples/pattern-examples.md |
||||
description: Consolidated pattern examples |
||||
group: examples |
||||
|
||||
# Phase 1 findings summary: |
||||
# - Tech: Python 3.13+, Streamlit, DuckDB, scipy/sklearn/umap, OpenRouter (QWEN) |
||||
# - 10 patterns discovered: Module singletons, Repository, Service layer, Pipeline |
||||
# - 8 anti-patterns: print() instead of logging, _DummySt global, bare except |
||||
# - 6 code clusters: Database, Streamlit UI, API, Analysis/ML, Config, Singletons |
||||
# - 3 groups: stdlib, 3rd party, local imports |
||||
@ -1,265 +0,0 @@ |
||||
# API Client Patterns |
||||
|
||||
## Base API Client Pattern |
||||
|
||||
Using requests.Session for connection pooling: |
||||
|
||||
```python |
||||
# api_client.py |
||||
import requests |
||||
from typing import Dict, List, Optional |
||||
from config import config |
||||
|
||||
class TweedeKamerAPI: |
||||
def __init__(self): |
||||
self.odata_base_url = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
||||
self.session = requests.Session() |
||||
self.session.headers.update({ |
||||
"Accept": "application/json", |
||||
"User-Agent": "Dutch-Political-Compass-Tool/1.0", |
||||
}) |
||||
|
||||
def get_motions( |
||||
self, |
||||
start_date: datetime = None, |
||||
end_date: datetime = None, |
||||
limit: int = 500, |
||||
) -> List[Dict]: |
||||
"""Get motions with voting results using OData API.""" |
||||
if not start_date: |
||||
start_date = datetime.now() - timedelta(days=730) |
||||
|
||||
try: |
||||
voting_records, besluit_meta = self._get_voting_records( |
||||
start_date, end_date, limit |
||||
) |
||||
return self._process_voting_records(voting_records, besluit_meta) |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") |
||||
return [] |
||||
``` |
||||
|
||||
## OData Pagination Pattern |
||||
|
||||
Handle server-side pagination with $skip: |
||||
|
||||
```python |
||||
def _get_voting_records( |
||||
self, |
||||
start_date: datetime, |
||||
end_date: datetime = None, |
||||
limit: int = 50000 |
||||
) -> tuple: |
||||
"""Fetch with automatic pagination.""" |
||||
|
||||
filter_query = ( |
||||
f"GewijzigdOp ge {start_date.strftime('%Y-%m-%d')}T00:00:00Z" |
||||
" and StemmingsSoort ne null" |
||||
" and Verwijderd eq false" |
||||
) |
||||
|
||||
page_size = 250 # API caps $top at 250 |
||||
base_url = f"{self.odata_base_url}/Besluit" |
||||
base_params = { |
||||
"$filter": filter_query, |
||||
"$top": page_size, |
||||
"$expand": "Stemming", |
||||
"$orderby": "GewijzigdOp desc", |
||||
} |
||||
|
||||
all_records = [] |
||||
skip = 0 |
||||
|
||||
while len(all_records) < limit: |
||||
params = {**base_params, "$skip": skip} |
||||
response = self.session.get( |
||||
base_url, |
||||
params=params, |
||||
timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
|
||||
besluit_page = data.get("value", []) |
||||
if not besluit_page: |
||||
break |
||||
|
||||
# Process page |
||||
for besluit in besluit_page: |
||||
all_records.extend(self._extract_votes(besluit)) |
||||
|
||||
skip += page_size |
||||
|
||||
return all_records |
||||
``` |
||||
|
||||
## Retry with Backoff Pattern |
||||
|
||||
For transient failures: |
||||
|
||||
```python |
||||
# ai_provider.py |
||||
import time |
||||
import random |
||||
from requests.exceptions import ConnectionError |
||||
|
||||
def _post_with_retries( |
||||
path: str, |
||||
json: dict, |
||||
retries: int = 3 |
||||
) -> requests.Response: |
||||
"""POST with exponential backoff retry.""" |
||||
|
||||
backoff = 0.5 |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
|
||||
# Handle rate limiting |
||||
if resp.status_code == 429: |
||||
if attempt == retries: |
||||
raise ProviderError("Rate limited") |
||||
|
||||
retry_after = resp.headers.get("Retry-After") |
||||
if retry_after: |
||||
time.sleep(int(retry_after)) |
||||
else: |
||||
sleep = backoff * (2 ** (attempt - 1)) |
||||
sleep += random.uniform(0, sleep * 0.1) |
||||
time.sleep(sleep) |
||||
continue |
||||
|
||||
# Handle server errors |
||||
if 500 <= resp.status_code < 600: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Server error: {resp.status_code}") |
||||
time.sleep(backoff * (2 ** (attempt - 1))) |
||||
continue |
||||
|
||||
return resp |
||||
|
||||
except ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Connection error: {exc}") |
||||
time.sleep(backoff * (2 ** (attempt - 1))) |
||||
|
||||
raise ProviderError("Failed after retries") |
||||
``` |
||||
|
||||
## Batch Processing Pattern |
||||
|
||||
Process items in batches to manage API limits: |
||||
|
||||
```python |
||||
def get_embeddings_with_retry( |
||||
texts: List[str], |
||||
batch_size: int = 50, |
||||
retries: int = 3, |
||||
) -> List[Optional[List[float]]]: |
||||
"""Process embeddings in batches with fallback to single items.""" |
||||
|
||||
results = [None] * len(texts) |
||||
|
||||
i = 0 |
||||
while i < len(texts): |
||||
end = min(len(texts), i + batch_size) |
||||
chunk = texts[i:end] |
||||
|
||||
# Try batch first |
||||
try: |
||||
emb_chunk = get_embeddings_batch(chunk) |
||||
for j, emb in enumerate(emb_chunk): |
||||
results[i + j] = emb |
||||
i = end |
||||
continue |
||||
except Exception: |
||||
pass |
||||
|
||||
# Fallback: single items |
||||
for j, text in enumerate(chunk): |
||||
try: |
||||
results[i + j] = get_embedding(text) |
||||
except Exception: |
||||
results[i + j] = None |
||||
|
||||
i = end |
||||
|
||||
return results |
||||
``` |
||||
|
||||
## Response Validation Pattern |
||||
|
||||
Validate API responses before processing: |
||||
|
||||
```python |
||||
def _process_response(self, response: requests.Response) -> Dict: |
||||
"""Validate and parse API response.""" |
||||
|
||||
response.raise_for_status() |
||||
data = response.json() |
||||
|
||||
if "value" not in data: |
||||
raise ValueError("Unexpected response format: missing 'value' key") |
||||
|
||||
return data |
||||
|
||||
def _validate_besluit(self, besluit: Dict) -> bool: |
||||
"""Check required fields exist.""" |
||||
required = ["Id", "GewijzigdOp"] |
||||
return all(field in besluit for field in required) |
||||
``` |
||||
|
||||
## Error Handling Patterns |
||||
|
||||
Always provide safe fallbacks: |
||||
|
||||
```python |
||||
def safe_api_call(self, endpoint: str, params: Dict = None) -> List[Dict]: |
||||
"""Call API with error handling and fallback.""" |
||||
try: |
||||
response = self.session.get( |
||||
endpoint, |
||||
params=params, |
||||
timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
return data.get("value", []) |
||||
except requests.Timeout: |
||||
_logger.warning(f"API timeout for {endpoint}") |
||||
return [] |
||||
except requests.HTTPError as e: |
||||
_logger.error(f"HTTP error: {e}") |
||||
return [] |
||||
except Exception as e: |
||||
_logger.error(f"API call failed: {e}") |
||||
return [] |
||||
``` |
||||
|
||||
## Session Management |
||||
|
||||
Reuse session for connection pooling: |
||||
|
||||
```python |
||||
class TweedeKamerAPI: |
||||
def __init__(self): |
||||
self.session = requests.Session() |
||||
self.session.headers.update({ |
||||
"Accept": "application/json", |
||||
"User-Agent": "Dutch-Political-Compass-Tool/1.0", |
||||
}) |
||||
|
||||
def close(self): |
||||
"""Clean up session when done.""" |
||||
self.session.close() |
||||
|
||||
def __enter__(self): |
||||
return self |
||||
|
||||
def __exit__(self, *args): |
||||
self.close() |
||||
|
||||
# Usage |
||||
with TweedeKamerAPI() as api: |
||||
motions = api.get_motions(start_date) |
||||
``` |
||||
@ -1,230 +0,0 @@ |
||||
# Architectural Patterns |
||||
|
||||
## Repository Pattern |
||||
|
||||
The `MotionDatabase` class acts as a repository, encapsulating all database operations behind a clean interface. |
||||
|
||||
```python |
||||
# database.py |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._init_database() |
||||
|
||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
||||
"""Get a single motion by ID.""" |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
||||
).fetchone() |
||||
return result |
||||
finally: |
||||
conn.close() |
||||
|
||||
def get_filtered_motions( |
||||
self, |
||||
policy_area: str = "Alle", |
||||
min_margin: float = 0.0, |
||||
max_margin: float = 1.0, |
||||
limit: int = 10 |
||||
) -> List[Dict]: |
||||
"""Get filtered list of motions.""" |
||||
... |
||||
``` |
||||
|
||||
**Usage**: Import the singleton instance for all DB operations. |
||||
```python |
||||
from database import db |
||||
|
||||
motions = db.get_filtered_motions(policy_area="Klimaat", limit=20) |
||||
``` |
||||
|
||||
## Facade Pattern |
||||
|
||||
Simplified interfaces over complex subsystems. |
||||
|
||||
### MotionDatabase Facade |
||||
```python |
||||
# Single entry point for all database operations |
||||
db = MotionDatabase() # Singleton instance |
||||
|
||||
# Operations are abstracted: |
||||
db.create_session(total_motions) |
||||
db.record_vote(session_id, motion_id, vote) |
||||
db.get_party_results(session_id) |
||||
``` |
||||
|
||||
### API Client Facade |
||||
```python |
||||
# api_client.py |
||||
class TweedeKamerAPI: |
||||
def __init__(self): |
||||
self.session = requests.Session() # Connection pooling |
||||
|
||||
def get_motions(self, start_date, end_date) -> List[Dict]: |
||||
"""Simple interface hiding OData pagination details.""" |
||||
voting_records, besluit_meta = self._get_voting_records(start_date, end_date) |
||||
return self._process_voting_records(voting_records, besluit_meta) |
||||
``` |
||||
|
||||
### MotionScraper Facade |
||||
```python |
||||
# scraper.py (if used) |
||||
class MotionScraper: |
||||
def get_motion_content(self, url: str) -> Optional[str]: |
||||
"""Extract body text from official website.""" |
||||
... |
||||
``` |
||||
|
||||
## Pipeline Pattern |
||||
|
||||
Sequential phases with explicit dependencies: |
||||
|
||||
``` |
||||
pipeline/run_pipeline.py |
||||
├── Phase 1: fetch_mp_metadata |
||||
│ └── pipeline/fetch_mp_metadata.py |
||||
├── Phase 2: extract_mp_votes |
||||
│ └── pipeline/extract_mp_votes.py |
||||
├── Phase 3: svd_pipeline |
||||
│ └── pipeline/svd_pipeline.py |
||||
├── Phase 4: text_pipeline (gap-fill) |
||||
│ └── pipeline/text_pipeline.py |
||||
└── Phase 5: fusion (combine SVD + text) |
||||
└── pipeline/fusion.py |
||||
``` |
||||
|
||||
### Phase Orchestration |
||||
```python |
||||
# pipeline/run_pipeline.py |
||||
def run(args: argparse.Namespace) -> int: |
||||
db = MotionDatabase(args.db_path) |
||||
|
||||
# Phase 1: MP metadata |
||||
if not args.skip_metadata: |
||||
from pipeline.fetch_mp_metadata import fetch_mp_metadata |
||||
fetch_mp_metadata(db_path=db.db_path) |
||||
|
||||
# Phase 2: Extract votes |
||||
if not args.skip_extract: |
||||
from pipeline.extract_mp_votes import extract_mp_votes |
||||
extract_mp_votes(db_path=db.db_path) |
||||
|
||||
# Phase 3: SVD per window |
||||
if not args.skip_svd: |
||||
from pipeline.svd_pipeline import run_svd_pipeline |
||||
run_svd_pipeline(db, windows, args.svd_k) |
||||
|
||||
# ... additional phases |
||||
``` |
||||
|
||||
## Strategy Pattern |
||||
|
||||
Interchangeable algorithms for axis computation: |
||||
|
||||
```python |
||||
# analysis/political_axis.py |
||||
def compute_political_axis( |
||||
vectors: Dict[str, np.ndarray], |
||||
method: str = "pca" # or "anchor" |
||||
) -> Tuple[np.ndarray, np.ndarray]: |
||||
"""Compute political axis using specified method. |
||||
|
||||
Methods: |
||||
- 'pca': Use first principal component |
||||
- 'anchor': Use predefined anchor motions |
||||
""" |
||||
if method == "pca": |
||||
return _compute_pca_axis(vectors) |
||||
elif method == "anchor": |
||||
return _compute_anchor_axis(vectors) |
||||
``` |
||||
|
||||
## Visitor Pattern |
||||
|
||||
External operations on data structures: |
||||
|
||||
```python |
||||
# analysis/trajectory.py |
||||
def _procrustes_align_windows( |
||||
window_vecs: Dict[str, Dict[str, np.ndarray]], |
||||
min_overlap: int = 5, |
||||
) -> Dict[str, Dict[str, np.ndarray]]: |
||||
"""Align SVD vectors across windows using Procrustes rotations. |
||||
|
||||
Takes the first window as reference and aligns each subsequent window |
||||
to it via orthogonal Procrustes on the set of common entities. |
||||
""" |
||||
``` |
||||
|
||||
## Builder Pattern |
||||
|
||||
Configuration via method chaining: |
||||
|
||||
```python |
||||
# CLI argument parsing |
||||
parser = argparse.ArgumentParser(description="Pipeline runner") |
||||
parser.add_argument("--db-path", default="data/motions.db") |
||||
parser.add_argument("--start-date", default=None) |
||||
parser.add_argument("--end-date", default=None) |
||||
parser.add_argument("--window-size", choices=["quarterly", "annual"], default="quarterly") |
||||
parser.add_argument("--svd-k", type=int, default=50) |
||||
``` |
||||
|
||||
## Decorator Pattern |
||||
|
||||
Retry logic for transient failures: |
||||
|
||||
```python |
||||
# pipeline/ai_provider_wrapper.py |
||||
def get_embeddings_with_retry( |
||||
texts: List[str], |
||||
retries: int = 3, |
||||
batch_size: int = 50, |
||||
) -> List[Optional[List[float]]]: |
||||
"""Return embeddings with automatic retry on failure.""" |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
return _embedder(texts, batch_size=len(texts)) |
||||
except Exception as exc: |
||||
if attempt == retries: |
||||
break |
||||
time.sleep(backoff * (2 ** (attempt - 1))) |
||||
return [None] * len(texts) # Safe fallback |
||||
``` |
||||
|
||||
## Data Patterns |
||||
|
||||
### Batch Processing |
||||
Process items in chunks to manage memory and API limits: |
||||
```python |
||||
for i in range(0, len(items), batch_size): |
||||
chunk = items[i:i + batch_size] |
||||
process_batch(chunk) |
||||
``` |
||||
|
||||
### Caching |
||||
Pre-compute and store expensive results: |
||||
```python |
||||
# SimilarityCache table stores computed similarities |
||||
db.get_similarity(motion_a, motion_b) |
||||
``` |
||||
|
||||
### Lazy Loading |
||||
Load data only when needed: |
||||
```python |
||||
class MotionDatabase: |
||||
@property |
||||
def _connection(self): |
||||
if self._conn is None: |
||||
self._conn = duckdb.connect(self.db_path) |
||||
return self._conn |
||||
``` |
||||
|
||||
### Vectorization |
||||
Use numpy for batch operations: |
||||
```python |
||||
vectors = np.array([v for v in entity_vectors.values()]) |
||||
normalized = vectors / np.linalg.norm(vectors, axis=1, keepdims=True) |
||||
``` |
||||
@ -1,239 +0,0 @@ |
||||
# DuckDB Database Patterns |
||||
|
||||
## Connection Management |
||||
|
||||
### Pattern 1: Short-lived per Method (Most Common) |
||||
|
||||
Always create a new connection, use try/finally for cleanup: |
||||
|
||||
```python |
||||
# database.py |
||||
class MotionDatabase: |
||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", |
||||
(motion_id,) |
||||
).fetchone() |
||||
conn.close() |
||||
return result |
||||
except Exception: |
||||
conn.close() |
||||
return None |
||||
|
||||
def get_filtered_motions( |
||||
self, |
||||
policy_area: str = "Alle", |
||||
min_margin: float = 0.0, |
||||
max_margin: float = 1.0, |
||||
limit: int = 10 |
||||
) -> List[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
query = """ |
||||
SELECT * FROM motions |
||||
WHERE (? = 'Alle' OR policy_area = ?) |
||||
AND winning_margin BETWEEN ? AND ? |
||||
ORDER BY RANDOM() |
||||
LIMIT ? |
||||
""" |
||||
rows = conn.execute(query, (policy_area, policy_area, min_margin, max_margin, limit)).fetchall() |
||||
conn.close() |
||||
return rows |
||||
except Exception: |
||||
conn.close() |
||||
return [] |
||||
``` |
||||
|
||||
### Pattern 2: With Statement (Cleaner) |
||||
|
||||
```python |
||||
def execute_query(self, query: str, params: tuple = ()): |
||||
with duckdb.connect(self.db_path) as conn: |
||||
return conn.execute(query, params).fetchall() |
||||
``` |
||||
|
||||
### Pattern 3: Lazy Connection Caching |
||||
|
||||
For frequently accessed connections: |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._conn = None |
||||
|
||||
@property |
||||
def connection(self): |
||||
if self._conn is None: |
||||
self._conn = duckdb.connect(self.db_path) |
||||
return self._conn |
||||
|
||||
def close(self): |
||||
if self._conn: |
||||
self._conn.close() |
||||
self._conn = None |
||||
``` |
||||
|
||||
## Table Initialization |
||||
|
||||
Create tables with proper constraints and sequences: |
||||
|
||||
```python |
||||
def _init_database(self): |
||||
conn = duckdb.connect(self.db_path) |
||||
|
||||
# Create sequence for auto-incrementing IDs |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: |
||||
pass |
||||
|
||||
# Create tables |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS motions ( |
||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||
title TEXT NOT NULL, |
||||
description TEXT, |
||||
date DATE, |
||||
policy_area TEXT, |
||||
voting_results JSON, |
||||
winning_margin FLOAT, |
||||
controversy_score FLOAT, |
||||
layman_explanation TEXT, |
||||
externe_identifier TEXT, |
||||
body_text TEXT, |
||||
url TEXT UNIQUE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
|
||||
# Add columns to existing tables safely |
||||
try: |
||||
conn.execute("ALTER TABLE motions ADD COLUMN IF NOT EXISTS body_text TEXT") |
||||
except Exception: |
||||
pass # Column may already exist |
||||
|
||||
conn.close() |
||||
``` |
||||
|
||||
## JSON Column Handling |
||||
|
||||
Store and retrieve JSON data: |
||||
|
||||
```python |
||||
# Insert JSON |
||||
def store_motion(self, motion: Dict): |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
conn.execute( |
||||
"INSERT INTO motions (title, voting_results) VALUES (?, ?)", |
||||
(motion["title"], json.dumps(motion["voting_results"])) |
||||
) |
||||
conn.close() |
||||
except Exception: |
||||
conn.close() |
||||
|
||||
# Query JSON |
||||
def get_motions_with_votes(self, party: str) -> List[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
rows = conn.execute(""" |
||||
SELECT title, voting_results |
||||
FROM motions |
||||
WHERE JSON_EXTRACT(voting_results, '$.party') = ? |
||||
""", (party,)).fetchall() |
||||
conn.close() |
||||
return rows |
||||
except Exception: |
||||
conn.close() |
||||
return [] |
||||
``` |
||||
|
||||
## Query Patterns |
||||
|
||||
### Parameterized Queries (Always!) |
||||
```python |
||||
# SAFE - uses parameterized query |
||||
conn.execute("SELECT * FROM motions WHERE id = ?", (motion_id,)) |
||||
|
||||
# AVOID - SQL injection risk |
||||
# conn.execute(f"SELECT * FROM motions WHERE id = {motion_id}") # BAD! |
||||
``` |
||||
|
||||
### Batch Inserts |
||||
```python |
||||
def bulk_insert_motions(self, motions: List[Dict]): |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
for motion in motions: |
||||
conn.execute( |
||||
"""INSERT OR IGNORE INTO motions |
||||
(title, date, policy_area) VALUES (?, ?, ?)""", |
||||
(motion["title"], motion["date"], motion["policy_area"]) |
||||
) |
||||
conn.close() |
||||
except Exception: |
||||
conn.close() |
||||
``` |
||||
|
||||
### Aggregation Queries |
||||
```python |
||||
def get_party_vote_stats(self, party: str) -> Dict: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute(""" |
||||
SELECT |
||||
COUNT(*) as total_votes, |
||||
SUM(CASE WHEN vote = 'Voor' THEN 1 ELSE 0 END) as voor, |
||||
SUM(CASE WHEN vote = 'Tegen' THEN 1 ELSE 0 END) as tegen |
||||
FROM mp_votes |
||||
WHERE party = ? |
||||
""", (party,)).fetchone() |
||||
conn.close() |
||||
return {"total": result[0], "voor": result[1], "tegen": result[2]} |
||||
except Exception: |
||||
conn.close() |
||||
return {"total": 0, "voor": 0, "tegen": 0} |
||||
``` |
||||
|
||||
## Error Handling |
||||
|
||||
Always close connections in finally block or with context manager: |
||||
|
||||
```python |
||||
def safe_query(self, query: str, params: tuple = ()): |
||||
conn = None |
||||
try: |
||||
conn = duckdb.connect(self.db_path) |
||||
result = conn.execute(query, params).fetchall() |
||||
return result |
||||
except Exception as e: |
||||
_logger.error(f"Query failed: {e}") |
||||
return [] |
||||
finally: |
||||
if conn: |
||||
conn.close() |
||||
``` |
||||
|
||||
## Testing with Mock |
||||
|
||||
For unit tests without DuckDB: |
||||
|
||||
```python |
||||
# In MotionDatabase.__init__ |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._file_mode = duckdb is None |
||||
|
||||
if duckdb is None: |
||||
# Create JSON fallback files |
||||
for p in (f"{db_path}.embeddings.json", f"{db_path}.similarity_cache.json"): |
||||
if not os.path.exists(p): |
||||
with open(p, "w") as fh: |
||||
fh.write("[]") |
||||
else: |
||||
self._init_database() |
||||
``` |
||||
@ -1,79 +0,0 @@ |
||||
--- |
||||
title: DuckDB Access Pattern |
||||
category: patterns |
||||
--- |
||||
# DuckDB Access Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
||||
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
||||
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
||||
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
||||
|
||||
## Examples |
||||
|
||||
### database.py - Explicit connect/close for schema init |
||||
|
||||
```python |
||||
conn = duckdb.connect(self.db_path) |
||||
... |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
conn.close() |
||||
``` |
||||
|
||||
### pipeline/svd_pipeline.py - Read-only connection |
||||
|
||||
```python |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute( |
||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||
(start_date, end_date), |
||||
).fetchall() |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
### similarity/compute.py - Preferred 'with' context |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
|
||||
with duckdb.connect(db.db_path) as conn: |
||||
rows = conn.execute(query, params).fetchall() |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Connection without closure |
||||
|
||||
```python |
||||
# BAD: connection may leak if exception occurs before explicit close |
||||
conn = duckdb.connect(db_path) |
||||
rows = conn.execute("SELECT ...").fetchall() |
||||
# missing finally/close |
||||
``` |
||||
|
||||
**Remediation**: Use "with" context or ensure conn.close() in finally block. |
||||
|
||||
### Bad: Parallel write connections |
||||
|
||||
**Problem**: Opening write connections from many parallel workers without coordination. |
||||
|
||||
**Remediation**: Open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
||||
@ -1,74 +0,0 @@ |
||||
--- |
||||
title: Embeddings Similarity Pipeline |
||||
category: patterns |
||||
--- |
||||
# Embeddings Similarity Pipeline |
||||
|
||||
## Rules |
||||
|
||||
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
||||
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
||||
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
||||
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
||||
|
||||
## Examples |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Batched embed + fallback |
||||
|
||||
```python |
||||
for start in range(0, len(texts), batch_size): |
||||
chunk = texts[start : start + batch_size] |
||||
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
||||
... |
||||
for j in range(i, end): |
||||
t = texts[j] |
||||
single, single_exc = _attempt_batch([t], j) |
||||
if single: |
||||
results[j] = single[0] |
||||
``` |
||||
|
||||
### pipeline/fusion.py - Concatenation and storage |
||||
|
||||
```python |
||||
try: |
||||
svd_vec = json.loads(svd_json) |
||||
except Exception: |
||||
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
||||
skipped_missing_svd += 1 |
||||
continue |
||||
... |
||||
fused = list(svd_vec) + list(text_vec) |
||||
res = db.store_fused_embedding( |
||||
int(entity_id), |
||||
window_id, |
||||
fused, |
||||
svd_dims=len(svd_vec), |
||||
text_dims=len(text_vec), |
||||
) |
||||
``` |
||||
|
||||
### similarity/compute.py - Normalized cosine similarity |
||||
|
||||
```python |
||||
# Normalize rows |
||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||
norms[norms == 0] = 1.0 |
||||
normalized = matrix / norms |
||||
sim = normalized @ normalized.T |
||||
... |
||||
# pick top-k neighbors and write to similarity_cache |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Assuming consistent vector length |
||||
|
||||
**Problem**: Assuming consistent vector length without checks leads to shape errors. |
||||
|
||||
**Remediation**: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
||||
|
||||
### Bad: Inline heavy computation in UI |
||||
|
||||
**Problem**: Recomputing heavy pipelines inline in UI requests. |
||||
|
||||
**Remediation**: Schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
||||
@ -1,63 +0,0 @@ |
||||
--- |
||||
title: Error Handling Pattern |
||||
category: patterns |
||||
--- |
||||
# Error Handling Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
||||
- Prefer logging.exception when catching an exception where stack trace is useful. |
||||
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
||||
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - Network error to ProviderError |
||||
|
||||
```python |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError( |
||||
f"Connection error when calling provider: {exc}" |
||||
) from exc |
||||
... |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Best-effort with logging |
||||
|
||||
```python |
||||
except Exception: |
||||
_logger.exception("Failed to append audit event for embedding failure") |
||||
results[j] = None |
||||
``` |
||||
|
||||
### similarity/compute.py - Defensive import handling |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Silent exception swallowing |
||||
|
||||
```python |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
# BAD: hides the root cause and returns an ambiguous default |
||||
``` |
||||
|
||||
**Remediation**: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
||||
|
||||
### Bad: Mixing print() and logging |
||||
|
||||
**Problem**: Mixing print() and logging for errors. |
||||
|
||||
**Remediation**: Replace print() calls with logger.* calls; use structured logging configuration. |
||||
@ -1,41 +0,0 @@ |
||||
--- |
||||
title: Module Singletons Pattern |
||||
category: patterns |
||||
--- |
||||
# Module Singletons Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
||||
- Avoid expensive initialization at import time. |
||||
- Provide a way to construct with a test DB path or to reinitialize in tests. |
||||
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
||||
|
||||
## Examples |
||||
|
||||
### database.py - Safe class initialization |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
# If duckdb is not available, operate in lightweight file-backed mode |
||||
self._file_mode = duckdb is None |
||||
self._init_database() |
||||
``` |
||||
|
||||
### similarity/lookup.py - Local instances |
||||
|
||||
```python |
||||
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
||||
if hasattr(db, "get_cached_similarities"): |
||||
rows = db.get_cached_similarities(...) |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Heavy initialization at import time |
||||
|
||||
**Problem**: Creating connections and performing heavy schema migrations during import. |
||||
|
||||
**Remediation**: Move heavy init to an explicit initialize() method and keep import fast. |
||||
@ -1,196 +0,0 @@ |
||||
# Python-Specific Patterns |
||||
|
||||
## Singleton Pattern |
||||
|
||||
Use module-level instances for shared resources: |
||||
|
||||
```python |
||||
# database.py |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._init_database() |
||||
|
||||
def _init_database(self): |
||||
# Initialize tables on first instantiation |
||||
... |
||||
|
||||
# Bottom of file - the singleton |
||||
db = MotionDatabase() |
||||
``` |
||||
|
||||
**Usage across the codebase:** |
||||
```python |
||||
# In other modules |
||||
from database import db |
||||
|
||||
def some_function(): |
||||
motions = db.get_filtered_motions(limit=10) |
||||
return motions |
||||
``` |
||||
|
||||
Similarly for other singletons: |
||||
```python |
||||
# summarizer.py |
||||
class MotionSummarizer: |
||||
def __init__(self): |
||||
pass # Stateless |
||||
|
||||
def generate_layman_explanation(self, title: str, body: str) -> str: |
||||
... |
||||
|
||||
summarizer = MotionSummarizer() |
||||
``` |
||||
|
||||
## Dataclass Config Pattern |
||||
|
||||
Use dataclass for configuration with environment variable support: |
||||
|
||||
```python |
||||
# config.py |
||||
from dataclasses import dataclass |
||||
from typing import List |
||||
import os |
||||
|
||||
@dataclass |
||||
class Config: |
||||
# Database settings |
||||
DATABASE_PATH = "data/motions.db" |
||||
|
||||
# API settings |
||||
TWEEDE_KAMER_ODATA_API = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
||||
API_TIMEOUT = 30 |
||||
API_BATCH_SIZE = 250 |
||||
|
||||
# AI settings |
||||
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") |
||||
OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1" |
||||
QWEN_MODEL = "qwen/qwen-2.5-72b-instruct" |
||||
|
||||
# App settings |
||||
DEFAULT_MOTION_COUNT = 10 |
||||
SESSION_TIMEOUT_DAYS = 30 |
||||
|
||||
# Policy areas |
||||
POLICY_AREAS: List[str] = None |
||||
def __post_init__(self): |
||||
self.POLICY_AREAS = [ |
||||
"Alle", "Economie", "Klimaat", "Immigratie", |
||||
"Zorg", "Onderwijs", "Defensie", "Sociale Zaken", "Algemeen" |
||||
] |
||||
|
||||
config = Config() |
||||
``` |
||||
|
||||
**Usage:** |
||||
```python |
||||
from config import config |
||||
|
||||
# Access as attributes |
||||
timeout = config.API_TIMEOUT |
||||
areas = config.POLICY_AREAS |
||||
``` |
||||
|
||||
## DuckDB Connection Pattern |
||||
|
||||
Short-lived connections with explicit cleanup: |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
result = conn.execute( |
||||
"SELECT * FROM motions WHERE id = ?", |
||||
(motion_id,) |
||||
).fetchone() |
||||
return result |
||||
finally: |
||||
conn.close() |
||||
|
||||
def get_filtered_motions(self, **kwargs) -> List[Dict]: |
||||
conn = duckdb.connect(self.db_path) |
||||
try: |
||||
rows = conn.execute(query, params).fetchall() |
||||
return rows |
||||
except Exception: |
||||
return [] # Safe fallback |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
**Context manager alternative (preferred when applicable):** |
||||
```python |
||||
def some_operation(self): |
||||
with duckdb.connect(self.db_path) as conn: |
||||
result = conn.execute("SELECT ...").fetchall() |
||||
return result |
||||
``` |
||||
|
||||
## Try/Except with Fallback Pattern |
||||
|
||||
Always provide safe fallbacks: |
||||
|
||||
```python |
||||
def get_motion_or_default(self, motion_id: int) -> Dict: |
||||
try: |
||||
conn = duckdb.connect(self.db_path) |
||||
result = conn.execute("SELECT * FROM motions WHERE id = ?", (motion_id,)).fetchone() |
||||
conn.close() |
||||
return result if result else {} |
||||
except Exception: |
||||
return {} |
||||
``` |
||||
|
||||
## Optional Import Pattern |
||||
|
||||
Handle optional dependencies gracefully: |
||||
|
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: # pragma: no cover |
||||
duckdb = None |
||||
|
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self._file_mode = duckdb is None |
||||
... |
||||
``` |
||||
|
||||
## Property Pattern |
||||
|
||||
Lazy initialization of expensive resources: |
||||
|
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._session_cache = None |
||||
|
||||
@property |
||||
def session(self): |
||||
"""Lazy-load expensive resources.""" |
||||
if self._session_cache is None: |
||||
self._session_cache = self._create_session() |
||||
return self._session_cache |
||||
``` |
||||
|
||||
## Type Annotation Patterns |
||||
|
||||
```python |
||||
from typing import Dict, List, Optional, Tuple, Any |
||||
|
||||
# Optional with None default |
||||
def get_motion(self, motion_id: Optional[int] = None) -> Optional[Dict]: |
||||
... |
||||
|
||||
# Multiple return types |
||||
def parse_vote(self, vote_str: str) -> Tuple[bool, str]: |
||||
"""Returns (success, error_message)""" |
||||
... |
||||
|
||||
# Generic types |
||||
def get_batch(self, ids: List[int]) -> Dict[str, Any]: |
||||
... |
||||
``` |
||||
@ -1,77 +0,0 @@ |
||||
--- |
||||
title: Requests HTTP Pattern |
||||
category: patterns |
||||
--- |
||||
# Requests HTTP Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
||||
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
||||
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
||||
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - 429 handling with Retry-After |
||||
|
||||
```python |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
... |
||||
if getattr(resp, "status_code", 0) == 429: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||
retry_after = None |
||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||
if raw: |
||||
try: |
||||
retry_after = int(raw) |
||||
except Exception: |
||||
... |
||||
if retry_after is not None: |
||||
time.sleep(retry_after) |
||||
continue |
||||
``` |
||||
|
||||
### api_client.py - Session + raise_for_status |
||||
|
||||
```python |
||||
response = self.session.get( |
||||
base_url, params=params, timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Retry/backoff wrapper |
||||
|
||||
```python |
||||
def _attempt_batch(chunk_texts, start_index): |
||||
backoff = 0.5 |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
emb_chunk = _embedder( |
||||
chunk_texts, model=model, batch_size=len(chunk_texts) |
||||
) |
||||
return emb_chunk, None |
||||
except Exception as exc: |
||||
if attempt == retries: |
||||
break |
||||
sleep = backoff * (2 ** (attempt - 1)) |
||||
time.sleep(sleep) |
||||
continue |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Silent exception swallowing |
||||
|
||||
**Problem**: Blindly catching all requests exceptions and returning empty response. |
||||
|
||||
**Remediation**: Map network exceptions to retryable vs terminal (ProviderError) and log details. |
||||
|
||||
### Bad: Using print() for errors |
||||
|
||||
**Problem**: Using print() for network errors instead of structured logging. |
||||
|
||||
**Remediation**: Use `_logger.exception()` instead (see api_client.py needs fixing). |
||||
@ -1,37 +0,0 @@ |
||||
--- |
||||
title: Validation Pattern |
||||
category: patterns |
||||
--- |
||||
# Validation Pattern |
||||
|
||||
## Rules |
||||
|
||||
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
||||
- Tests should assert that invalid inputs raise the expected exceptions. |
||||
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
||||
|
||||
## Examples |
||||
|
||||
### ai_provider.py - Type validation |
||||
|
||||
```python |
||||
if not isinstance(text, str): |
||||
raise ProviderError("text must be a string") |
||||
``` |
||||
|
||||
### pipeline/ai_provider_wrapper.py - Defensive empty handling |
||||
|
||||
```python |
||||
if not texts: |
||||
return [] |
||||
if motion_ids is None: |
||||
motion_ids = [None for _ in texts] |
||||
``` |
||||
|
||||
## Anti-Patterns |
||||
|
||||
### Bad: Invalid values into computation |
||||
|
||||
**Problem**: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
||||
|
||||
**Remediation**: Fail fast with a typed exception and add unit tests to cover validations. |
||||
@ -1,67 +0,0 @@ |
||||
--- |
||||
title: Tech Stack |
||||
category: stack |
||||
--- |
||||
|
||||
# Tech Stack |
||||
|
||||
## Runtime & Language |
||||
- **Python >=3.13** |
||||
|
||||
## Web Framework |
||||
- **Streamlit** - Multi-page app with Home, Stemwijzer, Explorer pages |
||||
|
||||
## Data Layer |
||||
- **DuckDB** - Embedded OLAP database |
||||
- Tables: motions, mp_votes, svd_vectors, fused_embeddings, embeddings, user_sessions, party_results, mp_metadata |
||||
- **ibis** - ORM (referenced but DuckDB-native implementation used) |
||||
|
||||
## AI / LLM |
||||
- **OpenRouter** - API abstraction for AI providers |
||||
- **QWEN** - Primary model |
||||
- Embeddings: `qwen/qwen3-embedding-4b` |
||||
- Chat: `qwen/qwen-2.5-72b-instruct` |
||||
- **requests** - HTTP client (not raw openai) |
||||
|
||||
## ML / Analytics |
||||
- **scikit-learn** - KMeans clustering, cosine_similarity, StandardScaler |
||||
- **scipy** - SVD (scipy.linalg.svd), spatial.procrustes |
||||
- **umap-learn** - Dimensionality reduction (optional, graceful fallback to SVD) |
||||
- **numpy** - Numerical computing |
||||
|
||||
## Visualization |
||||
- **Plotly** - Interactive charts (go.Figure, _DummyTrace fallback) |
||||
- **matplotlib** - Static plotting (optional) |
||||
|
||||
## HTTP & Parsing |
||||
- **requests** - Session pooling, retry with backoff |
||||
- **beautifulsoup4** - HTML parsing |
||||
- **lxml** - XML/HTML processing |
||||
|
||||
## Key Source Files |
||||
|
||||
| File | Purpose | |
||||
|------|---------| |
||||
| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema | |
||||
| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) | |
||||
| `explorer_helpers.py` | Pure helper functions, Plotly chart builders | |
||||
| `analysis/` | SVD pipeline, UMAP projection, clustering | |
||||
| `pipeline/` | Data fetch, transform, store pipeline | |
||||
| `pages/1_Stemwijzer.py` | Quiz page | |
||||
| `pages/2_Explorer.py` | Explorer page | |
||||
| `config.py` | Dataclass Config pattern | |
||||
| `ai_provider.py` | OpenRouter API wrapper with retry | |
||||
| `api_client.py` | TweedeKamer OData API client | |
||||
|
||||
## Singleton Instances |
||||
|
||||
| Module | Instance | Type | |
||||
|--------|----------|------| |
||||
| `database.py` | `db` | `MotionDatabase` | |
||||
| `config.py` | `config` | `Config` (dataclass) | |
||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
||||
|
||||
## Environment |
||||
- Python >=3.13 |
||||
- Environment variables via `.env` (DB path, API keys) |
||||
- No `.env` values in constraint files (security) |
||||
@ -1,72 +0,0 @@ |
||||
import os |
||||
import re |
||||
from typing import List |
||||
|
||||
|
||||
def file_exists(base_dir: str, path: str) -> bool: |
||||
"""Check whether a path exists under base_dir without opening the file. |
||||
|
||||
This resolves the path relative to base_dir and returns True if the |
||||
resolved path exists on the filesystem (file or directory). |
||||
""" |
||||
if not base_dir: |
||||
base = "" |
||||
else: |
||||
base = base_dir |
||||
full = os.path.join(base, path) |
||||
return os.path.exists(full) |
||||
|
||||
|
||||
def detect_truncated(snippet: str) -> bool: |
||||
"""Heuristic detection whether a snippet is truncated. |
||||
|
||||
Returns True if the snippet ends with an ellipsis '...' (after |
||||
trimming whitespace) or contains a common truncation marker like |
||||
the substring 'truncat' (case-insensitive). |
||||
""" |
||||
if snippet is None: |
||||
return False |
||||
s = snippet.strip() |
||||
if s.endswith("..."): |
||||
return True |
||||
if "truncat" in s.lower(): |
||||
return True |
||||
return False |
||||
|
||||
|
||||
def find_potential_secrets(text: str) -> List[str]: |
||||
"""Scan the provided text and return a list of potential secret-like |
||||
strings. This uses a few common heuristics and regex patterns and only |
||||
scans the provided text (no external resources). |
||||
|
||||
The function returns a list of found token strings (values when |
||||
capture groups are available, otherwise the matched substring). |
||||
""" |
||||
if not text: |
||||
return [] |
||||
|
||||
candidates: List[str] = [] |
||||
|
||||
# AWS access key id pattern (common): AKIA followed by 16 alphanumeric |
||||
aws_pattern = re.compile(r"AKIA[0-9A-Z]{16}") |
||||
candidates.extend(aws_pattern.findall(text)) |
||||
|
||||
# Common key/value patterns like api_key = "..." or "api-key: ..." |
||||
# allow shorter secret values (down to 4 chars) to catch short test values |
||||
kv_pattern = re.compile( |
||||
r"(?i)(?:api[_-]?key|secret[_-]?key|access[_-]?token|access[_-]?key|token|password|passwd|pwd)\s*[=:]+\s*['\"]?([A-Za-z0-9\-_=+/\.]{4,128})['\"]?" |
||||
) |
||||
candidates.extend(m.group(1) for m in kv_pattern.finditer(text)) |
||||
|
||||
# Generic long hex or base64-like strings (heuristic) |
||||
long_hex = re.compile(r"\b([a-f0-9]{32,128})\b", re.IGNORECASE) |
||||
candidates.extend(long_hex.findall(text)) |
||||
|
||||
# Deduplicate while preserving order |
||||
seen = set() |
||||
result: List[str] = [] |
||||
for c in candidates: |
||||
if c and c not in seen: |
||||
seen.add(c) |
||||
result.append(c) |
||||
return result |
||||
@ -1,32 +0,0 @@ |
||||
from typing import List, Optional |
||||
|
||||
|
||||
def main(argv: Optional[List[str]] = None) -> int: |
||||
"""CLI wrapper that delegates to scripts.mindmodel.validator.main. |
||||
|
||||
Returns the integer exit code from the delegated main. If the |
||||
validator module is not available or raises, return a non-zero |
||||
exit code. |
||||
""" |
||||
try: |
||||
# Import here to avoid side-effects on module import |
||||
from scripts.mindmodel import validator |
||||
|
||||
# Call the validator.main if present |
||||
if hasattr(validator, "main"): |
||||
result = validator.main(argv) |
||||
# Ensure we return an int |
||||
try: |
||||
return int(result) # type: ignore |
||||
except Exception: |
||||
return 1 |
||||
else: |
||||
return 2 |
||||
except Exception: |
||||
# Import error or runtime error — return non-zero so callers |
||||
# can detect failure (tests expect non-zero on missing manifest) |
||||
return 2 |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
raise SystemExit(main()) |
||||
@ -1,67 +0,0 @@ |
||||
"""Simple manifest loader for mindmodel manifests. |
||||
|
||||
Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`. |
||||
|
||||
Behavior: |
||||
- If PyYAML is installed, uses yaml.safe_load to parse the file. |
||||
- Otherwise falls back to the stdlib json parser. |
||||
- If the top-level document is a list it will be normalized to {"constraints": <list>}. |
||||
- Raises ManifestLoadError for missing file or parse errors. |
||||
""" |
||||
|
||||
from typing import Any, Dict |
||||
import json |
||||
from pathlib import Path |
||||
|
||||
|
||||
class ManifestLoadError(Exception): |
||||
"""Raised when a manifest cannot be loaded or parsed.""" |
||||
|
||||
|
||||
try: |
||||
import yaml # type: ignore |
||||
except Exception: # YAML not available |
||||
yaml = None # type: ignore |
||||
|
||||
|
||||
def _parse_with_yaml(text: str) -> Any: |
||||
# yamlsafe_load may return any Python structure |
||||
try: |
||||
return yaml.safe_load(text) |
||||
except Exception as exc: # pragma: no cover - defensive |
||||
raise ManifestLoadError(f"YAML parse error: {exc}") from exc |
||||
|
||||
|
||||
def _parse_with_json(text: str) -> Any: |
||||
try: |
||||
return json.loads(text) |
||||
except Exception as exc: |
||||
raise ManifestLoadError(f"JSON parse error: {exc}") from exc |
||||
|
||||
|
||||
def load_manifest(path: str) -> Dict[str, Any]: |
||||
"""Load a manifest from the given file path and normalize it to a dict. |
||||
|
||||
If the top-level document is a list, it will be returned as {"constraints": list}. |
||||
Raises ManifestLoadError if the file does not exist or if parsing fails. |
||||
""" |
||||
p = Path(path) |
||||
if not p.exists(): |
||||
raise ManifestLoadError(f"Manifest file not found: {path}") |
||||
|
||||
text = p.read_text(encoding="utf-8") |
||||
|
||||
if yaml is not None: |
||||
data = _parse_with_yaml(text) |
||||
else: |
||||
data = _parse_with_json(text) |
||||
|
||||
# Normalize |
||||
if isinstance(data, list): |
||||
return {"constraints": data} |
||||
|
||||
if isinstance(data, dict): |
||||
return data |
||||
|
||||
# Unexpected top-level type, wrap it |
||||
return {"manifest": data} |
||||
@ -1,108 +0,0 @@ |
||||
from typing import Dict, Tuple, List, Any |
||||
import json |
||||
from pathlib import Path |
||||
|
||||
from scripts.mindmodel import loader |
||||
from scripts.mindmodel import checks |
||||
|
||||
|
||||
def validate_manifest(path: str, base_dir: str = None) -> Tuple[int, Dict[str, Any]]: |
||||
"""Validate a manifest file at `path`. |
||||
|
||||
Returns a tuple (exit_code, report). |
||||
|
||||
exit codes: |
||||
0 - ok (no issues) |
||||
1 - warnings (only truncated snippets found) |
||||
2 - critical (missing files, secrets, or parse error) |
||||
""" |
||||
report: Dict[str, Any] = { |
||||
"path": path, |
||||
"secrets": [], |
||||
"missing_files": [], |
||||
"truncated": 0, |
||||
"constraints": [], |
||||
} |
||||
|
||||
p = Path(path) |
||||
try: |
||||
raw_text = p.read_text(encoding="utf-8") |
||||
except Exception as exc: |
||||
report["load_error"] = f"Manifest file not readable: {exc}" |
||||
return 2, report |
||||
|
||||
# scan for secrets in the manifest text |
||||
secrets = checks.find_potential_secrets(raw_text) |
||||
report["secrets"] = secrets |
||||
|
||||
try: |
||||
manifest = loader.load_manifest(path) |
||||
except loader.ManifestLoadError as exc: |
||||
report["load_error"] = str(exc) |
||||
# treat parse/load errors as critical |
||||
return 2, report |
||||
|
||||
constraints = manifest.get("constraints") or [] |
||||
|
||||
for constraint in constraints: |
||||
c_rep: Dict[str, Any] = {"constraint": constraint, "evidence": []} |
||||
for ev in ( |
||||
constraint.get("evidence", []) |
||||
if isinstance(constraint.get("evidence", []), list) |
||||
else [] |
||||
): |
||||
text = ev.get("text") if isinstance(ev, dict) else None |
||||
file_ref = ev.get("file") if isinstance(ev, dict) else None |
||||
|
||||
exists = True |
||||
if file_ref: |
||||
if not checks.file_exists(base_dir or "", file_ref): |
||||
exists = False |
||||
report["missing_files"].append(file_ref) |
||||
|
||||
truncated = False |
||||
if text: |
||||
truncated = checks.detect_truncated(text) |
||||
if truncated: |
||||
report["truncated"] += 1 |
||||
|
||||
c_rep["evidence"].append( |
||||
{ |
||||
"text": text, |
||||
"file": file_ref, |
||||
"exists": exists, |
||||
"truncated": truncated, |
||||
} |
||||
) |
||||
|
||||
report["constraints"].append(c_rep) |
||||
|
||||
# decide exit code |
||||
if report["secrets"]: |
||||
return 2, report |
||||
|
||||
if report["missing_files"]: |
||||
return 2, report |
||||
|
||||
if report["truncated"] > 0: |
||||
return 1, report |
||||
|
||||
return 0, report |
||||
|
||||
|
||||
def main(argv: List[str]) -> int: |
||||
import sys |
||||
|
||||
if len(argv) < 2: |
||||
print(json.dumps({"error": "manifest path required"})) |
||||
return 2 |
||||
|
||||
path = argv[1] |
||||
base_dir = argv[2] if len(argv) > 2 else None |
||||
|
||||
code, report = validate_manifest(path, base_dir=base_dir) |
||||
print(json.dumps(report)) |
||||
return code |
||||
|
||||
|
||||
# no execution at import time |
||||
@ -1,56 +0,0 @@ |
||||
"""Command-line wrapper around src.validators.mindmodel_validator.validate_manifest |
||||
|
||||
This tiny CLI loads a manifest and writes a structured JSON report to stdout |
||||
and optionally to a file path. It is report-only: it never raises an error or |
||||
changes exit code based on findings. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
import argparse |
||||
import json |
||||
import os |
||||
from pathlib import Path |
||||
from typing import Any |
||||
|
||||
|
||||
def _write_report(report: dict[str, Any], path: Path | None) -> None: |
||||
text = json.dumps(report, indent=2, ensure_ascii=False) |
||||
print(text) |
||||
if path: |
||||
path.parent.mkdir(parents=True, exist_ok=True) |
||||
path.write_text(text, encoding="utf-8") |
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int: |
||||
parser = argparse.ArgumentParser("validate_mindmodel") |
||||
parser.add_argument("manifest", nargs="?", help="path to manifest file") |
||||
parser.add_argument("--manifest", dest="manifest_opt", help="path to manifest file") |
||||
parser.add_argument("--report", help="optional output report path") |
||||
args = parser.parse_args(argv) |
||||
|
||||
manifest = args.manifest_opt or args.manifest |
||||
if not manifest: |
||||
parser.error("manifest path is required (positional or --manifest)") |
||||
|
||||
# import here to keep CLI tiny when unused |
||||
try: |
||||
from src.validators.mindmodel_validator import validate_manifest |
||||
except Exception as e: # pragma: no cover - defensive |
||||
print(f"Failed to import validator: {e}") |
||||
return 0 |
||||
|
||||
try: |
||||
report = validate_manifest(manifest, report_only=True) |
||||
except Exception as e: # never fail the process |
||||
report = {"error": str(e)} |
||||
|
||||
report_path = Path(args.report) if args.report else None |
||||
_write_report(report, report_path) |
||||
|
||||
# always exit zero for report-only operation |
||||
return 0 |
||||
|
||||
|
||||
if __name__ == "__main__": |
||||
raise SystemExit(main()) |
||||
@ -1,35 +0,0 @@ |
||||
"""Motion-related simple types and JSON helpers. |
||||
|
||||
Decision: MotionId is an alias for str for simplicity. |
||||
""" |
||||
|
||||
from dataclasses import dataclass, asdict |
||||
from typing import List |
||||
import json |
||||
|
||||
MotionId = str |
||||
Embedding = List[float] |
||||
|
||||
|
||||
@dataclass |
||||
class SimilarityNeighbor: |
||||
motion_id: MotionId |
||||
score: float |
||||
|
||||
|
||||
def to_json(neighbors: List[SimilarityNeighbor]) -> str: |
||||
"""Serialize a list of SimilarityNeighbor to a JSON string. |
||||
|
||||
The format is a JSON list of objects with keys 'motion_id' and 'score'. |
||||
""" |
||||
list_of_dicts = [asdict(n) for n in neighbors] |
||||
return json.dumps(list_of_dicts) |
||||
|
||||
|
||||
def from_json(json_str: str) -> List[SimilarityNeighbor]: |
||||
"""Deserialize a JSON string (list of dicts) into SimilarityNeighbor list.""" |
||||
parsed = json.loads(json_str) |
||||
return [ |
||||
SimilarityNeighbor(motion_id=item["motion_id"], score=float(item["score"])) |
||||
for item in parsed |
||||
] |
||||
@ -1,142 +0,0 @@ |
||||
"""Conservative, report-only mindmodel/manifest validator. |
||||
|
||||
This module provides a small validator that reads a manifest (YAML if |
||||
PyYAML is available, otherwise a tiny fallback parser) and reports |
||||
potential issues without making changes. |
||||
|
||||
The returned report contains the keys: |
||||
- missing_files: list of file paths referenced in the manifest that don't exist |
||||
- truncated_evidence: list of items (dicts) where evidence_excerpt appears truncated |
||||
- potential_secrets: list of items (dicts) where evidence_excerpt looks like it may contain secrets |
||||
|
||||
The manifest is expected to contain a top-level `files` list with |
||||
entries that are mappings and have at least a `path` (or `file_path`) |
||||
and optionally `evidence_excerpt`. |
||||
""" |
||||
|
||||
from __future__ import annotations |
||||
|
||||
import os |
||||
from typing import List, Dict, Any |
||||
|
||||
|
||||
def _load_yaml_native(path: str) -> Dict[str, Any]: |
||||
try: |
||||
import yaml # type: ignore |
||||
|
||||
with open(path, "r", encoding="utf-8") as f: |
||||
return yaml.safe_load(f) or {} |
||||
except Exception: |
||||
raise |
||||
|
||||
|
||||
def _load_yaml_fallback(path: str) -> Dict[str, Any]: |
||||
"""Tiny YAML-ish fallback parser that understands a minimal manifest. |
||||
|
||||
It only supports a top-level `files:` key and a sequence of simple |
||||
mappings with `-` list items and `key: value` pairs indented. |
||||
This is intentionally conservative and fragile; it's only used when |
||||
PyYAML is not available. |
||||
""" |
||||
result: Dict[str, Any] = {} |
||||
files: List[Dict[str, Any]] = [] |
||||
current: Dict[str, Any] | None = None |
||||
|
||||
with open(path, "r", encoding="utf-8") as f: |
||||
for raw in f: |
||||
line = raw.rstrip("\n") |
||||
stripped = line.lstrip() |
||||
if not stripped or stripped.startswith("#"): |
||||
continue |
||||
if stripped.startswith("files:") and line.startswith(stripped): |
||||
# top-level marker, skip |
||||
continue |
||||
if stripped.startswith("- "): |
||||
# start new item |
||||
if current is not None: |
||||
files.append(current) |
||||
current = {} |
||||
# possible inline key: - path: something |
||||
rest = stripped[2:].strip() |
||||
if rest: |
||||
if ":" in rest: |
||||
k, v = rest.split(":", 1) |
||||
current[k.strip()] = v.strip() |
||||
continue |
||||
# key: value lines (indented) |
||||
if ":" in stripped and current is not None: |
||||
k, v = stripped.split(":", 1) |
||||
current[k.strip()] = v.strip() |
||||
|
||||
if current is not None: |
||||
files.append(current) |
||||
if files: |
||||
result["files"] = files |
||||
return result |
||||
|
||||
|
||||
def _normalize_entry(entry: Any) -> Dict[str, Any]: |
||||
if not isinstance(entry, dict): |
||||
return {"path": str(entry)} |
||||
# prefer path or file_path |
||||
if "file_path" in entry and "path" not in entry: |
||||
entry = dict(entry) |
||||
entry["path"] = entry.pop("file_path") |
||||
return entry |
||||
|
||||
|
||||
def validate_manifest(manifest_path: str, report_only: bool = True) -> dict: |
||||
"""Validate a minimal mindmodel manifest and return a report. |
||||
|
||||
Parameters |
||||
- manifest_path: path to the YAML manifest file |
||||
- report_only: unused flag for now; kept to emphasise this is report-only |
||||
|
||||
Returns a dict with keys: missing_files, truncated_evidence, potential_secrets |
||||
""" |
||||
if not os.path.exists(manifest_path): |
||||
raise FileNotFoundError(manifest_path) |
||||
|
||||
# attempt to use PyYAML if available, otherwise fallback |
||||
try: |
||||
manifest = _load_yaml_native(manifest_path) |
||||
except Exception: |
||||
manifest = _load_yaml_fallback(manifest_path) |
||||
|
||||
files = manifest.get("files") or [] |
||||
report = {"missing_files": [], "truncated_evidence": [], "potential_secrets": []} |
||||
|
||||
def _strip_surrounding_quotes(s: str) -> str: |
||||
s = s.strip() |
||||
if len(s) >= 2 and s[0] == s[-1] and s[0] in ('"', "'"): |
||||
return s[1:-1] |
||||
return s |
||||
|
||||
for raw in files: |
||||
entry = _normalize_entry(raw) |
||||
path = entry.get("path") |
||||
evidence = entry.get("evidence_excerpt") or entry.get("evidence") or "" |
||||
# Remove surrounding quotes if the fallback YAML parser left them in place |
||||
if isinstance(evidence, str): |
||||
evidence = _strip_surrounding_quotes(evidence) |
||||
|
||||
# missing files |
||||
if path: |
||||
if not os.path.exists(path): |
||||
report["missing_files"].append(path) |
||||
|
||||
# truncated evidence heuristics |
||||
if isinstance(evidence, str): |
||||
if len(evidence) > 1000 or evidence.strip().endswith("..."): |
||||
report["truncated_evidence"].append( |
||||
{"path": path, "evidence_excerpt": evidence} |
||||
) |
||||
|
||||
# potential secrets heuristics |
||||
up = evidence.upper() |
||||
if "PASSWORD" in up or "SECRET" in up or "BEGIN PRIVATE KEY" in evidence: |
||||
report["potential_secrets"].append( |
||||
{"path": path, "evidence_excerpt": evidence} |
||||
) |
||||
|
||||
return report |
||||
@ -1,11 +0,0 @@ |
||||
import pathlib |
||||
|
||||
|
||||
def test_schedule_workflow_exists(): |
||||
path = pathlib.Path(".github/workflows/mindmodel-schedule.yml") |
||||
assert path.exists(), f"Expected {path} to exist" |
||||
|
||||
text = path.read_text(encoding="utf-8") |
||||
# ensure the file is a GitHub Actions workflow that declares a schedule |
||||
assert "on:" in text |
||||
assert "schedule" in text |
||||
@ -1,26 +0,0 @@ |
||||
import os |
||||
|
||||
try: |
||||
import yaml |
||||
|
||||
_HAS_YAML = True |
||||
except Exception: |
||||
_HAS_YAML = False |
||||
|
||||
|
||||
def test_mindmodel_workflow_exists_and_parses(): |
||||
path = os.path.join(".github", "workflows", "mindmodel-validation.yml") |
||||
assert os.path.exists(path), f"Workflow file {path} does not exist" |
||||
|
||||
# Minimal parse: if PyYAML is available, try safe_load; otherwise do a token check |
||||
with open(path, "r", encoding="utf-8") as f: |
||||
content = f.read() |
||||
|
||||
if _HAS_YAML: |
||||
data = yaml.safe_load(content) |
||||
assert data is not None and isinstance(data, dict) |
||||
assert "on" in data or "name" in data |
||||
else: |
||||
# fall back to simple checks to avoid introducing new deps |
||||
assert "name:" in content |
||||
assert "on:" in content |
||||
@ -1,43 +0,0 @@ |
||||
import os |
||||
import tempfile |
||||
|
||||
from scripts.mindmodel import checks |
||||
|
||||
|
||||
def test_file_exists(tmp_path): |
||||
# create a file under tmp_path |
||||
base = str(tmp_path) |
||||
p = tmp_path / "subdir" |
||||
p.mkdir() |
||||
f = p / "file.txt" |
||||
f.write_text("hello") |
||||
|
||||
# path relative to base |
||||
assert checks.file_exists(base, "subdir/file.txt") |
||||
# non-existing |
||||
assert not checks.file_exists(base, "subdir/missing.txt") |
||||
|
||||
|
||||
def test_detect_truncated(): |
||||
assert checks.detect_truncated("This is a truncated snippet...") |
||||
assert checks.detect_truncated("Truncation marker: [truncated]") |
||||
assert checks.detect_truncated("contains truncatED word") |
||||
assert not checks.detect_truncated("This is complete") |
||||
assert not checks.detect_truncated("") |
||||
|
||||
|
||||
def test_find_potential_secrets(): |
||||
text = """ |
||||
api_key = "abcdEFGH1234ijklMNOP" |
||||
password: 'hunter2' |
||||
aws = AKIA1234567890ABCD12 |
||||
random_hex = deadbeefdeadbeefdeadbeefdeadbeef |
||||
not_a_secret = short |
||||
""" |
||||
|
||||
found = checks.find_potential_secrets(text) |
||||
# should find api_key value, password, aws and long hex |
||||
assert "abcdEFGH1234ijklMNOP" in found |
||||
assert "hunter2" in found |
||||
assert any(item.startswith("AKIA") for item in found) |
||||
assert any("deadbeef" in item for item in found) |
||||
@ -1,14 +0,0 @@ |
||||
import os |
||||
|
||||
|
||||
def test_cli_with_nonexistent_manifest(): |
||||
"""Calling cli.main with a non-existent manifest should return non-zero.""" |
||||
from scripts.mindmodel import cli |
||||
|
||||
# Provide a path that is extremely unlikely to exist |
||||
fake_manifest = "/this/path/does/not/exist/manifest.json" |
||||
|
||||
code = cli.main([fake_manifest]) |
||||
|
||||
assert isinstance(code, int) |
||||
assert code != 0 |
||||
@ -1,21 +0,0 @@ |
||||
import json |
||||
import pytest |
||||
|
||||
from scripts.mindmodel import loader |
||||
|
||||
|
||||
def test_load_json_manifest(tmp_path): |
||||
data = [{"id": "c1", "description": "a constraint"}] |
||||
p = tmp_path / "manifest.json" |
||||
p.write_text(json.dumps(data), encoding="utf-8") |
||||
|
||||
loaded = loader.load_manifest(str(p)) |
||||
|
||||
assert isinstance(loaded, dict) |
||||
assert "constraints" in loaded |
||||
assert any(c.get("id") == "c1" for c in loaded["constraints"]) |
||||
|
||||
|
||||
def test_missing_manifest_raises(): |
||||
with pytest.raises(loader.ManifestLoadError): |
||||
loader.load_manifest("nonexistent-file-manifest.json") |
||||
@ -1,70 +0,0 @@ |
||||
import json |
||||
import os |
||||
|
||||
from scripts.mindmodel import validator |
||||
|
||||
|
||||
def write_manifest(path, data: str): |
||||
p = path |
||||
p.write_text(data, encoding="utf-8") |
||||
return str(p) |
||||
|
||||
|
||||
def test_validate_ok(tmp_path): |
||||
# manifest with one constraint and evidence pointing to an existing file |
||||
evidence_file = tmp_path / "file.txt" |
||||
evidence_file.write_text("hello") |
||||
|
||||
manifest = { |
||||
"constraints": [ |
||||
{"id": "c1", "evidence": [{"file": "file.txt", "text": "complete content"}]} |
||||
] |
||||
} |
||||
|
||||
manifest_path = tmp_path / "manifest.json" |
||||
manifest_path.write_text(json.dumps(manifest)) |
||||
|
||||
code, report = validator.validate_manifest( |
||||
str(manifest_path), base_dir=str(tmp_path) |
||||
) |
||||
assert code == 0 |
||||
assert report["missing_files"] == [] |
||||
assert report["secrets"] == [] |
||||
|
||||
|
||||
def test_missing_file_flags_failure(tmp_path): |
||||
# manifest refers to missing file |
||||
manifest = { |
||||
"constraints": [{"id": "c2", "evidence": [{"file": "nope.txt", "text": "foo"}]}] |
||||
} |
||||
manifest_path = tmp_path / "manifest.json" |
||||
manifest_path.write_text(json.dumps(manifest)) |
||||
|
||||
code, report = validator.validate_manifest( |
||||
str(manifest_path), base_dir=str(tmp_path) |
||||
) |
||||
assert code == 2 |
||||
assert "nope.txt" in report["missing_files"] |
||||
|
||||
|
||||
def test_truncated_produces_warning(tmp_path): |
||||
# evidence text is truncated -> warning |
||||
f = tmp_path / "manifest.json" |
||||
manifest = { |
||||
"constraints": [{"id": "c3", "evidence": [{"text": "This is truncated..."}]}] |
||||
} |
||||
f.write_text(json.dumps(manifest)) |
||||
|
||||
code, report = validator.validate_manifest(str(f), base_dir=str(tmp_path)) |
||||
assert code == 1 |
||||
assert report["truncated"] >= 1 |
||||
|
||||
|
||||
def test_manifest_scanned_for_secrets(tmp_path): |
||||
# manifest text contains an api_key pattern |
||||
f = tmp_path / "manifest.json" |
||||
f.write_text('api_key = "secretVALUE1234"') |
||||
|
||||
code, report = validator.validate_manifest(str(f), base_dir=str(tmp_path)) |
||||
assert code == 2 |
||||
assert any("secretVALUE1234" in s for s in report["secrets"]) or report["secrets"] |
||||
@ -1,52 +0,0 @@ |
||||
import json |
||||
import subprocess |
||||
import sys |
||||
from pathlib import Path |
||||
|
||||
|
||||
def test_cli_runs(tmp_path): |
||||
manifest = Path(".mindmodel/manifest.yaml") |
||||
assert manifest.exists(), "expected .mindmodel/manifest.yaml to exist in repo" |
||||
|
||||
report_path = tmp_path / "report.json" |
||||
|
||||
# Try module mode first, fallback to direct script invocation |
||||
cmds = [ |
||||
[ |
||||
sys.executable, |
||||
"-m", |
||||
"scripts.validate_mindmodel", |
||||
str(manifest), |
||||
"--report", |
||||
str(report_path), |
||||
], |
||||
[ |
||||
sys.executable, |
||||
"scripts/validate_mindmodel.py", |
||||
str(manifest), |
||||
"--report", |
||||
str(report_path), |
||||
], |
||||
] |
||||
|
||||
result = None |
||||
for cmd in cmds: |
||||
try: |
||||
result = subprocess.run(cmd, check=False, capture_output=True, text=True) |
||||
# if process ran (any exit code), break and use this result |
||||
break |
||||
except FileNotFoundError: |
||||
continue |
||||
|
||||
assert result is not None, "Failed to run script (no suitable invocation)" |
||||
# CLI should exit with 0 (report-only) |
||||
assert result.returncode == 0, ( |
||||
f"CLI exited non-zero: {result.returncode}\nstderr: {result.stderr}" |
||||
) |
||||
|
||||
assert report_path.exists(), f"Report file was not created at {report_path}" |
||||
|
||||
data = json.loads(report_path.read_text(encoding="utf-8")) |
||||
# top-level keys expected from validator |
||||
for key in ("missing_files", "truncated_evidence", "potential_secrets"): |
||||
assert key in data, f"Report JSON missing key: {key}" |
||||
@ -1,22 +0,0 @@ |
||||
import json |
||||
|
||||
from src.types.motion_types import SimilarityNeighbor, to_json, from_json |
||||
|
||||
|
||||
def test_similarity_neighbor_json_roundtrip(): |
||||
neighbors = [ |
||||
SimilarityNeighbor(motion_id="m1", score=0.9), |
||||
SimilarityNeighbor(motion_id="m2", score=0.75), |
||||
] |
||||
|
||||
# Serialize to JSON string |
||||
json_str = to_json(neighbors) |
||||
assert isinstance(json_str, str) |
||||
|
||||
# Ensure it's valid JSON |
||||
parsed = json.loads(json_str) |
||||
assert isinstance(parsed, list) |
||||
|
||||
# Deserialize back to objects |
||||
recovered = from_json(json_str) |
||||
assert recovered == neighbors |
||||
@ -1,45 +0,0 @@ |
||||
import os |
||||
import tempfile |
||||
from pathlib import Path |
||||
|
||||
import pytest |
||||
|
||||
from src.validators.mindmodel_validator import validate_manifest |
||||
|
||||
|
||||
def _write_temp_manifest(contents: str) -> str: |
||||
fd, path = tempfile.mkstemp(prefix="manifest_", suffix=".yaml") |
||||
os.close(fd) |
||||
with open(path, "w", encoding="utf-8") as f: |
||||
f.write(contents) |
||||
return path |
||||
|
||||
|
||||
def test_validator_reports_missing_file(tmp_path): |
||||
# manifest referencing a non-existent file |
||||
missing = str(tmp_path / "no_such_file.txt") |
||||
manifest = f""" |
||||
files: |
||||
- path: {missing} |
||||
""" |
||||
mpath = _write_temp_manifest(manifest) |
||||
try: |
||||
report = validate_manifest(mpath) |
||||
assert "missing_files" in report |
||||
assert missing in report["missing_files"] |
||||
finally: |
||||
Path(mpath).unlink() |
||||
|
||||
|
||||
def test_validator_detects_potential_secret(tmp_path): |
||||
# manifest with evidence_excerpt containing PASSWORD |
||||
evidence = "This shows a PASSWORD=hunter2 in the output" |
||||
manifest = f'files:\n - path: some_file.txt\n evidence_excerpt: "{evidence}"\n' |
||||
mpath = _write_temp_manifest(manifest) |
||||
try: |
||||
report = validate_manifest(mpath) |
||||
assert "potential_secrets" in report |
||||
items = report["potential_secrets"] |
||||
assert any(evidence in (item.get("evidence_excerpt") or "") for item in items) |
||||
finally: |
||||
Path(mpath).unlink() |
||||
@ -1,24 +0,0 @@ |
||||
import os |
||||
from pathlib import Path |
||||
|
||||
import pytest |
||||
|
||||
from src.validators.types import parse_manifest, Manifest |
||||
|
||||
|
||||
def test_manifest_model_parses_sample(tmp_path: Path): |
||||
sample = """ |
||||
files: |
||||
- path: data/file1.txt |
||||
evidence_excerpt: "some evidence" |
||||
- file_path: data/file2.txt |
||||
evidence_excerpt: "other evidence" |
||||
""" |
||||
p = tmp_path / "manifest.yaml" |
||||
p.write_text(sample, encoding="utf-8") |
||||
|
||||
manifest = parse_manifest(str(p)) |
||||
assert isinstance(manifest, Manifest) |
||||
assert len(manifest.files) == 2 |
||||
assert manifest.files[0]["path"] == "data/file1.txt" |
||||
assert manifest.files[1]["path"] == "data/file2.txt" |
||||
@ -1,56 +0,0 @@ |
||||
import os |
||||
from pathlib import Path |
||||
|
||||
from src.validators.mindmodel_validator import validate_manifest |
||||
|
||||
|
||||
def test_missing_files_reported(tmp_path): |
||||
# create two paths that do not exist |
||||
p1 = str(tmp_path / "missing_one.txt") |
||||
p2 = str(tmp_path / "missing_two.txt") |
||||
|
||||
manifest = f""" |
||||
files: |
||||
- path: {p1} |
||||
- path: {p2} |
||||
""" |
||||
|
||||
mpath = tmp_path / "manifest_missing.yaml" |
||||
mpath.write_text(manifest, encoding="utf-8") |
||||
|
||||
report = validate_manifest(str(mpath)) |
||||
assert "missing_files" in report |
||||
# both missing paths should be reported |
||||
assert p1 in report["missing_files"] |
||||
assert p2 in report["missing_files"] |
||||
|
||||
|
||||
def test_truncated_evidence_and_secrets_reported(tmp_path): |
||||
# entry with truncated evidence (ends with ...) |
||||
trunc_path = str(tmp_path / "trunc.txt") |
||||
trunc_evidence = "This output was cut off..." |
||||
|
||||
# entry with potential secret (contains PASSWORD) |
||||
secret_path = str(tmp_path / "secret.txt") |
||||
secret_evidence = "Found PASSWORD=sekret123 in the logs" |
||||
|
||||
manifest = f""" |
||||
files: |
||||
- path: {trunc_path} |
||||
evidence_excerpt: "{trunc_evidence}" |
||||
- path: {secret_path} |
||||
evidence_excerpt: "{secret_evidence}" |
||||
""" |
||||
|
||||
mpath = tmp_path / "manifest_edgecases.yaml" |
||||
mpath.write_text(manifest, encoding="utf-8") |
||||
|
||||
report = validate_manifest(str(mpath)) |
||||
|
||||
# truncated evidence should report the trunc_path |
||||
assert "truncated_evidence" in report |
||||
assert any(item.get("path") == trunc_path for item in report["truncated_evidence"]) |
||||
|
||||
# potential secrets should report the secret_path |
||||
assert "potential_secrets" in report |
||||
assert any(item.get("path") == secret_path for item in report["potential_secrets"]) |
||||
@ -1,40 +0,0 @@ |
||||
# 2026-03-28 Ansible package implementation |
||||
|
||||
Summary of changes added to repository: |
||||
|
||||
- packages/@ansible/example/ |
||||
- package.json (scoped package @ansible/example) |
||||
- README.md |
||||
- src/index.js |
||||
- tests/ (test_package_json.js, test_pack_inspect.js, _pack_helpers.js, run.js) |
||||
- .github/workflows/publish-ansible-example.yml |
||||
- .github/workflows/deploy-motief.yml |
||||
- docs/deployment/ansible-package-deploy.md |
||||
- docs/embeddings.md |
||||
- README.md (top-level) |
||||
- thoughts/shared/changes/2026-03-28-ansible-package-implementation.md (this file) |
||||
|
||||
Verification commands (run from repo root): |
||||
|
||||
1. Run package tests: |
||||
cd packages/@ansible/example && npm test |
||||
|
||||
2. Run pack inspection: |
||||
cd packages/@ansible/example && node tests/test_pack_inspect.js |
||||
|
||||
3. Simulate pack locally: |
||||
cd packages/@ansible/example && npm pack && tar -tzf <produced-tgz> | head -n 20 |
||||
|
||||
4. Check workflows syntax locally (optional): |
||||
- Use `act` or `nektos/act` to run workflow_dispatch triggers in a container; ensure secrets are not printed. |
||||
|
||||
5. Verify docs updated for embeddings and deployment: open docs/embeddings.md and docs/deployment/ansible-package-deploy.md |
||||
|
||||
Notes: |
||||
- Do NOT add secrets to repo. Secrets: NPM_TOKEN, DEPLOY_SSH_KEY, DEPLOY_HOST, DEPLOY_USER, DEPLOY_SSH_PORT, OPENROUTER_API_KEY |
||||
|
||||
Contact: Sven Geboers |
||||
|
||||
End of changelog. |
||||
|
||||
Write the file with neutral tone and concise steps for verification. |
||||
@ -1,36 +0,0 @@ |
||||
--- |
||||
date: 2026-03-28 |
||||
title: "Remove .env from tracking — report" |
||||
--- |
||||
|
||||
Summary |
||||
------- |
||||
|
||||
I removed `.env` from the repository index and added it to `.gitignore` to prevent accidental future commits. This was a non-destructive, forward-facing change — the repository history still contains prior commits that touched `.env`. |
||||
|
||||
What I ran |
||||
----------- |
||||
|
||||
- git rm --cached .env |
||||
- ensured `.gitignore` contains `.env` |
||||
- committed the change: chore(secrets): stop tracking .env and add to .gitignore |
||||
|
||||
Commits that referenced .env |
||||
---------------------------- |
||||
|
||||
These commits touched `.env` in the repository history (from git log --all -- .env): |
||||
|
||||
- 35f4667 2026-03-28 Sven Geboers chore(secrets): stop tracking .env and add to .gitignore |
||||
- 3551a82 2026-03-21 Sven Geboers feat(analysis): add 2D political compass and 2D trajectories |
||||
|
||||
Notes |
||||
----- |
||||
|
||||
- The `.env` file was removed from the index but remains in historical commits. If you need to remove it from history, we can perform a history rewrite (git-filter-repo or BFG) and force-push; this is destructive and requires coordination. |
||||
- I created a CI guard to fail builds if a `.env` file is present in the repository root (see .github/workflows/forbid-env.yml). This prevents accidental re-adding via pushes/PRs. |
||||
|
||||
Next steps (recommended) |
||||
------------------------ |
||||
|
||||
1. Rotate secrets that might have been in `.env` (see the secrets-rotation checklist next). This is mandatory if those keys were used anywhere publicly or in shared CI. |
||||
2. If you require history purge, reply confirming and I'll prepare a filter-repo run and the exact force-push sequence. |
||||
@ -1,25 +0,0 @@ |
||||
--- |
||||
date: 2026-03-28 |
||||
title: "Secrets rotation checklist" |
||||
--- |
||||
|
||||
Rotate these secrets if they were stored in `.env` or otherwise exposed: |
||||
|
||||
- OPENROUTER_API_KEY / OPENAI_API_KEY |
||||
- NPM_TOKEN |
||||
- DEPLOY SSH keys or passwords (DEPLOY_SSH_KEY, DEPLOY_PASSWORD) |
||||
- Any database credentials, API keys, or third-party service tokens |
||||
|
||||
Steps |
||||
----- |
||||
|
||||
1. Revoke the current tokens in each provider's dashboard. |
||||
2. Create new tokens/keys and store them in the repository secrets (GitHub Settings → Secrets). |
||||
3. Update any running services / CI variables to use the new tokens. |
||||
4. If you used SSH keys and replaced them, update the authorized_keys on the VPS and remove the old key. |
||||
|
||||
Verification |
||||
------------ |
||||
|
||||
- Use CI dry-run jobs that check connectivity and token validity. |
||||
- Run local commands that use the new tokens. |
||||
Loading…
Reference in new issue