Removes: - .mindmodel/ directory and related CI workflows (mindmodel-schedule.yml, mindmodel-validation.yml) - scripts/mindmodel/ and scripts/validate_mindmodel.py - src/types/ and src/validators/ (orphaned type modules, only used by mindmodel) - tests/ci/, tests/scripts/mindmodel/, tests/types/, tests/validators/ (mindmodel-only tests) - thoughts/ledgers/ and thoughts/shared/ (stale transient directories) - .venv_axis and .venv_plotly (orphaned virtual environments, ~1.1 GB) - outputs/blog-charts/ (stale generated HTML files) - data/*.json sidecars (empty cache artifacts) - __pycache__ and *.pyc files across repo Updates: - .gitignore: remove thoughts/shared/analyses/ entry Space reclaimed: ~1.1 GB+main
parent
6e36fa2604
commit
07dd393533
@ -1,37 +0,0 @@ |
|||||||
name: mindmodel scheduled validate |
|
||||||
|
|
||||||
on: |
|
||||||
schedule: |
|
||||||
- cron: '0 0 * * 0' # weekly |
|
||||||
|
|
||||||
jobs: |
|
||||||
validate: |
|
||||||
runs-on: ubuntu-latest |
|
||||||
steps: |
|
||||||
- name: Checkout |
|
||||||
uses: actions/checkout@v4 |
|
||||||
|
|
||||||
- name: Install uv |
|
||||||
uses: astral-sh/setup-uv@v5 |
|
||||||
with: |
|
||||||
version: "0.6.x" |
|
||||||
|
|
||||||
- name: Set up Python |
|
||||||
uses: actions/setup-python@v5 |
|
||||||
with: |
|
||||||
python-version: "3.13" |
|
||||||
|
|
||||||
- name: Install dependencies |
|
||||||
run: uv sync --locked |
|
||||||
|
|
||||||
- name: Run tests |
|
||||||
run: uv run pytest tests/ -q |
|
||||||
|
|
||||||
- name: Run mindmodel validator if manifest exists |
|
||||||
if: ${{ always() }} |
|
||||||
run: | |
|
||||||
if [ -f .mindmodel/manifest.yaml ]; then |
|
||||||
uv run python -m scripts.mindmodel.cli || true |
|
||||||
else |
|
||||||
echo "No .mindmodel/manifest.yaml present — skipping validator" |
|
||||||
fi |
|
||||||
@ -1,47 +0,0 @@ |
|||||||
name: mindmodel validation |
|
||||||
|
|
||||||
on: |
|
||||||
push: |
|
||||||
branches: [ main ] |
|
||||||
pull_request: |
|
||||||
branches: [ main ] |
|
||||||
|
|
||||||
jobs: |
|
||||||
validate: |
|
||||||
runs-on: ubuntu-latest |
|
||||||
steps: |
|
||||||
- name: Checkout |
|
||||||
uses: actions/checkout@v4 |
|
||||||
|
|
||||||
- name: Set up Python |
|
||||||
uses: actions/setup-python@v4 |
|
||||||
with: |
|
||||||
python-version: '3.x' |
|
||||||
|
|
||||||
- name: Install development dependencies (if present) |
|
||||||
run: | |
|
||||||
python -m pip install --upgrade pip |
|
||||||
if [ -f requirements-dev.txt ]; then |
|
||||||
pip install -r requirements-dev.txt |
|
||||||
else |
|
||||||
echo "requirements-dev.txt not found, skipping" |
|
||||||
fi |
|
||||||
|
|
||||||
- name: Run mindmodel validator (report-only) |
|
||||||
if: ${{ always() }} |
|
||||||
run: | |
|
||||||
# Make this step report-only: run the validator but always exit 0 so PRs are not blocked |
|
||||||
set +e |
|
||||||
if [ -f .mindmodel/manifest.yaml ]; then |
|
||||||
python scripts/validate_mindmodel.py --manifest .mindmodel/manifest.yaml --report reports/out.json || true |
|
||||||
else |
|
||||||
echo "No .mindmodel/manifest.yaml present — skipping validator" |
|
||||||
fi |
|
||||||
exit 0 |
|
||||||
|
|
||||||
- name: Upload mindmodel reports |
|
||||||
if: ${{ always() }} |
|
||||||
uses: actions/upload-artifact@v4 |
|
||||||
with: |
|
||||||
name: mindmodel-reports |
|
||||||
path: reports/mindmodel-report-*.json |
|
||||||
@ -1,11 +0,0 @@ |
|||||||
# .mindmodel |
|
||||||
|
|
||||||
This directory contains a generated, read-only snapshot of the repository's "mind model" — structured metadata and evidence used by tooling to reason about repository intent, patterns, and decisions. |
|
||||||
|
|
||||||
Guidelines |
|
||||||
- Read-only: Treat files in this directory as generated artifacts. Local tooling or CI may regenerate or validate them; avoid manual edits unless you are intentionally updating the generator. |
|
||||||
- No secrets: Do not place any credentials, tokens, or sensitive data here. The validator that consumes this folder is designed to detect common secret patterns and will fail if secrets are found. |
|
||||||
- Safe to read: Tools and CI may read these files. They must avoid opening or parsing arbitrary repository secrets and should operate in read-only mode. |
|
||||||
- Validation: CI workflows will run a validator against this folder (if present) to ensure manifest shape, evidence snippets, and referenced files meet project rules. |
|
||||||
|
|
||||||
If you need to propose a change to the mind model, open a PR describing the intent and the generator changes. The CI validator will validate the submitted artifact before merge. |
|
||||||
@ -1,127 +0,0 @@ |
|||||||
--- |
|
||||||
title: Anti-Patterns in Stemwijzer |
|
||||||
category: anti-patterns |
|
||||||
severity: critical |
|
||||||
--- |
|
||||||
|
|
||||||
# Anti-Patterns |
|
||||||
|
|
||||||
> **NOTE**: Some anti-patterns below were investigated and found to be resolved or invalid. See individual entries for details. |
|
||||||
|
|
||||||
## CRITICAL: print() Instead of Logging |
|
||||||
|
|
||||||
**File**: `api_client.py` |
|
||||||
**Evidence**: 11 instances of `print(f"...")` instead of `_logger.info(...)` |
|
||||||
|
|
||||||
**Broken code**: |
|
||||||
```python |
|
||||||
def get_motions(self, ...): |
|
||||||
try: |
|
||||||
# ... |
|
||||||
print(f"Fetched {len(voting_records)} voting records from API") # BAD |
|
||||||
print(f"Processed into {len(motions)} unique motions") # BAD |
|
||||||
except Exception as e: |
|
||||||
print(f"Error fetching motions from API: {e}") # BAD - no traceback |
|
||||||
``` |
|
||||||
|
|
||||||
**Fix**: |
|
||||||
```python |
|
||||||
import logging |
|
||||||
|
|
||||||
_logger = logging.getLogger(__name__) |
|
||||||
|
|
||||||
def get_motions(self, ...): |
|
||||||
try: |
|
||||||
_logger.info("Fetched %d voting records from API", len(voting_records)) |
|
||||||
_logger.info("Processed into %d unique motions", len(motions)) |
|
||||||
except Exception as e: |
|
||||||
_logger.exception("Error fetching motions from API: %s", e) |
|
||||||
return [] |
|
||||||
``` |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## CRITICAL: Global `_DummySt` Replacement |
|
||||||
|
|
||||||
**File**: `explorer.py` |
|
||||||
**Evidence**: Lines ~50-70, module-level `st = _DummySt()` global replacement |
|
||||||
|
|
||||||
**Problem**: Creates a module-level variable `st` that shadows `streamlit` module, causing subtle bugs. |
|
||||||
|
|
||||||
**Fix**: Use conditional flags instead of global replacement: |
|
||||||
```python |
|
||||||
# GOOD: Use conditional logic |
|
||||||
try: |
|
||||||
import plotly.express as px |
|
||||||
import plotly.graph_objects as go |
|
||||||
HAS_PLOTLY = True |
|
||||||
except ImportError: |
|
||||||
HAS_PLOTLY = False |
|
||||||
px = None |
|
||||||
go = None |
|
||||||
|
|
||||||
def render_chart(data): |
|
||||||
if not HAS_PLOTLY: |
|
||||||
_logger.warning("Plotly not available") |
|
||||||
return |
|
||||||
# ... rest of chart logic |
|
||||||
``` |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## WARNING: Logger Naming Inconsistency |
|
||||||
|
|
||||||
**Evidence**: 16 files use `logger`, 17 files use `_logger` |
|
||||||
|
|
||||||
**Files with `logger`** (without underscore): |
|
||||||
- api_client.py, ai_provider.py, pipeline files, analysis files |
|
||||||
|
|
||||||
**Files with `_logger`** (with underscore): |
|
||||||
- database.py, explorer.py, explorer_helpers.py |
|
||||||
|
|
||||||
**Recommendation**: Standardize on `_logger` for module-level loggers. |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## WARNING: Bare except with pass |
|
||||||
|
|
||||||
**File**: `database.py`, line 47 |
|
||||||
|
|
||||||
```python |
|
||||||
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
|
||||||
try: |
|
||||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
|
||||||
except: # bare except |
|
||||||
pass |
|
||||||
``` |
|
||||||
|
|
||||||
**Fix**: |
|
||||||
```python |
|
||||||
try: |
|
||||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
|
||||||
except Exception as exc: |
|
||||||
_logger.debug("Sequence creation skipped: %s", exc) |
|
||||||
``` |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## INVESTIGATED: Entity-ID / Party-Name Mismatch |
|
||||||
|
|
||||||
**Status**: INVALID - investigated and resolved |
|
||||||
|
|
||||||
**Investigation Summary**: `svd_vectors.entity_id` only contains MP names (not party names). Party centroids are correctly computed via `mp_metadata` lookups. No production bug exists. |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Pattern: Three Separate Party Alias Dictionaries |
|
||||||
|
|
||||||
**Problem**: Party name variations exist in 3+ places with no canonical alias mapping. |
|
||||||
|
|
||||||
**Fix**: Create one `PARTY_ALIASES` dict in `config.py`: |
|
||||||
```python |
|
||||||
PARTY_ALIASES = { |
|
||||||
"GroenLinks-PvdA": ["GL-PvdA", "GroenLinks PvdA", "PvdA-GroenLinks"], |
|
||||||
"PVV": ["Partij voor de Vrijheid"], |
|
||||||
# ... |
|
||||||
} |
|
||||||
``` |
|
||||||
@ -1,143 +0,0 @@ |
|||||||
--- |
|
||||||
title: Error Handling Patterns |
|
||||||
category: constraints |
|
||||||
severity: high |
|
||||||
--- |
|
||||||
|
|
||||||
# Error Handling Patterns |
|
||||||
|
|
||||||
## Core Rules |
|
||||||
|
|
||||||
1. **Catch `Exception`, return safe fallbacks** (False/[]/None) |
|
||||||
2. **Log exceptions with traceback** using `_logger.exception()` |
|
||||||
3. **Never swallow exceptions silently** - always log or return sensible default |
|
||||||
4. **Avoid nested try/except blocks** - flatten exception handling |
|
||||||
|
|
||||||
## Pattern: Try/Except Safe Fallback |
|
||||||
|
|
||||||
This is the dominant pattern in the codebase (219+ instances). |
|
||||||
|
|
||||||
```python |
|
||||||
# Standard pattern from database.py, api_client.py, etc. |
|
||||||
try: |
|
||||||
result = risky_operation() |
|
||||||
return process(result) |
|
||||||
except Exception as exc: |
|
||||||
_logger.warning("Operation failed: %s", exc) |
|
||||||
return safe_fallback # False, [], None, {} |
|
||||||
``` |
|
||||||
|
|
||||||
### Examples from Codebase |
|
||||||
|
|
||||||
**database.py** - DuckDB operations: |
|
||||||
```python |
|
||||||
def get_svd_vectors(self, window: str): |
|
||||||
try: |
|
||||||
conn = duckdb.connect(self.db_path, read_only=True) |
|
||||||
try: |
|
||||||
result = conn.execute(query, (window,)).fetchall() |
|
||||||
return self._parse_vectors(result) |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
except Exception as exc: |
|
||||||
_logger.warning("Failed to get SVD vectors: %s", exc) |
|
||||||
return [] |
|
||||||
``` |
|
||||||
|
|
||||||
**ai_provider.py** - HTTP retries: |
|
||||||
```python |
|
||||||
try: |
|
||||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
|
||||||
resp.raise_for_status() |
|
||||||
return resp.json() |
|
||||||
except requests.ConnectionError as exc: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError(f"Connection error: {exc}") from exc |
|
||||||
# ... retry logic |
|
||||||
``` |
|
||||||
|
|
||||||
## Pattern: Optional Dependency Fallback |
|
||||||
|
|
||||||
Gracefully degrade when optional packages are unavailable. |
|
||||||
|
|
||||||
```python |
|
||||||
# UMAP fallback in explorer_helpers.py |
|
||||||
try: |
|
||||||
import umap |
|
||||||
HAS_UMAP = True |
|
||||||
except ImportError: |
|
||||||
HAS_UMAP = False |
|
||||||
_logger.debug("UMAP not available, using SVD vectors directly") |
|
||||||
|
|
||||||
def project_to_2d(vectors): |
|
||||||
if HAS_UMAP: |
|
||||||
return umap.UMAP().fit_transform(vectors) |
|
||||||
return vectors[:, :2] # Fallback: first 2 SVD dimensions |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### 1. Bare except with pass (CRITICAL) |
|
||||||
**File**: `database.py`, line 47 |
|
||||||
|
|
||||||
```python |
|
||||||
# BAD - catches KeyboardInterrupt, SystemExit, MemoryError |
|
||||||
try: |
|
||||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
|
||||||
except: # bare except |
|
||||||
pass |
|
||||||
``` |
|
||||||
|
|
||||||
**Fix**: Catch specific exception or log and continue: |
|
||||||
```python |
|
||||||
try: |
|
||||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
|
||||||
except Exception as exc: |
|
||||||
_logger.debug("Sequence creation skipped (may already exist): %s", exc) |
|
||||||
``` |
|
||||||
|
|
||||||
### 2. Nested Exception Handling |
|
||||||
**File**: `explorer.py`, lines 244-261 |
|
||||||
|
|
||||||
```python |
|
||||||
# BAD - opaque error paths |
|
||||||
try: |
|
||||||
result = compute_svd(motions) |
|
||||||
except Exception: |
|
||||||
try: |
|
||||||
result = fallback_compute(motions) |
|
||||||
except Exception: |
|
||||||
pass # Both exceptions silently dropped |
|
||||||
``` |
|
||||||
|
|
||||||
**Fix**: Flatten and handle each case explicitly: |
|
||||||
```python |
|
||||||
# GOOD - explicit handling |
|
||||||
try: |
|
||||||
result = compute_svd(motions) |
|
||||||
except Exception as exc: |
|
||||||
_logger.warning("SVD failed, trying fallback: %s", exc) |
|
||||||
try: |
|
||||||
result = fallback_compute(motions) |
|
||||||
except Exception as fallback_exc: |
|
||||||
_logger.error("Both SVD approaches failed: %s, %s", exc, fallback_exc) |
|
||||||
raise |
|
||||||
``` |
|
||||||
|
|
||||||
## Rule Summary |
|
||||||
|
|
||||||
| Pattern | When to Use | Return Value | |
|
||||||
|---------|-------------|--------------| |
|
||||||
| Safe fallback | Best-effort operations | `[]`, `{}`, `False`, `None` | |
|
||||||
| Re-raise | Critical operations that must succeed | raise | |
|
||||||
| Log and continue | Optional steps in pipeline | (continue) | |
|
||||||
| Graceful degradation | Optional dependencies | Default behavior | |
|
||||||
|
|
||||||
## When to Log vs Return |
|
||||||
|
|
||||||
| Scenario | Action | |
|
||||||
|----------|--------| |
|
||||||
| User action fails | Log warning, return safe default | |
|
||||||
| Internal error (corrupt data) | Log error, return safe default | |
|
||||||
| Transient failure (network) | Log warning, retry if appropriate | |
|
||||||
| Configuration error | Log error, raise with clear message | |
|
||||||
@ -1,205 +0,0 @@ |
|||||||
# Import Organization Constraints |
|
||||||
|
|
||||||
## Standard Order |
|
||||||
|
|
||||||
Organize imports in three groups with blank lines between: |
|
||||||
|
|
||||||
```python |
|
||||||
# 1. Standard library imports (alphabetical within group) |
|
||||||
import json |
|
||||||
import logging |
|
||||||
import os |
|
||||||
from datetime import datetime, timedelta |
|
||||||
from typing import Dict, List, Optional, Tuple |
|
||||||
|
|
||||||
# 2. Third-party packages (alphabetical within group) |
|
||||||
import duckdb |
|
||||||
import requests |
|
||||||
from config import config |
|
||||||
|
|
||||||
# 3. Local application modules (can use relative imports) |
|
||||||
from database import db |
|
||||||
from summarizer import summarizer |
|
||||||
``` |
|
||||||
|
|
||||||
## Alphabetical Ordering |
|
||||||
|
|
||||||
Within each group, sort imports alphabetically: |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD - alphabetical |
|
||||||
import json |
|
||||||
import logging |
|
||||||
from datetime import datetime |
|
||||||
from typing import Dict, List, Optional |
|
||||||
|
|
||||||
# BAD - random order |
|
||||||
from typing import Optional |
|
||||||
import json |
|
||||||
from datetime import datetime |
|
||||||
import logging |
|
||||||
from typing import Dict, List |
|
||||||
``` |
|
||||||
|
|
||||||
## Grouping Rules |
|
||||||
|
|
||||||
### Standard Library |
|
||||||
- `json`, `logging`, `os`, `sys`, `time` |
|
||||||
- `datetime`, `timedelta` from `datetime` |
|
||||||
- `Dict`, `List`, `Optional`, etc. from `typing` |
|
||||||
- `argparse`, `pathlib`, `re`, `uuid` |
|
||||||
|
|
||||||
### Third-Party |
|
||||||
- `duckdb`, `requests`, `streamlit` |
|
||||||
- `numpy`, `scipy`, `sklearn` |
|
||||||
- `plotly`, `beautifulsoup4` |
|
||||||
- `pytest` |
|
||||||
|
|
||||||
### Local Application |
|
||||||
- Modules from same package |
|
||||||
- Relative imports when appropriate |
|
||||||
|
|
||||||
## When to Use `from X import Y` |
|
||||||
|
|
||||||
### Prefer `from module import specific_items` for: |
|
||||||
- Constants and config |
|
||||||
- Single classes or functions used frequently |
|
||||||
- Type annotations |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD - clear about what we're using |
|
||||||
from config import config |
|
||||||
from database import db |
|
||||||
|
|
||||||
# GOOD - type hints |
|
||||||
from typing import Dict, List, Optional |
|
||||||
``` |
|
||||||
|
|
||||||
### Use `import module` when: |
|
||||||
- You need multiple items from the module |
|
||||||
- Using module.namespace is clearer |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD - duckdb used for types and module access |
|
||||||
import duckdb |
|
||||||
|
|
||||||
conn = duckdb.connect(...) |
|
||||||
result = conn.execute(...) |
|
||||||
|
|
||||||
# Also acceptable for types |
|
||||||
from typing import Dict |
|
||||||
``` |
|
||||||
|
|
||||||
## Relative Imports |
|
||||||
|
|
||||||
In package modules, prefer relative imports: |
|
||||||
|
|
||||||
```python |
|
||||||
# pipeline/svd_pipeline.py |
|
||||||
from ..database import MotionDatabase # relative import |
|
||||||
from .text_pipeline import process_text # relative import |
|
||||||
``` |
|
||||||
|
|
||||||
## Circular Imports |
|
||||||
|
|
||||||
Avoid circular imports by: |
|
||||||
1. Moving shared code to a third module |
|
||||||
2. Using TYPE_CHECKING for type hints only |
|
||||||
|
|
||||||
```python |
|
||||||
# types.py - shared type definitions |
|
||||||
from typing import TypedDict |
|
||||||
|
|
||||||
class MotionDict(TypedDict): |
|
||||||
id: int |
|
||||||
title: str |
|
||||||
... |
|
||||||
|
|
||||||
# module_a.py |
|
||||||
from .types import MotionDict |
|
||||||
|
|
||||||
# module_b.py - if needed here too |
|
||||||
from .types import MotionDict |
|
||||||
``` |
|
||||||
|
|
||||||
## Import Patterns to Avoid |
|
||||||
|
|
||||||
### Wildcard Imports |
|
||||||
```python |
|
||||||
# BAD |
|
||||||
from database import * |
|
||||||
|
|
||||||
# GOOD |
|
||||||
from database import db, MotionDatabase |
|
||||||
``` |
|
||||||
|
|
||||||
### Import in Function Scope (unless necessary) |
|
||||||
```python |
|
||||||
# AVOID - delays import, makes dependencies unclear |
|
||||||
def some_function(): |
|
||||||
import pandas as pd # Late import |
|
||||||
return pd.DataFrame(...) |
|
||||||
|
|
||||||
# PREFER - import at module level |
|
||||||
import pandas as pd |
|
||||||
|
|
||||||
def some_function(): |
|
||||||
return pd.DataFrame(...) |
|
||||||
``` |
|
||||||
|
|
||||||
### Reassigning Imported Names |
|
||||||
```python |
|
||||||
# BAD - confusing |
|
||||||
from module import process |
|
||||||
process = something_else # Reassigning |
|
||||||
|
|
||||||
# GOOD - clear naming |
|
||||||
from module import process as process_data |
|
||||||
``` |
|
||||||
|
|
||||||
## Type Checking Imports |
|
||||||
|
|
||||||
For type hints only, use TYPE_CHECKING: |
|
||||||
|
|
||||||
```python |
|
||||||
from typing import TYPE_CHECKING |
|
||||||
|
|
||||||
if TYPE_CHECKING: |
|
||||||
from .models import Motion |
|
||||||
|
|
||||||
def get_motion(motion_id: int) -> "Motion": # String quote for forward ref |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Optional Dependency Imports |
|
||||||
|
|
||||||
Handle optional dependencies gracefully: |
|
||||||
|
|
||||||
```python |
|
||||||
try: |
|
||||||
import duckdb |
|
||||||
except Exception: |
|
||||||
duckdb = None # Will be checked later |
|
||||||
|
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self): |
|
||||||
if duckdb is None: |
|
||||||
self._file_mode = True # Fallback mode |
|
||||||
``` |
|
||||||
|
|
||||||
## Example: Complete Import Block |
|
||||||
|
|
||||||
```python |
|
||||||
# Complete example from database.py |
|
||||||
import json |
|
||||||
import logging |
|
||||||
import uuid |
|
||||||
from datetime import datetime, timedelta |
|
||||||
from typing import Dict, List, Optional, Tuple |
|
||||||
|
|
||||||
import duckdb |
|
||||||
|
|
||||||
from config import config |
|
||||||
|
|
||||||
from database import db |
|
||||||
``` |
|
||||||
@ -1,131 +0,0 @@ |
|||||||
--- |
|
||||||
title: Logging Constraints |
|
||||||
category: constraints |
|
||||||
severity: critical |
|
||||||
--- |
|
||||||
|
|
||||||
# Logging Constraints |
|
||||||
|
|
||||||
## Core Rule |
|
||||||
|
|
||||||
Use `logging.getLogger(__name__)` - never use `print()` |
|
||||||
|
|
||||||
**CRITICAL ANTI-PATTERN**: `api_client.py` uses `print()` instead of logging (11 instances). |
|
||||||
|
|
||||||
## CRITICAL Anti-Pattern: print() Instead of Logging |
|
||||||
|
|
||||||
**File**: `api_client.py` |
|
||||||
**Evidence**: Lines with `print(f"...")` instead of `_logger.info(...)` |
|
||||||
|
|
||||||
**Broken code**: |
|
||||||
```python |
|
||||||
def get_motions(self, ...): |
|
||||||
try: |
|
||||||
# ... |
|
||||||
print(f"Fetched {len(voting_records)} voting records from API") # BAD |
|
||||||
print(f"Processed into {len(motions)} unique motions") # BAD |
|
||||||
except Exception as e: |
|
||||||
print(f"Error fetching motions from API: {e}") # BAD - no traceback |
|
||||||
``` |
|
||||||
|
|
||||||
**Fix**: |
|
||||||
```python |
|
||||||
import logging |
|
||||||
|
|
||||||
_logger = logging.getLogger(__name__) |
|
||||||
|
|
||||||
def get_motions(self, ...): |
|
||||||
try: |
|
||||||
_logger.info("Fetched %d voting records from API", len(voting_records)) |
|
||||||
_logger.info("Processed into %d unique motions", len(motions)) |
|
||||||
except Exception as e: |
|
||||||
_logger.exception("Error fetching motions from API: %s", e) |
|
||||||
return [] |
|
||||||
``` |
|
||||||
|
|
||||||
## Logger Initialization |
|
||||||
|
|
||||||
Get logger at module level: |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD: Use logging.getLogger(__name__) |
|
||||||
import logging |
|
||||||
|
|
||||||
_logger = logging.getLogger(__name__) |
|
||||||
|
|
||||||
def some_function(): |
|
||||||
_logger.info("Processing started") |
|
||||||
_logger.debug("Detail: %s", detail) |
|
||||||
``` |
|
||||||
|
|
||||||
## Logger Naming |
|
||||||
|
|
||||||
Use `__name__` for automatic module path: |
|
||||||
|
|
||||||
```python |
|
||||||
# In database.py - logger will be "database" |
|
||||||
_logger = logging.getLogger(__name__) |
|
||||||
|
|
||||||
# In pipeline/svd_pipeline.py - logger will be "pipeline.svd_pipeline" |
|
||||||
_logger = logging.getLogger(__name__) |
|
||||||
``` |
|
||||||
|
|
||||||
**INCONSISTENCY WARNING**: 16 files use `logger`, 17 files use `_logger`. Choose one convention. |
|
||||||
|
|
||||||
**Recommendation**: Use `_logger` (with underscore) for module-level loggers to distinguish from class-level loggers. |
|
||||||
|
|
||||||
## Log Levels |
|
||||||
|
|
||||||
| Level | When to Use | |
|
||||||
|-------|-------------| |
|
||||||
| DEBUG | Detailed diagnostic info (dev only) | |
|
||||||
| INFO | Normal operation milestones | |
|
||||||
| WARNING | Unexpected but handled (fallbacks) | |
|
||||||
| ERROR | Operation failed, may need attention | |
|
||||||
| CRITICAL | Fatal error, program may crash | |
|
||||||
|
|
||||||
## Exception Logging |
|
||||||
|
|
||||||
Use `_logger.exception()` for caught exceptions (includes traceback): |
|
||||||
|
|
||||||
```python |
|
||||||
try: |
|
||||||
result = risky_operation() |
|
||||||
except Exception as exc: |
|
||||||
_logger.exception("Operation failed: %s", exc) |
|
||||||
return fallback_value |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### Debug Prints in Production Code |
|
||||||
```python |
|
||||||
# BAD |
|
||||||
print(f"[TRAJ DEBUG] processing window {wid}") |
|
||||||
|
|
||||||
# GOOD |
|
||||||
_logger.debug("Processing window %s", wid) |
|
||||||
``` |
|
||||||
|
|
||||||
### Inconsistent Logger Names |
|
||||||
```python |
|
||||||
# BAD - mixing _logger and logger |
|
||||||
_logger = logging.getLogger(__name__) |
|
||||||
logger = logging.getLogger("other") # Inconsistent |
|
||||||
``` |
|
||||||
|
|
||||||
## Sensitive Data |
|
||||||
|
|
||||||
Never log sensitive information: |
|
||||||
- API keys |
|
||||||
- User votes |
|
||||||
- Session IDs (if tied to user data) |
|
||||||
- Personal information |
|
||||||
|
|
||||||
```python |
|
||||||
# BAD |
|
||||||
_logger.info("User %s voted %s", user_id, vote) |
|
||||||
|
|
||||||
# GOOD - log aggregates, not individual votes |
|
||||||
_logger.info("Vote recorded for session %s", session_id[:8]) |
|
||||||
``` |
|
||||||
@ -1,141 +0,0 @@ |
|||||||
# Naming Constraints |
|
||||||
|
|
||||||
## File Names |
|
||||||
|
|
||||||
### Python Modules |
|
||||||
- **Convention**: `snake_case.py` |
|
||||||
- **Examples**: `motion_database.py`, `api_client.py`, `text_pipeline.py` |
|
||||||
|
|
||||||
### Test Files |
|
||||||
- **Convention**: `test_<module_name>.py` |
|
||||||
- **Examples**: `test_database.py`, `test_api_client.py` |
|
||||||
|
|
||||||
### Config Files |
|
||||||
- **Convention**: `snake_case` |
|
||||||
- **Examples**: `config.py`, `.env.example`, `pyproject.toml` |
|
||||||
|
|
||||||
### Directories |
|
||||||
- **Convention**: `snake_case/` |
|
||||||
- **Examples**: `pipeline/`, `tests/integration/`, `src/validators/` |
|
||||||
|
|
||||||
## Class Names |
|
||||||
|
|
||||||
- **Convention**: `PascalCase` |
|
||||||
- **Examples**: `MotionDatabase`, `TweedeKamerAPI`, `MotionSummarizer` |
|
||||||
|
|
||||||
### Naming Patterns |
|
||||||
| Pattern | Example | |
|
||||||
|---------|---------| |
|
||||||
| Database wrapper | `MotionDatabase` | |
|
||||||
| API client | `TweedeKamerAPI` | |
|
||||||
| Service/Helpers | `MotionScraper`, `MotionAnalyzer` | |
|
||||||
| Exceptions | `ProviderError` | |
|
||||||
|
|
||||||
## Function Names |
|
||||||
|
|
||||||
- **Convention**: `snake_case` |
|
||||||
- **Examples**: `get_motions`, `compute_similarity`, `process_voting_records` |
|
||||||
|
|
||||||
### Private Methods |
|
||||||
- **Convention**: `_snake_case` (single underscore prefix) |
|
||||||
- **Examples**: `_get_voting_records`, `_parse_response` |
|
||||||
|
|
||||||
## Variable Names |
|
||||||
|
|
||||||
### Regular Variables |
|
||||||
- **Convention**: `snake_case` |
|
||||||
- **Examples**: `motion_id`, `party_name`, `voting_results` |
|
||||||
|
|
||||||
### Constants (Module-Level) |
|
||||||
- **Convention**: `UPPER_SNAKE_CASE` |
|
||||||
- **Examples**: `DATABASE_PATH`, `API_TIMEOUT`, `MAX_RETRIES` |
|
||||||
|
|
||||||
### Config Variables (in dataclass) |
|
||||||
- **Convention**: `UPPER_SNAKE_CASE` |
|
||||||
- **Examples**: `QWEN_MODEL`, `POLICY_AREAS` |
|
||||||
|
|
||||||
### Booleans |
|
||||||
- **Convention**: `is_`, `has_`, `can_` prefixes or `_flag` suffix |
|
||||||
- **Examples**: `is_active`, `has_votes`, `skip_extract` |
|
||||||
|
|
||||||
### Private Variables |
|
||||||
- **Convention**: `_underscore_prefix` |
|
||||||
- **Examples**: `_conn`, `_cache`, `_session` |
|
||||||
|
|
||||||
## Singleton Instances |
|
||||||
|
|
||||||
- **Convention**: `lower_snake_case` at module level |
|
||||||
- **Examples**: `db = MotionDatabase()`, `summarizer = MotionSummarizer()` |
|
||||||
|
|
||||||
```python |
|
||||||
# database.py |
|
||||||
class MotionDatabase: |
|
||||||
... |
|
||||||
|
|
||||||
# Singleton instance |
|
||||||
db = MotionDatabase() |
|
||||||
|
|
||||||
# Usage |
|
||||||
from database import db |
|
||||||
motions = db.get_motions() |
|
||||||
``` |
|
||||||
|
|
||||||
## Type Variables |
|
||||||
|
|
||||||
- **Convention**: `PascalCase` |
|
||||||
- **Examples**: `T = TypeVar('T')`, `MotionDict = Dict[str, Any]` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### Inconsistent Naming |
|
||||||
```python |
|
||||||
# BAD - mixing styles |
|
||||||
get_motions() # snake_case |
|
||||||
GetMotionById() # PascalCase |
|
||||||
processData() # camelCase |
|
||||||
|
|
||||||
# GOOD - consistent snake_case |
|
||||||
get_motions() |
|
||||||
get_motion_by_id() |
|
||||||
process_voting_data() |
|
||||||
``` |
|
||||||
|
|
||||||
### Abbreviations |
|
||||||
```python |
|
||||||
# AVOID - unclear abbreviations |
|
||||||
calc_similarity() # calculate_* |
|
||||||
proc_votes() # process_* |
|
||||||
get_mp_data() # get_mp_metadata() |
|
||||||
|
|
||||||
# PREFER - full words |
|
||||||
calculate_similarity() |
|
||||||
process_votes() |
|
||||||
get_mp_metadata() |
|
||||||
``` |
|
||||||
|
|
||||||
### Hungarian Notation |
|
||||||
```python |
|
||||||
# BAD - Hungarian notation |
|
||||||
str_title = "..." |
|
||||||
int_count = 0 |
|
||||||
b_is_active = True |
|
||||||
|
|
||||||
# GOOD - clear types via naming |
|
||||||
title = "..." |
|
||||||
count = 0 |
|
||||||
is_active = True |
|
||||||
``` |
|
||||||
|
|
||||||
## Special Cases |
|
||||||
|
|
||||||
### Window IDs |
|
||||||
- **Format**: `"YYYY-QN"` or `"YYYY"` |
|
||||||
- **Examples**: `"2024-Q1"`, `"2024-Q2"`, `"2024"` |
|
||||||
|
|
||||||
### Policy Areas |
|
||||||
- **Convention**: PascalCase with spaces |
|
||||||
- **Examples**: `"Economie"`, `"Sociale Zaken"`, `"Klimaat"` |
|
||||||
|
|
||||||
### Vote Values |
|
||||||
- **Convention**: PascalCase Dutch terms |
|
||||||
- **Values**: `"Voor"`, `"Tegen"`, `"Onthouden"`, `"Geen stem"`, `"Afwezig"` |
|
||||||
@ -1,26 +0,0 @@ |
|||||||
# Testing conventions constraint (YAML) |
|
||||||
|
|
||||||
rules: |
|
||||||
- name: test_naming |
|
||||||
rule: "Use pytest and name tests test_*.py and test_* functions." |
|
||||||
examples: |
|
||||||
- good: "tests/test_text_pipeline.py" |
|
||||||
- bad: "tests/text_pipeline_test.py" |
|
||||||
|
|
||||||
- name: fixtures_and_conftest |
|
||||||
rule: "Place shared fixtures in tests/conftest.py or tests/fixtures/ for reuse." |
|
||||||
examples: |
|
||||||
- good: "use fixtures declared in tests/conftest.py" |
|
||||||
|
|
||||||
- name: assert_raises |
|
||||||
rule: "Explicitly assert expected exceptions with pytest.raises for invalid input." |
|
||||||
examples: |
|
||||||
- good: | |
|
||||||
import pytest |
|
||||||
|
|
||||||
def test_invalid_input(): |
|
||||||
with pytest.raises(ValueError): |
|
||||||
function_under_test('bad') |
|
||||||
|
|
||||||
enforcement_examples: |
|
||||||
- "Run pytest in CI; fail if tests don't run or if there are regressions." |
|
||||||
@ -1,233 +0,0 @@ |
|||||||
# Type Hint Constraints |
|
||||||
|
|
||||||
## Core Rule |
|
||||||
|
|
||||||
**Use type hints on all public functions and methods** |
|
||||||
|
|
||||||
## Function Type Hints |
|
||||||
|
|
||||||
### Required on Public APIs |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD - complete type hints |
|
||||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
|
||||||
... |
|
||||||
|
|
||||||
def get_filtered_motions( |
|
||||||
self, |
|
||||||
policy_area: str = "Alle", |
|
||||||
limit: int = 10 |
|
||||||
) -> List[Dict]: |
|
||||||
... |
|
||||||
|
|
||||||
def calculate_similarity(self, motion_a: int, motion_b: int) -> float: |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
### Optional Parameters |
|
||||||
|
|
||||||
Use `Optional[X]` or `X | None`: |
|
||||||
|
|
||||||
```python |
|
||||||
# Both forms are acceptable |
|
||||||
def get_motion(self, motion_id: Optional[int] = None) -> Optional[Dict]: |
|
||||||
... |
|
||||||
|
|
||||||
def get_motion(self, motion_id: int | None = None) -> dict | None: |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
### Multiple Return Types |
|
||||||
|
|
||||||
Use `Union[X, Y]` or `|` operator: |
|
||||||
|
|
||||||
```python |
|
||||||
# Acceptable forms |
|
||||||
def parse_value(self, value: str) -> Union[bool, str, None]: |
|
||||||
... |
|
||||||
|
|
||||||
def parse_value(self, value: str) -> bool | str | None: |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
### Generic Types |
|
||||||
|
|
||||||
Use `List[X]`, `Dict[K, V]`, `Tuple[X, Y]`: |
|
||||||
|
|
||||||
```python |
|
||||||
from typing import Dict, List, Optional, Tuple |
|
||||||
|
|
||||||
def get_motions(self, ids: List[int]) -> Dict[int, Dict]: |
|
||||||
"""Map motion_id -> motion data.""" |
|
||||||
... |
|
||||||
|
|
||||||
def process_batch(self, items: List[str]) -> Tuple[List[str], List[str]]: |
|
||||||
"""Returns (successes, failures).""" |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Collection Types |
|
||||||
|
|
||||||
Prefer specific types over bare `list`/`dict`: |
|
||||||
|
|
||||||
```python |
|
||||||
# GOOD - specific types |
|
||||||
def get_votes(self) -> List[str]: |
|
||||||
... |
|
||||||
|
|
||||||
def get_metadata(self) -> Dict[str, Any]: |
|
||||||
... |
|
||||||
|
|
||||||
# ACCEPTABLE - for truly generic collections |
|
||||||
def merge_dicts(*dicts: dict) -> dict: |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## DuckDB Result Types |
|
||||||
|
|
||||||
DuckDB returns tuples/lists - document expected structure: |
|
||||||
|
|
||||||
```python |
|
||||||
def get_motion(self, motion_id: int) -> Optional[Tuple]: |
|
||||||
"""Returns (id, title, description, date, ...) or None.""" |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
result = conn.execute( |
|
||||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
|
||||||
).fetchone() |
|
||||||
return result |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
|
|
||||||
# Or use Dict for clarity |
|
||||||
def get_motion_as_dict(self, motion_id: int) -> Optional[Dict]: |
|
||||||
"""Returns motion dict or None.""" |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
row = conn.execute( |
|
||||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
|
||||||
).fetchone() |
|
||||||
if row: |
|
||||||
return { |
|
||||||
"id": row[0], |
|
||||||
"title": row[1], |
|
||||||
"description": row[2], |
|
||||||
... |
|
||||||
} |
|
||||||
return None |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
## Class/Instance Types |
|
||||||
|
|
||||||
Use `Self` for methods returning instance type: |
|
||||||
|
|
||||||
```python |
|
||||||
from typing import Self |
|
||||||
|
|
||||||
class MotionDatabase: |
|
||||||
def with_connection(self, path: str) -> Self: |
|
||||||
"""Return new instance with different path.""" |
|
||||||
return MotionDatabase(db_path=path) |
|
||||||
``` |
|
||||||
|
|
||||||
## Callback/Function Types |
|
||||||
|
|
||||||
Use `Callable` for function parameters: |
|
||||||
|
|
||||||
```python |
|
||||||
from typing import Callable |
|
||||||
|
|
||||||
def process_motions( |
|
||||||
motions: List[Dict], |
|
||||||
processor: Callable[[Dict], Any] |
|
||||||
) -> List[Any]: |
|
||||||
return [processor(m) for m in motions] |
|
||||||
``` |
|
||||||
|
|
||||||
## Type Aliases |
|
||||||
|
|
||||||
Define clear type aliases for domain concepts: |
|
||||||
|
|
||||||
```python |
|
||||||
from typing import Dict, List, TypedDict, Literal |
|
||||||
|
|
||||||
# Vote values |
|
||||||
VoteValue = Literal["Voor", "Tegen", "Onthouden", "Geen stem", "Afwezig"] |
|
||||||
|
|
||||||
# Policy areas |
|
||||||
PolicyArea = Literal["Alle", "Economie", "Klimaat", "Immigratie", ...] |
|
||||||
|
|
||||||
# Motion dict |
|
||||||
class MotionDict(TypedDict): |
|
||||||
id: int |
|
||||||
title: str |
|
||||||
description: Optional[str] |
|
||||||
date: Optional[str] |
|
||||||
policy_area: Optional[str] |
|
||||||
voting_results: Optional[str] # JSON string |
|
||||||
winning_margin: Optional[float] |
|
||||||
|
|
||||||
def get_motion(self, motion_id: int) -> Optional[MotionDict]: |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Avoid `Any` |
|
||||||
|
|
||||||
Use `Any` sparingly - prefer specific types: |
|
||||||
|
|
||||||
```python |
|
||||||
# AVOID - too vague |
|
||||||
def process(data: Any) -> Any: |
|
||||||
... |
|
||||||
|
|
||||||
# PREFER - specific types |
|
||||||
def process(motion: MotionDict) -> Optional[SimilarityResult]: |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Inline Type Hints |
|
||||||
|
|
||||||
For simple cases, inline hints are fine: |
|
||||||
|
|
||||||
```python |
|
||||||
def get_count(self) -> int: |
|
||||||
... |
|
||||||
|
|
||||||
def is_empty(self) -> bool: |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Docstring Type Hints |
|
||||||
|
|
||||||
For complex types, include in docstrings: |
|
||||||
|
|
||||||
```python |
|
||||||
def get_party_positions(self, window_id: str) -> Dict[str, List[float]]: |
|
||||||
"""Get party positions in political space. |
|
||||||
|
|
||||||
Args: |
|
||||||
window_id: Time window (e.g., "2024-Q1") |
|
||||||
|
|
||||||
Returns: |
|
||||||
Dict mapping party_name -> [x, y] coordinates |
|
||||||
|
|
||||||
Example: |
|
||||||
>>> positions = db.get_party_positions("2024-Q1") |
|
||||||
>>> positions["VVD"] |
|
||||||
[0.5, -0.3] |
|
||||||
""" |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Type Checking |
|
||||||
|
|
||||||
For runtime type checking, use runtime checks: |
|
||||||
|
|
||||||
```python |
|
||||||
def set_count(self, count: int) -> None: |
|
||||||
if not isinstance(count, int): |
|
||||||
raise TypeError(f"Expected int, got {type(count).__name__}") |
|
||||||
self._count = count |
|
||||||
``` |
|
||||||
@ -1,124 +0,0 @@ |
|||||||
# Naming Conventions |
|
||||||
|
|
||||||
## Files |
|
||||||
- **snake_case** for all Python files: `database.py`, `explorer_helpers.py`, `motion_cache.py` |
|
||||||
- **PascalCase** NOT used for files |
|
||||||
|
|
||||||
## Functions |
|
||||||
- **snake_case**: `get_svd_vectors()`, `compute_party_coords()`, `build_scatter_trace()` |
|
||||||
- Private helpers prefixed with `_`: `_get_window_data()` |
|
||||||
|
|
||||||
## Classes |
|
||||||
- **PascalCase**: `MotionDatabase`, `Config` |
|
||||||
- **Dataclass pattern** for Config: `@dataclass` decorator with typed fields |
|
||||||
|
|
||||||
## Variables |
|
||||||
- **snake_case**: `party_map`, `mp_name`, `svd_vectors`, `party_centroids` |
|
||||||
- **CONSTANT_SNAKE_CASE** for module-level constants: `PARTY_COLOURS`, `DEFAULT_WINDOW` |
|
||||||
|
|
||||||
## Module-Level Exports |
|
||||||
- **Singleton instance**: `db = MotionDatabase()` at module bottom (not class-level) |
|
||||||
- **Config instance**: `config = Config(...)` at module bottom |
|
||||||
- **Dicts**: `PARTY_COLOURS` exported from `config.py` |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
# Error Handling |
|
||||||
|
|
||||||
## Known Patterns |
|
||||||
1. **Bare except with pass** (ANTI-PATTERN - see anti-patterns.yaml) |
|
||||||
```python |
|
||||||
except: |
|
||||||
pass # database.py:47 |
|
||||||
``` |
|
||||||
|
|
||||||
2. **Graceful degradation**: catch specific exceptions, fall back to default |
|
||||||
```python |
|
||||||
try: |
|
||||||
result = compute_svd() |
|
||||||
except ImportError: |
|
||||||
result = DEFAULT_SVD |
|
||||||
``` |
|
||||||
|
|
||||||
3. **Optional dependency fallbacks**: |
|
||||||
```python |
|
||||||
try: |
|
||||||
import umap |
|
||||||
use_umap = True |
|
||||||
except ImportError: |
|
||||||
use_umap = False |
|
||||||
``` |
|
||||||
|
|
||||||
4. **Nested exception handling** (ANTI-PATTERN - see anti-patterns.yaml): |
|
||||||
```python |
|
||||||
try: |
|
||||||
... |
|
||||||
except Exception: |
|
||||||
try: |
|
||||||
... |
|
||||||
except Exception: |
|
||||||
pass |
|
||||||
``` |
|
||||||
|
|
||||||
## Rules |
|
||||||
- Never use bare `except:` — always specify exception type |
|
||||||
- Never swallow exceptions silently — log or return a sensible default |
|
||||||
- For optional deps, use `ImportError` or `ModuleNotFoundError` explicitly |
|
||||||
- Avoid nested try/except blocks |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
# Code Organization |
|
||||||
|
|
||||||
## Singleton Pattern |
|
||||||
Each module owns one shared instance: |
|
||||||
```python |
|
||||||
# database.py |
|
||||||
db = MotionDatabase() |
|
||||||
|
|
||||||
# config.py |
|
||||||
config = Config(...) |
|
||||||
PARTY_COLOURS = {...} |
|
||||||
``` |
|
||||||
|
|
||||||
## Pure Functions in Helpers |
|
||||||
`explorer_helpers.py` contains only pure functions (no IO, no Streamlit calls): |
|
||||||
```python |
|
||||||
def compute_party_coords(svd_vectors, party_map): |
|
||||||
"""Pure: no side effects, no imports from this module""" |
|
||||||
... |
|
||||||
|
|
||||||
def build_scatter_trace(df, color_col): |
|
||||||
"""Pure: returns Plotly trace dict""" |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Cached Data Loaders |
|
||||||
Use `@st.cache_data` for expensive data loading: |
|
||||||
```python |
|
||||||
@st.cache_data |
|
||||||
def load_svd_vectors(window: str) -> pd.DataFrame: |
|
||||||
return db.get_svd_vectors(window) |
|
||||||
``` |
|
||||||
|
|
||||||
## Dataclass Config |
|
||||||
```python |
|
||||||
@dataclass |
|
||||||
class Config: |
|
||||||
db_path: str = "data/stemwijzer.duckdb" |
|
||||||
default_window: str = "2023" |
|
||||||
party_colours: dict = field(default_factory=lambda: PARTY_COLOURS) |
|
||||||
``` |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
# Imports |
|
||||||
|
|
||||||
## Ordering (convention) |
|
||||||
1. Standard library |
|
||||||
2. Third-party (streamlit, ibis, plotly, sklearn, umap) |
|
||||||
3. Local/relative imports |
|
||||||
|
|
||||||
## Avoid |
|
||||||
- Wildcard imports (`from module import *`) |
|
||||||
- Circular imports (ensure dependency direction: helpers → database → config) |
|
||||||
@ -1,92 +0,0 @@ |
|||||||
--- |
|
||||||
title: Dependencies and Library Usage |
|
||||||
category: dependencies |
|
||||||
--- |
|
||||||
|
|
||||||
# Dependencies and Library Usage |
|
||||||
|
|
||||||
## Core Dependencies |
|
||||||
|
|
||||||
### duckdb |
|
||||||
- **Required**: Yes |
|
||||||
- **Fallback**: None (core functionality) |
|
||||||
- **Usage**: SQL database for motions, embeddings, SVD vectors |
|
||||||
- **Files**: database.py, analysis/*.py, pipeline/*.py |
|
||||||
|
|
||||||
### streamlit |
|
||||||
- **Required**: Yes |
|
||||||
- **Fallback**: None |
|
||||||
- **Usage**: Web UI framework |
|
||||||
- **Files**: app.py, pages/*.py, explorer.py |
|
||||||
|
|
||||||
### requests |
|
||||||
- **Required**: Yes |
|
||||||
- **Fallback**: None |
|
||||||
- **Usage**: HTTP client for API calls |
|
||||||
- **Files**: api_client.py, ai_provider.py |
|
||||||
|
|
||||||
### plotly |
|
||||||
- **Required**: Yes |
|
||||||
- **Fallback**: None (raises ImportError) |
|
||||||
- **Usage**: Interactive charts for explorer |
|
||||||
- **Files**: explorer.py, explorer_helpers.py |
|
||||||
|
|
||||||
## Optional Dependencies |
|
||||||
|
|
||||||
### umap-learn |
|
||||||
- **Required**: No |
|
||||||
- **Fallback**: Use raw SVD vectors (first 2 dimensions) |
|
||||||
- **Usage**: Dimensionality reduction for visualization |
|
||||||
- **Files**: analysis/clustering.py |
|
||||||
|
|
||||||
### matplotlib |
|
||||||
- **Required**: No |
|
||||||
- **Fallback**: Plotly or raw output |
|
||||||
- **Usage**: Static charting |
|
||||||
- **Files**: Various analysis scripts |
|
||||||
|
|
||||||
## ML Dependencies |
|
||||||
|
|
||||||
### sklearn |
|
||||||
- **Required**: Yes |
|
||||||
- **Usage**: KMeans clustering, cosine_similarity, StandardScaler |
|
||||||
- **Files**: analysis/clustering.py, similarity/compute.py |
|
||||||
|
|
||||||
### scipy |
|
||||||
- **Required**: Yes |
|
||||||
- **Usage**: SVD (scipy.linalg.svd), spatial.procrustes for alignment |
|
||||||
- **Files**: analysis/trajectory.py, pipeline/svd_pipeline.py |
|
||||||
|
|
||||||
### numpy |
|
||||||
- **Required**: Yes |
|
||||||
- **Usage**: Array operations, linear algebra |
|
||||||
- **Files**: Throughout codebase |
|
||||||
|
|
||||||
## Key Imports by File |
|
||||||
|
|
||||||
### explorer.py |
|
||||||
- `import streamlit as st` |
|
||||||
- `from database import db` |
|
||||||
- `from explorer_helpers import *` |
|
||||||
|
|
||||||
### explorer_helpers.py |
|
||||||
- `import pandas as pd` |
|
||||||
- `import plotly.graph_objects as go` |
|
||||||
- `from database import db` (optional, for type hints) |
|
||||||
|
|
||||||
### database.py |
|
||||||
- `import ibis` |
|
||||||
- `import duckdb` |
|
||||||
- `from config import config, PARTY_COLOURS` |
|
||||||
|
|
||||||
### config.py |
|
||||||
- `from dataclasses import dataclass, field` |
|
||||||
- `import streamlit as st` (optional, for warnings) |
|
||||||
|
|
||||||
## Singleton Instances |
|
||||||
|
|
||||||
| Module | Instance | Type | |
|
||||||
|--------|----------|------| |
|
||||||
| `database.py` | `db` | `MotionDatabase` | |
|
||||||
| `config.py` | `config` | `Config` (dataclass) | |
|
||||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
|
||||||
@ -1,146 +0,0 @@ |
|||||||
--- |
|
||||||
title: Domain Glossary |
|
||||||
category: domain |
|
||||||
--- |
|
||||||
|
|
||||||
# Domain Glossary - Dutch Political Terms |
|
||||||
|
|
||||||
## CRITICAL INVARIANTS |
|
||||||
|
|
||||||
> **Rule 1**: Centroid of right-wing parties on RIGHT side of ALL axes |
|
||||||
> - PVV, FVD, JA21, SGP centroid must appear on the RIGHT |
|
||||||
> - Individual right-wing parties may vary slightly from the centroid |
|
||||||
> - This is non-negotiable for any compass/axis visualization |
|
||||||
|
|
||||||
> **Rule 2**: SVD labels are empirically derived from voting data |
|
||||||
> - Labels represent WHAT THE DATA SHOWS, not party self-identification or public opinion |
|
||||||
> - Labels are derived from outliers and 20 representative motions (10 positive, 10 negative) |
|
||||||
> - See SVD Label Derivation section below |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## SVD Label Derivation |
|
||||||
|
|
||||||
### The Process |
|
||||||
|
|
||||||
SVD (Singular Value Decomposition) finds axes that maximize variance in the MP × Motion voting matrix. To label each axis: |
|
||||||
|
|
||||||
1. **Identify outliers**: Find the two MPs with most extreme positions on that axis |
|
||||||
2. **Select representative motions**: Pick 20 motions where these outliers disagreed most sharply (10 they voted opposite on, 10 where both voted same direction but with other extremes) |
|
||||||
3. **Interpret theme**: Read the motion titles to derive what the axis represents |
|
||||||
4. **Assign label**: Label describes the empirical theme, could be: |
|
||||||
- Left-Right |
|
||||||
- Coalition-Opposition |
|
||||||
- Progressive-Conservative |
|
||||||
- EU-National sovereignty |
|
||||||
- Populist-Establishment |
|
||||||
- Or whatever the voting patterns show |
|
||||||
|
|
||||||
### Example |
|
||||||
|
|
||||||
| Step | Description | |
|
||||||
|------|-------------| |
|
||||||
| Outlier A | Wilders (PVV) - extreme positive on Dim 1 | |
|
||||||
| Outlier B | Marijnissen (SP) - extreme negative on Dim 1 | |
|
||||||
| 20 Motions | Immigration, integration, law & order themes dominate | |
|
||||||
| Label | "Links-Rechts" (Left-Right) | |
|
||||||
|
|
||||||
### Labeling Rules |
|
||||||
|
|
||||||
- **Never use party names in labels** (e.g., not "PVV-SP axis") |
|
||||||
- **Never use semantic/ideological labels** (e.g., not "progressive-conservative" unless that's what the motions show) |
|
||||||
- **Use motion-derived themes** (e.g., "Immigration", "EU", "Economy") |
|
||||||
- **Fallback**: If theme is unclear, use "Axis 1", "Axis 2" |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Core Entities |
|
||||||
|
|
||||||
### Motion / Motie |
|
||||||
- Parliamentary motion submitted by MPs |
|
||||||
- Fields: `id`, `title`, `date`, `category` |
|
||||||
- MPs vote: **For** (+1), **Against** (-1), **Abstain** (0), **Absent** |
|
||||||
|
|
||||||
### MP / Kamerlid |
|
||||||
- Member of Parliament (Tweede Kamerlid) |
|
||||||
- Identified by full name (e.g., "Van Dijk, I.") |
|
||||||
- Has voting record, party affiliation, SVD position vector |
|
||||||
|
|
||||||
### Party / Fractie |
|
||||||
- Political party (e.g., "GroenLinks-PvdA", "PVV", "VVD") |
|
||||||
- Party centroids: average SVD position of all MPs in party |
|
||||||
|
|
||||||
### Vote / Stemming |
|
||||||
- Individual MP's vote on a motion: +1, 0, -1 |
|
||||||
- Aggregated to compute SVD vectors |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Time & Analysis Concepts |
|
||||||
|
|
||||||
### Window / Tijdsvenster |
|
||||||
- Time period for analysis (annual or quarterly) |
|
||||||
- Values: "2023", "2023-Q1", "2024", etc. |
|
||||||
- SVD vectors computed per window |
|
||||||
|
|
||||||
### Trajectory |
|
||||||
- MP's position change across multiple windows |
|
||||||
- Computed from `svd_vectors` + window ordering |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Mathematical / Algorithmic Terms |
|
||||||
|
|
||||||
### SVD Vector |
|
||||||
- 2D vector from Singular Value Decomposition of MP × Motion vote matrix |
|
||||||
- Represents MP's position in political space |
|
||||||
|
|
||||||
### SVD Label |
|
||||||
- Empirically derived axis label based on outlier MPs and representative motions |
|
||||||
- Describes the theme of disagreement on that axis |
|
||||||
- NOT based on party ideology or semantic labels |
|
||||||
|
|
||||||
### Political Compass |
|
||||||
- 2D visualization with SVD axes mapped to compass quadrants |
|
||||||
- X-axis: First SVD dimension (labeled from voting data) |
|
||||||
- Y-axis: Second SVD dimension (labeled from voting data) |
|
||||||
|
|
||||||
### Procrustes Alignment |
|
||||||
- Algorithm to align SVD vectors across time windows |
|
||||||
- Ensures comparable positions across years/quarters |
|
||||||
|
|
||||||
### UMAP |
|
||||||
- Uniform Manifold Approximation and Projection |
|
||||||
- Dimensionality reduction for visualization |
|
||||||
- Optional dependency with graceful SVD fallback |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Database Table Reference |
|
||||||
|
|
||||||
| Table | Key Fields | |
|
||||||
|-------|-----------| |
|
||||||
| `motions` | id, title, date, category | |
|
||||||
| `mp_votes` | mp_id, motion_id, vote | |
|
||||||
| `svd_vectors` | entity_id, window, vector_2d (list[2]) | |
|
||||||
| `mp_party_history` | mp_id, party, start_date, end_date | |
|
||||||
| `windows` | window_id, start_date, end_date, period_type | |
|
||||||
| `mp_trajectories` | mp_id, window, trajectory_vector | |
|
||||||
|
|
||||||
--- |
|
||||||
|
|
||||||
## Dutch Political Parties |
|
||||||
|
|
||||||
### Canonical Right-Wing (centroid on RIGHT of axes) |
|
||||||
- PVV (Partij voor de Vrijheid) |
|
||||||
- FVD (Forum voor Democratie) |
|
||||||
- JA21 |
|
||||||
- SGP (Staatkundig Gereformeerde Partij) |
|
||||||
|
|
||||||
### Other Major Parties |
|
||||||
- VVD (Volkspartij voor Vrijheid en Democratie) |
|
||||||
- GL-PvdA (GroenLinks-PvdA) |
|
||||||
- NSC (Nieuw Sociaal Contract) |
|
||||||
- BBB (BoerBurgerBeweging) |
|
||||||
- SP (Socialistische Partij) |
|
||||||
- D66 (Democraten 66) |
|
||||||
@ -1,196 +0,0 @@ |
|||||||
"""Example: TweedeKamerAPI usage - from api_client.py and actual codebase.""" |
|
||||||
|
|
||||||
from datetime import datetime, timedelta |
|
||||||
from typing import Dict, List |
|
||||||
|
|
||||||
# Import the API client |
|
||||||
from api_client import TweedeKamerAPI |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 1: Basic API usage |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_fetch_motions(): |
|
||||||
"""Fetch recent parliamentary motions from TweedeKamer API.""" |
|
||||||
|
|
||||||
api = TweedeKamerAPI() |
|
||||||
|
|
||||||
# Fetch motions from last 30 days |
|
||||||
start_date = datetime.now() - timedelta(days=30) |
|
||||||
|
|
||||||
try: |
|
||||||
motions = api.get_motions(start_date=start_date, limit=100) |
|
||||||
|
|
||||||
print(f"Fetched {len(motions)} motions") |
|
||||||
|
|
||||||
for motion in motions[:5]: # Show first 5 |
|
||||||
print(f" - {motion.get('title', 'N/A')}") |
|
||||||
|
|
||||||
return motions |
|
||||||
finally: |
|
||||||
api.close() |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 2: Fetching with date range |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_date_range(): |
|
||||||
"""Fetch motions from a specific date range.""" |
|
||||||
|
|
||||||
api = TweedeKamerAPI() |
|
||||||
|
|
||||||
start = datetime(2024, 1, 1) |
|
||||||
end = datetime(2024, 3, 31) # Q1 2024 |
|
||||||
|
|
||||||
try: |
|
||||||
motions = api.get_motions(start_date=start, end_date=end, limit=500) |
|
||||||
|
|
||||||
# Group by policy area |
|
||||||
by_area = {} |
|
||||||
for m in motions: |
|
||||||
area = m.get("policy_area", "Onbekend") |
|
||||||
by_area.setdefault(area, []).append(m) |
|
||||||
|
|
||||||
for area, area_motions in sorted(by_area.items()): |
|
||||||
print(f"{area}: {len(area_motions)} motions") |
|
||||||
|
|
||||||
return motions |
|
||||||
finally: |
|
||||||
api.close() |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 3: Context manager usage |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_context_manager(): |
|
||||||
"""Use API client as context manager.""" |
|
||||||
|
|
||||||
with TweedeKamerAPI() as api: |
|
||||||
motions = api.get_motions( |
|
||||||
start_date=datetime.now() - timedelta(days=7), limit=50 |
|
||||||
) |
|
||||||
|
|
||||||
print(f"Fetched {len(motions)} motions this week") |
|
||||||
|
|
||||||
return motions |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 4: Processing voting records |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_process_votes(): |
|
||||||
"""Process individual voting records from API.""" |
|
||||||
|
|
||||||
api = TweedeKamerAPI() |
|
||||||
|
|
||||||
start_date = datetime.now() - timedelta(days=7) |
|
||||||
|
|
||||||
try: |
|
||||||
# Get voting records directly |
|
||||||
voting_records, besluit_meta = api._get_voting_records( |
|
||||||
start_date=start_date, limit=1000 |
|
||||||
) |
|
||||||
|
|
||||||
print(f"Fetched {len(voting_records)} voting records") |
|
||||||
print(f"From {len(besluit_meta)} unique decisions") |
|
||||||
|
|
||||||
# Count votes by party |
|
||||||
party_votes = {} |
|
||||||
for record in voting_records: |
|
||||||
party = record.get("Fractie", "Onbekend") |
|
||||||
vote = record.get("Soort", "Onbekend") |
|
||||||
party_votes.setdefault(party, {})[vote] = ( |
|
||||||
party_votes.get(party, {}).get(vote, 0) + 1 |
|
||||||
) |
|
||||||
|
|
||||||
for party, votes in sorted(party_votes.items()): |
|
||||||
total = sum(votes.values()) |
|
||||||
voor = votes.get("Voor", 0) |
|
||||||
print(f"{party}: {total} votes ({voor} voor)") |
|
||||||
|
|
||||||
return voting_records |
|
||||||
finally: |
|
||||||
api.close() |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 5: Safe API call with fallback |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_safe_call(): |
|
||||||
"""Make API call with safe fallback on failure.""" |
|
||||||
|
|
||||||
api = TweedeKamerAPI() |
|
||||||
|
|
||||||
try: |
|
||||||
# This will return [] on any error |
|
||||||
motions = api.get_motions( |
|
||||||
start_date=datetime.now() - timedelta(days=30), limit=100 |
|
||||||
) |
|
||||||
|
|
||||||
if not motions: |
|
||||||
print("No motions returned - using cached data") |
|
||||||
# Fallback to cached/local data |
|
||||||
from database import db |
|
||||||
|
|
||||||
return db.get_filtered_motions(limit=10) |
|
||||||
|
|
||||||
return motions |
|
||||||
finally: |
|
||||||
api.close() |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 6: Pagination handling |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_pagination(): |
|
||||||
"""Understand how pagination works in the API.""" |
|
||||||
|
|
||||||
api = TweedeKamerAPI() |
|
||||||
|
|
||||||
start_date = datetime.now() - timedelta(days=365) |
|
||||||
|
|
||||||
# Simulate pagination |
|
||||||
page_size = 250 |
|
||||||
total_limit = 500 |
|
||||||
|
|
||||||
all_motions = [] |
|
||||||
skip = 0 |
|
||||||
|
|
||||||
while len(all_motions) < total_limit: |
|
||||||
print(f"Fetching page with skip={skip}...") |
|
||||||
|
|
||||||
# In real usage, get_motions handles pagination internally |
|
||||||
# This demonstrates what's happening under the hood |
|
||||||
page_motions = api._fetch_page(start_date=start_date, skip=skip, top=page_size) |
|
||||||
|
|
||||||
if not page_motions: |
|
||||||
break |
|
||||||
|
|
||||||
all_motions.extend(page_motions) |
|
||||||
skip += page_size |
|
||||||
|
|
||||||
if len(page_motions) < page_size: |
|
||||||
break # Last page |
|
||||||
|
|
||||||
print(f"Total fetched: {len(all_motions)} motions") |
|
||||||
return all_motions |
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__": |
|
||||||
print("=== Basic Fetch ===") |
|
||||||
example_fetch_motions() |
|
||||||
|
|
||||||
print("\n=== Process Votes ===") |
|
||||||
example_process_votes() |
|
||||||
@ -1,191 +0,0 @@ |
|||||||
"""Example: MotionDatabase usage - from database.py and actual codebase.""" |
|
||||||
|
|
||||||
from typing import Dict, List, Optional |
|
||||||
import duckdb |
|
||||||
import json |
|
||||||
from config import config |
|
||||||
|
|
||||||
# Import the singleton instance |
|
||||||
from database import db |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 1: Getting filtered motions |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_get_filtered_motions(): |
|
||||||
"""Get controversial motions from a specific policy area.""" |
|
||||||
|
|
||||||
motions = db.get_filtered_motions( |
|
||||||
policy_area="Klimaat", |
|
||||||
min_margin=0.0, |
|
||||||
max_margin=0.3, # Controversial: close margin |
|
||||||
limit=10, |
|
||||||
) |
|
||||||
|
|
||||||
for motion in motions: |
|
||||||
print(f"{motion['title']}: {motion['winning_margin']:.1%} margin") |
|
||||||
|
|
||||||
return motions |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 2: Creating a voting session |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_voting_session(): |
|
||||||
"""Create a new user session and record votes.""" |
|
||||||
|
|
||||||
# Create session for 10 motions |
|
||||||
session_id = db.create_session(total_motions=10) |
|
||||||
print(f"Created session: {session_id}") |
|
||||||
|
|
||||||
# Get motions for the session |
|
||||||
motions = db.get_filtered_motions(policy_area="Alle", limit=10) |
|
||||||
|
|
||||||
# Record votes |
|
||||||
for motion in motions: |
|
||||||
# In real app, user would choose vote |
|
||||||
vote = "Voor" # Example vote |
|
||||||
db.record_vote(session_id=session_id, motion_id=motion["id"], vote=vote) |
|
||||||
|
|
||||||
# Get results |
|
||||||
results = db.get_party_results(session_id) |
|
||||||
|
|
||||||
for party, result in sorted(results.items(), key=lambda x: -x[1]["agreement"]): |
|
||||||
print(f"{party}: {result['agreement']:.1%} agreement") |
|
||||||
|
|
||||||
return results |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 3: Working with DuckDB connections directly |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_direct_duckdb(): |
|
||||||
"""Example of proper DuckDB connection handling.""" |
|
||||||
|
|
||||||
conn = duckdb.connect(config.DATABASE_PATH) |
|
||||||
try: |
|
||||||
# Get motion with votes |
|
||||||
result = conn.execute( |
|
||||||
""" |
|
||||||
SELECT m.*, |
|
||||||
JSON_EXTRACT(voting_results, '$.total_votes') as total_votes |
|
||||||
FROM motions m |
|
||||||
WHERE m.id = ? |
|
||||||
""", |
|
||||||
(123,), |
|
||||||
).fetchone() |
|
||||||
|
|
||||||
if result: |
|
||||||
print(f"Motion: {result[1]}") # title is index 1 |
|
||||||
|
|
||||||
return result |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 4: Bulk operations |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_bulk_insert(): |
|
||||||
"""Example of bulk inserting motions.""" |
|
||||||
|
|
||||||
# Sample data |
|
||||||
motions = [ |
|
||||||
{ |
|
||||||
"title": "Motion about climate policy", |
|
||||||
"description": "Proposal to reduce emissions", |
|
||||||
"date": "2024-01-15", |
|
||||||
"policy_area": "Klimaat", |
|
||||||
"voting_results": json.dumps({"Voor": 75, "Tegen": 65}), |
|
||||||
"winning_margin": 0.07, |
|
||||||
"controversy_score": 0.85, |
|
||||||
}, |
|
||||||
{ |
|
||||||
"title": "Motion about healthcare", |
|
||||||
"description": "Increase healthcare budget", |
|
||||||
"date": "2024-01-20", |
|
||||||
"policy_area": "Zorg", |
|
||||||
"voting_results": json.dumps({"Voor": 90, "Tegen": 50}), |
|
||||||
"winning_margin": 0.29, |
|
||||||
"controversy_score": 0.42, |
|
||||||
}, |
|
||||||
] |
|
||||||
|
|
||||||
conn = duckdb.connect(config.DATABASE_PATH) |
|
||||||
try: |
|
||||||
for motion in motions: |
|
||||||
conn.execute( |
|
||||||
""" |
|
||||||
INSERT INTO motions |
|
||||||
(title, description, date, policy_area, voting_results, |
|
||||||
winning_margin, controversy_score) |
|
||||||
VALUES (?, ?, ?, ?, ?, ?, ?) |
|
||||||
""", |
|
||||||
( |
|
||||||
motion["title"], |
|
||||||
motion["description"], |
|
||||||
motion["date"], |
|
||||||
motion["policy_area"], |
|
||||||
motion["voting_results"], |
|
||||||
motion["winning_margin"], |
|
||||||
motion["controversy_score"], |
|
||||||
), |
|
||||||
) |
|
||||||
conn.close() |
|
||||||
print(f"Inserted {len(motions)} motions") |
|
||||||
except Exception as e: |
|
||||||
conn.close() |
|
||||||
print(f"Error inserting motions: {e}") |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 5: Query with aggregation |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_aggregation(): |
|
||||||
"""Example of aggregate queries.""" |
|
||||||
|
|
||||||
conn = duckdb.connect(config.DATABASE_PATH) |
|
||||||
try: |
|
||||||
# Get statistics by policy area |
|
||||||
results = conn.execute(""" |
|
||||||
SELECT |
|
||||||
policy_area, |
|
||||||
COUNT(*) as motion_count, |
|
||||||
AVG(winning_margin) as avg_margin, |
|
||||||
AVG(controversy_score) as avg_controversy |
|
||||||
FROM motions |
|
||||||
WHERE policy_area IS NOT NULL |
|
||||||
GROUP BY policy_area |
|
||||||
ORDER BY motion_count DESC |
|
||||||
""").fetchall() |
|
||||||
|
|
||||||
for row in results: |
|
||||||
print( |
|
||||||
f"{row[0]}: {row[1]} motions, " |
|
||||||
f"avg margin {row[2]:.1%}, " |
|
||||||
f"controversy {row[3]:.2f}" |
|
||||||
) |
|
||||||
|
|
||||||
conn.close() |
|
||||||
return results |
|
||||||
except Exception as e: |
|
||||||
conn.close() |
|
||||||
return [] |
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__": |
|
||||||
print("=== Filtered Motions ===") |
|
||||||
example_get_filtered_motions() |
|
||||||
|
|
||||||
print("\n=== Aggregation ===") |
|
||||||
example_aggregation() |
|
||||||
@ -1,116 +0,0 @@ |
|||||||
# Extracted pattern examples (representative snippets) |
|
||||||
|
|
||||||
Note: snippets are verbatim extracts from repository files (Phase 1). Paths shown. |
|
||||||
|
|
||||||
## DuckDB connect + schema init (database.py) |
|
||||||
```python |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
|
|
||||||
# Create sequence for auto-incrementing IDs |
|
||||||
try: |
|
||||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
|
||||||
except: |
|
||||||
pass |
|
||||||
|
|
||||||
# Create tables with proper ID handling |
|
||||||
conn.execute(""" |
|
||||||
CREATE TABLE IF NOT EXISTS motions ( |
|
||||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
|
||||||
title TEXT NOT NULL, |
|
||||||
description TEXT, |
|
||||||
date DATE, |
|
||||||
policy_area TEXT, |
|
||||||
voting_results JSON, |
|
||||||
winning_margin FLOAT, |
|
||||||
controversy_score FLOAT, |
|
||||||
layman_explanation TEXT, |
|
||||||
externe_identifier TEXT, |
|
||||||
body_text TEXT, |
|
||||||
url TEXT UNIQUE, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
""") |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
## Read-only compute worker (svd_pipeline.py) |
|
||||||
```python |
|
||||||
conn = duckdb.connect(db_path, read_only=True) |
|
||||||
try: |
|
||||||
rows = conn.execute( |
|
||||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
|
||||||
(start_date, end_date), |
|
||||||
).fetchall() |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
## Requests with retry/backoff (ai_provider.py) |
|
||||||
```python |
|
||||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
|
||||||
... |
|
||||||
if getattr(resp, "status_code", 0) == 429: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
|
||||||
retry_after = None |
|
||||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
|
||||||
if raw: |
|
||||||
try: |
|
||||||
retry_after = int(raw) |
|
||||||
except Exception: |
|
||||||
try: |
|
||||||
dt = parsedate_to_datetime(raw) |
|
||||||
now = datetime.now(tz=dt.tzinfo or timezone.utc) |
|
||||||
secs = (dt - now).total_seconds() |
|
||||||
retry_after = max(0, int(secs)) |
|
||||||
except Exception: |
|
||||||
retry_after = None |
|
||||||
|
|
||||||
if retry_after is not None: |
|
||||||
time.sleep(retry_after) |
|
||||||
continue |
|
||||||
``` |
|
||||||
|
|
||||||
## Embedding batch + per-item fallback (pipeline/ai_provider_wrapper.py) |
|
||||||
```python |
|
||||||
for start in range(0, len(texts), batch_size): |
|
||||||
chunk = texts[i:end] |
|
||||||
emb_chunk, emb_exc = _attempt_batch(chunk, i) |
|
||||||
if emb_chunk is not None: |
|
||||||
for j, emb in enumerate(emb_chunk): |
|
||||||
results[i + j] = emb |
|
||||||
i = end |
|
||||||
continue |
|
||||||
|
|
||||||
# batch failed -> fallback to per-item attempts |
|
||||||
for j in range(i, end): |
|
||||||
t = texts[j] |
|
||||||
single, single_exc = _attempt_batch([t], j) |
|
||||||
if single: |
|
||||||
results[j] = single[0] |
|
||||||
continue |
|
||||||
results[j] = None |
|
||||||
``` |
|
||||||
|
|
||||||
## Similarity compute (similarity/compute.py) |
|
||||||
```python |
|
||||||
# Ensure consistent dimensionality: pad shorter vectors with zeros |
|
||||||
lengths = [len(v) for v in vecs] |
|
||||||
max_dim = max(lengths) |
|
||||||
if len(set(lengths)) != 1: |
|
||||||
logger.warning( |
|
||||||
"Inconsistent vector dimensions detected (max=%d). Padding shorter vectors with zeros.", |
|
||||||
max_dim, |
|
||||||
) |
|
||||||
|
|
||||||
matrix = np.zeros((len(vecs), max_dim), dtype=np.float32) |
|
||||||
for i, v in enumerate(vecs): |
|
||||||
matrix[i, : len(v)] = v |
|
||||||
|
|
||||||
# Normalize rows and compute cosine similarity |
|
||||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
|
||||||
norms[norms == 0] = 1.0 |
|
||||||
normalized = matrix / norms |
|
||||||
sim = normalized @ normalized.T |
|
||||||
``` |
|
||||||
@ -1,217 +0,0 @@ |
|||||||
"""Example: Pipeline phase execution - from pipeline/run_pipeline.py and actual codebase.""" |
|
||||||
|
|
||||||
import argparse |
|
||||||
from datetime import date, timedelta |
|
||||||
from typing import List, Tuple |
|
||||||
|
|
||||||
# Import pipeline modules |
|
||||||
from pipeline.fetch_mp_metadata import fetch_mp_metadata |
|
||||||
from pipeline.extract_mp_votes import extract_mp_votes |
|
||||||
from pipeline.svd_pipeline import run_svd_pipeline |
|
||||||
from pipeline.text_pipeline import run_text_pipeline |
|
||||||
from pipeline.fusion import run_fusion |
|
||||||
|
|
||||||
from database import MotionDatabase |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 1: Running full pipeline |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_full_pipeline(): |
|
||||||
"""Run the complete data ingestion pipeline.""" |
|
||||||
|
|
||||||
# Parse arguments like CLI would |
|
||||||
parser = argparse.ArgumentParser(description="Pipeline runner") |
|
||||||
parser.add_argument("--db-path", default="data/motions.db") |
|
||||||
parser.add_argument("--start-date", default=None) |
|
||||||
parser.add_argument("--end-date", default=None) |
|
||||||
parser.add_argument( |
|
||||||
"--window-size", choices=["quarterly", "annual"], default="quarterly" |
|
||||||
) |
|
||||||
parser.add_argument("--svd-k", type=int, default=50) |
|
||||||
|
|
||||||
args = parser.parse_args([]) |
|
||||||
|
|
||||||
# Resolve dates |
|
||||||
end_date = date.fromisoformat(args.end_date) if args.end_date else date.today() |
|
||||||
start_date = ( |
|
||||||
date.fromisoformat(args.start_date) |
|
||||||
if args.start_date |
|
||||||
else end_date - timedelta(days=730) |
|
||||||
) |
|
||||||
|
|
||||||
print(f"Running pipeline: {start_date} → {end_date}") |
|
||||||
print(f"Window size: {args.window_size}") |
|
||||||
print(f"DB path: {args.db_path}") |
|
||||||
|
|
||||||
# Initialize database |
|
||||||
db = MotionDatabase(args.db_path) |
|
||||||
|
|
||||||
# Phase 1: Fetch MP metadata |
|
||||||
print("\n=== Phase 1: MP Metadata ===") |
|
||||||
n_mp = fetch_mp_metadata(db_path=args.db_path) |
|
||||||
print(f"Processed {n_mp} MPs") |
|
||||||
|
|
||||||
# Phase 2: Extract MP votes |
|
||||||
print("\n=== Phase 2: Extract Votes ===") |
|
||||||
n_votes = extract_mp_votes(db_path=args.db_path) |
|
||||||
print(f"Extracted {n_votes} vote records") |
|
||||||
|
|
||||||
# Phase 3: Generate time windows |
|
||||||
print("\n=== Phase 3: SVD Pipeline ===") |
|
||||||
windows = generate_windows(start_date, end_date, args.window_size) |
|
||||||
print(f"Generated {len(windows)} windows: {windows}") |
|
||||||
|
|
||||||
# Phase 4: SVD per window |
|
||||||
run_svd_pipeline(db, windows, args.svd_k) |
|
||||||
print(f"Computed SVD for {len(windows)} windows") |
|
||||||
|
|
||||||
# Phase 5: Text embeddings |
|
||||||
print("\n=== Phase 4: Text Embeddings ===") |
|
||||||
run_text_pipeline(args.db_path, batch_size=50) |
|
||||||
print("Text embeddings completed") |
|
||||||
|
|
||||||
# Phase 6: Fusion |
|
||||||
print("\n=== Phase 5: Fusion ===") |
|
||||||
run_fusion(args.db_path, windows) |
|
||||||
print("Fusion completed") |
|
||||||
|
|
||||||
print("\n=== Pipeline Complete ===") |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 2: Generate time windows |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def generate_windows( |
|
||||||
start: date, end: date, granularity: str |
|
||||||
) -> List[Tuple[str, str, str]]: |
|
||||||
"""Generate time windows for pipeline processing.""" |
|
||||||
|
|
||||||
windows = [] |
|
||||||
cursor = date(start.year, start.month, 1) |
|
||||||
|
|
||||||
if granularity == "annual": |
|
||||||
cursor = date(start.year, 1, 1) |
|
||||||
while cursor <= end: |
|
||||||
year_end = date(cursor.year, 12, 31) |
|
||||||
w_end = min(year_end, end) |
|
||||||
windows.append((str(cursor.year), cursor.isoformat(), w_end.isoformat())) |
|
||||||
cursor = date(cursor.year + 1, 1, 1) |
|
||||||
else: |
|
||||||
# quarterly |
|
||||||
quarter_starts = {1: 1, 2: 4, 3: 7, 4: 10} |
|
||||||
quarter_ends = {1: 3, 2: 6, 3: 9, 4: 12} |
|
||||||
|
|
||||||
q = (cursor.month - 1) // 3 + 1 |
|
||||||
cursor = date(cursor.year, quarter_starts[q], 1) |
|
||||||
|
|
||||||
while cursor <= end: |
|
||||||
q = (cursor.month - 1) // 3 + 1 |
|
||||||
import calendar |
|
||||||
|
|
||||||
q_end_month = quarter_ends[q] |
|
||||||
last_day = calendar.monthrange(cursor.year, q_end_month)[1] |
|
||||||
q_end = date(cursor.year, q_end_month, last_day) |
|
||||||
w_end = min(q_end, end) |
|
||||||
window_id = f"{cursor.year}-Q{q}" |
|
||||||
windows.append((window_id, cursor.isoformat(), w_end.isoformat())) |
|
||||||
cursor = q_end + timedelta(days=1) |
|
||||||
|
|
||||||
return windows |
|
||||||
|
|
||||||
|
|
||||||
def example_window_generation(): |
|
||||||
"""Example of window generation.""" |
|
||||||
|
|
||||||
start = date(2023, 1, 1) |
|
||||||
end = date(2024, 6, 30) |
|
||||||
|
|
||||||
print("Quarterly windows:") |
|
||||||
quarterly = generate_windows(start, end, "quarterly") |
|
||||||
for wid, s, e in quarterly: |
|
||||||
print(f" {wid}: {s} to {e}") |
|
||||||
|
|
||||||
print("\nAnnual windows:") |
|
||||||
annual = generate_windows(start, end, "annual") |
|
||||||
for wid, s, e in annual: |
|
||||||
print(f" {wid}: {s} to {e}") |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 3: Running individual phases |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_individual_phases(): |
|
||||||
"""Run pipeline phases individually for debugging.""" |
|
||||||
|
|
||||||
db_path = "data/motions.db" |
|
||||||
db = MotionDatabase(db_path) |
|
||||||
|
|
||||||
# Only run MP metadata fetch |
|
||||||
print("Fetching MP metadata...") |
|
||||||
n = fetch_mp_metadata(db_path=db_path) |
|
||||||
print(f" {n} MPs processed") |
|
||||||
|
|
||||||
# Only run vote extraction |
|
||||||
print("Extracting votes...") |
|
||||||
n = extract_mp_votes(db_path=db_path) |
|
||||||
print(f" {n} votes extracted") |
|
||||||
|
|
||||||
# Only run SVD for specific window |
|
||||||
print("Computing SVD...") |
|
||||||
windows = [("2024-Q1", "2024-01-01", "2024-03-31")] |
|
||||||
run_svd_pipeline(db, windows, k=50) |
|
||||||
print(" SVD computed") |
|
||||||
|
|
||||||
# Only run text embeddings |
|
||||||
print("Computing embeddings...") |
|
||||||
run_text_pipeline(db_path, batch_size=25) # Smaller batch for testing |
|
||||||
print(" Embeddings computed") |
|
||||||
|
|
||||||
|
|
||||||
# ============================================================================= |
|
||||||
# Example 4: Dry run |
|
||||||
# ============================================================================= |
|
||||||
|
|
||||||
|
|
||||||
def example_dry_run(): |
|
||||||
"""Show what pipeline would do without making changes.""" |
|
||||||
|
|
||||||
print("DRY RUN - no writes will be made") |
|
||||||
|
|
||||||
start_date = date(2024, 1, 1) |
|
||||||
end_date = date(2024, 6, 30) |
|
||||||
|
|
||||||
# Generate and show windows |
|
||||||
windows = generate_windows(start_date, end_date, "quarterly") |
|
||||||
|
|
||||||
print(f"Would process {len(windows)} windows:") |
|
||||||
for wid, s, e in windows: |
|
||||||
print(f" {wid}: {s} to {e}") |
|
||||||
|
|
||||||
print("\nWould run phases:") |
|
||||||
print(" 1. fetch_mp_metadata") |
|
||||||
print(" 2. extract_mp_votes") |
|
||||||
print(" 3. svd_pipeline") |
|
||||||
print(" 4. text_pipeline") |
|
||||||
print(" 5. fusion") |
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__": |
|
||||||
import logging |
|
||||||
|
|
||||||
logging.basicConfig( |
|
||||||
level=logging.INFO, |
|
||||||
format="%(asctime)s %(levelname)s %(name)s: %(message)s", |
|
||||||
) |
|
||||||
|
|
||||||
print("=== Window Generation ===") |
|
||||||
example_window_generation() |
|
||||||
|
|
||||||
print("\n=== Dry Run ===") |
|
||||||
example_dry_run() |
|
||||||
@ -1,108 +0,0 @@ |
|||||||
# stemwijzer Mind Model - Manifest |
|
||||||
# Generated: 2026-04-12 |
|
||||||
# Phase: 2 - Assembly from Phase 1 Analysis |
|
||||||
|
|
||||||
name: stemwijzer |
|
||||||
version: 2 |
|
||||||
description: Dutch political voting compass (Stemwijzer) - Mind Model constraints |
|
||||||
|
|
||||||
categories: |
|
||||||
# Core documentation |
|
||||||
- path: system.md |
|
||||||
description: System overview and architecture summary |
|
||||||
group: docs |
|
||||||
- path: stack/stack.md |
|
||||||
description: Technology stack with versions and purposes |
|
||||||
group: stack |
|
||||||
- path: domain/domain-glossary.md |
|
||||||
description: Domain entities, terms, relationships, and CRITICAL INVARIANTS |
|
||||||
group: domain |
|
||||||
|
|
||||||
# Design patterns |
|
||||||
- path: patterns/patterns.yaml |
|
||||||
description: Code patterns (Singleton, Repository, Pipeline, etc.) |
|
||||||
group: patterns |
|
||||||
- path: patterns/streamlit.yaml |
|
||||||
description: Streamlit-specific patterns (session state, cache) |
|
||||||
group: patterns |
|
||||||
- path: patterns/api.yaml |
|
||||||
description: API client patterns with retry and pagination |
|
||||||
group: patterns |
|
||||||
- path: patterns/database.yaml |
|
||||||
description: DuckDB patterns and connection management |
|
||||||
group: patterns |
|
||||||
- path: patterns/python.yaml |
|
||||||
description: Python-specific patterns (dataclass, typing) |
|
||||||
group: patterns |
|
||||||
- path: patterns/duckdb-access.md |
|
||||||
description: DuckDB connection patterns and best practices |
|
||||||
group: patterns |
|
||||||
- path: patterns/embeddings-similarity.md |
|
||||||
description: Embeddings and similarity computation patterns |
|
||||||
group: patterns |
|
||||||
- path: patterns/error-handling.md |
|
||||||
description: Error handling and exception patterns |
|
||||||
group: patterns |
|
||||||
- path: patterns/module-singletons.md |
|
||||||
description: Module-level singleton patterns |
|
||||||
group: patterns |
|
||||||
- path: patterns/requests-http.md |
|
||||||
description: HTTP client patterns with retry |
|
||||||
group: patterns |
|
||||||
- path: patterns/validation.md |
|
||||||
description: Input validation patterns |
|
||||||
group: patterns |
|
||||||
|
|
||||||
# Coding constraints |
|
||||||
- path: constraints/error-handling.md |
|
||||||
description: Error handling patterns with safe fallbacks |
|
||||||
group: constraints |
|
||||||
- path: constraints/logging.md |
|
||||||
description: Logging conventions |
|
||||||
group: constraints |
|
||||||
- path: constraints/naming.yaml |
|
||||||
description: File, class, function naming rules |
|
||||||
group: constraints |
|
||||||
- path: constraints/imports.yaml |
|
||||||
description: Import organization and module structure |
|
||||||
group: constraints |
|
||||||
- path: constraints/types.yaml |
|
||||||
description: Type hint conventions |
|
||||||
group: constraints |
|
||||||
- path: constraints/testing.yaml |
|
||||||
description: Testing conventions |
|
||||||
group: constraints |
|
||||||
|
|
||||||
# Anti-patterns |
|
||||||
- path: anti-patterns/anti-patterns.md |
|
||||||
description: Known anti-patterns with evidence and fixes |
|
||||||
group: anti-patterns |
|
||||||
|
|
||||||
# Dependencies |
|
||||||
- path: dependencies/dependencies.md |
|
||||||
description: Library usage and singleton instances |
|
||||||
group: dependencies |
|
||||||
|
|
||||||
# Code examples |
|
||||||
- path: examples/database-example.py |
|
||||||
description: MotionDatabase usage examples |
|
||||||
group: examples |
|
||||||
- path: examples/api-client-example.py |
|
||||||
description: TweedeKamerAPI usage examples |
|
||||||
group: examples |
|
||||||
- path: examples/pipeline-example.py |
|
||||||
description: Pipeline orchestration examples |
|
||||||
group: examples |
|
||||||
- path: examples/streamlit-page-example.py |
|
||||||
description: Streamlit page patterns |
|
||||||
group: examples |
|
||||||
- path: examples/pattern-examples.md |
|
||||||
description: Consolidated pattern examples |
|
||||||
group: examples |
|
||||||
|
|
||||||
# Phase 1 findings summary: |
|
||||||
# - Tech: Python 3.13+, Streamlit, DuckDB, scipy/sklearn/umap, OpenRouter (QWEN) |
|
||||||
# - 10 patterns discovered: Module singletons, Repository, Service layer, Pipeline |
|
||||||
# - 8 anti-patterns: print() instead of logging, _DummySt global, bare except |
|
||||||
# - 6 code clusters: Database, Streamlit UI, API, Analysis/ML, Config, Singletons |
|
||||||
# - 3 groups: stdlib, 3rd party, local imports |
|
||||||
@ -1,265 +0,0 @@ |
|||||||
# API Client Patterns |
|
||||||
|
|
||||||
## Base API Client Pattern |
|
||||||
|
|
||||||
Using requests.Session for connection pooling: |
|
||||||
|
|
||||||
```python |
|
||||||
# api_client.py |
|
||||||
import requests |
|
||||||
from typing import Dict, List, Optional |
|
||||||
from config import config |
|
||||||
|
|
||||||
class TweedeKamerAPI: |
|
||||||
def __init__(self): |
|
||||||
self.odata_base_url = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
|
||||||
self.session = requests.Session() |
|
||||||
self.session.headers.update({ |
|
||||||
"Accept": "application/json", |
|
||||||
"User-Agent": "Dutch-Political-Compass-Tool/1.0", |
|
||||||
}) |
|
||||||
|
|
||||||
def get_motions( |
|
||||||
self, |
|
||||||
start_date: datetime = None, |
|
||||||
end_date: datetime = None, |
|
||||||
limit: int = 500, |
|
||||||
) -> List[Dict]: |
|
||||||
"""Get motions with voting results using OData API.""" |
|
||||||
if not start_date: |
|
||||||
start_date = datetime.now() - timedelta(days=730) |
|
||||||
|
|
||||||
try: |
|
||||||
voting_records, besluit_meta = self._get_voting_records( |
|
||||||
start_date, end_date, limit |
|
||||||
) |
|
||||||
return self._process_voting_records(voting_records, besluit_meta) |
|
||||||
except Exception as e: |
|
||||||
print(f"Error fetching motions from API: {e}") |
|
||||||
return [] |
|
||||||
``` |
|
||||||
|
|
||||||
## OData Pagination Pattern |
|
||||||
|
|
||||||
Handle server-side pagination with $skip: |
|
||||||
|
|
||||||
```python |
|
||||||
def _get_voting_records( |
|
||||||
self, |
|
||||||
start_date: datetime, |
|
||||||
end_date: datetime = None, |
|
||||||
limit: int = 50000 |
|
||||||
) -> tuple: |
|
||||||
"""Fetch with automatic pagination.""" |
|
||||||
|
|
||||||
filter_query = ( |
|
||||||
f"GewijzigdOp ge {start_date.strftime('%Y-%m-%d')}T00:00:00Z" |
|
||||||
" and StemmingsSoort ne null" |
|
||||||
" and Verwijderd eq false" |
|
||||||
) |
|
||||||
|
|
||||||
page_size = 250 # API caps $top at 250 |
|
||||||
base_url = f"{self.odata_base_url}/Besluit" |
|
||||||
base_params = { |
|
||||||
"$filter": filter_query, |
|
||||||
"$top": page_size, |
|
||||||
"$expand": "Stemming", |
|
||||||
"$orderby": "GewijzigdOp desc", |
|
||||||
} |
|
||||||
|
|
||||||
all_records = [] |
|
||||||
skip = 0 |
|
||||||
|
|
||||||
while len(all_records) < limit: |
|
||||||
params = {**base_params, "$skip": skip} |
|
||||||
response = self.session.get( |
|
||||||
base_url, |
|
||||||
params=params, |
|
||||||
timeout=config.API_TIMEOUT |
|
||||||
) |
|
||||||
response.raise_for_status() |
|
||||||
data = response.json() |
|
||||||
|
|
||||||
besluit_page = data.get("value", []) |
|
||||||
if not besluit_page: |
|
||||||
break |
|
||||||
|
|
||||||
# Process page |
|
||||||
for besluit in besluit_page: |
|
||||||
all_records.extend(self._extract_votes(besluit)) |
|
||||||
|
|
||||||
skip += page_size |
|
||||||
|
|
||||||
return all_records |
|
||||||
``` |
|
||||||
|
|
||||||
## Retry with Backoff Pattern |
|
||||||
|
|
||||||
For transient failures: |
|
||||||
|
|
||||||
```python |
|
||||||
# ai_provider.py |
|
||||||
import time |
|
||||||
import random |
|
||||||
from requests.exceptions import ConnectionError |
|
||||||
|
|
||||||
def _post_with_retries( |
|
||||||
path: str, |
|
||||||
json: dict, |
|
||||||
retries: int = 3 |
|
||||||
) -> requests.Response: |
|
||||||
"""POST with exponential backoff retry.""" |
|
||||||
|
|
||||||
backoff = 0.5 |
|
||||||
for attempt in range(1, retries + 1): |
|
||||||
try: |
|
||||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
|
||||||
|
|
||||||
# Handle rate limiting |
|
||||||
if resp.status_code == 429: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError("Rate limited") |
|
||||||
|
|
||||||
retry_after = resp.headers.get("Retry-After") |
|
||||||
if retry_after: |
|
||||||
time.sleep(int(retry_after)) |
|
||||||
else: |
|
||||||
sleep = backoff * (2 ** (attempt - 1)) |
|
||||||
sleep += random.uniform(0, sleep * 0.1) |
|
||||||
time.sleep(sleep) |
|
||||||
continue |
|
||||||
|
|
||||||
# Handle server errors |
|
||||||
if 500 <= resp.status_code < 600: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError(f"Server error: {resp.status_code}") |
|
||||||
time.sleep(backoff * (2 ** (attempt - 1))) |
|
||||||
continue |
|
||||||
|
|
||||||
return resp |
|
||||||
|
|
||||||
except ConnectionError as exc: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError(f"Connection error: {exc}") |
|
||||||
time.sleep(backoff * (2 ** (attempt - 1))) |
|
||||||
|
|
||||||
raise ProviderError("Failed after retries") |
|
||||||
``` |
|
||||||
|
|
||||||
## Batch Processing Pattern |
|
||||||
|
|
||||||
Process items in batches to manage API limits: |
|
||||||
|
|
||||||
```python |
|
||||||
def get_embeddings_with_retry( |
|
||||||
texts: List[str], |
|
||||||
batch_size: int = 50, |
|
||||||
retries: int = 3, |
|
||||||
) -> List[Optional[List[float]]]: |
|
||||||
"""Process embeddings in batches with fallback to single items.""" |
|
||||||
|
|
||||||
results = [None] * len(texts) |
|
||||||
|
|
||||||
i = 0 |
|
||||||
while i < len(texts): |
|
||||||
end = min(len(texts), i + batch_size) |
|
||||||
chunk = texts[i:end] |
|
||||||
|
|
||||||
# Try batch first |
|
||||||
try: |
|
||||||
emb_chunk = get_embeddings_batch(chunk) |
|
||||||
for j, emb in enumerate(emb_chunk): |
|
||||||
results[i + j] = emb |
|
||||||
i = end |
|
||||||
continue |
|
||||||
except Exception: |
|
||||||
pass |
|
||||||
|
|
||||||
# Fallback: single items |
|
||||||
for j, text in enumerate(chunk): |
|
||||||
try: |
|
||||||
results[i + j] = get_embedding(text) |
|
||||||
except Exception: |
|
||||||
results[i + j] = None |
|
||||||
|
|
||||||
i = end |
|
||||||
|
|
||||||
return results |
|
||||||
``` |
|
||||||
|
|
||||||
## Response Validation Pattern |
|
||||||
|
|
||||||
Validate API responses before processing: |
|
||||||
|
|
||||||
```python |
|
||||||
def _process_response(self, response: requests.Response) -> Dict: |
|
||||||
"""Validate and parse API response.""" |
|
||||||
|
|
||||||
response.raise_for_status() |
|
||||||
data = response.json() |
|
||||||
|
|
||||||
if "value" not in data: |
|
||||||
raise ValueError("Unexpected response format: missing 'value' key") |
|
||||||
|
|
||||||
return data |
|
||||||
|
|
||||||
def _validate_besluit(self, besluit: Dict) -> bool: |
|
||||||
"""Check required fields exist.""" |
|
||||||
required = ["Id", "GewijzigdOp"] |
|
||||||
return all(field in besluit for field in required) |
|
||||||
``` |
|
||||||
|
|
||||||
## Error Handling Patterns |
|
||||||
|
|
||||||
Always provide safe fallbacks: |
|
||||||
|
|
||||||
```python |
|
||||||
def safe_api_call(self, endpoint: str, params: Dict = None) -> List[Dict]: |
|
||||||
"""Call API with error handling and fallback.""" |
|
||||||
try: |
|
||||||
response = self.session.get( |
|
||||||
endpoint, |
|
||||||
params=params, |
|
||||||
timeout=config.API_TIMEOUT |
|
||||||
) |
|
||||||
response.raise_for_status() |
|
||||||
data = response.json() |
|
||||||
return data.get("value", []) |
|
||||||
except requests.Timeout: |
|
||||||
_logger.warning(f"API timeout for {endpoint}") |
|
||||||
return [] |
|
||||||
except requests.HTTPError as e: |
|
||||||
_logger.error(f"HTTP error: {e}") |
|
||||||
return [] |
|
||||||
except Exception as e: |
|
||||||
_logger.error(f"API call failed: {e}") |
|
||||||
return [] |
|
||||||
``` |
|
||||||
|
|
||||||
## Session Management |
|
||||||
|
|
||||||
Reuse session for connection pooling: |
|
||||||
|
|
||||||
```python |
|
||||||
class TweedeKamerAPI: |
|
||||||
def __init__(self): |
|
||||||
self.session = requests.Session() |
|
||||||
self.session.headers.update({ |
|
||||||
"Accept": "application/json", |
|
||||||
"User-Agent": "Dutch-Political-Compass-Tool/1.0", |
|
||||||
}) |
|
||||||
|
|
||||||
def close(self): |
|
||||||
"""Clean up session when done.""" |
|
||||||
self.session.close() |
|
||||||
|
|
||||||
def __enter__(self): |
|
||||||
return self |
|
||||||
|
|
||||||
def __exit__(self, *args): |
|
||||||
self.close() |
|
||||||
|
|
||||||
# Usage |
|
||||||
with TweedeKamerAPI() as api: |
|
||||||
motions = api.get_motions(start_date) |
|
||||||
``` |
|
||||||
@ -1,230 +0,0 @@ |
|||||||
# Architectural Patterns |
|
||||||
|
|
||||||
## Repository Pattern |
|
||||||
|
|
||||||
The `MotionDatabase` class acts as a repository, encapsulating all database operations behind a clean interface. |
|
||||||
|
|
||||||
```python |
|
||||||
# database.py |
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self.db_path = db_path |
|
||||||
self._init_database() |
|
||||||
|
|
||||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
|
||||||
"""Get a single motion by ID.""" |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
result = conn.execute( |
|
||||||
"SELECT * FROM motions WHERE id = ?", (motion_id,) |
|
||||||
).fetchone() |
|
||||||
return result |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
|
|
||||||
def get_filtered_motions( |
|
||||||
self, |
|
||||||
policy_area: str = "Alle", |
|
||||||
min_margin: float = 0.0, |
|
||||||
max_margin: float = 1.0, |
|
||||||
limit: int = 10 |
|
||||||
) -> List[Dict]: |
|
||||||
"""Get filtered list of motions.""" |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
**Usage**: Import the singleton instance for all DB operations. |
|
||||||
```python |
|
||||||
from database import db |
|
||||||
|
|
||||||
motions = db.get_filtered_motions(policy_area="Klimaat", limit=20) |
|
||||||
``` |
|
||||||
|
|
||||||
## Facade Pattern |
|
||||||
|
|
||||||
Simplified interfaces over complex subsystems. |
|
||||||
|
|
||||||
### MotionDatabase Facade |
|
||||||
```python |
|
||||||
# Single entry point for all database operations |
|
||||||
db = MotionDatabase() # Singleton instance |
|
||||||
|
|
||||||
# Operations are abstracted: |
|
||||||
db.create_session(total_motions) |
|
||||||
db.record_vote(session_id, motion_id, vote) |
|
||||||
db.get_party_results(session_id) |
|
||||||
``` |
|
||||||
|
|
||||||
### API Client Facade |
|
||||||
```python |
|
||||||
# api_client.py |
|
||||||
class TweedeKamerAPI: |
|
||||||
def __init__(self): |
|
||||||
self.session = requests.Session() # Connection pooling |
|
||||||
|
|
||||||
def get_motions(self, start_date, end_date) -> List[Dict]: |
|
||||||
"""Simple interface hiding OData pagination details.""" |
|
||||||
voting_records, besluit_meta = self._get_voting_records(start_date, end_date) |
|
||||||
return self._process_voting_records(voting_records, besluit_meta) |
|
||||||
``` |
|
||||||
|
|
||||||
### MotionScraper Facade |
|
||||||
```python |
|
||||||
# scraper.py (if used) |
|
||||||
class MotionScraper: |
|
||||||
def get_motion_content(self, url: str) -> Optional[str]: |
|
||||||
"""Extract body text from official website.""" |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Pipeline Pattern |
|
||||||
|
|
||||||
Sequential phases with explicit dependencies: |
|
||||||
|
|
||||||
``` |
|
||||||
pipeline/run_pipeline.py |
|
||||||
├── Phase 1: fetch_mp_metadata |
|
||||||
│ └── pipeline/fetch_mp_metadata.py |
|
||||||
├── Phase 2: extract_mp_votes |
|
||||||
│ └── pipeline/extract_mp_votes.py |
|
||||||
├── Phase 3: svd_pipeline |
|
||||||
│ └── pipeline/svd_pipeline.py |
|
||||||
├── Phase 4: text_pipeline (gap-fill) |
|
||||||
│ └── pipeline/text_pipeline.py |
|
||||||
└── Phase 5: fusion (combine SVD + text) |
|
||||||
└── pipeline/fusion.py |
|
||||||
``` |
|
||||||
|
|
||||||
### Phase Orchestration |
|
||||||
```python |
|
||||||
# pipeline/run_pipeline.py |
|
||||||
def run(args: argparse.Namespace) -> int: |
|
||||||
db = MotionDatabase(args.db_path) |
|
||||||
|
|
||||||
# Phase 1: MP metadata |
|
||||||
if not args.skip_metadata: |
|
||||||
from pipeline.fetch_mp_metadata import fetch_mp_metadata |
|
||||||
fetch_mp_metadata(db_path=db.db_path) |
|
||||||
|
|
||||||
# Phase 2: Extract votes |
|
||||||
if not args.skip_extract: |
|
||||||
from pipeline.extract_mp_votes import extract_mp_votes |
|
||||||
extract_mp_votes(db_path=db.db_path) |
|
||||||
|
|
||||||
# Phase 3: SVD per window |
|
||||||
if not args.skip_svd: |
|
||||||
from pipeline.svd_pipeline import run_svd_pipeline |
|
||||||
run_svd_pipeline(db, windows, args.svd_k) |
|
||||||
|
|
||||||
# ... additional phases |
|
||||||
``` |
|
||||||
|
|
||||||
## Strategy Pattern |
|
||||||
|
|
||||||
Interchangeable algorithms for axis computation: |
|
||||||
|
|
||||||
```python |
|
||||||
# analysis/political_axis.py |
|
||||||
def compute_political_axis( |
|
||||||
vectors: Dict[str, np.ndarray], |
|
||||||
method: str = "pca" # or "anchor" |
|
||||||
) -> Tuple[np.ndarray, np.ndarray]: |
|
||||||
"""Compute political axis using specified method. |
|
||||||
|
|
||||||
Methods: |
|
||||||
- 'pca': Use first principal component |
|
||||||
- 'anchor': Use predefined anchor motions |
|
||||||
""" |
|
||||||
if method == "pca": |
|
||||||
return _compute_pca_axis(vectors) |
|
||||||
elif method == "anchor": |
|
||||||
return _compute_anchor_axis(vectors) |
|
||||||
``` |
|
||||||
|
|
||||||
## Visitor Pattern |
|
||||||
|
|
||||||
External operations on data structures: |
|
||||||
|
|
||||||
```python |
|
||||||
# analysis/trajectory.py |
|
||||||
def _procrustes_align_windows( |
|
||||||
window_vecs: Dict[str, Dict[str, np.ndarray]], |
|
||||||
min_overlap: int = 5, |
|
||||||
) -> Dict[str, Dict[str, np.ndarray]]: |
|
||||||
"""Align SVD vectors across windows using Procrustes rotations. |
|
||||||
|
|
||||||
Takes the first window as reference and aligns each subsequent window |
|
||||||
to it via orthogonal Procrustes on the set of common entities. |
|
||||||
""" |
|
||||||
``` |
|
||||||
|
|
||||||
## Builder Pattern |
|
||||||
|
|
||||||
Configuration via method chaining: |
|
||||||
|
|
||||||
```python |
|
||||||
# CLI argument parsing |
|
||||||
parser = argparse.ArgumentParser(description="Pipeline runner") |
|
||||||
parser.add_argument("--db-path", default="data/motions.db") |
|
||||||
parser.add_argument("--start-date", default=None) |
|
||||||
parser.add_argument("--end-date", default=None) |
|
||||||
parser.add_argument("--window-size", choices=["quarterly", "annual"], default="quarterly") |
|
||||||
parser.add_argument("--svd-k", type=int, default=50) |
|
||||||
``` |
|
||||||
|
|
||||||
## Decorator Pattern |
|
||||||
|
|
||||||
Retry logic for transient failures: |
|
||||||
|
|
||||||
```python |
|
||||||
# pipeline/ai_provider_wrapper.py |
|
||||||
def get_embeddings_with_retry( |
|
||||||
texts: List[str], |
|
||||||
retries: int = 3, |
|
||||||
batch_size: int = 50, |
|
||||||
) -> List[Optional[List[float]]]: |
|
||||||
"""Return embeddings with automatic retry on failure.""" |
|
||||||
for attempt in range(1, retries + 1): |
|
||||||
try: |
|
||||||
return _embedder(texts, batch_size=len(texts)) |
|
||||||
except Exception as exc: |
|
||||||
if attempt == retries: |
|
||||||
break |
|
||||||
time.sleep(backoff * (2 ** (attempt - 1))) |
|
||||||
return [None] * len(texts) # Safe fallback |
|
||||||
``` |
|
||||||
|
|
||||||
## Data Patterns |
|
||||||
|
|
||||||
### Batch Processing |
|
||||||
Process items in chunks to manage memory and API limits: |
|
||||||
```python |
|
||||||
for i in range(0, len(items), batch_size): |
|
||||||
chunk = items[i:i + batch_size] |
|
||||||
process_batch(chunk) |
|
||||||
``` |
|
||||||
|
|
||||||
### Caching |
|
||||||
Pre-compute and store expensive results: |
|
||||||
```python |
|
||||||
# SimilarityCache table stores computed similarities |
|
||||||
db.get_similarity(motion_a, motion_b) |
|
||||||
``` |
|
||||||
|
|
||||||
### Lazy Loading |
|
||||||
Load data only when needed: |
|
||||||
```python |
|
||||||
class MotionDatabase: |
|
||||||
@property |
|
||||||
def _connection(self): |
|
||||||
if self._conn is None: |
|
||||||
self._conn = duckdb.connect(self.db_path) |
|
||||||
return self._conn |
|
||||||
``` |
|
||||||
|
|
||||||
### Vectorization |
|
||||||
Use numpy for batch operations: |
|
||||||
```python |
|
||||||
vectors = np.array([v for v in entity_vectors.values()]) |
|
||||||
normalized = vectors / np.linalg.norm(vectors, axis=1, keepdims=True) |
|
||||||
``` |
|
||||||
@ -1,239 +0,0 @@ |
|||||||
# DuckDB Database Patterns |
|
||||||
|
|
||||||
## Connection Management |
|
||||||
|
|
||||||
### Pattern 1: Short-lived per Method (Most Common) |
|
||||||
|
|
||||||
Always create a new connection, use try/finally for cleanup: |
|
||||||
|
|
||||||
```python |
|
||||||
# database.py |
|
||||||
class MotionDatabase: |
|
||||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
result = conn.execute( |
|
||||||
"SELECT * FROM motions WHERE id = ?", |
|
||||||
(motion_id,) |
|
||||||
).fetchone() |
|
||||||
conn.close() |
|
||||||
return result |
|
||||||
except Exception: |
|
||||||
conn.close() |
|
||||||
return None |
|
||||||
|
|
||||||
def get_filtered_motions( |
|
||||||
self, |
|
||||||
policy_area: str = "Alle", |
|
||||||
min_margin: float = 0.0, |
|
||||||
max_margin: float = 1.0, |
|
||||||
limit: int = 10 |
|
||||||
) -> List[Dict]: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
query = """ |
|
||||||
SELECT * FROM motions |
|
||||||
WHERE (? = 'Alle' OR policy_area = ?) |
|
||||||
AND winning_margin BETWEEN ? AND ? |
|
||||||
ORDER BY RANDOM() |
|
||||||
LIMIT ? |
|
||||||
""" |
|
||||||
rows = conn.execute(query, (policy_area, policy_area, min_margin, max_margin, limit)).fetchall() |
|
||||||
conn.close() |
|
||||||
return rows |
|
||||||
except Exception: |
|
||||||
conn.close() |
|
||||||
return [] |
|
||||||
``` |
|
||||||
|
|
||||||
### Pattern 2: With Statement (Cleaner) |
|
||||||
|
|
||||||
```python |
|
||||||
def execute_query(self, query: str, params: tuple = ()): |
|
||||||
with duckdb.connect(self.db_path) as conn: |
|
||||||
return conn.execute(query, params).fetchall() |
|
||||||
``` |
|
||||||
|
|
||||||
### Pattern 3: Lazy Connection Caching |
|
||||||
|
|
||||||
For frequently accessed connections: |
|
||||||
|
|
||||||
```python |
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self.db_path = db_path |
|
||||||
self._conn = None |
|
||||||
|
|
||||||
@property |
|
||||||
def connection(self): |
|
||||||
if self._conn is None: |
|
||||||
self._conn = duckdb.connect(self.db_path) |
|
||||||
return self._conn |
|
||||||
|
|
||||||
def close(self): |
|
||||||
if self._conn: |
|
||||||
self._conn.close() |
|
||||||
self._conn = None |
|
||||||
``` |
|
||||||
|
|
||||||
## Table Initialization |
|
||||||
|
|
||||||
Create tables with proper constraints and sequences: |
|
||||||
|
|
||||||
```python |
|
||||||
def _init_database(self): |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
|
|
||||||
# Create sequence for auto-incrementing IDs |
|
||||||
try: |
|
||||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
|
||||||
except: |
|
||||||
pass |
|
||||||
|
|
||||||
# Create tables |
|
||||||
conn.execute(""" |
|
||||||
CREATE TABLE IF NOT EXISTS motions ( |
|
||||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
|
||||||
title TEXT NOT NULL, |
|
||||||
description TEXT, |
|
||||||
date DATE, |
|
||||||
policy_area TEXT, |
|
||||||
voting_results JSON, |
|
||||||
winning_margin FLOAT, |
|
||||||
controversy_score FLOAT, |
|
||||||
layman_explanation TEXT, |
|
||||||
externe_identifier TEXT, |
|
||||||
body_text TEXT, |
|
||||||
url TEXT UNIQUE, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
""") |
|
||||||
|
|
||||||
# Add columns to existing tables safely |
|
||||||
try: |
|
||||||
conn.execute("ALTER TABLE motions ADD COLUMN IF NOT EXISTS body_text TEXT") |
|
||||||
except Exception: |
|
||||||
pass # Column may already exist |
|
||||||
|
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
## JSON Column Handling |
|
||||||
|
|
||||||
Store and retrieve JSON data: |
|
||||||
|
|
||||||
```python |
|
||||||
# Insert JSON |
|
||||||
def store_motion(self, motion: Dict): |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
conn.execute( |
|
||||||
"INSERT INTO motions (title, voting_results) VALUES (?, ?)", |
|
||||||
(motion["title"], json.dumps(motion["voting_results"])) |
|
||||||
) |
|
||||||
conn.close() |
|
||||||
except Exception: |
|
||||||
conn.close() |
|
||||||
|
|
||||||
# Query JSON |
|
||||||
def get_motions_with_votes(self, party: str) -> List[Dict]: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
rows = conn.execute(""" |
|
||||||
SELECT title, voting_results |
|
||||||
FROM motions |
|
||||||
WHERE JSON_EXTRACT(voting_results, '$.party') = ? |
|
||||||
""", (party,)).fetchall() |
|
||||||
conn.close() |
|
||||||
return rows |
|
||||||
except Exception: |
|
||||||
conn.close() |
|
||||||
return [] |
|
||||||
``` |
|
||||||
|
|
||||||
## Query Patterns |
|
||||||
|
|
||||||
### Parameterized Queries (Always!) |
|
||||||
```python |
|
||||||
# SAFE - uses parameterized query |
|
||||||
conn.execute("SELECT * FROM motions WHERE id = ?", (motion_id,)) |
|
||||||
|
|
||||||
# AVOID - SQL injection risk |
|
||||||
# conn.execute(f"SELECT * FROM motions WHERE id = {motion_id}") # BAD! |
|
||||||
``` |
|
||||||
|
|
||||||
### Batch Inserts |
|
||||||
```python |
|
||||||
def bulk_insert_motions(self, motions: List[Dict]): |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
for motion in motions: |
|
||||||
conn.execute( |
|
||||||
"""INSERT OR IGNORE INTO motions |
|
||||||
(title, date, policy_area) VALUES (?, ?, ?)""", |
|
||||||
(motion["title"], motion["date"], motion["policy_area"]) |
|
||||||
) |
|
||||||
conn.close() |
|
||||||
except Exception: |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
### Aggregation Queries |
|
||||||
```python |
|
||||||
def get_party_vote_stats(self, party: str) -> Dict: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
result = conn.execute(""" |
|
||||||
SELECT |
|
||||||
COUNT(*) as total_votes, |
|
||||||
SUM(CASE WHEN vote = 'Voor' THEN 1 ELSE 0 END) as voor, |
|
||||||
SUM(CASE WHEN vote = 'Tegen' THEN 1 ELSE 0 END) as tegen |
|
||||||
FROM mp_votes |
|
||||||
WHERE party = ? |
|
||||||
""", (party,)).fetchone() |
|
||||||
conn.close() |
|
||||||
return {"total": result[0], "voor": result[1], "tegen": result[2]} |
|
||||||
except Exception: |
|
||||||
conn.close() |
|
||||||
return {"total": 0, "voor": 0, "tegen": 0} |
|
||||||
``` |
|
||||||
|
|
||||||
## Error Handling |
|
||||||
|
|
||||||
Always close connections in finally block or with context manager: |
|
||||||
|
|
||||||
```python |
|
||||||
def safe_query(self, query: str, params: tuple = ()): |
|
||||||
conn = None |
|
||||||
try: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
result = conn.execute(query, params).fetchall() |
|
||||||
return result |
|
||||||
except Exception as e: |
|
||||||
_logger.error(f"Query failed: {e}") |
|
||||||
return [] |
|
||||||
finally: |
|
||||||
if conn: |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
## Testing with Mock |
|
||||||
|
|
||||||
For unit tests without DuckDB: |
|
||||||
|
|
||||||
```python |
|
||||||
# In MotionDatabase.__init__ |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self.db_path = db_path |
|
||||||
self._file_mode = duckdb is None |
|
||||||
|
|
||||||
if duckdb is None: |
|
||||||
# Create JSON fallback files |
|
||||||
for p in (f"{db_path}.embeddings.json", f"{db_path}.similarity_cache.json"): |
|
||||||
if not os.path.exists(p): |
|
||||||
with open(p, "w") as fh: |
|
||||||
fh.write("[]") |
|
||||||
else: |
|
||||||
self._init_database() |
|
||||||
``` |
|
||||||
@ -1,79 +0,0 @@ |
|||||||
--- |
|
||||||
title: DuckDB Access Pattern |
|
||||||
category: patterns |
|
||||||
--- |
|
||||||
# DuckDB Access Pattern |
|
||||||
|
|
||||||
## Rules |
|
||||||
|
|
||||||
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
|
||||||
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
|
||||||
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
|
||||||
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### database.py - Explicit connect/close for schema init |
|
||||||
|
|
||||||
```python |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
... |
|
||||||
conn.execute(""" |
|
||||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
|
||||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
|
||||||
motion_id INTEGER NOT NULL, |
|
||||||
window_id TEXT NOT NULL, |
|
||||||
vector JSON NOT NULL, |
|
||||||
svd_dims INTEGER NOT NULL, |
|
||||||
text_dims INTEGER NOT NULL, |
|
||||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
|
||||||
PRIMARY KEY (id) |
|
||||||
) |
|
||||||
""") |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
### pipeline/svd_pipeline.py - Read-only connection |
|
||||||
|
|
||||||
```python |
|
||||||
conn = duckdb.connect(db_path, read_only=True) |
|
||||||
try: |
|
||||||
rows = conn.execute( |
|
||||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
|
||||||
(start_date, end_date), |
|
||||||
).fetchall() |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
### similarity/compute.py - Preferred 'with' context |
|
||||||
|
|
||||||
```python |
|
||||||
try: |
|
||||||
import duckdb |
|
||||||
except Exception: |
|
||||||
logger.exception("duckdb import failed; cannot load vectors") |
|
||||||
return 0 |
|
||||||
|
|
||||||
with duckdb.connect(db.db_path) as conn: |
|
||||||
rows = conn.execute(query, params).fetchall() |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### Bad: Connection without closure |
|
||||||
|
|
||||||
```python |
|
||||||
# BAD: connection may leak if exception occurs before explicit close |
|
||||||
conn = duckdb.connect(db_path) |
|
||||||
rows = conn.execute("SELECT ...").fetchall() |
|
||||||
# missing finally/close |
|
||||||
``` |
|
||||||
|
|
||||||
**Remediation**: Use "with" context or ensure conn.close() in finally block. |
|
||||||
|
|
||||||
### Bad: Parallel write connections |
|
||||||
|
|
||||||
**Problem**: Opening write connections from many parallel workers without coordination. |
|
||||||
|
|
||||||
**Remediation**: Open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
|
||||||
@ -1,74 +0,0 @@ |
|||||||
--- |
|
||||||
title: Embeddings Similarity Pipeline |
|
||||||
category: patterns |
|
||||||
--- |
|
||||||
# Embeddings Similarity Pipeline |
|
||||||
|
|
||||||
## Rules |
|
||||||
|
|
||||||
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
|
||||||
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
|
||||||
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
|
||||||
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### pipeline/ai_provider_wrapper.py - Batched embed + fallback |
|
||||||
|
|
||||||
```python |
|
||||||
for start in range(0, len(texts), batch_size): |
|
||||||
chunk = texts[start : start + batch_size] |
|
||||||
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
|
||||||
... |
|
||||||
for j in range(i, end): |
|
||||||
t = texts[j] |
|
||||||
single, single_exc = _attempt_batch([t], j) |
|
||||||
if single: |
|
||||||
results[j] = single[0] |
|
||||||
``` |
|
||||||
|
|
||||||
### pipeline/fusion.py - Concatenation and storage |
|
||||||
|
|
||||||
```python |
|
||||||
try: |
|
||||||
svd_vec = json.loads(svd_json) |
|
||||||
except Exception: |
|
||||||
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
|
||||||
skipped_missing_svd += 1 |
|
||||||
continue |
|
||||||
... |
|
||||||
fused = list(svd_vec) + list(text_vec) |
|
||||||
res = db.store_fused_embedding( |
|
||||||
int(entity_id), |
|
||||||
window_id, |
|
||||||
fused, |
|
||||||
svd_dims=len(svd_vec), |
|
||||||
text_dims=len(text_vec), |
|
||||||
) |
|
||||||
``` |
|
||||||
|
|
||||||
### similarity/compute.py - Normalized cosine similarity |
|
||||||
|
|
||||||
```python |
|
||||||
# Normalize rows |
|
||||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
|
||||||
norms[norms == 0] = 1.0 |
|
||||||
normalized = matrix / norms |
|
||||||
sim = normalized @ normalized.T |
|
||||||
... |
|
||||||
# pick top-k neighbors and write to similarity_cache |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### Bad: Assuming consistent vector length |
|
||||||
|
|
||||||
**Problem**: Assuming consistent vector length without checks leads to shape errors. |
|
||||||
|
|
||||||
**Remediation**: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
|
||||||
|
|
||||||
### Bad: Inline heavy computation in UI |
|
||||||
|
|
||||||
**Problem**: Recomputing heavy pipelines inline in UI requests. |
|
||||||
|
|
||||||
**Remediation**: Schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
|
||||||
@ -1,63 +0,0 @@ |
|||||||
--- |
|
||||||
title: Error Handling Pattern |
|
||||||
category: patterns |
|
||||||
--- |
|
||||||
# Error Handling Pattern |
|
||||||
|
|
||||||
## Rules |
|
||||||
|
|
||||||
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
|
||||||
- Prefer logging.exception when catching an exception where stack trace is useful. |
|
||||||
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
|
||||||
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### ai_provider.py - Network error to ProviderError |
|
||||||
|
|
||||||
```python |
|
||||||
except requests.ConnectionError as exc: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError( |
|
||||||
f"Connection error when calling provider: {exc}" |
|
||||||
) from exc |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
### pipeline/ai_provider_wrapper.py - Best-effort with logging |
|
||||||
|
|
||||||
```python |
|
||||||
except Exception: |
|
||||||
_logger.exception("Failed to append audit event for embedding failure") |
|
||||||
results[j] = None |
|
||||||
``` |
|
||||||
|
|
||||||
### similarity/compute.py - Defensive import handling |
|
||||||
|
|
||||||
```python |
|
||||||
try: |
|
||||||
import duckdb |
|
||||||
except Exception: |
|
||||||
logger.exception("duckdb import failed; cannot load vectors") |
|
||||||
return 0 |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### Bad: Silent exception swallowing |
|
||||||
|
|
||||||
```python |
|
||||||
try: |
|
||||||
do_work() |
|
||||||
except Exception: |
|
||||||
return [] |
|
||||||
# BAD: hides the root cause and returns an ambiguous default |
|
||||||
``` |
|
||||||
|
|
||||||
**Remediation**: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
|
||||||
|
|
||||||
### Bad: Mixing print() and logging |
|
||||||
|
|
||||||
**Problem**: Mixing print() and logging for errors. |
|
||||||
|
|
||||||
**Remediation**: Replace print() calls with logger.* calls; use structured logging configuration. |
|
||||||
@ -1,41 +0,0 @@ |
|||||||
--- |
|
||||||
title: Module Singletons Pattern |
|
||||||
category: patterns |
|
||||||
--- |
|
||||||
# Module Singletons Pattern |
|
||||||
|
|
||||||
## Rules |
|
||||||
|
|
||||||
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
|
||||||
- Avoid expensive initialization at import time. |
|
||||||
- Provide a way to construct with a test DB path or to reinitialize in tests. |
|
||||||
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### database.py - Safe class initialization |
|
||||||
|
|
||||||
```python |
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self.db_path = db_path |
|
||||||
# If duckdb is not available, operate in lightweight file-backed mode |
|
||||||
self._file_mode = duckdb is None |
|
||||||
self._init_database() |
|
||||||
``` |
|
||||||
|
|
||||||
### similarity/lookup.py - Local instances |
|
||||||
|
|
||||||
```python |
|
||||||
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
|
||||||
if hasattr(db, "get_cached_similarities"): |
|
||||||
rows = db.get_cached_similarities(...) |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### Bad: Heavy initialization at import time |
|
||||||
|
|
||||||
**Problem**: Creating connections and performing heavy schema migrations during import. |
|
||||||
|
|
||||||
**Remediation**: Move heavy init to an explicit initialize() method and keep import fast. |
|
||||||
@ -1,196 +0,0 @@ |
|||||||
# Python-Specific Patterns |
|
||||||
|
|
||||||
## Singleton Pattern |
|
||||||
|
|
||||||
Use module-level instances for shared resources: |
|
||||||
|
|
||||||
```python |
|
||||||
# database.py |
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self.db_path = db_path |
|
||||||
self._init_database() |
|
||||||
|
|
||||||
def _init_database(self): |
|
||||||
# Initialize tables on first instantiation |
|
||||||
... |
|
||||||
|
|
||||||
# Bottom of file - the singleton |
|
||||||
db = MotionDatabase() |
|
||||||
``` |
|
||||||
|
|
||||||
**Usage across the codebase:** |
|
||||||
```python |
|
||||||
# In other modules |
|
||||||
from database import db |
|
||||||
|
|
||||||
def some_function(): |
|
||||||
motions = db.get_filtered_motions(limit=10) |
|
||||||
return motions |
|
||||||
``` |
|
||||||
|
|
||||||
Similarly for other singletons: |
|
||||||
```python |
|
||||||
# summarizer.py |
|
||||||
class MotionSummarizer: |
|
||||||
def __init__(self): |
|
||||||
pass # Stateless |
|
||||||
|
|
||||||
def generate_layman_explanation(self, title: str, body: str) -> str: |
|
||||||
... |
|
||||||
|
|
||||||
summarizer = MotionSummarizer() |
|
||||||
``` |
|
||||||
|
|
||||||
## Dataclass Config Pattern |
|
||||||
|
|
||||||
Use dataclass for configuration with environment variable support: |
|
||||||
|
|
||||||
```python |
|
||||||
# config.py |
|
||||||
from dataclasses import dataclass |
|
||||||
from typing import List |
|
||||||
import os |
|
||||||
|
|
||||||
@dataclass |
|
||||||
class Config: |
|
||||||
# Database settings |
|
||||||
DATABASE_PATH = "data/motions.db" |
|
||||||
|
|
||||||
# API settings |
|
||||||
TWEEDE_KAMER_ODATA_API = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
|
||||||
API_TIMEOUT = 30 |
|
||||||
API_BATCH_SIZE = 250 |
|
||||||
|
|
||||||
# AI settings |
|
||||||
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") |
|
||||||
OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1" |
|
||||||
QWEN_MODEL = "qwen/qwen-2.5-72b-instruct" |
|
||||||
|
|
||||||
# App settings |
|
||||||
DEFAULT_MOTION_COUNT = 10 |
|
||||||
SESSION_TIMEOUT_DAYS = 30 |
|
||||||
|
|
||||||
# Policy areas |
|
||||||
POLICY_AREAS: List[str] = None |
|
||||||
def __post_init__(self): |
|
||||||
self.POLICY_AREAS = [ |
|
||||||
"Alle", "Economie", "Klimaat", "Immigratie", |
|
||||||
"Zorg", "Onderwijs", "Defensie", "Sociale Zaken", "Algemeen" |
|
||||||
] |
|
||||||
|
|
||||||
config = Config() |
|
||||||
``` |
|
||||||
|
|
||||||
**Usage:** |
|
||||||
```python |
|
||||||
from config import config |
|
||||||
|
|
||||||
# Access as attributes |
|
||||||
timeout = config.API_TIMEOUT |
|
||||||
areas = config.POLICY_AREAS |
|
||||||
``` |
|
||||||
|
|
||||||
## DuckDB Connection Pattern |
|
||||||
|
|
||||||
Short-lived connections with explicit cleanup: |
|
||||||
|
|
||||||
```python |
|
||||||
class MotionDatabase: |
|
||||||
def get_motion(self, motion_id: int) -> Optional[Dict]: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
result = conn.execute( |
|
||||||
"SELECT * FROM motions WHERE id = ?", |
|
||||||
(motion_id,) |
|
||||||
).fetchone() |
|
||||||
return result |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
|
|
||||||
def get_filtered_motions(self, **kwargs) -> List[Dict]: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
try: |
|
||||||
rows = conn.execute(query, params).fetchall() |
|
||||||
return rows |
|
||||||
except Exception: |
|
||||||
return [] # Safe fallback |
|
||||||
finally: |
|
||||||
conn.close() |
|
||||||
``` |
|
||||||
|
|
||||||
**Context manager alternative (preferred when applicable):** |
|
||||||
```python |
|
||||||
def some_operation(self): |
|
||||||
with duckdb.connect(self.db_path) as conn: |
|
||||||
result = conn.execute("SELECT ...").fetchall() |
|
||||||
return result |
|
||||||
``` |
|
||||||
|
|
||||||
## Try/Except with Fallback Pattern |
|
||||||
|
|
||||||
Always provide safe fallbacks: |
|
||||||
|
|
||||||
```python |
|
||||||
def get_motion_or_default(self, motion_id: int) -> Dict: |
|
||||||
try: |
|
||||||
conn = duckdb.connect(self.db_path) |
|
||||||
result = conn.execute("SELECT * FROM motions WHERE id = ?", (motion_id,)).fetchone() |
|
||||||
conn.close() |
|
||||||
return result if result else {} |
|
||||||
except Exception: |
|
||||||
return {} |
|
||||||
``` |
|
||||||
|
|
||||||
## Optional Import Pattern |
|
||||||
|
|
||||||
Handle optional dependencies gracefully: |
|
||||||
|
|
||||||
```python |
|
||||||
try: |
|
||||||
import duckdb |
|
||||||
except Exception: # pragma: no cover |
|
||||||
duckdb = None |
|
||||||
|
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self._file_mode = duckdb is None |
|
||||||
... |
|
||||||
``` |
|
||||||
|
|
||||||
## Property Pattern |
|
||||||
|
|
||||||
Lazy initialization of expensive resources: |
|
||||||
|
|
||||||
```python |
|
||||||
class MotionDatabase: |
|
||||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
|
||||||
self.db_path = db_path |
|
||||||
self._session_cache = None |
|
||||||
|
|
||||||
@property |
|
||||||
def session(self): |
|
||||||
"""Lazy-load expensive resources.""" |
|
||||||
if self._session_cache is None: |
|
||||||
self._session_cache = self._create_session() |
|
||||||
return self._session_cache |
|
||||||
``` |
|
||||||
|
|
||||||
## Type Annotation Patterns |
|
||||||
|
|
||||||
```python |
|
||||||
from typing import Dict, List, Optional, Tuple, Any |
|
||||||
|
|
||||||
# Optional with None default |
|
||||||
def get_motion(self, motion_id: Optional[int] = None) -> Optional[Dict]: |
|
||||||
... |
|
||||||
|
|
||||||
# Multiple return types |
|
||||||
def parse_vote(self, vote_str: str) -> Tuple[bool, str]: |
|
||||||
"""Returns (success, error_message)""" |
|
||||||
... |
|
||||||
|
|
||||||
# Generic types |
|
||||||
def get_batch(self, ids: List[int]) -> Dict[str, Any]: |
|
||||||
... |
|
||||||
``` |
|
||||||
@ -1,77 +0,0 @@ |
|||||||
--- |
|
||||||
title: Requests HTTP Pattern |
|
||||||
category: patterns |
|
||||||
--- |
|
||||||
# Requests HTTP Pattern |
|
||||||
|
|
||||||
## Rules |
|
||||||
|
|
||||||
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
|
||||||
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
|
||||||
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
|
||||||
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### ai_provider.py - 429 handling with Retry-After |
|
||||||
|
|
||||||
```python |
|
||||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
|
||||||
... |
|
||||||
if getattr(resp, "status_code", 0) == 429: |
|
||||||
if attempt == retries: |
|
||||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
|
||||||
retry_after = None |
|
||||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
|
||||||
if raw: |
|
||||||
try: |
|
||||||
retry_after = int(raw) |
|
||||||
except Exception: |
|
||||||
... |
|
||||||
if retry_after is not None: |
|
||||||
time.sleep(retry_after) |
|
||||||
continue |
|
||||||
``` |
|
||||||
|
|
||||||
### api_client.py - Session + raise_for_status |
|
||||||
|
|
||||||
```python |
|
||||||
response = self.session.get( |
|
||||||
base_url, params=params, timeout=config.API_TIMEOUT |
|
||||||
) |
|
||||||
response.raise_for_status() |
|
||||||
data = response.json() |
|
||||||
``` |
|
||||||
|
|
||||||
### pipeline/ai_provider_wrapper.py - Retry/backoff wrapper |
|
||||||
|
|
||||||
```python |
|
||||||
def _attempt_batch(chunk_texts, start_index): |
|
||||||
backoff = 0.5 |
|
||||||
for attempt in range(1, retries + 1): |
|
||||||
try: |
|
||||||
emb_chunk = _embedder( |
|
||||||
chunk_texts, model=model, batch_size=len(chunk_texts) |
|
||||||
) |
|
||||||
return emb_chunk, None |
|
||||||
except Exception as exc: |
|
||||||
if attempt == retries: |
|
||||||
break |
|
||||||
sleep = backoff * (2 ** (attempt - 1)) |
|
||||||
time.sleep(sleep) |
|
||||||
continue |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### Bad: Silent exception swallowing |
|
||||||
|
|
||||||
**Problem**: Blindly catching all requests exceptions and returning empty response. |
|
||||||
|
|
||||||
**Remediation**: Map network exceptions to retryable vs terminal (ProviderError) and log details. |
|
||||||
|
|
||||||
### Bad: Using print() for errors |
|
||||||
|
|
||||||
**Problem**: Using print() for network errors instead of structured logging. |
|
||||||
|
|
||||||
**Remediation**: Use `_logger.exception()` instead (see api_client.py needs fixing). |
|
||||||
@ -1,37 +0,0 @@ |
|||||||
--- |
|
||||||
title: Validation Pattern |
|
||||||
category: patterns |
|
||||||
--- |
|
||||||
# Validation Pattern |
|
||||||
|
|
||||||
## Rules |
|
||||||
|
|
||||||
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
|
||||||
- Tests should assert that invalid inputs raise the expected exceptions. |
|
||||||
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
|
||||||
|
|
||||||
## Examples |
|
||||||
|
|
||||||
### ai_provider.py - Type validation |
|
||||||
|
|
||||||
```python |
|
||||||
if not isinstance(text, str): |
|
||||||
raise ProviderError("text must be a string") |
|
||||||
``` |
|
||||||
|
|
||||||
### pipeline/ai_provider_wrapper.py - Defensive empty handling |
|
||||||
|
|
||||||
```python |
|
||||||
if not texts: |
|
||||||
return [] |
|
||||||
if motion_ids is None: |
|
||||||
motion_ids = [None for _ in texts] |
|
||||||
``` |
|
||||||
|
|
||||||
## Anti-Patterns |
|
||||||
|
|
||||||
### Bad: Invalid values into computation |
|
||||||
|
|
||||||
**Problem**: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
|
||||||
|
|
||||||
**Remediation**: Fail fast with a typed exception and add unit tests to cover validations. |
|
||||||
@ -1,67 +0,0 @@ |
|||||||
--- |
|
||||||
title: Tech Stack |
|
||||||
category: stack |
|
||||||
--- |
|
||||||
|
|
||||||
# Tech Stack |
|
||||||
|
|
||||||
## Runtime & Language |
|
||||||
- **Python >=3.13** |
|
||||||
|
|
||||||
## Web Framework |
|
||||||
- **Streamlit** - Multi-page app with Home, Stemwijzer, Explorer pages |
|
||||||
|
|
||||||
## Data Layer |
|
||||||
- **DuckDB** - Embedded OLAP database |
|
||||||
- Tables: motions, mp_votes, svd_vectors, fused_embeddings, embeddings, user_sessions, party_results, mp_metadata |
|
||||||
- **ibis** - ORM (referenced but DuckDB-native implementation used) |
|
||||||
|
|
||||||
## AI / LLM |
|
||||||
- **OpenRouter** - API abstraction for AI providers |
|
||||||
- **QWEN** - Primary model |
|
||||||
- Embeddings: `qwen/qwen3-embedding-4b` |
|
||||||
- Chat: `qwen/qwen-2.5-72b-instruct` |
|
||||||
- **requests** - HTTP client (not raw openai) |
|
||||||
|
|
||||||
## ML / Analytics |
|
||||||
- **scikit-learn** - KMeans clustering, cosine_similarity, StandardScaler |
|
||||||
- **scipy** - SVD (scipy.linalg.svd), spatial.procrustes |
|
||||||
- **umap-learn** - Dimensionality reduction (optional, graceful fallback to SVD) |
|
||||||
- **numpy** - Numerical computing |
|
||||||
|
|
||||||
## Visualization |
|
||||||
- **Plotly** - Interactive charts (go.Figure, _DummyTrace fallback) |
|
||||||
- **matplotlib** - Static plotting (optional) |
|
||||||
|
|
||||||
## HTTP & Parsing |
|
||||||
- **requests** - Session pooling, retry with backoff |
|
||||||
- **beautifulsoup4** - HTML parsing |
|
||||||
- **lxml** - XML/HTML processing |
|
||||||
|
|
||||||
## Key Source Files |
|
||||||
|
|
||||||
| File | Purpose | |
|
||||||
|------|---------| |
|
||||||
| `database.py` | MotionDatabase singleton, DuckDB connection, 9-table schema | |
|
||||||
| `explorer.py` | Explorer page with 4 tabs (Motion, MP, Party, Evolution) | |
|
||||||
| `explorer_helpers.py` | Pure helper functions, Plotly chart builders | |
|
||||||
| `analysis/` | SVD pipeline, UMAP projection, clustering | |
|
||||||
| `pipeline/` | Data fetch, transform, store pipeline | |
|
||||||
| `pages/1_Stemwijzer.py` | Quiz page | |
|
||||||
| `pages/2_Explorer.py` | Explorer page | |
|
||||||
| `config.py` | Dataclass Config pattern | |
|
||||||
| `ai_provider.py` | OpenRouter API wrapper with retry | |
|
||||||
| `api_client.py` | TweedeKamer OData API client | |
|
||||||
|
|
||||||
## Singleton Instances |
|
||||||
|
|
||||||
| Module | Instance | Type | |
|
||||||
|--------|----------|------| |
|
||||||
| `database.py` | `db` | `MotionDatabase` | |
|
||||||
| `config.py` | `config` | `Config` (dataclass) | |
|
||||||
| `config.py` | `PARTY_COLOURS` | `dict[str, str]` | |
|
||||||
|
|
||||||
## Environment |
|
||||||
- Python >=3.13 |
|
||||||
- Environment variables via `.env` (DB path, API keys) |
|
||||||
- No `.env` values in constraint files (security) |
|
||||||
@ -1,72 +0,0 @@ |
|||||||
import os |
|
||||||
import re |
|
||||||
from typing import List |
|
||||||
|
|
||||||
|
|
||||||
def file_exists(base_dir: str, path: str) -> bool: |
|
||||||
"""Check whether a path exists under base_dir without opening the file. |
|
||||||
|
|
||||||
This resolves the path relative to base_dir and returns True if the |
|
||||||
resolved path exists on the filesystem (file or directory). |
|
||||||
""" |
|
||||||
if not base_dir: |
|
||||||
base = "" |
|
||||||
else: |
|
||||||
base = base_dir |
|
||||||
full = os.path.join(base, path) |
|
||||||
return os.path.exists(full) |
|
||||||
|
|
||||||
|
|
||||||
def detect_truncated(snippet: str) -> bool: |
|
||||||
"""Heuristic detection whether a snippet is truncated. |
|
||||||
|
|
||||||
Returns True if the snippet ends with an ellipsis '...' (after |
|
||||||
trimming whitespace) or contains a common truncation marker like |
|
||||||
the substring 'truncat' (case-insensitive). |
|
||||||
""" |
|
||||||
if snippet is None: |
|
||||||
return False |
|
||||||
s = snippet.strip() |
|
||||||
if s.endswith("..."): |
|
||||||
return True |
|
||||||
if "truncat" in s.lower(): |
|
||||||
return True |
|
||||||
return False |
|
||||||
|
|
||||||
|
|
||||||
def find_potential_secrets(text: str) -> List[str]: |
|
||||||
"""Scan the provided text and return a list of potential secret-like |
|
||||||
strings. This uses a few common heuristics and regex patterns and only |
|
||||||
scans the provided text (no external resources). |
|
||||||
|
|
||||||
The function returns a list of found token strings (values when |
|
||||||
capture groups are available, otherwise the matched substring). |
|
||||||
""" |
|
||||||
if not text: |
|
||||||
return [] |
|
||||||
|
|
||||||
candidates: List[str] = [] |
|
||||||
|
|
||||||
# AWS access key id pattern (common): AKIA followed by 16 alphanumeric |
|
||||||
aws_pattern = re.compile(r"AKIA[0-9A-Z]{16}") |
|
||||||
candidates.extend(aws_pattern.findall(text)) |
|
||||||
|
|
||||||
# Common key/value patterns like api_key = "..." or "api-key: ..." |
|
||||||
# allow shorter secret values (down to 4 chars) to catch short test values |
|
||||||
kv_pattern = re.compile( |
|
||||||
r"(?i)(?:api[_-]?key|secret[_-]?key|access[_-]?token|access[_-]?key|token|password|passwd|pwd)\s*[=:]+\s*['\"]?([A-Za-z0-9\-_=+/\.]{4,128})['\"]?" |
|
||||||
) |
|
||||||
candidates.extend(m.group(1) for m in kv_pattern.finditer(text)) |
|
||||||
|
|
||||||
# Generic long hex or base64-like strings (heuristic) |
|
||||||
long_hex = re.compile(r"\b([a-f0-9]{32,128})\b", re.IGNORECASE) |
|
||||||
candidates.extend(long_hex.findall(text)) |
|
||||||
|
|
||||||
# Deduplicate while preserving order |
|
||||||
seen = set() |
|
||||||
result: List[str] = [] |
|
||||||
for c in candidates: |
|
||||||
if c and c not in seen: |
|
||||||
seen.add(c) |
|
||||||
result.append(c) |
|
||||||
return result |
|
||||||
@ -1,32 +0,0 @@ |
|||||||
from typing import List, Optional |
|
||||||
|
|
||||||
|
|
||||||
def main(argv: Optional[List[str]] = None) -> int: |
|
||||||
"""CLI wrapper that delegates to scripts.mindmodel.validator.main. |
|
||||||
|
|
||||||
Returns the integer exit code from the delegated main. If the |
|
||||||
validator module is not available or raises, return a non-zero |
|
||||||
exit code. |
|
||||||
""" |
|
||||||
try: |
|
||||||
# Import here to avoid side-effects on module import |
|
||||||
from scripts.mindmodel import validator |
|
||||||
|
|
||||||
# Call the validator.main if present |
|
||||||
if hasattr(validator, "main"): |
|
||||||
result = validator.main(argv) |
|
||||||
# Ensure we return an int |
|
||||||
try: |
|
||||||
return int(result) # type: ignore |
|
||||||
except Exception: |
|
||||||
return 1 |
|
||||||
else: |
|
||||||
return 2 |
|
||||||
except Exception: |
|
||||||
# Import error or runtime error — return non-zero so callers |
|
||||||
# can detect failure (tests expect non-zero on missing manifest) |
|
||||||
return 2 |
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__": |
|
||||||
raise SystemExit(main()) |
|
||||||
@ -1,67 +0,0 @@ |
|||||||
"""Simple manifest loader for mindmodel manifests. |
|
||||||
|
|
||||||
Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`. |
|
||||||
|
|
||||||
Behavior: |
|
||||||
- If PyYAML is installed, uses yaml.safe_load to parse the file. |
|
||||||
- Otherwise falls back to the stdlib json parser. |
|
||||||
- If the top-level document is a list it will be normalized to {"constraints": <list>}. |
|
||||||
- Raises ManifestLoadError for missing file or parse errors. |
|
||||||
""" |
|
||||||
|
|
||||||
from typing import Any, Dict |
|
||||||
import json |
|
||||||
from pathlib import Path |
|
||||||
|
|
||||||
|
|
||||||
class ManifestLoadError(Exception): |
|
||||||
"""Raised when a manifest cannot be loaded or parsed.""" |
|
||||||
|
|
||||||
|
|
||||||
try: |
|
||||||
import yaml # type: ignore |
|
||||||
except Exception: # YAML not available |
|
||||||
yaml = None # type: ignore |
|
||||||
|
|
||||||
|
|
||||||
def _parse_with_yaml(text: str) -> Any: |
|
||||||
# yamlsafe_load may return any Python structure |
|
||||||
try: |
|
||||||
return yaml.safe_load(text) |
|
||||||
except Exception as exc: # pragma: no cover - defensive |
|
||||||
raise ManifestLoadError(f"YAML parse error: {exc}") from exc |
|
||||||
|
|
||||||
|
|
||||||
def _parse_with_json(text: str) -> Any: |
|
||||||
try: |
|
||||||
return json.loads(text) |
|
||||||
except Exception as exc: |
|
||||||
raise ManifestLoadError(f"JSON parse error: {exc}") from exc |
|
||||||
|
|
||||||
|
|
||||||
def load_manifest(path: str) -> Dict[str, Any]: |
|
||||||
"""Load a manifest from the given file path and normalize it to a dict. |
|
||||||
|
|
||||||
If the top-level document is a list, it will be returned as {"constraints": list}. |
|
||||||
Raises ManifestLoadError if the file does not exist or if parsing fails. |
|
||||||
""" |
|
||||||
p = Path(path) |
|
||||||
if not p.exists(): |
|
||||||
raise ManifestLoadError(f"Manifest file not found: {path}") |
|
||||||
|
|
||||||
text = p.read_text(encoding="utf-8") |
|
||||||
|
|
||||||
if yaml is not None: |
|
||||||
data = _parse_with_yaml(text) |
|
||||||
else: |
|
||||||
data = _parse_with_json(text) |
|
||||||
|
|
||||||
# Normalize |
|
||||||
if isinstance(data, list): |
|
||||||
return {"constraints": data} |
|
||||||
|
|
||||||
if isinstance(data, dict): |
|
||||||
return data |
|
||||||
|
|
||||||
# Unexpected top-level type, wrap it |
|
||||||
return {"manifest": data} |
|
||||||
@ -1,108 +0,0 @@ |
|||||||
from typing import Dict, Tuple, List, Any |
|
||||||
import json |
|
||||||
from pathlib import Path |
|
||||||
|
|
||||||
from scripts.mindmodel import loader |
|
||||||
from scripts.mindmodel import checks |
|
||||||
|
|
||||||
|
|
||||||
def validate_manifest(path: str, base_dir: str = None) -> Tuple[int, Dict[str, Any]]: |
|
||||||
"""Validate a manifest file at `path`. |
|
||||||
|
|
||||||
Returns a tuple (exit_code, report). |
|
||||||
|
|
||||||
exit codes: |
|
||||||
0 - ok (no issues) |
|
||||||
1 - warnings (only truncated snippets found) |
|
||||||
2 - critical (missing files, secrets, or parse error) |
|
||||||
""" |
|
||||||
report: Dict[str, Any] = { |
|
||||||
"path": path, |
|
||||||
"secrets": [], |
|
||||||
"missing_files": [], |
|
||||||
"truncated": 0, |
|
||||||
"constraints": [], |
|
||||||
} |
|
||||||
|
|
||||||
p = Path(path) |
|
||||||
try: |
|
||||||
raw_text = p.read_text(encoding="utf-8") |
|
||||||
except Exception as exc: |
|
||||||
report["load_error"] = f"Manifest file not readable: {exc}" |
|
||||||
return 2, report |
|
||||||
|
|
||||||
# scan for secrets in the manifest text |
|
||||||
secrets = checks.find_potential_secrets(raw_text) |
|
||||||
report["secrets"] = secrets |
|
||||||
|
|
||||||
try: |
|
||||||
manifest = loader.load_manifest(path) |
|
||||||
except loader.ManifestLoadError as exc: |
|
||||||
report["load_error"] = str(exc) |
|
||||||
# treat parse/load errors as critical |
|
||||||
return 2, report |
|
||||||
|
|
||||||
constraints = manifest.get("constraints") or [] |
|
||||||
|
|
||||||
for constraint in constraints: |
|
||||||
c_rep: Dict[str, Any] = {"constraint": constraint, "evidence": []} |
|
||||||
for ev in ( |
|
||||||
constraint.get("evidence", []) |
|
||||||
if isinstance(constraint.get("evidence", []), list) |
|
||||||
else [] |
|
||||||
): |
|
||||||
text = ev.get("text") if isinstance(ev, dict) else None |
|
||||||
file_ref = ev.get("file") if isinstance(ev, dict) else None |
|
||||||
|
|
||||||
exists = True |
|
||||||
if file_ref: |
|
||||||
if not checks.file_exists(base_dir or "", file_ref): |
|
||||||
exists = False |
|
||||||
report["missing_files"].append(file_ref) |
|
||||||
|
|
||||||
truncated = False |
|
||||||
if text: |
|
||||||
truncated = checks.detect_truncated(text) |
|
||||||
if truncated: |
|
||||||
report["truncated"] += 1 |
|
||||||
|
|
||||||
c_rep["evidence"].append( |
|
||||||
{ |
|
||||||
"text": text, |
|
||||||
"file": file_ref, |
|
||||||
"exists": exists, |
|
||||||
"truncated": truncated, |
|
||||||
} |
|
||||||
) |
|
||||||
|
|
||||||
report["constraints"].append(c_rep) |
|
||||||
|
|
||||||
# decide exit code |
|
||||||
if report["secrets"]: |
|
||||||
return 2, report |
|
||||||
|
|
||||||
if report["missing_files"]: |
|
||||||
return 2, report |
|
||||||
|
|
||||||
if report["truncated"] > 0: |
|
||||||
return 1, report |
|
||||||
|
|
||||||
return 0, report |
|
||||||
|
|
||||||
|
|
||||||
def main(argv: List[str]) -> int: |
|
||||||
import sys |
|
||||||
|
|
||||||
if len(argv) < 2: |
|
||||||
print(json.dumps({"error": "manifest path required"})) |
|
||||||
return 2 |
|
||||||
|
|
||||||
path = argv[1] |
|
||||||
base_dir = argv[2] if len(argv) > 2 else None |
|
||||||
|
|
||||||
code, report = validate_manifest(path, base_dir=base_dir) |
|
||||||
print(json.dumps(report)) |
|
||||||
return code |
|
||||||
|
|
||||||
|
|
||||||
# no execution at import time |
|
||||||
@ -1,56 +0,0 @@ |
|||||||
"""Command-line wrapper around src.validators.mindmodel_validator.validate_manifest |
|
||||||
|
|
||||||
This tiny CLI loads a manifest and writes a structured JSON report to stdout |
|
||||||
and optionally to a file path. It is report-only: it never raises an error or |
|
||||||
changes exit code based on findings. |
|
||||||
""" |
|
||||||
|
|
||||||
from __future__ import annotations |
|
||||||
|
|
||||||
import argparse |
|
||||||
import json |
|
||||||
import os |
|
||||||
from pathlib import Path |
|
||||||
from typing import Any |
|
||||||
|
|
||||||
|
|
||||||
def _write_report(report: dict[str, Any], path: Path | None) -> None: |
|
||||||
text = json.dumps(report, indent=2, ensure_ascii=False) |
|
||||||
print(text) |
|
||||||
if path: |
|
||||||
path.parent.mkdir(parents=True, exist_ok=True) |
|
||||||
path.write_text(text, encoding="utf-8") |
|
||||||
|
|
||||||
|
|
||||||
def main(argv: list[str] | None = None) -> int: |
|
||||||
parser = argparse.ArgumentParser("validate_mindmodel") |
|
||||||
parser.add_argument("manifest", nargs="?", help="path to manifest file") |
|
||||||
parser.add_argument("--manifest", dest="manifest_opt", help="path to manifest file") |
|
||||||
parser.add_argument("--report", help="optional output report path") |
|
||||||
args = parser.parse_args(argv) |
|
||||||
|
|
||||||
manifest = args.manifest_opt or args.manifest |
|
||||||
if not manifest: |
|
||||||
parser.error("manifest path is required (positional or --manifest)") |
|
||||||
|
|
||||||
# import here to keep CLI tiny when unused |
|
||||||
try: |
|
||||||
from src.validators.mindmodel_validator import validate_manifest |
|
||||||
except Exception as e: # pragma: no cover - defensive |
|
||||||
print(f"Failed to import validator: {e}") |
|
||||||
return 0 |
|
||||||
|
|
||||||
try: |
|
||||||
report = validate_manifest(manifest, report_only=True) |
|
||||||
except Exception as e: # never fail the process |
|
||||||
report = {"error": str(e)} |
|
||||||
|
|
||||||
report_path = Path(args.report) if args.report else None |
|
||||||
_write_report(report, report_path) |
|
||||||
|
|
||||||
# always exit zero for report-only operation |
|
||||||
return 0 |
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__": |
|
||||||
raise SystemExit(main()) |
|
||||||
@ -1,35 +0,0 @@ |
|||||||
"""Motion-related simple types and JSON helpers. |
|
||||||
|
|
||||||
Decision: MotionId is an alias for str for simplicity. |
|
||||||
""" |
|
||||||
|
|
||||||
from dataclasses import dataclass, asdict |
|
||||||
from typing import List |
|
||||||
import json |
|
||||||
|
|
||||||
MotionId = str |
|
||||||
Embedding = List[float] |
|
||||||
|
|
||||||
|
|
||||||
@dataclass |
|
||||||
class SimilarityNeighbor: |
|
||||||
motion_id: MotionId |
|
||||||
score: float |
|
||||||
|
|
||||||
|
|
||||||
def to_json(neighbors: List[SimilarityNeighbor]) -> str: |
|
||||||
"""Serialize a list of SimilarityNeighbor to a JSON string. |
|
||||||
|
|
||||||
The format is a JSON list of objects with keys 'motion_id' and 'score'. |
|
||||||
""" |
|
||||||
list_of_dicts = [asdict(n) for n in neighbors] |
|
||||||
return json.dumps(list_of_dicts) |
|
||||||
|
|
||||||
|
|
||||||
def from_json(json_str: str) -> List[SimilarityNeighbor]: |
|
||||||
"""Deserialize a JSON string (list of dicts) into SimilarityNeighbor list.""" |
|
||||||
parsed = json.loads(json_str) |
|
||||||
return [ |
|
||||||
SimilarityNeighbor(motion_id=item["motion_id"], score=float(item["score"])) |
|
||||||
for item in parsed |
|
||||||
] |
|
||||||
@ -1,142 +0,0 @@ |
|||||||
"""Conservative, report-only mindmodel/manifest validator. |
|
||||||
|
|
||||||
This module provides a small validator that reads a manifest (YAML if |
|
||||||
PyYAML is available, otherwise a tiny fallback parser) and reports |
|
||||||
potential issues without making changes. |
|
||||||
|
|
||||||
The returned report contains the keys: |
|
||||||
- missing_files: list of file paths referenced in the manifest that don't exist |
|
||||||
- truncated_evidence: list of items (dicts) where evidence_excerpt appears truncated |
|
||||||
- potential_secrets: list of items (dicts) where evidence_excerpt looks like it may contain secrets |
|
||||||
|
|
||||||
The manifest is expected to contain a top-level `files` list with |
|
||||||
entries that are mappings and have at least a `path` (or `file_path`) |
|
||||||
and optionally `evidence_excerpt`. |
|
||||||
""" |
|
||||||
|
|
||||||
from __future__ import annotations |
|
||||||
|
|
||||||
import os |
|
||||||
from typing import List, Dict, Any |
|
||||||
|
|
||||||
|
|
||||||
def _load_yaml_native(path: str) -> Dict[str, Any]: |
|
||||||
try: |
|
||||||
import yaml # type: ignore |
|
||||||
|
|
||||||
with open(path, "r", encoding="utf-8") as f: |
|
||||||
return yaml.safe_load(f) or {} |
|
||||||
except Exception: |
|
||||||
raise |
|
||||||
|
|
||||||
|
|
||||||
def _load_yaml_fallback(path: str) -> Dict[str, Any]: |
|
||||||
"""Tiny YAML-ish fallback parser that understands a minimal manifest. |
|
||||||
|
|
||||||
It only supports a top-level `files:` key and a sequence of simple |
|
||||||
mappings with `-` list items and `key: value` pairs indented. |
|
||||||
This is intentionally conservative and fragile; it's only used when |
|
||||||
PyYAML is not available. |
|
||||||
""" |
|
||||||
result: Dict[str, Any] = {} |
|
||||||
files: List[Dict[str, Any]] = [] |
|
||||||
current: Dict[str, Any] | None = None |
|
||||||
|
|
||||||
with open(path, "r", encoding="utf-8") as f: |
|
||||||
for raw in f: |
|
||||||
line = raw.rstrip("\n") |
|
||||||
stripped = line.lstrip() |
|
||||||
if not stripped or stripped.startswith("#"): |
|
||||||
continue |
|
||||||
if stripped.startswith("files:") and line.startswith(stripped): |
|
||||||
# top-level marker, skip |
|
||||||
continue |
|
||||||
if stripped.startswith("- "): |
|
||||||
# start new item |
|
||||||
if current is not None: |
|
||||||
files.append(current) |
|
||||||
current = {} |
|
||||||
# possible inline key: - path: something |
|
||||||
rest = stripped[2:].strip() |
|
||||||
if rest: |
|
||||||
if ":" in rest: |
|
||||||
k, v = rest.split(":", 1) |
|
||||||
current[k.strip()] = v.strip() |
|
||||||
continue |
|
||||||
# key: value lines (indented) |
|
||||||
if ":" in stripped and current is not None: |
|
||||||
k, v = stripped.split(":", 1) |
|
||||||
current[k.strip()] = v.strip() |
|
||||||
|
|
||||||
if current is not None: |
|
||||||
files.append(current) |
|
||||||
if files: |
|
||||||
result["files"] = files |
|
||||||
return result |
|
||||||
|
|
||||||
|
|
||||||
def _normalize_entry(entry: Any) -> Dict[str, Any]: |
|
||||||
if not isinstance(entry, dict): |
|
||||||
return {"path": str(entry)} |
|
||||||
# prefer path or file_path |
|
||||||
if "file_path" in entry and "path" not in entry: |
|
||||||
entry = dict(entry) |
|
||||||
entry["path"] = entry.pop("file_path") |
|
||||||
return entry |
|
||||||
|
|
||||||
|
|
||||||
def validate_manifest(manifest_path: str, report_only: bool = True) -> dict: |
|
||||||
"""Validate a minimal mindmodel manifest and return a report. |
|
||||||
|
|
||||||
Parameters |
|
||||||
- manifest_path: path to the YAML manifest file |
|
||||||
- report_only: unused flag for now; kept to emphasise this is report-only |
|
||||||
|
|
||||||
Returns a dict with keys: missing_files, truncated_evidence, potential_secrets |
|
||||||
""" |
|
||||||
if not os.path.exists(manifest_path): |
|
||||||
raise FileNotFoundError(manifest_path) |
|
||||||
|
|
||||||
# attempt to use PyYAML if available, otherwise fallback |
|
||||||
try: |
|
||||||
manifest = _load_yaml_native(manifest_path) |
|
||||||
except Exception: |
|
||||||
manifest = _load_yaml_fallback(manifest_path) |
|
||||||
|
|
||||||
files = manifest.get("files") or [] |
|
||||||
report = {"missing_files": [], "truncated_evidence": [], "potential_secrets": []} |
|
||||||
|
|
||||||
def _strip_surrounding_quotes(s: str) -> str: |
|
||||||
s = s.strip() |
|
||||||
if len(s) >= 2 and s[0] == s[-1] and s[0] in ('"', "'"): |
|
||||||
return s[1:-1] |
|
||||||
return s |
|
||||||
|
|
||||||
for raw in files: |
|
||||||
entry = _normalize_entry(raw) |
|
||||||
path = entry.get("path") |
|
||||||
evidence = entry.get("evidence_excerpt") or entry.get("evidence") or "" |
|
||||||
# Remove surrounding quotes if the fallback YAML parser left them in place |
|
||||||
if isinstance(evidence, str): |
|
||||||
evidence = _strip_surrounding_quotes(evidence) |
|
||||||
|
|
||||||
# missing files |
|
||||||
if path: |
|
||||||
if not os.path.exists(path): |
|
||||||
report["missing_files"].append(path) |
|
||||||
|
|
||||||
# truncated evidence heuristics |
|
||||||
if isinstance(evidence, str): |
|
||||||
if len(evidence) > 1000 or evidence.strip().endswith("..."): |
|
||||||
report["truncated_evidence"].append( |
|
||||||
{"path": path, "evidence_excerpt": evidence} |
|
||||||
) |
|
||||||
|
|
||||||
# potential secrets heuristics |
|
||||||
up = evidence.upper() |
|
||||||
if "PASSWORD" in up or "SECRET" in up or "BEGIN PRIVATE KEY" in evidence: |
|
||||||
report["potential_secrets"].append( |
|
||||||
{"path": path, "evidence_excerpt": evidence} |
|
||||||
) |
|
||||||
|
|
||||||
return report |
|
||||||
@ -1,11 +0,0 @@ |
|||||||
import pathlib |
|
||||||
|
|
||||||
|
|
||||||
def test_schedule_workflow_exists(): |
|
||||||
path = pathlib.Path(".github/workflows/mindmodel-schedule.yml") |
|
||||||
assert path.exists(), f"Expected {path} to exist" |
|
||||||
|
|
||||||
text = path.read_text(encoding="utf-8") |
|
||||||
# ensure the file is a GitHub Actions workflow that declares a schedule |
|
||||||
assert "on:" in text |
|
||||||
assert "schedule" in text |
|
||||||
@ -1,26 +0,0 @@ |
|||||||
import os |
|
||||||
|
|
||||||
try: |
|
||||||
import yaml |
|
||||||
|
|
||||||
_HAS_YAML = True |
|
||||||
except Exception: |
|
||||||
_HAS_YAML = False |
|
||||||
|
|
||||||
|
|
||||||
def test_mindmodel_workflow_exists_and_parses(): |
|
||||||
path = os.path.join(".github", "workflows", "mindmodel-validation.yml") |
|
||||||
assert os.path.exists(path), f"Workflow file {path} does not exist" |
|
||||||
|
|
||||||
# Minimal parse: if PyYAML is available, try safe_load; otherwise do a token check |
|
||||||
with open(path, "r", encoding="utf-8") as f: |
|
||||||
content = f.read() |
|
||||||
|
|
||||||
if _HAS_YAML: |
|
||||||
data = yaml.safe_load(content) |
|
||||||
assert data is not None and isinstance(data, dict) |
|
||||||
assert "on" in data or "name" in data |
|
||||||
else: |
|
||||||
# fall back to simple checks to avoid introducing new deps |
|
||||||
assert "name:" in content |
|
||||||
assert "on:" in content |
|
||||||
@ -1,43 +0,0 @@ |
|||||||
import os |
|
||||||
import tempfile |
|
||||||
|
|
||||||
from scripts.mindmodel import checks |
|
||||||
|
|
||||||
|
|
||||||
def test_file_exists(tmp_path): |
|
||||||
# create a file under tmp_path |
|
||||||
base = str(tmp_path) |
|
||||||
p = tmp_path / "subdir" |
|
||||||
p.mkdir() |
|
||||||
f = p / "file.txt" |
|
||||||
f.write_text("hello") |
|
||||||
|
|
||||||
# path relative to base |
|
||||||
assert checks.file_exists(base, "subdir/file.txt") |
|
||||||
# non-existing |
|
||||||
assert not checks.file_exists(base, "subdir/missing.txt") |
|
||||||
|
|
||||||
|
|
||||||
def test_detect_truncated(): |
|
||||||
assert checks.detect_truncated("This is a truncated snippet...") |
|
||||||
assert checks.detect_truncated("Truncation marker: [truncated]") |
|
||||||
assert checks.detect_truncated("contains truncatED word") |
|
||||||
assert not checks.detect_truncated("This is complete") |
|
||||||
assert not checks.detect_truncated("") |
|
||||||
|
|
||||||
|
|
||||||
def test_find_potential_secrets(): |
|
||||||
text = """ |
|
||||||
api_key = "abcdEFGH1234ijklMNOP" |
|
||||||
password: 'hunter2' |
|
||||||
aws = AKIA1234567890ABCD12 |
|
||||||
random_hex = deadbeefdeadbeefdeadbeefdeadbeef |
|
||||||
not_a_secret = short |
|
||||||
""" |
|
||||||
|
|
||||||
found = checks.find_potential_secrets(text) |
|
||||||
# should find api_key value, password, aws and long hex |
|
||||||
assert "abcdEFGH1234ijklMNOP" in found |
|
||||||
assert "hunter2" in found |
|
||||||
assert any(item.startswith("AKIA") for item in found) |
|
||||||
assert any("deadbeef" in item for item in found) |
|
||||||
@ -1,14 +0,0 @@ |
|||||||
import os |
|
||||||
|
|
||||||
|
|
||||||
def test_cli_with_nonexistent_manifest(): |
|
||||||
"""Calling cli.main with a non-existent manifest should return non-zero.""" |
|
||||||
from scripts.mindmodel import cli |
|
||||||
|
|
||||||
# Provide a path that is extremely unlikely to exist |
|
||||||
fake_manifest = "/this/path/does/not/exist/manifest.json" |
|
||||||
|
|
||||||
code = cli.main([fake_manifest]) |
|
||||||
|
|
||||||
assert isinstance(code, int) |
|
||||||
assert code != 0 |
|
||||||
@ -1,21 +0,0 @@ |
|||||||
import json |
|
||||||
import pytest |
|
||||||
|
|
||||||
from scripts.mindmodel import loader |
|
||||||
|
|
||||||
|
|
||||||
def test_load_json_manifest(tmp_path): |
|
||||||
data = [{"id": "c1", "description": "a constraint"}] |
|
||||||
p = tmp_path / "manifest.json" |
|
||||||
p.write_text(json.dumps(data), encoding="utf-8") |
|
||||||
|
|
||||||
loaded = loader.load_manifest(str(p)) |
|
||||||
|
|
||||||
assert isinstance(loaded, dict) |
|
||||||
assert "constraints" in loaded |
|
||||||
assert any(c.get("id") == "c1" for c in loaded["constraints"]) |
|
||||||
|
|
||||||
|
|
||||||
def test_missing_manifest_raises(): |
|
||||||
with pytest.raises(loader.ManifestLoadError): |
|
||||||
loader.load_manifest("nonexistent-file-manifest.json") |
|
||||||
@ -1,70 +0,0 @@ |
|||||||
import json |
|
||||||
import os |
|
||||||
|
|
||||||
from scripts.mindmodel import validator |
|
||||||
|
|
||||||
|
|
||||||
def write_manifest(path, data: str): |
|
||||||
p = path |
|
||||||
p.write_text(data, encoding="utf-8") |
|
||||||
return str(p) |
|
||||||
|
|
||||||
|
|
||||||
def test_validate_ok(tmp_path): |
|
||||||
# manifest with one constraint and evidence pointing to an existing file |
|
||||||
evidence_file = tmp_path / "file.txt" |
|
||||||
evidence_file.write_text("hello") |
|
||||||
|
|
||||||
manifest = { |
|
||||||
"constraints": [ |
|
||||||
{"id": "c1", "evidence": [{"file": "file.txt", "text": "complete content"}]} |
|
||||||
] |
|
||||||
} |
|
||||||
|
|
||||||
manifest_path = tmp_path / "manifest.json" |
|
||||||
manifest_path.write_text(json.dumps(manifest)) |
|
||||||
|
|
||||||
code, report = validator.validate_manifest( |
|
||||||
str(manifest_path), base_dir=str(tmp_path) |
|
||||||
) |
|
||||||
assert code == 0 |
|
||||||
assert report["missing_files"] == [] |
|
||||||
assert report["secrets"] == [] |
|
||||||
|
|
||||||
|
|
||||||
def test_missing_file_flags_failure(tmp_path): |
|
||||||
# manifest refers to missing file |
|
||||||
manifest = { |
|
||||||
"constraints": [{"id": "c2", "evidence": [{"file": "nope.txt", "text": "foo"}]}] |
|
||||||
} |
|
||||||
manifest_path = tmp_path / "manifest.json" |
|
||||||
manifest_path.write_text(json.dumps(manifest)) |
|
||||||
|
|
||||||
code, report = validator.validate_manifest( |
|
||||||
str(manifest_path), base_dir=str(tmp_path) |
|
||||||
) |
|
||||||
assert code == 2 |
|
||||||
assert "nope.txt" in report["missing_files"] |
|
||||||
|
|
||||||
|
|
||||||
def test_truncated_produces_warning(tmp_path): |
|
||||||
# evidence text is truncated -> warning |
|
||||||
f = tmp_path / "manifest.json" |
|
||||||
manifest = { |
|
||||||
"constraints": [{"id": "c3", "evidence": [{"text": "This is truncated..."}]}] |
|
||||||
} |
|
||||||
f.write_text(json.dumps(manifest)) |
|
||||||
|
|
||||||
code, report = validator.validate_manifest(str(f), base_dir=str(tmp_path)) |
|
||||||
assert code == 1 |
|
||||||
assert report["truncated"] >= 1 |
|
||||||
|
|
||||||
|
|
||||||
def test_manifest_scanned_for_secrets(tmp_path): |
|
||||||
# manifest text contains an api_key pattern |
|
||||||
f = tmp_path / "manifest.json" |
|
||||||
f.write_text('api_key = "secretVALUE1234"') |
|
||||||
|
|
||||||
code, report = validator.validate_manifest(str(f), base_dir=str(tmp_path)) |
|
||||||
assert code == 2 |
|
||||||
assert any("secretVALUE1234" in s for s in report["secrets"]) or report["secrets"] |
|
||||||
@ -1,52 +0,0 @@ |
|||||||
import json |
|
||||||
import subprocess |
|
||||||
import sys |
|
||||||
from pathlib import Path |
|
||||||
|
|
||||||
|
|
||||||
def test_cli_runs(tmp_path): |
|
||||||
manifest = Path(".mindmodel/manifest.yaml") |
|
||||||
assert manifest.exists(), "expected .mindmodel/manifest.yaml to exist in repo" |
|
||||||
|
|
||||||
report_path = tmp_path / "report.json" |
|
||||||
|
|
||||||
# Try module mode first, fallback to direct script invocation |
|
||||||
cmds = [ |
|
||||||
[ |
|
||||||
sys.executable, |
|
||||||
"-m", |
|
||||||
"scripts.validate_mindmodel", |
|
||||||
str(manifest), |
|
||||||
"--report", |
|
||||||
str(report_path), |
|
||||||
], |
|
||||||
[ |
|
||||||
sys.executable, |
|
||||||
"scripts/validate_mindmodel.py", |
|
||||||
str(manifest), |
|
||||||
"--report", |
|
||||||
str(report_path), |
|
||||||
], |
|
||||||
] |
|
||||||
|
|
||||||
result = None |
|
||||||
for cmd in cmds: |
|
||||||
try: |
|
||||||
result = subprocess.run(cmd, check=False, capture_output=True, text=True) |
|
||||||
# if process ran (any exit code), break and use this result |
|
||||||
break |
|
||||||
except FileNotFoundError: |
|
||||||
continue |
|
||||||
|
|
||||||
assert result is not None, "Failed to run script (no suitable invocation)" |
|
||||||
# CLI should exit with 0 (report-only) |
|
||||||
assert result.returncode == 0, ( |
|
||||||
f"CLI exited non-zero: {result.returncode}\nstderr: {result.stderr}" |
|
||||||
) |
|
||||||
|
|
||||||
assert report_path.exists(), f"Report file was not created at {report_path}" |
|
||||||
|
|
||||||
data = json.loads(report_path.read_text(encoding="utf-8")) |
|
||||||
# top-level keys expected from validator |
|
||||||
for key in ("missing_files", "truncated_evidence", "potential_secrets"): |
|
||||||
assert key in data, f"Report JSON missing key: {key}" |
|
||||||
@ -1,22 +0,0 @@ |
|||||||
import json |
|
||||||
|
|
||||||
from src.types.motion_types import SimilarityNeighbor, to_json, from_json |
|
||||||
|
|
||||||
|
|
||||||
def test_similarity_neighbor_json_roundtrip(): |
|
||||||
neighbors = [ |
|
||||||
SimilarityNeighbor(motion_id="m1", score=0.9), |
|
||||||
SimilarityNeighbor(motion_id="m2", score=0.75), |
|
||||||
] |
|
||||||
|
|
||||||
# Serialize to JSON string |
|
||||||
json_str = to_json(neighbors) |
|
||||||
assert isinstance(json_str, str) |
|
||||||
|
|
||||||
# Ensure it's valid JSON |
|
||||||
parsed = json.loads(json_str) |
|
||||||
assert isinstance(parsed, list) |
|
||||||
|
|
||||||
# Deserialize back to objects |
|
||||||
recovered = from_json(json_str) |
|
||||||
assert recovered == neighbors |
|
||||||
@ -1,45 +0,0 @@ |
|||||||
import os |
|
||||||
import tempfile |
|
||||||
from pathlib import Path |
|
||||||
|
|
||||||
import pytest |
|
||||||
|
|
||||||
from src.validators.mindmodel_validator import validate_manifest |
|
||||||
|
|
||||||
|
|
||||||
def _write_temp_manifest(contents: str) -> str: |
|
||||||
fd, path = tempfile.mkstemp(prefix="manifest_", suffix=".yaml") |
|
||||||
os.close(fd) |
|
||||||
with open(path, "w", encoding="utf-8") as f: |
|
||||||
f.write(contents) |
|
||||||
return path |
|
||||||
|
|
||||||
|
|
||||||
def test_validator_reports_missing_file(tmp_path): |
|
||||||
# manifest referencing a non-existent file |
|
||||||
missing = str(tmp_path / "no_such_file.txt") |
|
||||||
manifest = f""" |
|
||||||
files: |
|
||||||
- path: {missing} |
|
||||||
""" |
|
||||||
mpath = _write_temp_manifest(manifest) |
|
||||||
try: |
|
||||||
report = validate_manifest(mpath) |
|
||||||
assert "missing_files" in report |
|
||||||
assert missing in report["missing_files"] |
|
||||||
finally: |
|
||||||
Path(mpath).unlink() |
|
||||||
|
|
||||||
|
|
||||||
def test_validator_detects_potential_secret(tmp_path): |
|
||||||
# manifest with evidence_excerpt containing PASSWORD |
|
||||||
evidence = "This shows a PASSWORD=hunter2 in the output" |
|
||||||
manifest = f'files:\n - path: some_file.txt\n evidence_excerpt: "{evidence}"\n' |
|
||||||
mpath = _write_temp_manifest(manifest) |
|
||||||
try: |
|
||||||
report = validate_manifest(mpath) |
|
||||||
assert "potential_secrets" in report |
|
||||||
items = report["potential_secrets"] |
|
||||||
assert any(evidence in (item.get("evidence_excerpt") or "") for item in items) |
|
||||||
finally: |
|
||||||
Path(mpath).unlink() |
|
||||||
@ -1,24 +0,0 @@ |
|||||||
import os |
|
||||||
from pathlib import Path |
|
||||||
|
|
||||||
import pytest |
|
||||||
|
|
||||||
from src.validators.types import parse_manifest, Manifest |
|
||||||
|
|
||||||
|
|
||||||
def test_manifest_model_parses_sample(tmp_path: Path): |
|
||||||
sample = """ |
|
||||||
files: |
|
||||||
- path: data/file1.txt |
|
||||||
evidence_excerpt: "some evidence" |
|
||||||
- file_path: data/file2.txt |
|
||||||
evidence_excerpt: "other evidence" |
|
||||||
""" |
|
||||||
p = tmp_path / "manifest.yaml" |
|
||||||
p.write_text(sample, encoding="utf-8") |
|
||||||
|
|
||||||
manifest = parse_manifest(str(p)) |
|
||||||
assert isinstance(manifest, Manifest) |
|
||||||
assert len(manifest.files) == 2 |
|
||||||
assert manifest.files[0]["path"] == "data/file1.txt" |
|
||||||
assert manifest.files[1]["path"] == "data/file2.txt" |
|
||||||
@ -1,56 +0,0 @@ |
|||||||
import os |
|
||||||
from pathlib import Path |
|
||||||
|
|
||||||
from src.validators.mindmodel_validator import validate_manifest |
|
||||||
|
|
||||||
|
|
||||||
def test_missing_files_reported(tmp_path): |
|
||||||
# create two paths that do not exist |
|
||||||
p1 = str(tmp_path / "missing_one.txt") |
|
||||||
p2 = str(tmp_path / "missing_two.txt") |
|
||||||
|
|
||||||
manifest = f""" |
|
||||||
files: |
|
||||||
- path: {p1} |
|
||||||
- path: {p2} |
|
||||||
""" |
|
||||||
|
|
||||||
mpath = tmp_path / "manifest_missing.yaml" |
|
||||||
mpath.write_text(manifest, encoding="utf-8") |
|
||||||
|
|
||||||
report = validate_manifest(str(mpath)) |
|
||||||
assert "missing_files" in report |
|
||||||
# both missing paths should be reported |
|
||||||
assert p1 in report["missing_files"] |
|
||||||
assert p2 in report["missing_files"] |
|
||||||
|
|
||||||
|
|
||||||
def test_truncated_evidence_and_secrets_reported(tmp_path): |
|
||||||
# entry with truncated evidence (ends with ...) |
|
||||||
trunc_path = str(tmp_path / "trunc.txt") |
|
||||||
trunc_evidence = "This output was cut off..." |
|
||||||
|
|
||||||
# entry with potential secret (contains PASSWORD) |
|
||||||
secret_path = str(tmp_path / "secret.txt") |
|
||||||
secret_evidence = "Found PASSWORD=sekret123 in the logs" |
|
||||||
|
|
||||||
manifest = f""" |
|
||||||
files: |
|
||||||
- path: {trunc_path} |
|
||||||
evidence_excerpt: "{trunc_evidence}" |
|
||||||
- path: {secret_path} |
|
||||||
evidence_excerpt: "{secret_evidence}" |
|
||||||
""" |
|
||||||
|
|
||||||
mpath = tmp_path / "manifest_edgecases.yaml" |
|
||||||
mpath.write_text(manifest, encoding="utf-8") |
|
||||||
|
|
||||||
report = validate_manifest(str(mpath)) |
|
||||||
|
|
||||||
# truncated evidence should report the trunc_path |
|
||||||
assert "truncated_evidence" in report |
|
||||||
assert any(item.get("path") == trunc_path for item in report["truncated_evidence"]) |
|
||||||
|
|
||||||
# potential secrets should report the secret_path |
|
||||||
assert "potential_secrets" in report |
|
||||||
assert any(item.get("path") == secret_path for item in report["potential_secrets"]) |
|
||||||
@ -1,40 +0,0 @@ |
|||||||
# 2026-03-28 Ansible package implementation |
|
||||||
|
|
||||||
Summary of changes added to repository: |
|
||||||
|
|
||||||
- packages/@ansible/example/ |
|
||||||
- package.json (scoped package @ansible/example) |
|
||||||
- README.md |
|
||||||
- src/index.js |
|
||||||
- tests/ (test_package_json.js, test_pack_inspect.js, _pack_helpers.js, run.js) |
|
||||||
- .github/workflows/publish-ansible-example.yml |
|
||||||
- .github/workflows/deploy-motief.yml |
|
||||||
- docs/deployment/ansible-package-deploy.md |
|
||||||
- docs/embeddings.md |
|
||||||
- README.md (top-level) |
|
||||||
- thoughts/shared/changes/2026-03-28-ansible-package-implementation.md (this file) |
|
||||||
|
|
||||||
Verification commands (run from repo root): |
|
||||||
|
|
||||||
1. Run package tests: |
|
||||||
cd packages/@ansible/example && npm test |
|
||||||
|
|
||||||
2. Run pack inspection: |
|
||||||
cd packages/@ansible/example && node tests/test_pack_inspect.js |
|
||||||
|
|
||||||
3. Simulate pack locally: |
|
||||||
cd packages/@ansible/example && npm pack && tar -tzf <produced-tgz> | head -n 20 |
|
||||||
|
|
||||||
4. Check workflows syntax locally (optional): |
|
||||||
- Use `act` or `nektos/act` to run workflow_dispatch triggers in a container; ensure secrets are not printed. |
|
||||||
|
|
||||||
5. Verify docs updated for embeddings and deployment: open docs/embeddings.md and docs/deployment/ansible-package-deploy.md |
|
||||||
|
|
||||||
Notes: |
|
||||||
- Do NOT add secrets to repo. Secrets: NPM_TOKEN, DEPLOY_SSH_KEY, DEPLOY_HOST, DEPLOY_USER, DEPLOY_SSH_PORT, OPENROUTER_API_KEY |
|
||||||
|
|
||||||
Contact: Sven Geboers |
|
||||||
|
|
||||||
End of changelog. |
|
||||||
|
|
||||||
Write the file with neutral tone and concise steps for verification. |
|
||||||
@ -1,36 +0,0 @@ |
|||||||
--- |
|
||||||
date: 2026-03-28 |
|
||||||
title: "Remove .env from tracking — report" |
|
||||||
--- |
|
||||||
|
|
||||||
Summary |
|
||||||
------- |
|
||||||
|
|
||||||
I removed `.env` from the repository index and added it to `.gitignore` to prevent accidental future commits. This was a non-destructive, forward-facing change — the repository history still contains prior commits that touched `.env`. |
|
||||||
|
|
||||||
What I ran |
|
||||||
----------- |
|
||||||
|
|
||||||
- git rm --cached .env |
|
||||||
- ensured `.gitignore` contains `.env` |
|
||||||
- committed the change: chore(secrets): stop tracking .env and add to .gitignore |
|
||||||
|
|
||||||
Commits that referenced .env |
|
||||||
---------------------------- |
|
||||||
|
|
||||||
These commits touched `.env` in the repository history (from git log --all -- .env): |
|
||||||
|
|
||||||
- 35f4667 2026-03-28 Sven Geboers chore(secrets): stop tracking .env and add to .gitignore |
|
||||||
- 3551a82 2026-03-21 Sven Geboers feat(analysis): add 2D political compass and 2D trajectories |
|
||||||
|
|
||||||
Notes |
|
||||||
----- |
|
||||||
|
|
||||||
- The `.env` file was removed from the index but remains in historical commits. If you need to remove it from history, we can perform a history rewrite (git-filter-repo or BFG) and force-push; this is destructive and requires coordination. |
|
||||||
- I created a CI guard to fail builds if a `.env` file is present in the repository root (see .github/workflows/forbid-env.yml). This prevents accidental re-adding via pushes/PRs. |
|
||||||
|
|
||||||
Next steps (recommended) |
|
||||||
------------------------ |
|
||||||
|
|
||||||
1. Rotate secrets that might have been in `.env` (see the secrets-rotation checklist next). This is mandatory if those keys were used anywhere publicly or in shared CI. |
|
||||||
2. If you require history purge, reply confirming and I'll prepare a filter-repo run and the exact force-push sequence. |
|
||||||
@ -1,25 +0,0 @@ |
|||||||
--- |
|
||||||
date: 2026-03-28 |
|
||||||
title: "Secrets rotation checklist" |
|
||||||
--- |
|
||||||
|
|
||||||
Rotate these secrets if they were stored in `.env` or otherwise exposed: |
|
||||||
|
|
||||||
- OPENROUTER_API_KEY / OPENAI_API_KEY |
|
||||||
- NPM_TOKEN |
|
||||||
- DEPLOY SSH keys or passwords (DEPLOY_SSH_KEY, DEPLOY_PASSWORD) |
|
||||||
- Any database credentials, API keys, or third-party service tokens |
|
||||||
|
|
||||||
Steps |
|
||||||
----- |
|
||||||
|
|
||||||
1. Revoke the current tokens in each provider's dashboard. |
|
||||||
2. Create new tokens/keys and store them in the repository secrets (GitHub Settings → Secrets). |
|
||||||
3. Update any running services / CI variables to use the new tokens. |
|
||||||
4. If you used SSH keys and replaced them, update the authorized_keys on the VPS and remove the old key. |
|
||||||
|
|
||||||
Verification |
|
||||||
------------ |
|
||||||
|
|
||||||
- Use CI dry-run jobs that check connectivity and token validity. |
|
||||||
- Run local commands that use the new tokens. |
|
||||||
Loading…
Reference in new issue