parent
9c82962d47
commit
2efd7ba3a0
@ -0,0 +1,34 @@ |
|||||||
|
# Naming & Style Conventions |
||||||
|
|
||||||
|
## Rules |
||||||
|
- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py |
||||||
|
- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py) |
||||||
|
- Classes: PascalCase. Evidence: MotionDatabase (database.py) |
||||||
|
- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred) |
||||||
|
- Imports order: stdlib, third-party, local; prefer absolute imports and grouped. |
||||||
|
- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections). |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### Function example (from pipeline/run_pipeline.py) |
||||||
|
```python |
||||||
|
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
||||||
|
"""Return list of (window_id, start_str, end_str) tuples.""" |
||||||
|
``` |
||||||
|
|
||||||
|
### Class example (from database.py) |
||||||
|
```python |
||||||
|
class MotionDatabase: |
||||||
|
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||||
|
... |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-patterns |
||||||
|
- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files. |
||||||
|
|
||||||
|
## Remediations |
||||||
|
- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step. |
||||||
|
|
||||||
|
## Evidence pointers |
||||||
|
- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120) |
||||||
|
- database.py: MotionDatabase class and methods (file database.py lines 1-400+) |
||||||
@ -0,0 +1,74 @@ |
|||||||
|
# Database Schema (DuckDB) — extracted DDL |
||||||
|
|
||||||
|
## Rules |
||||||
|
- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). |
||||||
|
- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py). |
||||||
|
|
||||||
|
## Examples (DDL snippets extracted from database.py) |
||||||
|
|
||||||
|
### motions table |
||||||
|
```sql |
||||||
|
CREATE TABLE IF NOT EXISTS motions ( |
||||||
|
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||||
|
title TEXT NOT NULL, |
||||||
|
description TEXT, |
||||||
|
date DATE, |
||||||
|
policy_area TEXT, |
||||||
|
voting_results JSON, |
||||||
|
winning_margin FLOAT, |
||||||
|
controversy_score FLOAT, |
||||||
|
layman_explanation TEXT, |
||||||
|
externe_identifier TEXT, |
||||||
|
body_text TEXT, |
||||||
|
url TEXT UNIQUE, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
``` |
||||||
|
|
||||||
|
### mp_votes table |
||||||
|
```sql |
||||||
|
CREATE TABLE IF NOT EXISTS mp_votes ( |
||||||
|
id INTEGER DEFAULT nextval('mp_votes_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
mp_name TEXT NOT NULL, |
||||||
|
party TEXT, |
||||||
|
vote TEXT NOT NULL, |
||||||
|
date DATE, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
``` |
||||||
|
|
||||||
|
### embeddings / fused_embeddings |
||||||
|
```sql |
||||||
|
CREATE TABLE IF NOT EXISTS embeddings ( |
||||||
|
id INTEGER DEFAULT nextval('embeddings_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
model TEXT, |
||||||
|
vector JSON NOT NULL, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||||
|
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
window_id TEXT NOT NULL, |
||||||
|
vector JSON NOT NULL, |
||||||
|
svd_dims INTEGER NOT NULL, |
||||||
|
text_dims INTEGER NOT NULL, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-patterns |
||||||
|
- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior. |
||||||
|
|
||||||
|
## Remediations |
||||||
|
- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically. |
||||||
|
- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80). |
||||||
|
|
||||||
|
## Evidence pointers |
||||||
|
- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings. |
||||||
@ -0,0 +1,22 @@ |
|||||||
|
# Domain Glossary |
||||||
|
|
||||||
|
## Rules |
||||||
|
- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id. |
||||||
|
|
||||||
|
## Terms |
||||||
|
- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110) |
||||||
|
- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes |
||||||
|
- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`. |
||||||
|
- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table. |
||||||
|
- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows |
||||||
|
- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score |
||||||
|
|
||||||
|
## Examples / Usage |
||||||
|
- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120 |
||||||
|
|
||||||
|
## Evidence pointers |
||||||
|
- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py) |
||||||
|
- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py) |
||||||
|
|
||||||
|
## Anti-patterns |
||||||
|
- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations. |
||||||
@ -0,0 +1,30 @@ |
|||||||
|
# Code Clusters / Organization |
||||||
|
|
||||||
|
## Rules |
||||||
|
- The repository organizes code into the following clusters (observed): |
||||||
|
- UI / Streamlit: Home.py, pages/, app.py, explorer.py |
||||||
|
- Database & persistence: database.py, config.py |
||||||
|
- ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion) |
||||||
|
- AI provider & summarization: ai_provider.py, pipeline/..., analysis/ |
||||||
|
- Similarity & caching: similarity/*, similarity_cache table in DB |
||||||
|
- API client & scraping: api_client.py, pipeline/fetch_mp_metadata |
||||||
|
- Analysis & visualization: analysis/visualize.py, explorer.py |
||||||
|
- CLI & scheduler: scheduler.py, pipeline/run_pipeline.py |
||||||
|
- Tests & migrations: tests/ (pytest) and database reset helpers |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### Pipeline orchestrator (cluster: CLI & pipeline) |
||||||
|
```python |
||||||
|
from database import MotionDatabase |
||||||
|
db = MotionDatabase(db_path) |
||||||
|
# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window |
||||||
|
``` |
||||||
|
|
||||||
|
## Remediations |
||||||
|
- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests. |
||||||
|
|
||||||
|
## Evidence pointers |
||||||
|
- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py) |
||||||
|
- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py) |
||||||
|
- analysis/visualize.py: visualization cluster (file: analysis/visualize.py) |
||||||
@ -0,0 +1,46 @@ |
|||||||
|
# Design Patterns & Code Patterns |
||||||
|
|
||||||
|
## Rules |
||||||
|
- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management. |
||||||
|
- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback. |
||||||
|
- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes). |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### Repository pattern (database.py MotionDatabase) |
||||||
|
```python |
||||||
|
class MotionDatabase: |
||||||
|
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||||
|
self.db_path = db_path |
||||||
|
self._init_database() |
||||||
|
|
||||||
|
def insert_motion(self, motion_data: Dict) -> bool: |
||||||
|
"""Insert a new motion into database""" |
||||||
|
# uses duckdb.connect and parameterized queries |
||||||
|
``` |
||||||
|
|
||||||
|
### Provider adapter with retries (ai_provider.py) |
||||||
|
```python |
||||||
|
def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response: |
||||||
|
# Implements retries/backoff, handles 429 with Retry-After and 5xx responses |
||||||
|
``` |
||||||
|
|
||||||
|
### Pipeline parallelism pattern (run_pipeline) |
||||||
|
```python |
||||||
|
with ThreadPoolExecutor(max_workers=max_workers) as pool: |
||||||
|
for window_id, w_start, w_end in windows: |
||||||
|
fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k) |
||||||
|
futures[fut] = window_id |
||||||
|
# wait then write sequentially to DuckDB |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-patterns |
||||||
|
- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors. |
||||||
|
|
||||||
|
## Remediations |
||||||
|
- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md. |
||||||
|
|
||||||
|
## Evidence pointers |
||||||
|
- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300) |
||||||
|
- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260) |
||||||
|
- database.py: MotionDatabase methods (file: database.py) |
||||||
@ -0,0 +1,24 @@ |
|||||||
|
# Anti-patterns, Issues and Recommended Fixes |
||||||
|
|
||||||
|
## Rules |
||||||
|
- Flagged issues discovered in Phase 1 must be remediated with concrete actions. |
||||||
|
|
||||||
|
## Issues |
||||||
|
- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml |
||||||
|
- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports. |
||||||
|
- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility. |
||||||
|
- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps. |
||||||
|
- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging. |
||||||
|
|
||||||
|
## Remediations / Recommended fixes |
||||||
|
- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml. |
||||||
|
- Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain. |
||||||
|
- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var. |
||||||
|
- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges. |
||||||
|
- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage. |
||||||
|
- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks. |
||||||
|
|
||||||
|
## Evidence pointers |
||||||
|
- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40) |
||||||
|
- database.py: multiple broad except blocks (file: database.py top and methods) |
||||||
|
- ai_provider.py: uses requests + env keys (file: ai_provider.py) |
||||||
@ -0,0 +1,117 @@ |
|||||||
|
# Example Extractions |
||||||
|
|
||||||
|
## Rules |
||||||
|
- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions. |
||||||
|
|
||||||
|
## (a) Function signatures with docstrings (5 examples) |
||||||
|
1) pipeline/run_pipeline.py::_generate_windows |
||||||
|
```python |
||||||
|
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
||||||
|
"""Return list of (window_id, start_str, end_str) tuples. |
||||||
|
|
||||||
|
window_id format: |
||||||
|
quarterly → "2024-Q1", "2024-Q2", … |
||||||
|
annual → "2024" |
||||||
|
""" |
||||||
|
``` |
||||||
|
|
||||||
|
2) database.py::append_audit_event |
||||||
|
```python |
||||||
|
def append_audit_event( |
||||||
|
self, |
||||||
|
actor_id: Optional[str], |
||||||
|
action: str, |
||||||
|
target_type: Optional[str] = None, |
||||||
|
target_id: Optional[str] = None, |
||||||
|
metadata: Optional[Dict] = None, |
||||||
|
) -> bool: |
||||||
|
"""Record an audit event. Tries DB then falls back to ledger file.""" |
||||||
|
``` |
||||||
|
|
||||||
|
3) ai_provider.py::get_embedding |
||||||
|
```python |
||||||
|
def get_embedding(text: str, model: str | None = None) -> list[float]: |
||||||
|
"""Return an embedding vector for `text` using the configured provider. |
||||||
|
|
||||||
|
Raises ProviderError for configuration or provider-side failures. |
||||||
|
""" |
||||||
|
``` |
||||||
|
|
||||||
|
4) ai_provider.py::get_embeddings_batch |
||||||
|
```python |
||||||
|
def get_embeddings_batch( |
||||||
|
texts: list[str], model: str | None = None, batch_size: int = 50 |
||||||
|
) -> list[list[float]]: |
||||||
|
"""Return embedding vectors for multiple texts using batched API calls.""" |
||||||
|
``` |
||||||
|
|
||||||
|
5) analysis/visualize.py::plot_umap_scatter |
||||||
|
```python |
||||||
|
def plot_umap_scatter( |
||||||
|
motion_ids: List[int], |
||||||
|
coords: List[List[float]], |
||||||
|
labels: Optional[List[int]] = None, |
||||||
|
window_id: Optional[str] = None, |
||||||
|
output_path: str = "analysis_umap.html", |
||||||
|
) -> str: |
||||||
|
"""Produce a 2D scatter plot of UMAP-reduced fused embeddings.""" |
||||||
|
``` |
||||||
|
|
||||||
|
## (b) SQL / DDL snippets (3 examples inferred from database.py) |
||||||
|
1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110) |
||||||
|
|
||||||
|
2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes |
||||||
|
|
||||||
|
3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings |
||||||
|
|
||||||
|
## (c) Pytest stubs (4 sample tests matching conventions) |
||||||
|
Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add. |
||||||
|
|
||||||
|
1) tests/test_database_basic.py |
||||||
|
```python |
||||||
|
def test_init_database_creates_tables(tmp_path): |
||||||
|
db_path = str(tmp_path / "motions.db") |
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
db = MotionDatabase(db_path=db_path) |
||||||
|
# If duckdb not available, JSON fallback should create .embeddings.json |
||||||
|
assert db is not None |
||||||
|
``` |
||||||
|
|
||||||
|
2) tests/test_ai_provider.py |
||||||
|
```python |
||||||
|
def test_local_embedding_fallback(): |
||||||
|
from ai_provider import _local_embedding |
||||||
|
|
||||||
|
v = _local_embedding("hello world", dim=16) |
||||||
|
assert isinstance(v, list) and len(v) == 16 |
||||||
|
``` |
||||||
|
|
||||||
|
3) tests/test_pipeline_windows.py |
||||||
|
```python |
||||||
|
from pipeline.run_pipeline import _generate_windows |
||||||
|
|
||||||
|
def test_generate_quarterly_windows(): |
||||||
|
from datetime import date |
||||||
|
|
||||||
|
start = date(2024, 1, 1) |
||||||
|
end = date(2024, 3, 31) |
||||||
|
windows = _generate_windows(start, end, "quarterly") |
||||||
|
assert any(w[0].endswith("Q1") for w in windows) |
||||||
|
``` |
||||||
|
|
||||||
|
4) tests/test_visualize_plot.py |
||||||
|
```python |
||||||
|
def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path): |
||||||
|
# If plotly missing, function should raise ImportError with guidance |
||||||
|
import analysis.visualize as vis |
||||||
|
|
||||||
|
try: |
||||||
|
vis._require_plotly() |
||||||
|
except ImportError: |
||||||
|
assert True |
||||||
|
``` |
||||||
|
|
||||||
|
## Evidence pointers |
||||||
|
- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py |
||||||
|
- DDL: database.py create table blocks |
||||||
@ -0,0 +1,43 @@ |
|||||||
|
# Stack and Dependencies |
||||||
|
|
||||||
|
## Rules |
||||||
|
- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13") |
||||||
|
- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile |
||||||
|
- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py |
||||||
|
- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/ |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
### pyproject dependencies (evidence: pyproject.toml) |
||||||
|
```toml |
||||||
|
dependencies = [ |
||||||
|
"duckdb>=1.3.2", |
||||||
|
"ibis-framework[duckdb]>=10.8.0", |
||||||
|
"openai>=1.99.7", |
||||||
|
"scipy>=1.11", |
||||||
|
"umap-learn>=0.5", |
||||||
|
"plotly>=5.0", |
||||||
|
"pytest>=9.0.2", |
||||||
|
"requests>=2.32.4", |
||||||
|
"schedule>=1.2.2", |
||||||
|
"streamlit>=1.48.0", |
||||||
|
"scikit-learn>=1.8.0", |
||||||
|
"beautifulsoup4>=4.14.3", |
||||||
|
"lxml>=6.0.2", |
||||||
|
] |
||||||
|
``` |
||||||
|
|
||||||
|
## Anti-patterns / Notes |
||||||
|
- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml |
||||||
|
- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility. |
||||||
|
- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai). |
||||||
|
|
||||||
|
## Remediations |
||||||
|
- Move test-only libs (pytest) to dev-dependencies in pyproject.toml. |
||||||
|
- Add lockfile and CI step to check for pinned dependencies. |
||||||
|
- Audit declared but unused packages (openai) and remove or confirm dynamic usage. |
||||||
|
|
||||||
|
## Evidence pointers |
||||||
|
- pyproject.toml: full dependency list (lines 1-40) |
||||||
|
- Home.py: streamlit usage and app entry (file: Home.py) |
||||||
|
- database.py: duckdb table creation and connection (file: database.py lines ~1-350) |
||||||
@ -0,0 +1,5 @@ |
|||||||
|
# Mindmodel constraints README |
||||||
|
|
||||||
|
Files in .mindmodel/constraints/ are YAML-like constraint documents describing |
||||||
|
conventions, patterns and remediation steps. Use these to guide PR reviews and |
||||||
|
CI automation. |
||||||
@ -1,60 +1,36 @@ |
|||||||
name: stemwijzer |
name: stemwijzer |
||||||
version: 2 |
version: 2 |
||||||
|
summary: >- |
||||||
|
Mindmodel constraints for the Stemwijzer repository (Python + Streamlit + |
||||||
|
DuckDB). Captures tech stack, conventions, DB schema, clusters, patterns, |
||||||
|
anti-patterns and example extractions. Generated from Phase 1 analysis. |
||||||
|
main_patterns: |
||||||
|
- Repository DB wrapper (MotionDatabase) |
||||||
|
- AI provider adapter with retry/backoff and local fallback |
||||||
|
- SVD + embedding fusion pipeline with windowed processing |
||||||
|
total_files: 11 |
||||||
categories: |
categories: |
||||||
- path: stack.yaml |
- path: .mindmodel/constraints/99-stack.yaml |
||||||
description: Project technology stack (languages, frameworks, runtime) |
description: Runtime tech stack and primary dependencies (Python, Streamlit, DuckDB, Ibis) |
||||||
group: stack |
group: stack |
||||||
- path: dependencies.yaml |
- path: .mindmodel/constraints/01-naming.yaml |
||||||
description: Declared and recommended dependencies grouped by purpose |
description: Naming, import and style conventions |
||||||
group: stack |
group: conventions |
||||||
- path: system.md |
- path: .mindmodel/constraints/10-db-schema.yaml |
||||||
description: System overview and architecture high-level notes |
description: DuckDB schema DDL extracted from database.py |
||||||
group: architecture |
group: database |
||||||
- path: architecture.yaml |
- path: .mindmodel/constraints/20-domain-glossary.yaml |
||||||
description: Architectural layers, organization and confidence levels |
description: Domain glossary and terminology (motions, MP, embeddings, windows) |
||||||
group: architecture |
|
||||||
- path: conventions.yaml |
|
||||||
description: Coding conventions cheat-sheet (naming, imports, types) |
|
||||||
group: style |
|
||||||
- path: domain-glossary.yaml |
|
||||||
description: Business domain glossary for the project |
|
||||||
group: domain |
group: domain |
||||||
- path: patterns/duckdb_access.yaml |
- path: .mindmodel/constraints/30-clusters.yaml |
||||||
description: DuckDB access patterns, examples, and anti-patterns |
description: Code clusters and module organization |
||||||
group: patterns |
group: architecture |
||||||
- path: patterns/requests_http.yaml |
- path: .mindmodel/constraints/40-patterns.yaml |
||||||
description: Requests/HTTP client usage and retry best-practices |
description: Design patterns and coding patterns observed with examples |
||||||
group: patterns |
group: patterns |
||||||
- path: patterns/embeddings_similarity.yaml |
- path: .mindmodel/constraints/50-anti-patterns.yaml |
||||||
description: Embedding, SVD, fusion and similarity pipeline patterns |
description: Anti-patterns, issues and recommended remediations |
||||||
group: patterns |
group: ops |
||||||
- path: patterns/error_handling.yaml |
- path: .mindmodel/constraints/60-examples.yaml |
||||||
description: Error handling patterns and rules |
description: Example extractions: function signatures, SQL DDL snippets, pytest stubs |
||||||
group: patterns |
group: examples |
||||||
- path: patterns/validation.yaml |
|
||||||
description: Input/domain validation patterns and examples |
|
||||||
group: patterns |
|
||||||
- path: patterns/module_singletons.yaml |
|
||||||
description: Module-level singletons and lifecycle patterns |
|
||||||
group: patterns |
|
||||||
- path: anti-patterns.yaml |
|
||||||
description: Known anti-patterns and remediation steps |
|
||||||
group: patterns |
|
||||||
- path: examples/pattern-examples.md |
|
||||||
description: Consolidated extracted code examples across patterns |
|
||||||
group: patterns |
|
||||||
- path: constraints/naming.yaml |
|
||||||
description: Enforce naming rules (snake_case, PascalCase, constants) |
|
||||||
group: constraints |
|
||||||
- path: constraints/imports.yaml |
|
||||||
description: Enforce import grouping and ordering |
|
||||||
group: constraints |
|
||||||
- path: constraints/db_connection.yaml |
|
||||||
description: Rules for opening/closing DB connections and read-only usage |
|
||||||
group: constraints |
|
||||||
- path: constraints/error_handling.yaml |
|
||||||
description: Error handling style and allowed exception scopes |
|
||||||
group: constraints |
|
||||||
- path: constraints/testing.yaml |
|
||||||
description: Test conventions (pytest, test naming, fixtures) |
|
||||||
group: constraints |
|
||||||
|
|||||||
@ -1,18 +1,14 @@ |
|||||||
# System overview |
# System Overview: Stemwijzer |
||||||
|
|
||||||
This project is a Streamlit-based UI and data-processing pipeline that computes embeddings, |
This mindmodel documents constraints, conventions and patterns for the Stemwijzer |
||||||
performs SVD over MP/motion voting matrices, fuses vector representations, and precomputes |
project (Python Streamlit app with DuckDB-backed pipeline for parliamentary |
||||||
a similarity cache for quick lookup in the UI. |
motions embedding analysis). |
||||||
|
|
||||||
Key subsystems: |
Key points: |
||||||
- UI: Streamlit pages (Home.py, pages/*). Exposes interactive explorer and quizzes. |
- Language: Python >=3.13 |
||||||
- Data ingestion: scripts and scraper/api_client.py (Tweede Kamer OData). |
- UI: Streamlit multi-page app (Home.py, pages/) |
||||||
- Processing pipelines: pipeline/* (text embeddings, SVD, fusion). |
- Storage: DuckDB with JSON fallback for tests/dev (database.py) |
||||||
- Similarity layer: similarity/compute.py and similarity/lookup.py storing precomputed neighbors. |
- Pipeline: ETL and SVD/text fusion pipeline (pipeline/run_pipeline.py) |
||||||
- Storage: DuckDB (primary), with a JSON-file fallback used in tests/environments without duckdb. |
- AI: ai_provider adapter uses HTTP-based OpenRouter/OpenAI-compatible API with retry/backoff and local fallback |
||||||
- AI/Embedding provider: ai_provider.py (HTTP wrapper around an OpenRouter/OpenAI-compatible API). |
|
||||||
|
|
||||||
Operational notes: |
Use the .mindmodel/ constraints files to guide code changes, CI, and onboarding. |
||||||
- Dockerfile exists; Streamlit default port 8501 exposed. |
|
||||||
- Tests use pytest. CI uses Drone (.drone.yml). |
|
||||||
- There is no lockfile present in the repository snapshot; add one (poetry.lock or requirements.txt) for reproducible installs. |
|
||||||
|
|||||||
@ -0,0 +1,67 @@ |
|||||||
|
"""Simple manifest loader for mindmodel manifests. |
||||||
|
|
||||||
|
Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`. |
||||||
|
|
||||||
|
Behavior: |
||||||
|
- If PyYAML is installed, uses yaml.safe_load to parse the file. |
||||||
|
- Otherwise falls back to the stdlib json parser. |
||||||
|
- If the top-level document is a list it will be normalized to {"constraints": <list>}. |
||||||
|
- Raises ManifestLoadError for missing file or parse errors. |
||||||
|
""" |
||||||
|
|
||||||
|
from typing import Any, Dict |
||||||
|
import json |
||||||
|
from pathlib import Path |
||||||
|
|
||||||
|
|
||||||
|
class ManifestLoadError(Exception): |
||||||
|
"""Raised when a manifest cannot be loaded or parsed.""" |
||||||
|
|
||||||
|
|
||||||
|
try: |
||||||
|
import yaml # type: ignore |
||||||
|
except Exception: # YAML not available |
||||||
|
yaml = None # type: ignore |
||||||
|
|
||||||
|
|
||||||
|
def _parse_with_yaml(text: str) -> Any: |
||||||
|
# yamlsafe_load may return any Python structure |
||||||
|
try: |
||||||
|
return yaml.safe_load(text) |
||||||
|
except Exception as exc: # pragma: no cover - defensive |
||||||
|
raise ManifestLoadError(f"YAML parse error: {exc}") from exc |
||||||
|
|
||||||
|
|
||||||
|
def _parse_with_json(text: str) -> Any: |
||||||
|
try: |
||||||
|
return json.loads(text) |
||||||
|
except Exception as exc: |
||||||
|
raise ManifestLoadError(f"JSON parse error: {exc}") from exc |
||||||
|
|
||||||
|
|
||||||
|
def load_manifest(path: str) -> Dict[str, Any]: |
||||||
|
"""Load a manifest from the given file path and normalize it to a dict. |
||||||
|
|
||||||
|
If the top-level document is a list, it will be returned as {"constraints": list}. |
||||||
|
Raises ManifestLoadError if the file does not exist or if parsing fails. |
||||||
|
""" |
||||||
|
p = Path(path) |
||||||
|
if not p.exists(): |
||||||
|
raise ManifestLoadError(f"Manifest file not found: {path}") |
||||||
|
|
||||||
|
text = p.read_text(encoding="utf-8") |
||||||
|
|
||||||
|
if yaml is not None: |
||||||
|
data = _parse_with_yaml(text) |
||||||
|
else: |
||||||
|
data = _parse_with_json(text) |
||||||
|
|
||||||
|
# Normalize |
||||||
|
if isinstance(data, list): |
||||||
|
return {"constraints": data} |
||||||
|
|
||||||
|
if isinstance(data, dict): |
||||||
|
return data |
||||||
|
|
||||||
|
# Unexpected top-level type, wrap it |
||||||
|
return {"manifest": data} |
||||||
@ -0,0 +1,21 @@ |
|||||||
|
import json |
||||||
|
import pytest |
||||||
|
|
||||||
|
from scripts.mindmodel import loader |
||||||
|
|
||||||
|
|
||||||
|
def test_load_json_manifest(tmp_path): |
||||||
|
data = [{"id": "c1", "description": "a constraint"}] |
||||||
|
p = tmp_path / "manifest.json" |
||||||
|
p.write_text(json.dumps(data), encoding="utf-8") |
||||||
|
|
||||||
|
loaded = loader.load_manifest(str(p)) |
||||||
|
|
||||||
|
assert isinstance(loaded, dict) |
||||||
|
assert "constraints" in loaded |
||||||
|
assert any(c.get("id") == "c1" for c in loaded["constraints"]) |
||||||
|
|
||||||
|
|
||||||
|
def test_missing_manifest_raises(): |
||||||
|
with pytest.raises(loader.ManifestLoadError): |
||||||
|
loader.load_manifest("nonexistent-file-manifest.json") |
||||||
@ -0,0 +1,73 @@ |
|||||||
|
--- |
||||||
|
date: 2026-03-24 |
||||||
|
topic: "mindmodel-generation" |
||||||
|
status: draft |
||||||
|
--- |
||||||
|
|
||||||
|
## Problem Statement |
||||||
|
|
||||||
|
We generated a .mindmodel/ snapshot for this repository using an automated orchestrator. The output includes inferred constraints, patterns, schema snippets, and remediation recommendations. We need a short, validated design that explains what was produced, how to verify and integrate it safely, and a recommended next set of changes (low-risk remediation and CI additions). |
||||||
|
|
||||||
|
## Constraints |
||||||
|
|
||||||
|
**Non-negotiables:** |
||||||
|
- Keep the generated .mindmodel/ files read-only until validated. |
||||||
|
- Do not make behavioral changes to production code in the same change as model metadata updates. |
||||||
|
- Avoid committing secrets or lockfiles without explicit review. |
||||||
|
|
||||||
|
**Limitations:** |
||||||
|
- The orchestrator used heuristic file reads; some evidence pointers may be truncated or approximate. |
||||||
|
- No poetry.lock / requirements.txt or CI workflows were found; dependency remediation must be conservative. |
||||||
|
|
||||||
|
## Approach |
||||||
|
|
||||||
|
I'm choosing an **audit-first, incremental integration** approach because the generated artifacts are high-value policy documents but rely on evidence that needs verification. We will: (1) validate evidence pointers and missing files, (2) mark fixes for trivial issues (move pytest to dev-deps, add formatter configs) in a small non-invasive PR, (3) integrate the .mindmodel/ into the repo and add a CI lint step that validates the manifest, and (4) iterate on higher-risk changes after tests pass. |
||||||
|
|
||||||
|
Alternatives considered: |
||||||
|
- Accept-and-commit everything immediately (faster) — rejected because of truncated reads and potential wrong pointers. |
||||||
|
- Manual rewrite of constraints by hand (accurate) — rejected due to time cost; validation + targeted fixes gives best ROI. |
||||||
|
|
||||||
|
## Architecture |
||||||
|
|
||||||
|
This is a documentation/metadata integration task, not a runtime service. Components: |
||||||
|
|
||||||
|
- **.mindmodel/**: constraint files and manifest produced by orchestrator. Source of truth for conventions and inferred patterns. |
||||||
|
- **Validator job (CI)**: lightweight script/CI step that verifies manifest consistency, required files exist, and key evidence pointers resolve. |
||||||
|
- **Small remediation PRs**: conservative code/config edits (pyproject tweaks, add black/ruff/isort configs, pre-commit) that enable future automation. |
||||||
|
|
||||||
|
## Components |
||||||
|
|
||||||
|
- Constraint Validator: verifies every .mindmodel/ constraint references existing files; flags truncated evidence ranges; ensures no secrets. |
||||||
|
- Staging branch: holds small remediation commits; each commit is limited to one class of change (deps dev/prod move, linters, CI yaml). |
||||||
|
- CI pipeline changes: add a validation job and a docs check that ensures .mindmodel/ manifest is up to date. |
||||||
|
|
||||||
|
## Data Flow |
||||||
|
|
||||||
|
1. Orchestrator output (.mindmodel/) exists in the working tree. |
||||||
|
2. Validator runs locally or in CI to check pointers and file existence. |
||||||
|
3. Developer reviews validator report and accepts/edits constraint files. |
||||||
|
4. Remediation PRs are opened for low-risk fixes. |
||||||
|
5. CI runs tests + validator; on green we merge and enable scheduled checks. |
||||||
|
|
||||||
|
## Error Handling |
||||||
|
|
||||||
|
- Validator failures are non-blocking for mainline but must be resolved before we rely on constraints for automation. |
||||||
|
- If a constraint references a deleted or moved file, mark the constraint as "needs-review" in the manifest and leave file unchanged. |
||||||
|
- For ambiguous evidence (truncated reads), add an explicit comment in the constraint file pointing to the reviewer. |
||||||
|
|
||||||
|
## Testing Strategy |
||||||
|
|
||||||
|
- Unit: small pytest tests that assert README/pyproject presence and that manifest YAML parses. |
||||||
|
- Integration: CI job that runs the Constraint Validator and fails on missing files or secrets. |
||||||
|
- Manual: reviewer inspects a sample of constraint files (3-5) for accuracy before merging. |
||||||
|
|
||||||
|
## Open Questions |
||||||
|
|
||||||
|
- Do we want the validator to auto-fix trivial issues (reformatting YAML paths) or only report? I'm leaning toward report-only for safety. |
||||||
|
- Should .mindmodel/ be protected by branch policy or just reviewed by humans? Recommend human review + CI check, not protected branch yet. |
||||||
|
|
||||||
|
## Next Steps (what I'll do now) |
||||||
|
|
||||||
|
1. Create this design doc (done). |
||||||
|
2. Commit the design doc to the repo (doing now). |
||||||
|
3. Spawn the planner to create a step-by-step implementation plan based on this design (spawning now). |
||||||
@ -0,0 +1,76 @@ |
|||||||
|
--- |
||||||
|
date: 2026-03-24 |
||||||
|
topic: "mindmodel-generation" |
||||||
|
status: draft |
||||||
|
--- |
||||||
|
|
||||||
|
# Implementation Plan: mindmodel-generation |
||||||
|
|
||||||
|
Goal: Implement a lightweight, safe Constraint Validator for the generated .mindmodel/ snapshot plus small CI / config artifacts to validate and integrate the manifest incrementally and safely. |
||||||
|
|
||||||
|
Design reference: thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Overview |
||||||
|
|
||||||
|
This plan breaks work into four batches: Foundation, Core, Components, Integration/Configs. Each micro-task is small and independently testable. Tests accompany core modules. The validator intentionally avoids reading repository secret files and only scans manifest text and evidence snippets. |
||||||
|
|
||||||
|
## Batch 1: Foundation (parallel) |
||||||
|
|
||||||
|
- Task 1.1: Manifest loader |
||||||
|
- Path: scripts/mindmodel/loader.py |
||||||
|
- Test: tests/scripts/mindmodel/test_loader.py |
||||||
|
- Behavior: load YAML or JSON manifest, normalize to dict, raise ManifestLoadError on failure |
||||||
|
|
||||||
|
- Task 1.2: Low-level checks |
||||||
|
- Path: scripts/mindmodel/checks.py |
||||||
|
- Test: tests/scripts/mindmodel/test_checks.py |
||||||
|
- Behavior: file existence (without opening), truncated-snippet heuristics, manifest-text secret heuristics |
||||||
|
|
||||||
|
## Batch 2: Core Modules (depends on Batch 1) |
||||||
|
|
||||||
|
- Task 2.1: Constraint Validator (core) |
||||||
|
- Path: scripts/mindmodel/validator.py |
||||||
|
- Test: tests/scripts/mindmodel/test_validator.py |
||||||
|
- Behavior: load manifest, scan for secrets, verify referenced files exist, detect truncated snippets, produce machine-readable report and exit codes: 0 ok, 1 warnings, 2 critical |
||||||
|
|
||||||
|
## Batch 3: Components (depends on Batch 2) |
||||||
|
|
||||||
|
- Task 3.1: CLI wrapper for CI and local runs |
||||||
|
- Path: scripts/mindmodel/cli.py |
||||||
|
- Test: tests/scripts/mindmodel/test_cli.py |
||||||
|
- Behavior: simple wrapper delegating to validator; callable as python -m scripts.mindmodel.cli |
||||||
|
|
||||||
|
## Batch 4: Integration / Configs / Docs (parallel) |
||||||
|
|
||||||
|
- Task 4.1: CI workflow to run validator on PRs and scheduled checks |
||||||
|
- Path: .github/workflows/mindmodel-validate.yml |
||||||
|
- Behavior: run tests, then run validator against .mindmodel/manifest.yaml if present |
||||||
|
|
||||||
|
- Task 4.2: .mindmodel/ README describing read-only policy |
||||||
|
- Path: .mindmodel/README.md |
||||||
|
|
||||||
|
- Task 4.3: Add a minimal pre-commit config (trailing whitespace, eof fixer, check-yaml) |
||||||
|
- Path: .pre-commit-config.yaml |
||||||
|
|
||||||
|
## Verification |
||||||
|
|
||||||
|
- Each unit has a focused pytest test to validate behavior. |
||||||
|
- CI will run the validator and tests; the validator should skip if no manifest present. |
||||||
|
|
||||||
|
## Implementation Checklist |
||||||
|
|
||||||
|
- [ ] Add scripts/mindmodel/loader.py + tests/scripts/mindmodel/test_loader.py |
||||||
|
- [ ] Add scripts/mindmodel/checks.py + tests/scripts/mindmodel/test_checks.py |
||||||
|
- [ ] Add scripts/mindmodel/validator.py + tests/scripts/mindmodel/test_validator.py |
||||||
|
- [ ] Add scripts/mindmodel/cli.py + tests/scripts/mindmodel/test_cli.py |
||||||
|
- [ ] Add .github/workflows/mindmodel-validate.yml |
||||||
|
- [ ] Add .mindmodel/README.md |
||||||
|
- [ ] Add .pre-commit-config.yaml |
||||||
|
|
||||||
|
## Next steps |
||||||
|
|
||||||
|
1. Create the files above in small commits (one micro-task per commit). |
||||||
|
2. Run unit tests for each new module as added. |
||||||
|
3. Open a small PR with the validator + CI + docs; request reviewers to run the validator locally. |
||||||
Loading…
Reference in new issue