parent
9c82962d47
commit
2efd7ba3a0
@ -0,0 +1,34 @@ |
||||
# Naming & Style Conventions |
||||
|
||||
## Rules |
||||
- Modules and files: snake_case.py. Evidence: pipeline/run_pipeline.py, database.py, ai_provider.py |
||||
- Functions and methods: snake_case. Evidence: compute_svd_for_window (pipeline), _generate_windows (pipeline/run_pipeline.py) |
||||
- Classes: PascalCase. Evidence: MotionDatabase (database.py) |
||||
- Constants: UPPER_SNAKE_CASE. Evidence: VOTE_MAP, DATABASE_PATH (config inferred) |
||||
- Imports order: stdlib, third-party, local; prefer absolute imports and grouped. |
||||
- Use black, ruff, isort, mypy as the recommended toolchain; repository lacks config files (black, ruff, pyproject sections). |
||||
|
||||
## Examples |
||||
|
||||
### Function example (from pipeline/run_pipeline.py) |
||||
```python |
||||
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
||||
"""Return list of (window_id, start_str, end_str) tuples.""" |
||||
``` |
||||
|
||||
### Class example (from database.py) |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
... |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Missing formatting configs (black, ruff, isort). Add pyproject.toml sections or dedicated config files. |
||||
|
||||
## Remediations |
||||
- Add pyproject.toml tool sections for black/ruff/isort and a pre-commit config. Run ruff/black CI lint step. |
||||
|
||||
## Evidence pointers |
||||
- pipeline/run_pipeline.py: function _generate_windows (lines ~1-120) |
||||
- database.py: MotionDatabase class and methods (file database.py lines 1-400+) |
||||
@ -0,0 +1,74 @@ |
||||
# Database Schema (DuckDB) — extracted DDL |
||||
|
||||
## Rules |
||||
- Use DuckDB for persistent storage when available; fallback to JSON files when duckdb is not installed (database.py). |
||||
- Keep schema migrations additive (ALTER TABLE ADD COLUMN IF NOT EXISTS used in database.py). |
||||
|
||||
## Examples (DDL snippets extracted from database.py) |
||||
|
||||
### motions table |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS motions ( |
||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||
title TEXT NOT NULL, |
||||
description TEXT, |
||||
date DATE, |
||||
policy_area TEXT, |
||||
voting_results JSON, |
||||
winning_margin FLOAT, |
||||
controversy_score FLOAT, |
||||
layman_explanation TEXT, |
||||
externe_identifier TEXT, |
||||
body_text TEXT, |
||||
url TEXT UNIQUE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
### mp_votes table |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS mp_votes ( |
||||
id INTEGER DEFAULT nextval('mp_votes_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
mp_name TEXT NOT NULL, |
||||
party TEXT, |
||||
vote TEXT NOT NULL, |
||||
date DATE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
### embeddings / fused_embeddings |
||||
```sql |
||||
CREATE TABLE IF NOT EXISTS embeddings ( |
||||
id INTEGER DEFAULT nextval('embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
model TEXT, |
||||
vector JSON NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
|
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Broad try/except around duckdb import (database.py top) — acceptable for optional dependency but should log explicitly the missing dependency and document test behavior. |
||||
|
||||
## Remediations |
||||
- Add a simple migration/versioning table (schema_version) to track schema changes and apply migrations deterministically. |
||||
- Add tests that exercise both duckdb-backed and JSON-fallback database paths. Evidence: database.py contains JSON fallback logic (lines ~1-80). |
||||
|
||||
## Evidence pointers |
||||
- database.py: DDL strings and sequences (file: database.py lines ~1-300 and further). See create table blocks for motions, mp_votes, embeddings, fused_embeddings. |
||||
@ -0,0 +1,22 @@ |
||||
# Domain Glossary |
||||
|
||||
## Rules |
||||
- Use consistent domain terms across code and DB: Motion, MP, Party, embedding, window, svd_vector, fused_embedding, similarity_cache, session_id. |
||||
|
||||
## Terms |
||||
- Motion: parliamentary motion stored in `motions` table. Evidence: database.py CREATE TABLE motions (file: database.py lines ~40-110) |
||||
- MP (Member of Parliament): individual with votes stored in `mp_votes`. Evidence: database.py CREATE TABLE mp_votes |
||||
- Embedding: text embedding stored in `embeddings` table; fused vectors in `fused_embeddings`. |
||||
- SVD vector: reduced-dimensional vectors stored in `svd_vectors` table. |
||||
- Window: time window identifier (e.g., "2024-Q1") used across SVD/fusion pipelines. Evidence: pipeline/run_pipeline.py _generate_windows |
||||
- Controversy score: derived field stored on motions as controversy_score. Evidence: database.py insert_motion sets controversy_score |
||||
|
||||
## Examples / Usage |
||||
- pipeline.run_pipeline._generate_windows produces window ids used when storing svd_vectors and fused_embeddings. Evidence: pipeline/run_pipeline.py lines ~1-120 |
||||
|
||||
## Evidence pointers |
||||
- database.py: motions, mp_votes, embeddings, fused_embeddings tables (file: database.py) |
||||
- pipeline/run_pipeline.py: window generation and pipeline phases (file: pipeline/run_pipeline.py) |
||||
|
||||
## Anti-patterns |
||||
- Inconsistent naming of domain terms across modules (e.g., `mp_vote_parties` vs `mp_votes` usage in database.insert_motion and pipeline extraction). Prefer canonical names matching DB columns and use small adapter functions when transitioning representations. |
||||
@ -0,0 +1,30 @@ |
||||
# Code Clusters / Organization |
||||
|
||||
## Rules |
||||
- The repository organizes code into the following clusters (observed): |
||||
- UI / Streamlit: Home.py, pages/, app.py, explorer.py |
||||
- Database & persistence: database.py, config.py |
||||
- ETL / pipeline: pipeline/ (run_pipeline.py, svd_pipeline, text_pipeline, fusion) |
||||
- AI provider & summarization: ai_provider.py, pipeline/..., analysis/ |
||||
- Similarity & caching: similarity/*, similarity_cache table in DB |
||||
- API client & scraping: api_client.py, pipeline/fetch_mp_metadata |
||||
- Analysis & visualization: analysis/visualize.py, explorer.py |
||||
- CLI & scheduler: scheduler.py, pipeline/run_pipeline.py |
||||
- Tests & migrations: tests/ (pytest) and database reset helpers |
||||
|
||||
## Examples |
||||
|
||||
### Pipeline orchestrator (cluster: CLI & pipeline) |
||||
```python |
||||
from database import MotionDatabase |
||||
db = MotionDatabase(db_path) |
||||
# then phases: fetch_mp_metadata, extract_mp_votes, compute svd, ensure_text_embeddings, fuse_for_window |
||||
``` |
||||
|
||||
## Remediations |
||||
- Add a brief CONTRIBUTING.md describing where to add new pipeline stages and how to run tests locally. Include notes about optional duckdb dependency and JSON fallback for tests. |
||||
|
||||
## Evidence pointers |
||||
- pipeline/run_pipeline.py: orchestrator and cluster boundaries (file: pipeline/run_pipeline.py) |
||||
- ai_provider.py: AI adapter for embeddings and chat (file: ai_provider.py) |
||||
- analysis/visualize.py: visualization cluster (file: analysis/visualize.py) |
||||
@ -0,0 +1,46 @@ |
||||
# Design Patterns & Code Patterns |
||||
|
||||
## Rules |
||||
- Use repository-style DB wrapper: MotionDatabase encapsulates DuckDB access and schema management. |
||||
- AI provider adapter pattern: ai_provider.py exposes get_embedding(s) and chat_completion with retry/backoff and local fallback. |
||||
- Pipeline orchestration: run_pipeline.py uses phases, ThreadPoolExecutor for parallel SVD computation with careful DuckDB connection handling (collect results before writes). |
||||
|
||||
## Examples |
||||
|
||||
### Repository pattern (database.py MotionDatabase) |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
self._init_database() |
||||
|
||||
def insert_motion(self, motion_data: Dict) -> bool: |
||||
"""Insert a new motion into database""" |
||||
# uses duckdb.connect and parameterized queries |
||||
``` |
||||
|
||||
### Provider adapter with retries (ai_provider.py) |
||||
```python |
||||
def _post_with_retries(path: str, json: dict[str, Any], retries: int = 3) -> requests.Response: |
||||
# Implements retries/backoff, handles 429 with Retry-After and 5xx responses |
||||
``` |
||||
|
||||
### Pipeline parallelism pattern (run_pipeline) |
||||
```python |
||||
with ThreadPoolExecutor(max_workers=max_workers) as pool: |
||||
for window_id, w_start, w_end in windows: |
||||
fut = pool.submit(compute_svd_for_window, db.db_path, window_id, w_start, w_end, args.svd_k) |
||||
futures[fut] = window_id |
||||
# wait then write sequentially to DuckDB |
||||
``` |
||||
|
||||
## Anti-patterns |
||||
- Broad excepts used in several places (database.py top-level try/except on duckdb import, many generic excepts around DB operations) — can hide real errors. |
||||
|
||||
## Remediations |
||||
- Replace broad except Exception with targeted exceptions and explicit logging. Where fallback is intended (e.g., optional duckdb), log at INFO/DEBUG with clear message and include guidance in CONTRIBUTING.md. |
||||
|
||||
## Evidence pointers |
||||
- ai_provider.py: _post_with_retries, get_embedding(s), _local_embedding (file: ai_provider.py lines ~1-300) |
||||
- pipeline/run_pipeline.py: ThreadPoolExecutor usage and duckdb connection handling (file: pipeline/run_pipeline.py lines ~120-260) |
||||
- database.py: MotionDatabase methods (file: database.py) |
||||
@ -0,0 +1,24 @@ |
||||
# Anti-patterns, Issues and Recommended Fixes |
||||
|
||||
## Rules |
||||
- Flagged issues discovered in Phase 1 must be remediated with concrete actions. |
||||
|
||||
## Issues |
||||
- pytest is listed as a runtime dependency (pyproject.toml). This increases image size and may pull dev-only transitive deps into production. Evidence: pyproject.toml |
||||
- openai is declared but static imports not found; may be unused. Evidence: pyproject.toml, ai_provider.py uses requests and env keys instead of openai imports. |
||||
- Many dependencies use permissive ">=" version ranges; no lockfile present. This reduces reproducibility. |
||||
- Missing formatting/linting configs (black, ruff, isort, mypy). Recommended to add config and CI steps. |
||||
- Broad except Exception used in many places (database.py, ai_provider.py fallback logic, analysis/visualize.py). This can mask bugs and slow debugging. |
||||
|
||||
## Remediations / Recommended fixes |
||||
- Move pytest from runtime dependencies to dev-dependencies in pyproject.toml. |
||||
- Suggested patch: under [project.optional-dependencies] or [tool.poetry.dev-dependencies] depending on toolchain. |
||||
- Audit `openai` usage. If unused, remove from pyproject.toml. If dynamically imported in runtime, add a small shim or explicit lazy import with documented env var. |
||||
- Pin critical dependencies or add upper bounds; generate lockfile (poetry.lock or pip-tools requirements.txt). Add CI job that fails on permissive ranges. |
||||
- Add black/ruff/isort/mypy config blocks to pyproject.toml and enable pre-commit hooks. Add CI lint stage. |
||||
- Replace broad except Exception with narrower catches and re-raise or log with traceback when unexpected. Example locations: database.py top import, insert_motion broad except, ai_provider fallback blocks. |
||||
|
||||
## Evidence pointers |
||||
- pyproject.toml: dependencies list (file: pyproject.toml lines 1-40) |
||||
- database.py: multiple broad except blocks (file: database.py top and methods) |
||||
- ai_provider.py: uses requests + env keys (file: ai_provider.py) |
||||
@ -0,0 +1,117 @@ |
||||
# Example Extractions |
||||
|
||||
## Rules |
||||
- Include concrete examples extracted from the codebase: function signatures with docstrings, SQL DDL snippets, and pytest stubs following repository conventions. |
||||
|
||||
## (a) Function signatures with docstrings (5 examples) |
||||
1) pipeline/run_pipeline.py::_generate_windows |
||||
```python |
||||
def _generate_windows(start: date, end: date, granularity: str) -> List[Tuple[str, str, str]]: |
||||
"""Return list of (window_id, start_str, end_str) tuples. |
||||
|
||||
window_id format: |
||||
quarterly → "2024-Q1", "2024-Q2", … |
||||
annual → "2024" |
||||
""" |
||||
``` |
||||
|
||||
2) database.py::append_audit_event |
||||
```python |
||||
def append_audit_event( |
||||
self, |
||||
actor_id: Optional[str], |
||||
action: str, |
||||
target_type: Optional[str] = None, |
||||
target_id: Optional[str] = None, |
||||
metadata: Optional[Dict] = None, |
||||
) -> bool: |
||||
"""Record an audit event. Tries DB then falls back to ledger file.""" |
||||
``` |
||||
|
||||
3) ai_provider.py::get_embedding |
||||
```python |
||||
def get_embedding(text: str, model: str | None = None) -> list[float]: |
||||
"""Return an embedding vector for `text` using the configured provider. |
||||
|
||||
Raises ProviderError for configuration or provider-side failures. |
||||
""" |
||||
``` |
||||
|
||||
4) ai_provider.py::get_embeddings_batch |
||||
```python |
||||
def get_embeddings_batch( |
||||
texts: list[str], model: str | None = None, batch_size: int = 50 |
||||
) -> list[list[float]]: |
||||
"""Return embedding vectors for multiple texts using batched API calls.""" |
||||
``` |
||||
|
||||
5) analysis/visualize.py::plot_umap_scatter |
||||
```python |
||||
def plot_umap_scatter( |
||||
motion_ids: List[int], |
||||
coords: List[List[float]], |
||||
labels: Optional[List[int]] = None, |
||||
window_id: Optional[str] = None, |
||||
output_path: str = "analysis_umap.html", |
||||
) -> str: |
||||
"""Produce a 2D scatter plot of UMAP-reduced fused embeddings.""" |
||||
``` |
||||
|
||||
## (b) SQL / DDL snippets (3 examples inferred from database.py) |
||||
1) motions table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE motions (lines ~40-110) |
||||
|
||||
2) mp_votes table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE mp_votes |
||||
|
||||
3) fused_embeddings table (see constraints/10-db-schema.yaml) — evidence: database.py CREATE TABLE fused_embeddings |
||||
|
||||
## (c) Pytest stubs (4 sample tests matching conventions) |
||||
Create tests under tests/ named test_*.py using fixtures in conftest.py. Examples below are stubs to add. |
||||
|
||||
1) tests/test_database_basic.py |
||||
```python |
||||
def test_init_database_creates_tables(tmp_path): |
||||
db_path = str(tmp_path / "motions.db") |
||||
from database import MotionDatabase |
||||
|
||||
db = MotionDatabase(db_path=db_path) |
||||
# If duckdb not available, JSON fallback should create .embeddings.json |
||||
assert db is not None |
||||
``` |
||||
|
||||
2) tests/test_ai_provider.py |
||||
```python |
||||
def test_local_embedding_fallback(): |
||||
from ai_provider import _local_embedding |
||||
|
||||
v = _local_embedding("hello world", dim=16) |
||||
assert isinstance(v, list) and len(v) == 16 |
||||
``` |
||||
|
||||
3) tests/test_pipeline_windows.py |
||||
```python |
||||
from pipeline.run_pipeline import _generate_windows |
||||
|
||||
def test_generate_quarterly_windows(): |
||||
from datetime import date |
||||
|
||||
start = date(2024, 1, 1) |
||||
end = date(2024, 3, 31) |
||||
windows = _generate_windows(start, end, "quarterly") |
||||
assert any(w[0].endswith("Q1") for w in windows) |
||||
``` |
||||
|
||||
4) tests/test_visualize_plot.py |
||||
```python |
||||
def test_plot_umap_scatter_no_plotly(monkeypatch, tmp_path): |
||||
# If plotly missing, function should raise ImportError with guidance |
||||
import analysis.visualize as vis |
||||
|
||||
try: |
||||
vis._require_plotly() |
||||
except ImportError: |
||||
assert True |
||||
``` |
||||
|
||||
## Evidence pointers |
||||
- Function docstrings: pipeline/run_pipeline.py, ai_provider.py, analysis/visualize.py, database.py |
||||
- DDL: database.py create table blocks |
||||
@ -0,0 +1,43 @@ |
||||
# Stack and Dependencies |
||||
|
||||
## Rules |
||||
- Primary language: Python >=3.13 (evidence: pyproject.toml requires-python = ">=3.13") |
||||
- Application: Streamlit app (streamlit >=1.48.0). Entrypoint: Home.py (CMD: streamlit run Home.py). Evidence: Home.py, pages/1_Stemwijzer.py, pyproject.toml, Dockerfile |
||||
- Database: DuckDB + Ibis (duckdb>=1.3.2, ibis-framework[duckdb]>=10.8.0). Evidence: pyproject.toml, database.py |
||||
- ML: scikit-learn, umap-learn, scipy. Evidence: pyproject.toml, pipeline/svd.py, analysis/ |
||||
|
||||
## Examples |
||||
|
||||
### pyproject dependencies (evidence: pyproject.toml) |
||||
```toml |
||||
dependencies = [ |
||||
"duckdb>=1.3.2", |
||||
"ibis-framework[duckdb]>=10.8.0", |
||||
"openai>=1.99.7", |
||||
"scipy>=1.11", |
||||
"umap-learn>=0.5", |
||||
"plotly>=5.0", |
||||
"pytest>=9.0.2", |
||||
"requests>=2.32.4", |
||||
"schedule>=1.2.2", |
||||
"streamlit>=1.48.0", |
||||
"scikit-learn>=1.8.0", |
||||
"beautifulsoup4>=4.14.3", |
||||
"lxml>=6.0.2", |
||||
] |
||||
``` |
||||
|
||||
## Anti-patterns / Notes |
||||
- pytest is listed under runtime dependencies in pyproject.toml (line: dependencies). Move pytest to dev-dependencies to avoid shipping test runner in production images. Evidence: pyproject.toml |
||||
- Many dependencies use permissive ">=" ranges. Recommend pinning or generating lockfile (poetry.lock/requirements.txt) and adding upper bounds for reproducibility. |
||||
- openai appears declared but static imports not found; possible unused dependency (evidence: pyproject.toml, ai_provider.py uses requests and environment keys instead of openai). |
||||
|
||||
## Remediations |
||||
- Move test-only libs (pytest) to dev-dependencies in pyproject.toml. |
||||
- Add lockfile and CI step to check for pinned dependencies. |
||||
- Audit declared but unused packages (openai) and remove or confirm dynamic usage. |
||||
|
||||
## Evidence pointers |
||||
- pyproject.toml: full dependency list (lines 1-40) |
||||
- Home.py: streamlit usage and app entry (file: Home.py) |
||||
- database.py: duckdb table creation and connection (file: database.py lines ~1-350) |
||||
@ -0,0 +1,5 @@ |
||||
# Mindmodel constraints README |
||||
|
||||
Files in .mindmodel/constraints/ are YAML-like constraint documents describing |
||||
conventions, patterns and remediation steps. Use these to guide PR reviews and |
||||
CI automation. |
||||
@ -1,60 +1,36 @@ |
||||
name: stemwijzer |
||||
version: 2 |
||||
summary: >- |
||||
Mindmodel constraints for the Stemwijzer repository (Python + Streamlit + |
||||
DuckDB). Captures tech stack, conventions, DB schema, clusters, patterns, |
||||
anti-patterns and example extractions. Generated from Phase 1 analysis. |
||||
main_patterns: |
||||
- Repository DB wrapper (MotionDatabase) |
||||
- AI provider adapter with retry/backoff and local fallback |
||||
- SVD + embedding fusion pipeline with windowed processing |
||||
total_files: 11 |
||||
categories: |
||||
- path: stack.yaml |
||||
description: Project technology stack (languages, frameworks, runtime) |
||||
- path: .mindmodel/constraints/99-stack.yaml |
||||
description: Runtime tech stack and primary dependencies (Python, Streamlit, DuckDB, Ibis) |
||||
group: stack |
||||
- path: dependencies.yaml |
||||
description: Declared and recommended dependencies grouped by purpose |
||||
group: stack |
||||
- path: system.md |
||||
description: System overview and architecture high-level notes |
||||
group: architecture |
||||
- path: architecture.yaml |
||||
description: Architectural layers, organization and confidence levels |
||||
group: architecture |
||||
- path: conventions.yaml |
||||
description: Coding conventions cheat-sheet (naming, imports, types) |
||||
group: style |
||||
- path: domain-glossary.yaml |
||||
description: Business domain glossary for the project |
||||
- path: .mindmodel/constraints/01-naming.yaml |
||||
description: Naming, import and style conventions |
||||
group: conventions |
||||
- path: .mindmodel/constraints/10-db-schema.yaml |
||||
description: DuckDB schema DDL extracted from database.py |
||||
group: database |
||||
- path: .mindmodel/constraints/20-domain-glossary.yaml |
||||
description: Domain glossary and terminology (motions, MP, embeddings, windows) |
||||
group: domain |
||||
- path: patterns/duckdb_access.yaml |
||||
description: DuckDB access patterns, examples, and anti-patterns |
||||
group: patterns |
||||
- path: patterns/requests_http.yaml |
||||
description: Requests/HTTP client usage and retry best-practices |
||||
group: patterns |
||||
- path: patterns/embeddings_similarity.yaml |
||||
description: Embedding, SVD, fusion and similarity pipeline patterns |
||||
group: patterns |
||||
- path: patterns/error_handling.yaml |
||||
description: Error handling patterns and rules |
||||
group: patterns |
||||
- path: patterns/validation.yaml |
||||
description: Input/domain validation patterns and examples |
||||
group: patterns |
||||
- path: patterns/module_singletons.yaml |
||||
description: Module-level singletons and lifecycle patterns |
||||
group: patterns |
||||
- path: anti-patterns.yaml |
||||
description: Known anti-patterns and remediation steps |
||||
group: patterns |
||||
- path: examples/pattern-examples.md |
||||
description: Consolidated extracted code examples across patterns |
||||
group: patterns |
||||
- path: constraints/naming.yaml |
||||
description: Enforce naming rules (snake_case, PascalCase, constants) |
||||
group: constraints |
||||
- path: constraints/imports.yaml |
||||
description: Enforce import grouping and ordering |
||||
group: constraints |
||||
- path: constraints/db_connection.yaml |
||||
description: Rules for opening/closing DB connections and read-only usage |
||||
group: constraints |
||||
- path: constraints/error_handling.yaml |
||||
description: Error handling style and allowed exception scopes |
||||
group: constraints |
||||
- path: constraints/testing.yaml |
||||
description: Test conventions (pytest, test naming, fixtures) |
||||
group: constraints |
||||
- path: .mindmodel/constraints/30-clusters.yaml |
||||
description: Code clusters and module organization |
||||
group: architecture |
||||
- path: .mindmodel/constraints/40-patterns.yaml |
||||
description: Design patterns and coding patterns observed with examples |
||||
group: patterns |
||||
- path: .mindmodel/constraints/50-anti-patterns.yaml |
||||
description: Anti-patterns, issues and recommended remediations |
||||
group: ops |
||||
- path: .mindmodel/constraints/60-examples.yaml |
||||
description: Example extractions: function signatures, SQL DDL snippets, pytest stubs |
||||
group: examples |
||||
|
||||
@ -1,18 +1,14 @@ |
||||
# System overview |
||||
# System Overview: Stemwijzer |
||||
|
||||
This project is a Streamlit-based UI and data-processing pipeline that computes embeddings, |
||||
performs SVD over MP/motion voting matrices, fuses vector representations, and precomputes |
||||
a similarity cache for quick lookup in the UI. |
||||
This mindmodel documents constraints, conventions and patterns for the Stemwijzer |
||||
project (Python Streamlit app with DuckDB-backed pipeline for parliamentary |
||||
motions embedding analysis). |
||||
|
||||
Key subsystems: |
||||
- UI: Streamlit pages (Home.py, pages/*). Exposes interactive explorer and quizzes. |
||||
- Data ingestion: scripts and scraper/api_client.py (Tweede Kamer OData). |
||||
- Processing pipelines: pipeline/* (text embeddings, SVD, fusion). |
||||
- Similarity layer: similarity/compute.py and similarity/lookup.py storing precomputed neighbors. |
||||
- Storage: DuckDB (primary), with a JSON-file fallback used in tests/environments without duckdb. |
||||
- AI/Embedding provider: ai_provider.py (HTTP wrapper around an OpenRouter/OpenAI-compatible API). |
||||
Key points: |
||||
- Language: Python >=3.13 |
||||
- UI: Streamlit multi-page app (Home.py, pages/) |
||||
- Storage: DuckDB with JSON fallback for tests/dev (database.py) |
||||
- Pipeline: ETL and SVD/text fusion pipeline (pipeline/run_pipeline.py) |
||||
- AI: ai_provider adapter uses HTTP-based OpenRouter/OpenAI-compatible API with retry/backoff and local fallback |
||||
|
||||
Operational notes: |
||||
- Dockerfile exists; Streamlit default port 8501 exposed. |
||||
- Tests use pytest. CI uses Drone (.drone.yml). |
||||
- There is no lockfile present in the repository snapshot; add one (poetry.lock or requirements.txt) for reproducible installs. |
||||
Use the .mindmodel/ constraints files to guide code changes, CI, and onboarding. |
||||
|
||||
@ -0,0 +1,67 @@ |
||||
"""Simple manifest loader for mindmodel manifests. |
||||
|
||||
Provides `load_manifest(path: str) -> dict` and `ManifestLoadError`. |
||||
|
||||
Behavior: |
||||
- If PyYAML is installed, uses yaml.safe_load to parse the file. |
||||
- Otherwise falls back to the stdlib json parser. |
||||
- If the top-level document is a list it will be normalized to {"constraints": <list>}. |
||||
- Raises ManifestLoadError for missing file or parse errors. |
||||
""" |
||||
|
||||
from typing import Any, Dict |
||||
import json |
||||
from pathlib import Path |
||||
|
||||
|
||||
class ManifestLoadError(Exception): |
||||
"""Raised when a manifest cannot be loaded or parsed.""" |
||||
|
||||
|
||||
try: |
||||
import yaml # type: ignore |
||||
except Exception: # YAML not available |
||||
yaml = None # type: ignore |
||||
|
||||
|
||||
def _parse_with_yaml(text: str) -> Any: |
||||
# yamlsafe_load may return any Python structure |
||||
try: |
||||
return yaml.safe_load(text) |
||||
except Exception as exc: # pragma: no cover - defensive |
||||
raise ManifestLoadError(f"YAML parse error: {exc}") from exc |
||||
|
||||
|
||||
def _parse_with_json(text: str) -> Any: |
||||
try: |
||||
return json.loads(text) |
||||
except Exception as exc: |
||||
raise ManifestLoadError(f"JSON parse error: {exc}") from exc |
||||
|
||||
|
||||
def load_manifest(path: str) -> Dict[str, Any]: |
||||
"""Load a manifest from the given file path and normalize it to a dict. |
||||
|
||||
If the top-level document is a list, it will be returned as {"constraints": list}. |
||||
Raises ManifestLoadError if the file does not exist or if parsing fails. |
||||
""" |
||||
p = Path(path) |
||||
if not p.exists(): |
||||
raise ManifestLoadError(f"Manifest file not found: {path}") |
||||
|
||||
text = p.read_text(encoding="utf-8") |
||||
|
||||
if yaml is not None: |
||||
data = _parse_with_yaml(text) |
||||
else: |
||||
data = _parse_with_json(text) |
||||
|
||||
# Normalize |
||||
if isinstance(data, list): |
||||
return {"constraints": data} |
||||
|
||||
if isinstance(data, dict): |
||||
return data |
||||
|
||||
# Unexpected top-level type, wrap it |
||||
return {"manifest": data} |
||||
@ -0,0 +1,21 @@ |
||||
import json |
||||
import pytest |
||||
|
||||
from scripts.mindmodel import loader |
||||
|
||||
|
||||
def test_load_json_manifest(tmp_path): |
||||
data = [{"id": "c1", "description": "a constraint"}] |
||||
p = tmp_path / "manifest.json" |
||||
p.write_text(json.dumps(data), encoding="utf-8") |
||||
|
||||
loaded = loader.load_manifest(str(p)) |
||||
|
||||
assert isinstance(loaded, dict) |
||||
assert "constraints" in loaded |
||||
assert any(c.get("id") == "c1" for c in loaded["constraints"]) |
||||
|
||||
|
||||
def test_missing_manifest_raises(): |
||||
with pytest.raises(loader.ManifestLoadError): |
||||
loader.load_manifest("nonexistent-file-manifest.json") |
||||
@ -0,0 +1,73 @@ |
||||
--- |
||||
date: 2026-03-24 |
||||
topic: "mindmodel-generation" |
||||
status: draft |
||||
--- |
||||
|
||||
## Problem Statement |
||||
|
||||
We generated a .mindmodel/ snapshot for this repository using an automated orchestrator. The output includes inferred constraints, patterns, schema snippets, and remediation recommendations. We need a short, validated design that explains what was produced, how to verify and integrate it safely, and a recommended next set of changes (low-risk remediation and CI additions). |
||||
|
||||
## Constraints |
||||
|
||||
**Non-negotiables:** |
||||
- Keep the generated .mindmodel/ files read-only until validated. |
||||
- Do not make behavioral changes to production code in the same change as model metadata updates. |
||||
- Avoid committing secrets or lockfiles without explicit review. |
||||
|
||||
**Limitations:** |
||||
- The orchestrator used heuristic file reads; some evidence pointers may be truncated or approximate. |
||||
- No poetry.lock / requirements.txt or CI workflows were found; dependency remediation must be conservative. |
||||
|
||||
## Approach |
||||
|
||||
I'm choosing an **audit-first, incremental integration** approach because the generated artifacts are high-value policy documents but rely on evidence that needs verification. We will: (1) validate evidence pointers and missing files, (2) mark fixes for trivial issues (move pytest to dev-deps, add formatter configs) in a small non-invasive PR, (3) integrate the .mindmodel/ into the repo and add a CI lint step that validates the manifest, and (4) iterate on higher-risk changes after tests pass. |
||||
|
||||
Alternatives considered: |
||||
- Accept-and-commit everything immediately (faster) — rejected because of truncated reads and potential wrong pointers. |
||||
- Manual rewrite of constraints by hand (accurate) — rejected due to time cost; validation + targeted fixes gives best ROI. |
||||
|
||||
## Architecture |
||||
|
||||
This is a documentation/metadata integration task, not a runtime service. Components: |
||||
|
||||
- **.mindmodel/**: constraint files and manifest produced by orchestrator. Source of truth for conventions and inferred patterns. |
||||
- **Validator job (CI)**: lightweight script/CI step that verifies manifest consistency, required files exist, and key evidence pointers resolve. |
||||
- **Small remediation PRs**: conservative code/config edits (pyproject tweaks, add black/ruff/isort configs, pre-commit) that enable future automation. |
||||
|
||||
## Components |
||||
|
||||
- Constraint Validator: verifies every .mindmodel/ constraint references existing files; flags truncated evidence ranges; ensures no secrets. |
||||
- Staging branch: holds small remediation commits; each commit is limited to one class of change (deps dev/prod move, linters, CI yaml). |
||||
- CI pipeline changes: add a validation job and a docs check that ensures .mindmodel/ manifest is up to date. |
||||
|
||||
## Data Flow |
||||
|
||||
1. Orchestrator output (.mindmodel/) exists in the working tree. |
||||
2. Validator runs locally or in CI to check pointers and file existence. |
||||
3. Developer reviews validator report and accepts/edits constraint files. |
||||
4. Remediation PRs are opened for low-risk fixes. |
||||
5. CI runs tests + validator; on green we merge and enable scheduled checks. |
||||
|
||||
## Error Handling |
||||
|
||||
- Validator failures are non-blocking for mainline but must be resolved before we rely on constraints for automation. |
||||
- If a constraint references a deleted or moved file, mark the constraint as "needs-review" in the manifest and leave file unchanged. |
||||
- For ambiguous evidence (truncated reads), add an explicit comment in the constraint file pointing to the reviewer. |
||||
|
||||
## Testing Strategy |
||||
|
||||
- Unit: small pytest tests that assert README/pyproject presence and that manifest YAML parses. |
||||
- Integration: CI job that runs the Constraint Validator and fails on missing files or secrets. |
||||
- Manual: reviewer inspects a sample of constraint files (3-5) for accuracy before merging. |
||||
|
||||
## Open Questions |
||||
|
||||
- Do we want the validator to auto-fix trivial issues (reformatting YAML paths) or only report? I'm leaning toward report-only for safety. |
||||
- Should .mindmodel/ be protected by branch policy or just reviewed by humans? Recommend human review + CI check, not protected branch yet. |
||||
|
||||
## Next Steps (what I'll do now) |
||||
|
||||
1. Create this design doc (done). |
||||
2. Commit the design doc to the repo (doing now). |
||||
3. Spawn the planner to create a step-by-step implementation plan based on this design (spawning now). |
||||
@ -0,0 +1,76 @@ |
||||
--- |
||||
date: 2026-03-24 |
||||
topic: "mindmodel-generation" |
||||
status: draft |
||||
--- |
||||
|
||||
# Implementation Plan: mindmodel-generation |
||||
|
||||
Goal: Implement a lightweight, safe Constraint Validator for the generated .mindmodel/ snapshot plus small CI / config artifacts to validate and integrate the manifest incrementally and safely. |
||||
|
||||
Design reference: thoughts/shared/designs/2026-03-24-mindmodel-generation-design.md |
||||
|
||||
--- |
||||
|
||||
## Overview |
||||
|
||||
This plan breaks work into four batches: Foundation, Core, Components, Integration/Configs. Each micro-task is small and independently testable. Tests accompany core modules. The validator intentionally avoids reading repository secret files and only scans manifest text and evidence snippets. |
||||
|
||||
## Batch 1: Foundation (parallel) |
||||
|
||||
- Task 1.1: Manifest loader |
||||
- Path: scripts/mindmodel/loader.py |
||||
- Test: tests/scripts/mindmodel/test_loader.py |
||||
- Behavior: load YAML or JSON manifest, normalize to dict, raise ManifestLoadError on failure |
||||
|
||||
- Task 1.2: Low-level checks |
||||
- Path: scripts/mindmodel/checks.py |
||||
- Test: tests/scripts/mindmodel/test_checks.py |
||||
- Behavior: file existence (without opening), truncated-snippet heuristics, manifest-text secret heuristics |
||||
|
||||
## Batch 2: Core Modules (depends on Batch 1) |
||||
|
||||
- Task 2.1: Constraint Validator (core) |
||||
- Path: scripts/mindmodel/validator.py |
||||
- Test: tests/scripts/mindmodel/test_validator.py |
||||
- Behavior: load manifest, scan for secrets, verify referenced files exist, detect truncated snippets, produce machine-readable report and exit codes: 0 ok, 1 warnings, 2 critical |
||||
|
||||
## Batch 3: Components (depends on Batch 2) |
||||
|
||||
- Task 3.1: CLI wrapper for CI and local runs |
||||
- Path: scripts/mindmodel/cli.py |
||||
- Test: tests/scripts/mindmodel/test_cli.py |
||||
- Behavior: simple wrapper delegating to validator; callable as python -m scripts.mindmodel.cli |
||||
|
||||
## Batch 4: Integration / Configs / Docs (parallel) |
||||
|
||||
- Task 4.1: CI workflow to run validator on PRs and scheduled checks |
||||
- Path: .github/workflows/mindmodel-validate.yml |
||||
- Behavior: run tests, then run validator against .mindmodel/manifest.yaml if present |
||||
|
||||
- Task 4.2: .mindmodel/ README describing read-only policy |
||||
- Path: .mindmodel/README.md |
||||
|
||||
- Task 4.3: Add a minimal pre-commit config (trailing whitespace, eof fixer, check-yaml) |
||||
- Path: .pre-commit-config.yaml |
||||
|
||||
## Verification |
||||
|
||||
- Each unit has a focused pytest test to validate behavior. |
||||
- CI will run the validator and tests; the validator should skip if no manifest present. |
||||
|
||||
## Implementation Checklist |
||||
|
||||
- [ ] Add scripts/mindmodel/loader.py + tests/scripts/mindmodel/test_loader.py |
||||
- [ ] Add scripts/mindmodel/checks.py + tests/scripts/mindmodel/test_checks.py |
||||
- [ ] Add scripts/mindmodel/validator.py + tests/scripts/mindmodel/test_validator.py |
||||
- [ ] Add scripts/mindmodel/cli.py + tests/scripts/mindmodel/test_cli.py |
||||
- [ ] Add .github/workflows/mindmodel-validate.yml |
||||
- [ ] Add .mindmodel/README.md |
||||
- [ ] Add .pre-commit-config.yaml |
||||
|
||||
## Next steps |
||||
|
||||
1. Create the files above in small commits (one micro-task per commit). |
||||
2. Run unit tests for each new module as added. |
||||
3. Open a small PR with the validator + CI + docs; request reviewers to run the validator locally. |
||||
Loading…
Reference in new issue