Generated mindmodel covering stack, architecture, domain glossary, coding conventions, DuckDB/requests/embeddings/error-handling patterns, anti-patterns, and explicit constraints for naming, imports, DB access, error handling, and testing.main
parent
504400faf2
commit
9c82962d47
@ -0,0 +1,43 @@ |
||||
# Known anti-patterns and recommended remediation (Phase 1 findings) |
||||
|
||||
anti_patterns: |
||||
- id: broad_except_swallows_errors |
||||
description: "Wide except: clauses that swallow exceptions without logging or re-raising." |
||||
examples: |
||||
- path: multiple |
||||
note: "Observed in various pipeline and ingestion spots where except Exception: returns a default without context." |
||||
remediation: |
||||
- "Replace broad except with specific exceptions." |
||||
- "When broad except is absolutely needed, call logger.exception(...) and re-raise or convert to a typed domain error." |
||||
- "Add unit tests to ensure critical errors are visible in CI logs." |
||||
|
||||
- id: mixed_print_and_logging |
||||
description: "Mixing print() and logging() for errors and info messages." |
||||
examples: |
||||
- path: api_client.py |
||||
excerpt: | |
||||
```python |
||||
print(f"Fetched {len(voting_records)} voting records from API") |
||||
... |
||||
except Exception as e: |
||||
print(f"Error fetching motions from API: {e}") |
||||
``` |
||||
remediation: |
||||
- "Use logging.getLogger(__name__) and logger.info/warning/exception consistently." |
||||
- "Add a top-level logging configuration for Streamlit and scripts." |
||||
|
||||
- id: no_lockfile |
||||
description: "No lockfile present -> unreproducible installs and CI unpredictability." |
||||
remediation: |
||||
- "Add a lockfile (poetry.lock, requirements.txt produced by pip-tools) and pin versions in CI." |
||||
- "Make CI use the lockfile for reproducible builds." |
||||
|
||||
- id: declared_but_unused_dependency |
||||
description: "Dependency declared but unused (openai in pyproject)." |
||||
remediation: |
||||
- "Either remove the dependency or add clear adapter code/tests that exercise it. Keep pyproject tidy." |
||||
|
||||
- id: brittle_identity_heuristics |
||||
description: "Heuristics for MP identity (comma-based parsing) are brittle." |
||||
remediation: |
||||
- "Add robust parsing rules and unit tests; prefer canonical identifiers (persoon_id) where available." |
||||
@ -0,0 +1,35 @@ |
||||
# Architecture overview and confidence levels |
||||
|
||||
layers: |
||||
- name: ui |
||||
description: "Streamlit pages and app entrypoints (Home.py, pages/*)." |
||||
confidence: high |
||||
- name: ingestion |
||||
description: "API client and scrapers (api_client.py, scraper.py)." |
||||
confidence: high |
||||
- name: processing |
||||
description: "Pipelines for embeddings, SVD, fusion (pipeline/*, similarity/*)." |
||||
confidence: high |
||||
- name: storage |
||||
description: "DuckDB primary store; JSON fallback used in tests when duckdb missing." |
||||
confidence: high |
||||
- name: ai_provider |
||||
description: "Lightweight HTTP wrapper around OpenRouter/OpenAI-style backends in ai_provider.py." |
||||
confidence: medium |
||||
- name: orchestration |
||||
description: "Script-based orchestration (scripts/*.py), rerun_embeddings, scheduler." |
||||
confidence: medium |
||||
|
||||
organization: |
||||
- Keep UI code separated from heavy compute — Streamlit runs should avoid heavy compute inline (use subprocess or schedule). |
||||
- Pipelines are implemented as re-entrant functions returning summary dicts to facilitate testing and subprocess usage (seen in svd_pipeline.compute_svd_for_window). |
||||
- DB access is centralised via MotionDatabase helper (database.py) with convenience methods (store_fused_embedding, append_audit_event). |
||||
|
||||
design_decisions: |
||||
- Use DuckDB for local fast analytics storage; read_only connections used in compute stages to allow parallel workers. |
||||
- Embeddings and similarity cache are stored as JSON in DuckDB tables (vector columns). |
||||
- The ai_provider uses requests with retry/backoff rather than a heavy SDK to keep testing simple. |
||||
|
||||
confidence_summary: |
||||
overall_confidence: high |
||||
notes: "Phase 1 input inspected files across the repo; design mapping is consistent with code samples." |
||||
@ -0,0 +1,29 @@ |
||||
# DB connection handling constraints |
||||
|
||||
rules: |
||||
- name: use_context_managers_for_connections |
||||
rule: "Prefer using 'with duckdb.connect(path, read_only=...) as conn' for scoped DB interactions where possible." |
||||
rationale: "Ensures proper resource cleanup and avoids connection leaks." |
||||
|
||||
- name: read_only_for_compute |
||||
rule: "Use read_only=True for compute steps that only read data (SVD, similarity compute)." |
||||
rationale: "Allows safe parallel workers and reduces write contention." |
||||
|
||||
- name: short_lived_writes |
||||
rule: "When performing database writes, open short-lived connections, commit quickly and close." |
||||
rationale: "Avoids long-lived transactions and reduces lock windows." |
||||
|
||||
examples: |
||||
- path: pipeline/svd_pipeline.py |
||||
snippet: | |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute(...).fetchall() |
||||
finally: |
||||
conn.close() |
||||
|
||||
anti_patterns_and_remediations: |
||||
- bad: "Creating a global connection at import that performs migrations." |
||||
remediation: "Move migrations to an explicit init function that runs at deployment/upgrade time." |
||||
- bad: "Not closing connections on exceptions." |
||||
remediation: "Wrap connects in `with` or finally: conn.close() blocks." |
||||
@ -0,0 +1,36 @@ |
||||
# Error handling style rules (YAML constraint example) |
||||
|
||||
rules: |
||||
- name: explicit_exceptions |
||||
rule: "Raise explicit exceptions (ValueError, ProviderError) for known error conditions rather than returning magic values." |
||||
examples: |
||||
- good: | |
||||
if not isinstance(text, str): |
||||
raise ProviderError('text must be a string') |
||||
- bad: | |
||||
if not isinstance(text, str): |
||||
return [] |
||||
|
||||
- name: avoid_broad_except |
||||
rule: "Avoid 'except Exception:' that swallows errors. If broad except is used for best-effort, log the exception with logger.exception and re-raise or convert." |
||||
examples: |
||||
- bad: | |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
- remediation: | |
||||
try: |
||||
do_work() |
||||
except SpecificError as exc: |
||||
logger.warning('Handled error: %s', exc) |
||||
raise |
||||
|
||||
- name: logging_over_print |
||||
rule: "Prefer logger.* over print() for messages and errors." |
||||
examples: |
||||
- bad: "print('Error fetching motions from API: %s' % e)" |
||||
- good: "logger.exception('Error fetching motions from API')" |
||||
|
||||
enforcement_examples: |
||||
- "Add a static code check to flag 'print(' in modules (except in simple scripts) and 'except Exception:' usages without logger.exception." |
||||
@ -0,0 +1,24 @@ |
||||
# Import grouping and ordering constraints |
||||
|
||||
rules: |
||||
- name: grouping |
||||
rule: "Group imports in three sections separated by a single blank line: stdlib, third-party, local." |
||||
examples: |
||||
- good: | |
||||
import json |
||||
import logging |
||||
|
||||
import requests |
||||
import duckdb |
||||
|
||||
from .pipeline import text_pipeline |
||||
- bad: | |
||||
import duckdb |
||||
import json |
||||
from pipeline import text_pipeline |
||||
|
||||
- name: from_imports |
||||
rule: "Prefer 'from x import y' only when it improves clarity or avoids circular import; otherwise import module and reference attributes." |
||||
|
||||
enforcement_examples: |
||||
- "Run isort or ruff- import sorting in pre-commit or CI to enforce ordering." |
||||
@ -0,0 +1,30 @@ |
||||
# Naming constraint rules (example constraint file) |
||||
|
||||
rules: |
||||
- name: module_file_names |
||||
rule: "Use snake_case for Python module filenames (e.g., text_pipeline.py, ai_provider.py)." |
||||
examples: |
||||
- good: "text_pipeline.py" |
||||
- bad: "TextPipeline.py" |
||||
|
||||
- name: function_names |
||||
rule: "Use snake_case for functions and methods." |
||||
examples: |
||||
- good: "def compute_similarities(...):" |
||||
- bad: "def ComputeSimilarities(...):" |
||||
|
||||
- name: class_names |
||||
rule: "Use PascalCase for classes." |
||||
examples: |
||||
- good: "class MotionDatabase:" |
||||
- bad: "class motion_database:" |
||||
|
||||
- name: constants |
||||
rule: "Constants use UPPER_SNAKE_CASE." |
||||
examples: |
||||
- good: "VOTE_MAP = { ... }" |
||||
- bad: "vote_map = { ... }" |
||||
|
||||
enforcement_examples: |
||||
- "Add a linter rule in CI: ruff or flake8 naming plugin to detect violations." |
||||
- "Run `python -m pip install ruff` and `ruff check` as part of CI." |
||||
@ -0,0 +1,26 @@ |
||||
# Testing conventions constraint (YAML) |
||||
|
||||
rules: |
||||
- name: test_naming |
||||
rule: "Use pytest and name tests test_*.py and test_* functions." |
||||
examples: |
||||
- good: "tests/test_text_pipeline.py" |
||||
- bad: "tests/text_pipeline_test.py" |
||||
|
||||
- name: fixtures_and_conftest |
||||
rule: "Place shared fixtures in tests/conftest.py or tests/fixtures/ for reuse." |
||||
examples: |
||||
- good: "use fixtures declared in tests/conftest.py" |
||||
|
||||
- name: assert_raises |
||||
rule: "Explicitly assert expected exceptions with pytest.raises for invalid input." |
||||
examples: |
||||
- good: | |
||||
import pytest |
||||
|
||||
def test_invalid_input(): |
||||
with pytest.raises(ValueError): |
||||
function_under_test('bad') |
||||
|
||||
enforcement_examples: |
||||
- "Run pytest in CI; fail if tests don't run or if there are regressions." |
||||
@ -0,0 +1,32 @@ |
||||
# Coding conventions cheat-sheet (extracted from Phase 1) |
||||
|
||||
naming: |
||||
module_files: snake_case (e.g., text_pipeline.py, ai_provider.py) |
||||
functions: snake_case |
||||
classes: PascalCase |
||||
constants: UPPER_SNAKE_CASE |
||||
module_singletons: module-level instances, named lower_snake (e.g., db = MotionDatabase()) |
||||
|
||||
imports: |
||||
order: |
||||
- stdlib |
||||
- third-party |
||||
- local application imports |
||||
style: |
||||
- group imports with a blank line between groups |
||||
- prefer "from x import y" only when needed to avoid circular imports |
||||
|
||||
types_and_dataclasses: |
||||
- Use type hints broadly (functions, public APIs) |
||||
- config should be a dataclass in config.py |
||||
- Module-level singletons are allowed (but follow lifecycle rules in db_connection constraints) |
||||
|
||||
tests: |
||||
- pytest |
||||
- tests/ directory, files named test_*.py |
||||
- Use fixtures in tests/fixtures and conftest.py |
||||
- Tests expect raises(...) for invalid input or ProviderError |
||||
|
||||
error_handling: |
||||
- Prefer explicit exceptions (ValueError, ProviderError) |
||||
- Avoid overly-broad except: clauses (see anti-patterns) |
||||
@ -0,0 +1,55 @@ |
||||
# Dependencies map and recommended extras (Phase 1 authoritative) |
||||
declared: |
||||
- streamlit |
||||
- duckdb |
||||
- ibis-framework[duckdb] |
||||
- plotly |
||||
- scikit-learn |
||||
- scipy |
||||
- umap-learn |
||||
- openai # note: declared but not observed imported; review usage |
||||
- requests |
||||
|
||||
observed: |
||||
- requests |
||||
- duckdb (used but sometimes import guarded) |
||||
- numpy |
||||
- pytest |
||||
|
||||
grouped: |
||||
core: |
||||
- python >=3.13 |
||||
- streamlit |
||||
- duckdb |
||||
- ibis-framework[duckdb] |
||||
- requests |
||||
ml: |
||||
- scikit-learn |
||||
- scipy |
||||
- umap-learn |
||||
- numpy |
||||
viz: |
||||
- plotly |
||||
testing: |
||||
- pytest |
||||
|
||||
recommended_extras: |
||||
reproducibility: |
||||
- poetry (poetry.lock) or pip-tools (requirements.txt + requirements.in) |
||||
- pipx or virtualenv usage documented |
||||
linting_and_formatting: |
||||
- black |
||||
- ruff |
||||
- isort |
||||
- mypy |
||||
logging_and_monitoring: |
||||
- structlog (optional) |
||||
containerization: |
||||
- docker (already used) |
||||
heavy_analytics (optional): |
||||
- pandas |
||||
- altair |
||||
- dash (if more interactive dashboards are needed) |
||||
notes: |
||||
- Because no lockfile was present during Phase 1, adding one is high priority for reproducible CI builds. |
||||
- openai is declared but not imported anywhere in Phase 1 files; prefer to either remove or add an explicit adapter usage and tests. |
||||
@ -0,0 +1,37 @@ |
||||
# Domain glossary (core concepts from Phase 1) |
||||
|
||||
terms: |
||||
Motion: |
||||
short: "A parliamentary motion/decision" |
||||
keys: [id, title, description, date, body_text, url] |
||||
motie: |
||||
short: "Dutch: motion (motie). Equivalent to Motion in code comments and UI." |
||||
MP: |
||||
short: "Member of Parliament (kamerlid)" |
||||
keys: [mp_name, party, van, tot_en_met, persoon_id] |
||||
mp_votes: |
||||
short: "Raw voting rows: motion_id, mp_name, vote, date" |
||||
mp_metadata: |
||||
short: "Per-MP metadata table and fields" |
||||
user_sessions: |
||||
short: "Streamlit user quiz session state (session_id, user_votes, completed_motions...)" |
||||
embeddings: |
||||
short: "Raw text embeddings stored per motion (embeddings table)" |
||||
svd_vectors: |
||||
short: "SVD-derived vectors from the vote matrix (svd_vectors table)" |
||||
fused_embeddings: |
||||
short: "Concatenation of SVD and text embeddings (fused_embeddings table)" |
||||
similarity_cache: |
||||
short: "Precomputed nearest neighbors for each motion" |
||||
window_id: |
||||
short: "Processing window identifier used for SVD/fusion runs" |
||||
controversy_score: |
||||
short: "Numeric measure stored in motions table" |
||||
winning_margin: |
||||
short: "Numeric field indicating margin of win in a vote" |
||||
Politiek_Kompas: |
||||
short: "Political compass; also appears in UI features" |
||||
MP_quiz: |
||||
short: "Interactive quiz derived from motions and mp_votes" |
||||
notes: |
||||
- Use these canonical terms in docs, tests, variable names and DB schemas. |
||||
@ -0,0 +1,116 @@ |
||||
# Extracted pattern examples (representative snippets) |
||||
|
||||
Note: snippets are verbatim extracts from repository files (Phase 1). Paths shown. |
||||
|
||||
## DuckDB connect + schema init (database.py) |
||||
```python |
||||
conn = duckdb.connect(self.db_path) |
||||
|
||||
# Create sequence for auto-incrementing IDs |
||||
try: |
||||
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||
except: |
||||
pass |
||||
|
||||
# Create tables with proper ID handling |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS motions ( |
||||
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||
title TEXT NOT NULL, |
||||
description TEXT, |
||||
date DATE, |
||||
policy_area TEXT, |
||||
voting_results JSON, |
||||
winning_margin FLOAT, |
||||
controversy_score FLOAT, |
||||
layman_explanation TEXT, |
||||
externe_identifier TEXT, |
||||
body_text TEXT, |
||||
url TEXT UNIQUE, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
conn.close() |
||||
``` |
||||
|
||||
## Read-only compute worker (svd_pipeline.py) |
||||
```python |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute( |
||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||
(start_date, end_date), |
||||
).fetchall() |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
|
||||
## Requests with retry/backoff (ai_provider.py) |
||||
```python |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
... |
||||
if getattr(resp, "status_code", 0) == 429: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||
retry_after = None |
||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||
if raw: |
||||
try: |
||||
retry_after = int(raw) |
||||
except Exception: |
||||
try: |
||||
dt = parsedate_to_datetime(raw) |
||||
now = datetime.now(tz=dt.tzinfo or timezone.utc) |
||||
secs = (dt - now).total_seconds() |
||||
retry_after = max(0, int(secs)) |
||||
except Exception: |
||||
retry_after = None |
||||
|
||||
if retry_after is not None: |
||||
time.sleep(retry_after) |
||||
continue |
||||
``` |
||||
|
||||
## Embedding batch + per-item fallback (pipeline/ai_provider_wrapper.py) |
||||
```python |
||||
for start in range(0, len(texts), batch_size): |
||||
chunk = texts[i:end] |
||||
emb_chunk, emb_exc = _attempt_batch(chunk, i) |
||||
if emb_chunk is not None: |
||||
for j, emb in enumerate(emb_chunk): |
||||
results[i + j] = emb |
||||
i = end |
||||
continue |
||||
|
||||
# batch failed -> fallback to per-item attempts |
||||
for j in range(i, end): |
||||
t = texts[j] |
||||
single, single_exc = _attempt_batch([t], j) |
||||
if single: |
||||
results[j] = single[0] |
||||
continue |
||||
results[j] = None |
||||
``` |
||||
|
||||
## Similarity compute (similarity/compute.py) |
||||
```python |
||||
# Ensure consistent dimensionality: pad shorter vectors with zeros |
||||
lengths = [len(v) for v in vecs] |
||||
max_dim = max(lengths) |
||||
if len(set(lengths)) != 1: |
||||
logger.warning( |
||||
"Inconsistent vector dimensions detected (max=%d). Padding shorter vectors with zeros.", |
||||
max_dim, |
||||
) |
||||
|
||||
matrix = np.zeros((len(vecs), max_dim), dtype=np.float32) |
||||
for i, v in enumerate(vecs): |
||||
matrix[i, : len(v)] = v |
||||
|
||||
# Normalize rows and compute cosine similarity |
||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||
norms[norms == 0] = 1.0 |
||||
normalized = matrix / norms |
||||
sim = normalized @ normalized.T |
||||
``` |
||||
@ -0,0 +1,60 @@ |
||||
name: stemwijzer |
||||
version: 2 |
||||
categories: |
||||
- path: stack.yaml |
||||
description: Project technology stack (languages, frameworks, runtime) |
||||
group: stack |
||||
- path: dependencies.yaml |
||||
description: Declared and recommended dependencies grouped by purpose |
||||
group: stack |
||||
- path: system.md |
||||
description: System overview and architecture high-level notes |
||||
group: architecture |
||||
- path: architecture.yaml |
||||
description: Architectural layers, organization and confidence levels |
||||
group: architecture |
||||
- path: conventions.yaml |
||||
description: Coding conventions cheat-sheet (naming, imports, types) |
||||
group: style |
||||
- path: domain-glossary.yaml |
||||
description: Business domain glossary for the project |
||||
group: domain |
||||
- path: patterns/duckdb_access.yaml |
||||
description: DuckDB access patterns, examples, and anti-patterns |
||||
group: patterns |
||||
- path: patterns/requests_http.yaml |
||||
description: Requests/HTTP client usage and retry best-practices |
||||
group: patterns |
||||
- path: patterns/embeddings_similarity.yaml |
||||
description: Embedding, SVD, fusion and similarity pipeline patterns |
||||
group: patterns |
||||
- path: patterns/error_handling.yaml |
||||
description: Error handling patterns and rules |
||||
group: patterns |
||||
- path: patterns/validation.yaml |
||||
description: Input/domain validation patterns and examples |
||||
group: patterns |
||||
- path: patterns/module_singletons.yaml |
||||
description: Module-level singletons and lifecycle patterns |
||||
group: patterns |
||||
- path: anti-patterns.yaml |
||||
description: Known anti-patterns and remediation steps |
||||
group: patterns |
||||
- path: examples/pattern-examples.md |
||||
description: Consolidated extracted code examples across patterns |
||||
group: patterns |
||||
- path: constraints/naming.yaml |
||||
description: Enforce naming rules (snake_case, PascalCase, constants) |
||||
group: constraints |
||||
- path: constraints/imports.yaml |
||||
description: Enforce import grouping and ordering |
||||
group: constraints |
||||
- path: constraints/db_connection.yaml |
||||
description: Rules for opening/closing DB connections and read-only usage |
||||
group: constraints |
||||
- path: constraints/error_handling.yaml |
||||
description: Error handling style and allowed exception scopes |
||||
group: constraints |
||||
- path: constraints/testing.yaml |
||||
description: Test conventions (pytest, test naming, fixtures) |
||||
group: constraints |
||||
@ -0,0 +1,70 @@ |
||||
name: duckdb_access |
||||
|
||||
rules: |
||||
- Prefer using read_only=True for compute-only subprocesses (e.g., SVD compute) to allow concurrent readers. |
||||
- Prefer "with duckdb.connect(db_path, read_only=True) as conn" for scoped connections so conn.close() is automatic. |
||||
- If a long-lived connection is created at module level, provide explicit close() or ensure operation is safe for Streamlit's lifecycle. |
||||
- Prefer parameterizing db_path in pipelines and creating connections locally (avoid global connections that cross threads). |
||||
|
||||
examples: |
||||
- path: database.py |
||||
excerpt: | |
||||
```python |
||||
conn = duckdb.connect(self.db_path) |
||||
... |
||||
conn.execute(""" |
||||
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||
motion_id INTEGER NOT NULL, |
||||
window_id TEXT NOT NULL, |
||||
vector JSON NOT NULL, |
||||
svd_dims INTEGER NOT NULL, |
||||
text_dims INTEGER NOT NULL, |
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||
PRIMARY KEY (id) |
||||
) |
||||
""") |
||||
conn.close() |
||||
``` |
||||
note: explicit connect/close used when initializing schema |
||||
|
||||
- path: pipeline/svd_pipeline.py |
||||
excerpt: | |
||||
```python |
||||
conn = duckdb.connect(db_path, read_only=True) |
||||
try: |
||||
rows = conn.execute( |
||||
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||
(start_date, end_date), |
||||
).fetchall() |
||||
finally: |
||||
conn.close() |
||||
``` |
||||
note: read_only connection used for compute-heavy worker |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
|
||||
with duckdb.connect(db.db_path) as conn: |
||||
rows = conn.execute(query, params).fetchall() |
||||
``` |
||||
note: preferred 'with' context for automatic close |
||||
|
||||
anti_patterns: |
||||
- Bad: creating a connection without closure in a long-running process |
||||
remediation: use "with" context or ensure conn.close() in finally block |
||||
example: | |
||||
```python |
||||
# BAD: connection may leak if exception occurs before explicit close |
||||
conn = duckdb.connect(db_path) |
||||
rows = conn.execute("SELECT ...").fetchall() |
||||
# missing finally/close |
||||
``` |
||||
- Bad: Opening write connections from many parallel workers without coordination |
||||
remediation: open read_only for compute processes and centralize writes via short-lived connections or a single writer worker. |
||||
@ -0,0 +1,63 @@ |
||||
name: embeddings_similarity_pipeline |
||||
|
||||
rules: |
||||
- Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure. |
||||
- Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text]. |
||||
- Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache. |
||||
- Use read_only DuckDB connections in compute workers to allow parallel runs. |
||||
|
||||
examples: |
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
for start in range(0, len(texts), batch_size): |
||||
chunk = texts[start : start + batch_size] |
||||
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk}) |
||||
... |
||||
for j in range(i, end): |
||||
t = texts[j] |
||||
single, single_exc = _attempt_batch([t], j) |
||||
if single: |
||||
results[j] = single[0] |
||||
``` |
||||
note: batched embed + fallback per-item retry |
||||
|
||||
- path: pipeline/fusion.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
svd_vec = json.loads(svd_json) |
||||
except Exception: |
||||
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
||||
skipped_missing_svd += 1 |
||||
continue |
||||
... |
||||
fused = list(svd_vec) + list(text_vec) |
||||
res = db.store_fused_embedding( |
||||
int(entity_id), |
||||
window_id, |
||||
fused, |
||||
svd_dims=len(svd_vec), |
||||
text_dims=len(text_vec), |
||||
) |
||||
``` |
||||
note: concatenation of vectors and storage via MotionDatabase |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
# Normalize rows |
||||
norms = np.linalg.norm(matrix, axis=1, keepdims=True) |
||||
norms[norms == 0] = 1.0 |
||||
normalized = matrix / norms |
||||
sim = normalized @ normalized.T |
||||
... |
||||
# pick top-k neighbors and write to similarity_cache |
||||
``` |
||||
note: numeric pipeline and padding to consistent dimensionality |
||||
|
||||
anti_patterns: |
||||
- Bad: Assuming consistent vector length without checks (leads to shape errors). |
||||
remediation: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py). |
||||
- Bad: Recomputing heavy pipelines inline in UI requests. |
||||
remediation: schedule heavy work in scripts/subprocesses and read precomputed results in UI. |
||||
@ -0,0 +1,54 @@ |
||||
name: error_handling |
||||
|
||||
rules: |
||||
- Use explicit exceptions for domain/error classification (e.g., ProviderError, ValueError). |
||||
- Prefer logging.exception when catching an exception where stack trace is useful. |
||||
- Avoid broad except: clauses that swallow exceptions; if broad except is used for "best-effort" fallback, log at warning and include original exception context. |
||||
- For public library-like functions, prefer raising typed exceptions instead of returning magic values ([], False) — only return safe defaults where documented. |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
except requests.ConnectionError as exc: |
||||
if attempt == retries: |
||||
raise ProviderError( |
||||
f"Connection error when calling provider: {exc}" |
||||
) from exc |
||||
... |
||||
``` |
||||
note: mapping network error to ProviderError with re-raise chaining |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
except Exception: |
||||
_logger.exception("Failed to append audit event for embedding failure") |
||||
results[j] = None |
||||
``` |
||||
note: logs and assigns None for failure; fallback behavior documented earlier in wrapper rule |
||||
|
||||
- path: similarity/compute.py |
||||
excerpt: | |
||||
```python |
||||
try: |
||||
import duckdb |
||||
except Exception: |
||||
logger.exception("duckdb import failed; cannot load vectors") |
||||
return 0 |
||||
``` |
||||
note: defensive import handling and early return on failure |
||||
|
||||
anti_patterns: |
||||
- Bad: Broad except without logging and without re-raising (silently hides bugs) |
||||
remediation: Narrow exception types or at minimum log.exception() and re-raise or convert to a domain error if truly handled. |
||||
example: | |
||||
```python |
||||
try: |
||||
do_work() |
||||
except Exception: |
||||
return [] |
||||
# BAD: hides the root cause and returns an ambiguous default |
||||
``` |
||||
- Bad: Mixing print() and logging for errors |
||||
remediation: Replace print() calls with logger.* calls; use structured logging configuration. |
||||
@ -0,0 +1,33 @@ |
||||
name: module_singletons |
||||
|
||||
rules: |
||||
- Module-level singletons (e.g., db = MotionDatabase()) are acceptable but should be created carefully: |
||||
- Avoid expensive initialization at import time. |
||||
- Provide a way to construct with a test DB path or to reinitialize in tests. |
||||
- If a singleton holds resources (DB connections, sessions), ensure safe shutdown on program exit. |
||||
|
||||
examples: |
||||
- path: database.py |
||||
excerpt: | |
||||
```python |
||||
class MotionDatabase: |
||||
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||
self.db_path = db_path |
||||
# If duckdb is not available, operate in lightweight file-backed mode |
||||
self._file_mode = duckdb is None |
||||
self._init_database() |
||||
``` |
||||
note: class is safe to instantiate and creates DB at init; consider lazy init if heavy |
||||
|
||||
- path: similarity/lookup.py |
||||
excerpt: | |
||||
```python |
||||
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
||||
if hasattr(db, "get_cached_similarities"): |
||||
rows = db.get_cached_similarities(...) |
||||
``` |
||||
note: consumers create local MotionDatabase instances, not relying on a single global |
||||
|
||||
anti_patterns: |
||||
- Bad: Creating connections and performing heavy schema migrations during import |
||||
remediation: Move heavy init to an explicit initialize() method and keep import fast. |
||||
@ -0,0 +1,65 @@ |
||||
name: requests_http |
||||
|
||||
rules: |
||||
- Reuse requests.Session when making multiple calls to the same host to benefit from connection pooling. |
||||
- Wrap outbound HTTP calls with retry/backoff logic and respect Retry-After on 429. |
||||
- Treat 5xx as transient and retry; surface 4xx as configuration/client errors (do not retry unless 429). |
||||
- Raise or wrap non-OK responses into domain ProviderError to make behavior consistent across the codebase. |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||
... |
||||
if getattr(resp, "status_code", 0) == 429: |
||||
if attempt == retries: |
||||
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||
retry_after = None |
||||
raw = resp.headers.get("Retry-After") if getattr(resp, "headers", None) else None |
||||
if raw: |
||||
try: |
||||
retry_after = int(raw) |
||||
except Exception: |
||||
... |
||||
if retry_after is not None: |
||||
time.sleep(retry_after) |
||||
continue |
||||
``` |
||||
note: explicit handling of 429 and Retry-After |
||||
|
||||
- path: api_client.py |
||||
excerpt: | |
||||
```python |
||||
response = self.session.get( |
||||
base_url, params=params, timeout=config.API_TIMEOUT |
||||
) |
||||
response.raise_for_status() |
||||
data = response.json() |
||||
``` |
||||
note: uses session + raise_for_status() to surface HTTP errors |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
def _attempt_batch(chunk_texts, start_index): |
||||
backoff = 0.5 |
||||
for attempt in range(1, retries + 1): |
||||
try: |
||||
emb_chunk = _embedder( |
||||
chunk_texts, model=model, batch_size=len(chunk_texts) |
||||
) |
||||
return emb_chunk, None |
||||
except Exception as exc: |
||||
if attempt == retries: |
||||
break |
||||
sleep = backoff * (2 ** (attempt - 1)) |
||||
time.sleep(sleep) |
||||
continue |
||||
``` |
||||
note: wrapper adds retry/backoff and per-item fallback |
||||
|
||||
anti_patterns: |
||||
- Bad: Blindly catching all requests exceptions and returning empty response |
||||
remediation: map network exceptions to retryable vs terminal (ProviderError) and log details. |
||||
- Bad: Using print() for network errors instead of structured logging (see api_client.py where print() is used; prefer logging). |
||||
@ -0,0 +1,29 @@ |
||||
name: validation |
||||
|
||||
rules: |
||||
- Validate inputs early and raise ValueError or domain-specific exceptions (ProviderError) for invalid contract inputs. |
||||
- Tests should assert that invalid inputs raise the expected exceptions. |
||||
- Use explicit checks for types and shapes on public APIs (e.g., ensure text is str before embedding). |
||||
|
||||
examples: |
||||
- path: ai_provider.py |
||||
excerpt: | |
||||
```python |
||||
if not isinstance(text, str): |
||||
raise ProviderError("text must be a string") |
||||
``` |
||||
note: explicit type validation before network call |
||||
|
||||
- path: pipeline/ai_provider_wrapper.py |
||||
excerpt: | |
||||
```python |
||||
if not texts: |
||||
return [] |
||||
if motion_ids is None: |
||||
motion_ids = [None for _ in texts] |
||||
``` |
||||
note: defensive handling of empty inputs |
||||
|
||||
anti_patterns: |
||||
- Bad: Allowing invalid values to propagate into heavy computation (e.g., non-string into embedding pipeline). |
||||
remediation: Fail fast with a typed exception and add unit tests to cover validations. |
||||
@ -0,0 +1,33 @@ |
||||
# Tech stack (Phase 1 authoritative) |
||||
|
||||
language: |
||||
name: python |
||||
version: ">=3.13" |
||||
|
||||
frameworks: |
||||
- streamlit: ">=1.48.0" # UI: Home.py, pages/..., app.py |
||||
|
||||
database: |
||||
primary: duckdb |
||||
orm_or_adapter: ibis-framework[duckdb] # used for some parts |
||||
|
||||
visualization: |
||||
- plotly |
||||
|
||||
ml: |
||||
- scikit-learn |
||||
- scipy |
||||
- umap-learn |
||||
|
||||
ai: |
||||
declared_dependency: openai # declared in pyproject but not observed imported; ai_provider uses requests |
||||
runtime_adapter: custom requests-based wrapper (ai_provider.py) |
||||
|
||||
container: |
||||
- docker: Dockerfile FROM python:3.13-slim, EXPOSE 8501, CMD streamlit run Home.py |
||||
|
||||
testing: |
||||
- pytest |
||||
|
||||
ci: |
||||
- drone: .drone.yml present |
||||
@ -0,0 +1,18 @@ |
||||
# System overview |
||||
|
||||
This project is a Streamlit-based UI and data-processing pipeline that computes embeddings, |
||||
performs SVD over MP/motion voting matrices, fuses vector representations, and precomputes |
||||
a similarity cache for quick lookup in the UI. |
||||
|
||||
Key subsystems: |
||||
- UI: Streamlit pages (Home.py, pages/*). Exposes interactive explorer and quizzes. |
||||
- Data ingestion: scripts and scraper/api_client.py (Tweede Kamer OData). |
||||
- Processing pipelines: pipeline/* (text embeddings, SVD, fusion). |
||||
- Similarity layer: similarity/compute.py and similarity/lookup.py storing precomputed neighbors. |
||||
- Storage: DuckDB (primary), with a JSON-file fallback used in tests/environments without duckdb. |
||||
- AI/Embedding provider: ai_provider.py (HTTP wrapper around an OpenRouter/OpenAI-compatible API). |
||||
|
||||
Operational notes: |
||||
- Dockerfile exists; Streamlit default port 8501 exposed. |
||||
- Tests use pytest. CI uses Drone (.drone.yml). |
||||
- There is no lockfile present in the repository snapshot; add one (poetry.lock or requirements.txt) for reproducible installs. |
||||
Loading…
Reference in new issue