---
date: 2026-03-19
topic: "Stemwijzer AI & DB design"
status: draft
---

## Problem Statement

We need a clear, low-risk design to improve AI usage and query ergonomics in this repository. The codebase currently ingests motions, stores them in DuckDB, and generates AI-driven layman summaries via an OpenRouter/OpenAI client. There are a few maintenance issues (e.g., missing config keys, a broken reset script) and no embedding/search infrastructure.

**Goal:**
- Centralize AI/LLM usage behind a provider abstraction so we can swap or prefer providers later.
- Introduce minimal embeddings storage and search so we can add semantic features without heavy infra.
- Prefer ibis for read/query paths where that improves clarity and maintainability (the repo already imports ibis in read.py).


## Constraints

- Work must be incremental and non-disruptive: keep existing DuckDB schema and write paths where possible.
- Do not add external services (vector DB) in the first iteration — store embeddings in DuckDB as JSON for now.
- Secrets must remain environment-driven (no checked-in secrets). Add env var defaults only.
- Keep changes small and well-tested; make it easy to roll back.


## Approach (chosen)

I'll introduce two small layers:
- **ai_provider**: a thin adapter that exposes get_embedding(text) and chat_completion(messages). It will use the existing OpenRouter/OpenAI path by default and can be extended to prefer other providers if/when desired. Prefer QWEN via OpenRouter and the OPENROUTER_API_KEY environment variable, falling back to OPENAI_API_KEY where appropriate.
- **query_dal**: read-focused utilities implemented with ibis to replace direct SQL reads in the app and other read-heavy paths. Writes (insert_motion, update_user_vote) stay in database.py initially.

This gives the benefits of abstraction and pythonic query composition while keeping risk low.


## Architecture

High level components (repo root):
- api_client.py — fetches motion data from Tweede Kamer OData (unchanged)
- scraper.py — optional HTML scraping fallback (unchanged)
- database.py — current writes, schema initialization (add small embeddings table)
- summarizer.py — generate layman summaries (refactor to use ai_provider)
- app.py — Streamlit UI (switch read paths to query_dal)
- scheduler.py — orchestrates ingestion and triggers summarization (unchanged)

Additions:
- ai_provider.py — single place for LLM/embedding calls and retries
- query_dal.py — ibis-based read helpers (get_filtered_motions, calculate_party_matches)
- minimal embeddings table in DuckDB (motion_id, model, vector JSON, created_at)


## Components and responsibilities

- **ai_provider**: choose provider, handle retries/backoff, return plain Python objects (list[float] embeddings, str completions). Keep error classes small and testable.
- **database (existing)**: add store_embedding and search_similar helpers (naive in-Python cosine scan). Keep insert_motion/update_user_vote unchanged to minimize risk.
- **query_dal**: use ibis for read queries used by Streamlit paths (get_filtered_motions, session lookups). Return parsed JSON fields.
- **summarizer**: call ai_provider.chat_completion to get summary; update motions.layman_explanation; optionally compute embedding via ai_provider.get_embedding and store via database.store_embedding.
- **app.py**: replace direct duckdb selects with query_dal functions.


## Data Flow

1. Ingest: scheduler / scraper / api_client fetch motions and call database.insert_motion(motion).
2. Summarize: summarizer calls ai_provider.chat_completion(summary prompt) → writes layman_explanation to motions table. Optionally computes embedding and writes to embeddings table.
3. Query: Streamlit app calls query_dal.get_filtered_motions (ibis) to load motions for sessions and query_dal.calculate_party_matches for results.
4. Semantic search (future): query_dal or app can call database.search_similar by providing an embedding computed with ai_provider.get_embedding.


## Error Handling

- ai_provider: retries with exponential backoff for transient errors; raises a ProviderError for terminal failures so callers can decide retry semantics.
- Summarizer: non-fatal on AI failures — store an empty/fallback summary and log the failure; surface a user-facing message in Streamlit if generating summaries fails interactively.
- DB functions: existing try/except patterns retained; ensure connections are closed on error.


## Testing Strategy

- Unit tests for ai_provider using mocks for HTTP/openai responses.
- DB tests using temporary DuckDB files to verify store_embedding and search_similar behavior.
- query_dal tests using ibis against a temporary DB file; ensure JSON fields parse correctly.
- Summarizer tests mock ai_provider to assert DB writes happen.


## Open Questions

- Store embeddings inside motions table vs separate embeddings table? Recommendation: separate embeddings table for clarity and easier upserts.
- Do we want to prefer other providers (Copilot) automatically? This repo currently references OPENROUTER. If user wants Copilot preference, we can add env vars and selection logic later.


## Next steps (short)

1. Add ai_provider.py (adapter) and tests.
2. Add embeddings table and store/search helpers in database.py and tests.
3. Add query_dal.py with ibis reads and tests.
4. Refactor summarizer.py to use ai_provider and optionally store embeddings.
5. Update Streamlit app read paths to use query_dal.
6. Fix housekeeping bugs: reset.py references reset_database(), scraper uses undefined SCRAPING_DELAY — address these small fixes in a separate patch.


I'm proceeding to save this design to thoughts/shared/designs/2026-03-19-stemwijzer-design.md and will spawn the planner to create a detailed implementation plan. Interrupt if you want changes to the design text above.