--- date: 2026-03-19 topic: "Stemwijzer AI & DB design" status: draft --- ## Problem Statement We need a clear, low-risk design to improve AI usage and query ergonomics in this repository. The codebase currently ingests motions, stores them in DuckDB, and generates AI-driven layman summaries via an OpenRouter/OpenAI client. There are a few maintenance issues (e.g., missing config keys, a broken reset script) and no embedding/search infrastructure. **Goal:** - Centralize AI/LLM usage behind a provider abstraction so we can swap or prefer providers later. - Introduce minimal embeddings storage and search so we can add semantic features without heavy infra. - Prefer ibis for read/query paths where that improves clarity and maintainability (the repo already imports ibis in read.py). ## Constraints - Work must be incremental and non-disruptive: keep existing DuckDB schema and write paths where possible. - Do not add external services (vector DB) in the first iteration — store embeddings in DuckDB as JSON for now. - Secrets must remain environment-driven (no checked-in secrets). Add env var defaults only. - Keep changes small and well-tested; make it easy to roll back. ## Approach (chosen) I'll introduce two small layers: - **ai_provider**: a thin adapter that exposes get_embedding(text) and chat_completion(messages). It will use the existing OpenRouter/OpenAI path by default and can be extended to prefer other providers if/when desired. Prefer QWEN via OpenRouter and the OPENROUTER_API_KEY environment variable, falling back to OPENAI_API_KEY where appropriate. - **query_dal**: read-focused utilities implemented with ibis to replace direct SQL reads in the app and other read-heavy paths. Writes (insert_motion, update_user_vote) stay in database.py initially. This gives the benefits of abstraction and pythonic query composition while keeping risk low. ## Architecture High level components (repo root): - api_client.py — fetches motion data from Tweede Kamer OData (unchanged) - scraper.py — optional HTML scraping fallback (unchanged) - database.py — current writes, schema initialization (add small embeddings table) - summarizer.py — generate layman summaries (refactor to use ai_provider) - app.py — Streamlit UI (switch read paths to query_dal) - scheduler.py — orchestrates ingestion and triggers summarization (unchanged) Additions: - ai_provider.py — single place for LLM/embedding calls and retries - query_dal.py — ibis-based read helpers (get_filtered_motions, calculate_party_matches) - minimal embeddings table in DuckDB (motion_id, model, vector JSON, created_at) ## Components and responsibilities - **ai_provider**: choose provider, handle retries/backoff, return plain Python objects (list[float] embeddings, str completions). Keep error classes small and testable. - **database (existing)**: add store_embedding and search_similar helpers (naive in-Python cosine scan). Keep insert_motion/update_user_vote unchanged to minimize risk. - **query_dal**: use ibis for read queries used by Streamlit paths (get_filtered_motions, session lookups). Return parsed JSON fields. - **summarizer**: call ai_provider.chat_completion to get summary; update motions.layman_explanation; optionally compute embedding via ai_provider.get_embedding and store via database.store_embedding. - **app.py**: replace direct duckdb selects with query_dal functions. ## Data Flow 1. Ingest: scheduler / scraper / api_client fetch motions and call database.insert_motion(motion). 2. Summarize: summarizer calls ai_provider.chat_completion(summary prompt) → writes layman_explanation to motions table. Optionally computes embedding and writes to embeddings table. 3. Query: Streamlit app calls query_dal.get_filtered_motions (ibis) to load motions for sessions and query_dal.calculate_party_matches for results. 4. Semantic search (future): query_dal or app can call database.search_similar by providing an embedding computed with ai_provider.get_embedding. ## Error Handling - ai_provider: retries with exponential backoff for transient errors; raises a ProviderError for terminal failures so callers can decide retry semantics. - Summarizer: non-fatal on AI failures — store an empty/fallback summary and log the failure; surface a user-facing message in Streamlit if generating summaries fails interactively. - DB functions: existing try/except patterns retained; ensure connections are closed on error. ## Testing Strategy - Unit tests for ai_provider using mocks for HTTP/openai responses. - DB tests using temporary DuckDB files to verify store_embedding and search_similar behavior. - query_dal tests using ibis against a temporary DB file; ensure JSON fields parse correctly. - Summarizer tests mock ai_provider to assert DB writes happen. ## Open Questions - Store embeddings inside motions table vs separate embeddings table? Recommendation: separate embeddings table for clarity and easier upserts. - Do we want to prefer other providers (Copilot) automatically? This repo currently references OPENROUTER. If user wants Copilot preference, we can add env vars and selection logic later. ## Next steps (short) 1. Add ai_provider.py (adapter) and tests. 2. Add embeddings table and store/search helpers in database.py and tests. 3. Add query_dal.py with ibis reads and tests. 4. Refactor summarizer.py to use ai_provider and optionally store embeddings. 5. Update Streamlit app read paths to use query_dal. 6. Fix housekeeping bugs: reset.py references reset_database(), scraper uses undefined SCRAPING_DELAY — address these small fixes in a separate patch. I'm proceeding to save this design to thoughts/shared/designs/2026-03-19-stemwijzer-design.md and will spawn the planner to create a detailed implementation plan. Interrupt if you want changes to the design text above.