You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
5.6 KiB
5.6 KiB
| date | topic | status |
|---|---|---|
| 2026-03-19 | Stemwijzer AI & DB design | draft |
Problem Statement
We need a clear, low-risk design to improve AI usage and query ergonomics in this repository. The codebase currently ingests motions, stores them in DuckDB, and generates AI-driven layman summaries via an OpenRouter/OpenAI client. There are a few maintenance issues (e.g., missing config keys, a broken reset script) and no embedding/search infrastructure.
Goal:
- Centralize AI/LLM usage behind a provider abstraction so we can swap or prefer providers later.
- Introduce minimal embeddings storage and search so we can add semantic features without heavy infra.
- Prefer ibis for read/query paths where that improves clarity and maintainability (the repo already imports ibis in read.py).
Constraints
- Work must be incremental and non-disruptive: keep existing DuckDB schema and write paths where possible.
- Do not add external services (vector DB) in the first iteration — store embeddings in DuckDB as JSON for now.
- Secrets must remain environment-driven (no checked-in secrets). Add env var defaults only.
- Keep changes small and well-tested; make it easy to roll back.
Approach (chosen)
I'll introduce two small layers:
- ai_provider: a thin adapter that exposes get_embedding(text) and chat_completion(messages). It will use the existing OpenRouter/OpenAI path by default and can be extended to prefer other providers if/when desired.
- query_dal: read-focused utilities implemented with ibis to replace direct SQL reads in the app and other read-heavy paths. Writes (insert_motion, update_user_vote) stay in database.py initially.
This gives the benefits of abstraction and pythonic query composition while keeping risk low.
Architecture
High level components (repo root):
- api_client.py — fetches motion data from Tweede Kamer OData (unchanged)
- scraper.py — optional HTML scraping fallback (unchanged)
- database.py — current writes, schema initialization (add small embeddings table)
- summarizer.py — generate layman summaries (refactor to use ai_provider)
- app.py — Streamlit UI (switch read paths to query_dal)
- scheduler.py — orchestrates ingestion and triggers summarization (unchanged)
Additions:
- ai_provider.py — single place for LLM/embedding calls and retries
- query_dal.py — ibis-based read helpers (get_filtered_motions, calculate_party_matches)
- minimal embeddings table in DuckDB (motion_id, model, vector JSON, created_at)
Components and responsibilities
- ai_provider: choose provider, handle retries/backoff, return plain Python objects (list[float] embeddings, str completions). Keep error classes small and testable.
- database (existing): add store_embedding and search_similar helpers (naive in-Python cosine scan). Keep insert_motion/update_user_vote unchanged to minimize risk.
- query_dal: use ibis for read queries used by Streamlit paths (get_filtered_motions, session lookups). Return parsed JSON fields.
- summarizer: call ai_provider.chat_completion to get summary; update motions.layman_explanation; optionally compute embedding via ai_provider.get_embedding and store via database.store_embedding.
- app.py: replace direct duckdb selects with query_dal functions.
Data Flow
- Ingest: scheduler / scraper / api_client fetch motions and call database.insert_motion(motion).
- Summarize: summarizer calls ai_provider.chat_completion(summary prompt) → writes layman_explanation to motions table. Optionally computes embedding and writes to embeddings table.
- Query: Streamlit app calls query_dal.get_filtered_motions (ibis) to load motions for sessions and query_dal.calculate_party_matches for results.
- Semantic search (future): query_dal or app can call database.search_similar by providing an embedding computed with ai_provider.get_embedding.
Error Handling
- ai_provider: retries with exponential backoff for transient errors; raises a ProviderError for terminal failures so callers can decide retry semantics.
- Summarizer: non-fatal on AI failures — store an empty/fallback summary and log the failure; surface a user-facing message in Streamlit if generating summaries fails interactively.
- DB functions: existing try/except patterns retained; ensure connections are closed on error.
Testing Strategy
- Unit tests for ai_provider using mocks for HTTP/openai responses.
- DB tests using temporary DuckDB files to verify store_embedding and search_similar behavior.
- query_dal tests using ibis against a temporary DB file; ensure JSON fields parse correctly.
- Summarizer tests mock ai_provider to assert DB writes happen.
Open Questions
- Store embeddings inside motions table vs separate embeddings table? Recommendation: separate embeddings table for clarity and easier upserts.
- Do we want to prefer other providers (Copilot) automatically? This repo currently references OPENROUTER. If user wants Copilot preference, we can add env vars and selection logic later.
Next steps (short)
- Add ai_provider.py (adapter) and tests.
- Add embeddings table and store/search helpers in database.py and tests.
- Add query_dal.py with ibis reads and tests.
- Refactor summarizer.py to use ai_provider and optionally store embeddings.
- Update Streamlit app read paths to use query_dal.
- Fix housekeeping bugs: reset.py references reset_database(), scraper uses undefined SCRAPING_DELAY — address these small fixes in a separate patch.
I'm proceeding to save this design to thoughts/shared/designs/2026-03-19-stemwijzer-design.md and will spawn the planner to create a detailed implementation plan. Interrupt if you want changes to the design text above.