You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/thoughts/shared/designs/2026-03-19-stemwijzer-desig...

5.7 KiB

date topic status
2026-03-19 Stemwijzer AI & DB design draft

Problem Statement

We need a clear, low-risk design to improve AI usage and query ergonomics in this repository. The codebase currently ingests motions, stores them in DuckDB, and generates AI-driven layman summaries via an OpenRouter/OpenAI client. There are a few maintenance issues (e.g., missing config keys, a broken reset script) and no embedding/search infrastructure.

Goal:

  • Centralize AI/LLM usage behind a provider abstraction so we can swap or prefer providers later.
  • Introduce minimal embeddings storage and search so we can add semantic features without heavy infra.
  • Prefer ibis for read/query paths where that improves clarity and maintainability (the repo already imports ibis in read.py).

Constraints

  • Work must be incremental and non-disruptive: keep existing DuckDB schema and write paths where possible.
  • Do not add external services (vector DB) in the first iteration — store embeddings in DuckDB as JSON for now.
  • Secrets must remain environment-driven (no checked-in secrets). Add env var defaults only.
  • Keep changes small and well-tested; make it easy to roll back.

Approach (chosen)

I'll introduce two small layers:

  • ai_provider: a thin adapter that exposes get_embedding(text) and chat_completion(messages). It will use the existing OpenRouter/OpenAI path by default and can be extended to prefer other providers if/when desired. Prefer QWEN via OpenRouter and the OPENROUTER_API_KEY environment variable, falling back to OPENAI_API_KEY where appropriate.
  • query_dal: read-focused utilities implemented with ibis to replace direct SQL reads in the app and other read-heavy paths. Writes (insert_motion, update_user_vote) stay in database.py initially.

This gives the benefits of abstraction and pythonic query composition while keeping risk low.

Architecture

High level components (repo root):

  • api_client.py — fetches motion data from Tweede Kamer OData (unchanged)
  • scraper.py — optional HTML scraping fallback (unchanged)
  • database.py — current writes, schema initialization (add small embeddings table)
  • summarizer.py — generate layman summaries (refactor to use ai_provider)
  • app.py — Streamlit UI (switch read paths to query_dal)
  • scheduler.py — orchestrates ingestion and triggers summarization (unchanged)

Additions:

  • ai_provider.py — single place for LLM/embedding calls and retries
  • query_dal.py — ibis-based read helpers (get_filtered_motions, calculate_party_matches)
  • minimal embeddings table in DuckDB (motion_id, model, vector JSON, created_at)

Components and responsibilities

  • ai_provider: choose provider, handle retries/backoff, return plain Python objects (list[float] embeddings, str completions). Keep error classes small and testable.
  • database (existing): add store_embedding and search_similar helpers (naive in-Python cosine scan). Keep insert_motion/update_user_vote unchanged to minimize risk.
  • query_dal: use ibis for read queries used by Streamlit paths (get_filtered_motions, session lookups). Return parsed JSON fields.
  • summarizer: call ai_provider.chat_completion to get summary; update motions.layman_explanation; optionally compute embedding via ai_provider.get_embedding and store via database.store_embedding.
  • app.py: replace direct duckdb selects with query_dal functions.

Data Flow

  1. Ingest: scheduler / scraper / api_client fetch motions and call database.insert_motion(motion).
  2. Summarize: summarizer calls ai_provider.chat_completion(summary prompt) → writes layman_explanation to motions table. Optionally computes embedding and writes to embeddings table.
  3. Query: Streamlit app calls query_dal.get_filtered_motions (ibis) to load motions for sessions and query_dal.calculate_party_matches for results.
  4. Semantic search (future): query_dal or app can call database.search_similar by providing an embedding computed with ai_provider.get_embedding.

Error Handling

  • ai_provider: retries with exponential backoff for transient errors; raises a ProviderError for terminal failures so callers can decide retry semantics.
  • Summarizer: non-fatal on AI failures — store an empty/fallback summary and log the failure; surface a user-facing message in Streamlit if generating summaries fails interactively.
  • DB functions: existing try/except patterns retained; ensure connections are closed on error.

Testing Strategy

  • Unit tests for ai_provider using mocks for HTTP/openai responses.
  • DB tests using temporary DuckDB files to verify store_embedding and search_similar behavior.
  • query_dal tests using ibis against a temporary DB file; ensure JSON fields parse correctly.
  • Summarizer tests mock ai_provider to assert DB writes happen.

Open Questions

  • Store embeddings inside motions table vs separate embeddings table? Recommendation: separate embeddings table for clarity and easier upserts.
  • Do we want to prefer other providers (Copilot) automatically? This repo currently references OPENROUTER. If user wants Copilot preference, we can add env vars and selection logic later.

Next steps (short)

  1. Add ai_provider.py (adapter) and tests.
  2. Add embeddings table and store/search helpers in database.py and tests.
  3. Add query_dal.py with ibis reads and tests.
  4. Refactor summarizer.py to use ai_provider and optionally store embeddings.
  5. Update Streamlit app read paths to use query_dal.
  6. Fix housekeeping bugs: reset.py references reset_database(), scraper uses undefined SCRAPING_DELAY — address these small fixes in a separate patch.

I'm proceeding to save this design to thoughts/shared/designs/2026-03-19-stemwijzer-design.md and will spawn the planner to create a detailed implementation plan. Interrupt if you want changes to the design text above.