ARCHITECTURE ============ Overview -------- - Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It ingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short human summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results. Tech stack ---------- - Language: Python (single-project repository) - Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py) - Web / UI: Streamlit (app.py) - HTTP: requests - HTML parsing: BeautifulSoup (scraper.py) - Scheduling: schedule (scheduler.py) - LLM: OpenAI-compatible client (summarizer.py uses openai.OpenAI configured via config) - Packaging: pyproject.toml present Top-level layout (annotated) ---------------------------- ./ - app.py — Streamlit UI, main UI flow and session handling (entrypoint for web) - main.py — minimal CLI entry / small script - database.py — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations - api_client.py — TweedeKamerAPI: fetch OData voting records and group into motions - scraper.py — MotionScraper: HTML fallback scraper for motion pages - summarizer.py — MotionSummarizer: LLM integration to generate layman_explanation - scheduler.py — DataUpdateScheduler: initial historical loads + periodic scheduled updates - config.py — Config dataclass: central configuration (DATABASE_PATH, API/AI settings, constants) - read.py — small ibis + duckdb demonstration/utility - fix_database.py — script to recreate/reset DuckDB schema - reset.py / verify.py — small maintenance scripts that call into database module - test.py — ad-hoc test script (manual insert/verification) - data/ — data/motions.db (DuckDB file) - pyproject.toml — project metadata / dependencies - .env — environment variables (not printed here) Core components --------------- - Streamlit UI (app.py) - Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes - Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(), database.calculate_party_matches(), summarizer.update_motion_summaries() - Storage (database.py) - MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions - Exposes a module-level instance `db = MotionDatabase()` used across the codebase - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote, calculate_party_matches - Ingestion (api_client.py + scraper.py) - api_client.py fetches votes via Tweede Kamer OData API and groups records into motions - scraper.py is an HTML fallback that scrapes motion pages and extracts vote info - Both provide structured motion dicts consumed by database.insert_motion() - Summarization (summarizer.py) - Wraps an OpenAI-compatible client to produce short layman explanations and persists them to DB - Reads motions without layman_explanation and updates rows - Orchestration (scheduler.py) - Runs initial historical ingestion and schedules periodic updates (using schedule) - Calls API client and summarizer and writes to the database Data flow (high level) ---------------------- 1. Ingestion - scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job() - Each produced motion dict is passed to MotionDatabase.insert_motion() - insert_motion writes to DuckDB (data/motions.db) 2. Enrichment - summarizer.update_motion_summaries() reads motions lacking layman_explanation, calls the LLM client (openai.OpenAI) and writes summary text back to the DB 3. Presentation / Interaction - app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them - Users vote; app.py writes votes into the database via db.update_user_vote() - app.py calls db.calculate_party_matches() to compute match percentages for parties External integrations & dependencies ----------------------------------- - Tweede Kamer OData API (api_client.py) - HTTP (requests) - HTML parsing (BeautifulSoup) used by scraper.py - DuckDB (database file at data/motions.db) - ibis (read.py demonstrates an ibis.duckdb connection) - Streamlit for UI - OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py Configuration ------------- - config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include: - config.DATABASE_PATH (default "data/motions.db") - OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py - QWEN_MODEL (or other model identifier) referenced in summarizer.py - API timeout / batch size constants - .env file present at repo root (do not commit secrets). See .env.example if present (none observed). - Packaging metadata: pyproject.toml Build, run & development notes ------------------------------ - Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CI workflows detected in the repository. - Streamlit app: run `streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint). - Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion. Tests ----- - There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification. Notes / caveats ---------------- - Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons (e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`). - Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py, scraper.py). Logging is not centralized (print statements used). Where to look first (for contributors) ------------------------------------- - app.py — follow the UI flow and see how votes & sessions are used - database.py — core data model and calculations - api_client.py — OData ingestion logic - summarizer.py — LLM usage and environment variables - scheduler.py — how ingestion is orchestrated over time