You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
113 lines
6.4 KiB
113 lines
6.4 KiB
# ARCHITECTURE
|
|
|
|
## Overview
|
|
|
|
- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). Itingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short humansummaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
|
|
|
|
## Tech stack
|
|
|
|
- Language: Python (single-project repository)
|
|
- Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py)
|
|
- Web / UI: Streamlit (app.py)
|
|
- HTTP: requests
|
|
- HTML parsing: BeautifulSoup (scraper.py)
|
|
- Scheduling: schedule (scheduler.py)
|
|
- LLM: QWEN (via OpenRouter) / OpenAI-compatible client (summarizer.py uses an OpenRouter/OpenAI-compatible client configured via config). Prefer QWEN via OpenRouter where possible.
|
|
- Packaging: pyproject.toml present
|
|
|
|
## Top-level layout (annotated)
|
|
|
|
./
|
|
|
|
- app.py — Streamlit UI, main UI flow and session handling (entrypoint for web)
|
|
- main.py — minimal CLI entry / small script
|
|
- database.py — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations
|
|
- api_client.py — TweedeKamerAPI: fetch OData voting records and group into motions
|
|
- scraper.py — MotionScraper: HTML fallback scraper for motion pages
|
|
- summarizer.py — MotionSummarizer: LLM integration to generate layman_explanation
|
|
- scheduler.py — DataUpdateScheduler: initial historical loads + periodic scheduled updates
|
|
- config.py — Config dataclass: central configuration (DATABASE_PATH, API/AI settings, constants)
|
|
- read.py — small ibis + duckdb demonstration/utility
|
|
- fix_database.py — script to recreate/reset DuckDB schema
|
|
- reset.py / verify.py — small maintenance scripts that call into database module
|
|
- test.py — ad-hoc test script (manual insert/verification)
|
|
- data/ — data/motions.db (DuckDB file)
|
|
- pyproject.toml — project metadata / dependencies
|
|
- .env — environment variables (not printed here)
|
|
|
|
## Core components
|
|
|
|
- Streamlit UI (app.py)
|
|
- Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes
|
|
- Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),database.calculate_party_matches(), summarizer.update_motion_summaries()
|
|
- Storage (database.py)
|
|
- MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions
|
|
- Exposes a module-level instance `db = MotionDatabase()` used across the codebase
|
|
- Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,calculate_party_matches
|
|
- Ingestion (api_client.py + scraper.py)
|
|
- api_client.py fetches votes via Tweede Kamer OData API and groups records into motions
|
|
- scraper.py is an HTML fallback that scrapes motion pages and extracts vote info
|
|
- Both provide structured motion dicts consumed by database.insert_motion()
|
|
- Summarization (summarizer.py)
|
|
- Wraps an OpenRouter/OpenAI-compatible client (QWEN via OpenRouter recommended) to produce short layman explanations and persists them to DB
|
|
- Reads motions without layman_explanation and updates rows
|
|
- Orchestration (scheduler.py)
|
|
- Runs initial historical ingestion and schedules periodic updates (using schedule)
|
|
- Calls API client and summarizer and writes to the database
|
|
|
|
## Data flow (high level)
|
|
|
|
1. Ingestion
|
|
- scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job()
|
|
- Each produced motion dict is passed to MotionDatabase.insert_motion()
|
|
- insert_motion writes to DuckDB (data/motions.db)
|
|
2. Enrichment
|
|
- summarizer.update_motion_summaries() reads motions lacking layman_explanation, calls the LLM client (OpenRouter/OpenAI-compatible client) and writes summary text back to the DB
|
|
3. Presentation / Interaction
|
|
- app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
|
|
- Users vote; app.py writes votes into the database via db.update_user_vote()
|
|
- app.py calls db.calculate_party_matches() to compute match percentages for parties
|
|
|
|
## External integrations & dependencies
|
|
|
|
- Tweede Kamer OData API (api_client.py)
|
|
- HTTP (requests)
|
|
- HTML parsing (BeautifulSoup) used by scraper.py
|
|
- DuckDB (database file at data/motions.db)
|
|
- ibis (read.py demonstrates an ibis.duckdb connection)
|
|
- Streamlit for UI
|
|
- OpenRouter/OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py. Prefer using OPENROUTER_API_KEY with OPENAI_API_KEY as a fallback where appropriate.
|
|
|
|
## Configuration
|
|
|
|
- config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include:
|
|
- config.DATABASE_PATH (default "data/motions.db")
|
|
- OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py
|
|
- QWEN_MODEL (or other model identifier) referenced in summarizer.py
|
|
- API timeout / batch size constants
|
|
- .env file present at repo root (do not commit secrets). See .env.example if present (none observed).
|
|
- Packaging metadata: pyproject.toml
|
|
|
|
## Build, run & development notes
|
|
|
|
- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CIworkflows detected in the repository.
|
|
- Use uv add and uv run to manage the dependencies in this directory and run scripts
|
|
- Streamlit app: run `uv run streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
|
|
- Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion.
|
|
|
|
## Tests
|
|
|
|
- There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification.
|
|
|
|
## Notes / caveats
|
|
|
|
- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons(e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
|
|
- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,scraper.py). Logging is not centralized (print statements used).
|
|
|
|
## Where to look first (for contributors)
|
|
|
|
- app.py — follow the UI flow and see how votes & sessions are used
|
|
- database.py — core data model and calculations
|
|
- api_client.py — OData ingestion logic
|
|
- summarizer.py — LLM usage and environment variables
|
|
- scheduler.py — how ingestion is orchestrated over time
|
|
|