You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/ARCHITECTURE.md

6.2 KiB

title
ARCHITECTURE

Overview

  • Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It ingests votes via OData API, stores motions in a DuckDB file, generates short human-readable summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.

Tech stack

  • Language: Python (single-project repository)
  • Data: DuckDB (file: data/motions.db)
  • Web / UI: Streamlit (app.py, pages/)
  • HTTP: requests (ai_provider.py, api_client.py)
  • LLM: QWEN (via OpenRouter) / OpenAI-compatible client (ai_provider.py). Prefer QWEN via OpenRouter where possible.
  • Analysis: scipy (SVD), scikit-learn (clustering), umap-learn (dimensionality reduction)
  • Visualization: Plotly
  • Packaging: pyproject.toml

Top-level layout (annotated)

./

  • app.py — Streamlit UI entrypoint (Home.py routing)
  • Home.py — Thin wrapper with minimal logic
  • database.py — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations
  • api_client.py — TweedeKamerAPI: fetch OData voting records and group into motions
  • summarizer.py — MotionSummarizer: LLM integration to generate layman_explanation
  • config.py — Config dataclass: central configuration (DATABASE_PATH, API/AI settings, constants)
  • ai_provider.py — Lightweight HTTP wrapper around OpenRouter/OpenAI-style backends
  • explorer.py — Explorer page logic, tab routing, SVD visualization
  • explorer_helpers.py — Pure functions for chart builders, coordinate computation
  • data/ — data/motions.db (DuckDB file, ~18GB)
  • pyproject.toml — project metadata / dependencies
  • .env — environment variables (not printed here)

Directory structure

  • pages/ — Streamlit pages: 1_Stemwijzer.py, 2_Explorer.py
  • pipeline/ — Data ingestion pipelines: run_pipeline.py, svd_pipeline.py, text_pipeline.py
  • analysis/ — SVD, clustering, trajectory, visualization modules
  • similarity/ — Embedding-based similarity computation
  • scripts/ — Utility scripts for data processing
  • tests/ — Test suite using pytest
  • migrations/ — SQL migration files

Core components

  • Streamlit UI (app.py + pages/)
    • Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes
    • Explorer page (explorer.py) provides SVD visualization and party trajectory analysis
  • Storage (database.py)
    • MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions
    • Exposes a module-level instance db = MotionDatabase() used across the codebase
    • Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote, calculate_party_matches
  • Ingestion (api_client.py + pipeline/)
    • api_client.py fetches votes via Tweede Kamer OData API and groups records into motions
    • pipeline/ orchestrates the full ingestion and analysis workflow
  • Summarization (summarizer.py)
    • Wraps an OpenRouter/OpenAI-compatible client (QWEN via OpenRouter recommended) to produce short layman explanations and persists them to DB
    • Reads motions without layman_explanation and updates rows
  • Analysis (analysis/)
    • SVD decomposition of voting patterns
    • UMAP for visualization
    • Clustering for motion grouping
    • Trajectory computation for party movement over time

Data flow (high level)

  1. Ingestion
    • Pipeline triggers TweedeKamerAPI.get_motions(...)
    • Each produced motion dict is passed to MotionDatabase.insert_motion()
    • insert_motion writes to DuckDB (data/motions.db)
  2. Enrichment
    • summarizer.update_motion_summaries() reads motions lacking layman_explanation, calls the LLM client and writes summary text back to the DB
  3. Analysis
    • pipeline/svd_pipeline.py computes SVD embeddings from vote matrix
    • Results stored in svd_vectors table for visualization
  4. Presentation / Interaction
    • app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
    • Users vote; app.py writes votes into the database via db.update_user_vote()
    • app.py calls db.calculate_party_matches() to compute match percentages for parties

External integrations & dependencies

  • Tweede Kamer OData API (api_client.py)
  • HTTP (requests)
  • DuckDB (database file at data/motions.db)
  • Streamlit for UI
  • OpenRouter/OpenAI-compatible LLM client (ai_provider.py) — configured with environment variables in config.py

Configuration

  • config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include:
    • config.DATABASE_PATH (default "data/motions.db")
    • OPENROUTER_API_KEY / other OPENROUTER_* variables used by ai_provider.py
    • QWEN_MODEL (or other model identifier) referenced in summarizer.py
    • API timeout / batch size constants
  • .env file present at repo root (do not commit secrets)
  • Packaging metadata: pyproject.toml

Build, run & development notes

  • Install dependencies via the project's Python packaging (pyproject.toml)
  • Use uv add and uv run to manage the dependencies in this directory and run scripts
  • Streamlit app: run uv run streamlit run app.py from project root to start the UI (app.py is the intended web entrypoint)
  • Never use pip directly!
  • Run tests: uv run pytest tests/

Tests

  • Test suite in tests/ using pytest
  • Run with uv run pytest tests/

Notes / caveats

  • Project is synchronous (no async/await patterns detected)
  • Many modules rely on module-level singletons (e.g., db = MotionDatabase(), summarizer = MotionSummarizer())
  • Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py)
  • Logging is not centralized (print statements used)

Where to look first (for contributors)

  • app.py + pages/ — follow the UI flow and see how votes & sessions are used
  • database.py — core data model and calculations
  • explorer.py — SVD visualization and party analysis
  • api_client.py — OData ingestion logic
  • summarizer.py — LLM usage and environment variables
  • pipeline/ — how ingestion and analysis is orchestrated