motief/ARCHITECTURE.md

---
title: ARCHITECTURE
---

## Overview

- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It ingests votes via OData API, stores motions in a DuckDB file, generates short human-readable summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.

## Tech stack

- Language: Python (single-project repository)
- Data: DuckDB (file: data/motions.db)
- Web / UI: Streamlit (app.py, pages/)
- HTTP: requests (ai_provider.py, api_client.py)
- LLM: QWEN (via OpenRouter) / OpenAI-compatible client (ai_provider.py). Prefer QWEN via OpenRouter where possible.
- Analysis: scipy (SVD), scikit-learn (clustering), umap-learn (dimensionality reduction)
- Visualization: Plotly
- Packaging: pyproject.toml

## Top-level layout (annotated)

./
- app.py               — Streamlit UI entrypoint (Home.py routing)
- Home.py              — Thin wrapper with minimal logic
- database.py          — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations
- api_client.py        — TweedeKamerAPI: fetch OData voting records and group into motions
- summarizer.py        — MotionSummarizer: LLM integration to generate layman_explanation
- config.py            — Config dataclass: central configuration (DATABASE_PATH, API/AI settings, constants)
- ai_provider.py       — Lightweight HTTP wrapper around OpenRouter/OpenAI-style backends
- explorer.py          — Explorer page logic, tab routing, SVD visualization
- explorer_helpers.py  — Pure functions for chart builders, coordinate computation
- data/                — data/motions.db (DuckDB file, ~18GB)
- pyproject.toml       — project metadata / dependencies
- .env                 — environment variables (not printed here)

## Directory structure

- `pages/`             — Streamlit pages: 1_Stemwijzer.py, 2_Explorer.py
- `pipeline/`          — Data ingestion pipelines: run_pipeline.py, svd_pipeline.py, text_pipeline.py
- `analysis/`          — SVD, clustering, trajectory, visualization modules
- `similarity/`        — Embedding-based similarity computation
- `scripts/`           — Utility scripts for data processing
- `tests/`             — Test suite using pytest
- `migrations/`        — SQL migration files

## Core components

- Streamlit UI (app.py + pages/)
  - Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes
  - Explorer page (explorer.py) provides SVD visualization and party trajectory analysis
- Storage (database.py)
  - MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions
  - Exposes a module-level instance `db = MotionDatabase()` used across the codebase
  - Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote, calculate_party_matches
- Ingestion (api_client.py + pipeline/)
  - api_client.py fetches votes via Tweede Kamer OData API and groups records into motions
  - pipeline/ orchestrates the full ingestion and analysis workflow
- Summarization (summarizer.py)
  - Wraps an OpenRouter/OpenAI-compatible client (QWEN via OpenRouter recommended) to produce short layman explanations and persists them to DB
  - Reads motions without layman_explanation and updates rows
- Analysis (analysis/)
  - SVD decomposition of voting patterns
  - UMAP for visualization
  - Clustering for motion grouping
  - Trajectory computation for party movement over time

## Data flow (high level)

1.  Ingestion
    - Pipeline triggers TweedeKamerAPI.get_motions(...)
    - Each produced motion dict is passed to MotionDatabase.insert_motion()
    - insert_motion writes to DuckDB (data/motions.db)
2.  Enrichment
    - summarizer.update_motion_summaries() reads motions lacking layman_explanation, calls the LLM client and writes summary text back to the DB
3.  Analysis
    - pipeline/svd_pipeline.py computes SVD embeddings from vote matrix
    - Results stored in svd_vectors table for visualization
4.  Presentation / Interaction
    - app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
    - Users vote; app.py writes votes into the database via db.update_user_vote()
    - app.py calls db.calculate_party_matches() to compute match percentages for parties

## External integrations & dependencies

- Tweede Kamer OData API (api_client.py)
- HTTP (requests)
- DuckDB (database file at data/motions.db)
- Streamlit for UI
- OpenRouter/OpenAI-compatible LLM client (ai_provider.py) — configured with environment variables in config.py

## Configuration

- config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include:
  - config.DATABASE_PATH (default "data/motions.db")
  - OPENROUTER_API_KEY / other OPENROUTER_* variables used by ai_provider.py
  - QWEN_MODEL (or other model identifier) referenced in summarizer.py
  - API timeout / batch size constants
- .env file present at repo root (do not commit secrets)
- Packaging metadata: pyproject.toml

## Build, run & development notes

- Install dependencies via the project's Python packaging (pyproject.toml)
- Use `uv add` and `uv run` to manage the dependencies in this directory and run scripts
- Streamlit app: run `uv run streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint)
- Never use pip directly!
- Run tests: `uv run pytest tests/`

## Tests

- Test suite in `tests/` using pytest
- Run with `uv run pytest tests/`

## Notes / caveats

- Project is synchronous (no async/await patterns detected)
- Many modules rely on module-level singletons (e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`)
- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py)
- Logging is not centralized (print statements used)

## Where to look first (for contributors)

- app.py + pages/     — follow the UI flow and see how votes & sessions are used
- database.py          — core data model and calculations
- explorer.py          — SVD visualization and party analysis
- api_client.py        — OData ingestion logic
- summarizer.py        — LLM usage and environment variables
- pipeline/            — how ingestion and analysis is orchestrated