feat(pipeline): implement parliamentary embedding pipeline MVP

- Add 4 migration files: mp_votes, mp_metadata, svd_vectors, fused_embeddings
- Extend database.py with 5 new helper methods and table init
- Add pipeline/ package: extract_mp_votes, fetch_mp_metadata, text_pipeline,
  svd_pipeline (with Procrustes alignment), fusion
- Add full test suite (17 tests) covering all pipeline modules and migrations
- Fix Procrustes alignment bug: scipy scale is a norm value, not a multiplier
- Fix DuckDB date type handling in test assertions (datetime.date vs string)
- Remove duckdb.py shim; tests now run against real duckdb + scipy via uv

Ref: thoughts/shared/plans/2026-03-21-parliamentary-embedding-pipeline-plan.md
main
Sven Geboers 1 month ago
parent c498c3467e
commit a36e6cba4e
  1. 38
      .drone.yml
  2. 10
      .gitignore
  3. 1
      .python-version
  4. 126
      ARCHITECTURE.md
  5. 118
      CODE_STYLE.md
  6. 36
      Dockerfile
  7. 90
      EMBEDDING_ANALYSIS.md
  8. 0
      README.md
  9. 188
      ai_provider.py
  10. 389
      api_client.py
  11. 310
      app.py
  12. 51
      config.py
  13. 582
      database.py
  14. 20
      docker-compose.yml
  15. 72
      docs/admin/recompute_similarity.md
  16. 67
      fix_database.py
  17. 6
      main.py
  18. 11
      migrations/2026-03-19-add-embeddings.sql
  19. 6
      migrations/2026-03-20-add-body-text.sql
  20. 24
      migrations/2026-03-22-add-audit-events.sql
  21. 15
      migrations/2026-03-22-add-similarity-cache.sql
  22. 13
      migrations/2026_03_21__create_fused_embeddings.sql
  23. 9
      migrations/2026_03_21__create_mp_metadata.sql
  24. 13
      migrations/2026_03_21__create_mp_votes.sql
  25. 13
      migrations/2026_03_21__create_svd_vectors.sql
  26. 0
      pipeline/__init__.py
  27. 75
      pipeline/extract_mp_votes.py
  28. 94
      pipeline/fetch_mp_metadata.py
  29. 116
      pipeline/fusion.py
  30. 206
      pipeline/svd_pipeline.py
  31. 122
      pipeline/text_pipeline.py
  32. 18
      pyproject.toml
  33. 9
      read.py
  34. 3
      reset.py
  35. 264
      scheduler.py
  36. 183
      scraper.py
  37. 128
      scripts/compute_test_batch.py
  38. 35
      src/types/motion_types.py
  39. 101
      summarizer.py
  40. 16
      test.py
  41. 1
      tests/__init__.py
  42. 63
      tests/conftest.py
  43. 1
      tests/fixtures/__init__.py
  44. 40
      tests/fixtures/sample_voting_results.json
  45. 0
      tests/integration/__init__.py
  46. 87
      tests/integration/test_pipeline_end_to_end.py
  47. 58
      tests/migrations/test_2026_03_22_add_audit_events.py
  48. 85
      tests/migrations/test_2026_03_22_add_similarity_cache.py
  49. 29
      tests/migrations/test_migration_fixtures_smoke.py
  50. 49
      tests/test_ai_provider.py
  51. 74
      tests/test_extract_mp_votes.py
  52. 103
      tests/test_fetch_mp_metadata.py
  53. 79
      tests/test_fusion.py
  54. 31
      tests/test_migration_embeddings.py
  55. 219
      tests/test_migration_pipeline_tables.py
  56. 5
      tests/test_pyproject_deps.py
  57. 63
      tests/test_svd_pipeline.py
  58. 80
      tests/test_text_pipeline.py
  59. 22
      tests/types/test_motion_types.py
  60. 66
      tests/utils/migration_fixtures.py
  61. 50
      thoughts/ledgers/CONTINUITY_stemwijzer.md
  62. 98
      thoughts/shared/designs/2026-03-19-stemwijzer-design.md
  63. 116
      thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md
  64. 335
      thoughts/shared/plans/2026-03-21-motions-guided-explorer-plan.md
  65. 106
      thoughts/thoughts/shared/plans/2026-03-19-stemwijzer-plan.md
  66. 129
      tools/query_tk_api.py
  67. 1246
      uv.lock
  68. 9
      verify.py

@ -0,0 +1,38 @@
kind: pipeline
type: docker
name: default
steps:
- name: build
image: docker:24.0.2
environment:
DOCKER_BUILDKIT: "1"
commands:
- docker build -t ${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA} .
- docker tag ${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA} ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest
- name: push
image: docker:24.0.2
commands:
- echo "Logging into registry"
- docker login -u ${DOCKER_USERNAME} -p ${DOCKER_PASSWORD} ${DOCKER_REGISTRY}
- docker push ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA}
- docker push ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest
- name: deploy
image: appleboy/drone-ssh
settings:
host: ${DEPLOY_HOST}
port: ${DEPLOY_SSH_PORT}
username: ${DEPLOY_USER}
password: ${DEPLOY_PASSWORD}
script: |
set -e
cd /srv/stemwijzer
docker pull ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest
docker-compose pull
docker-compose up -d
trigger:
branch:
- main

10
.gitignore vendored

@ -0,0 +1,10 @@
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info
# Virtual environments
.venv

@ -0,0 +1,126 @@
ARCHITECTURE
============
Overview
--------
- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It
ingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short human
summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results.
Tech stack
----------
- Language: Python (single-project repository)
- Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py)
- Web / UI: Streamlit (app.py)
- HTTP: requests
- HTML parsing: BeautifulSoup (scraper.py)
- Scheduling: schedule (scheduler.py)
- LLM: OpenAI-compatible client (summarizer.py uses openai.OpenAI configured via config)
- Packaging: pyproject.toml present
Top-level layout (annotated)
----------------------------
./
- app.py — Streamlit UI, main UI flow and session handling (entrypoint for web)
- main.py — minimal CLI entry / small script
- database.py — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations
- api_client.py — TweedeKamerAPI: fetch OData voting records and group into motions
- scraper.py — MotionScraper: HTML fallback scraper for motion pages
- summarizer.py — MotionSummarizer: LLM integration to generate layman_explanation
- scheduler.py — DataUpdateScheduler: initial historical loads + periodic scheduled updates
- config.py — Config dataclass: central configuration (DATABASE_PATH, API/AI settings, constants)
- read.py — small ibis + duckdb demonstration/utility
- fix_database.py — script to recreate/reset DuckDB schema
- reset.py / verify.py — small maintenance scripts that call into database module
- test.py — ad-hoc test script (manual insert/verification)
- data/ — data/motions.db (DuckDB file)
- pyproject.toml — project metadata / dependencies
- .env — environment variables (not printed here)
Core components
---------------
- Streamlit UI (app.py)
- Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes
- Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(),
database.calculate_party_matches(), summarizer.update_motion_summaries()
- Storage (database.py)
- MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions
- Exposes a module-level instance `db = MotionDatabase()` used across the codebase
- Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote,
calculate_party_matches
- Ingestion (api_client.py + scraper.py)
- api_client.py fetches votes via Tweede Kamer OData API and groups records into motions
- scraper.py is an HTML fallback that scrapes motion pages and extracts vote info
- Both provide structured motion dicts consumed by database.insert_motion()
- Summarization (summarizer.py)
- Wraps an OpenAI-compatible client to produce short layman explanations and persists them to DB
- Reads motions without layman_explanation and updates rows
- Orchestration (scheduler.py)
- Runs initial historical ingestion and schedules periodic updates (using schedule)
- Calls API client and summarizer and writes to the database
Data flow (high level)
----------------------
1. Ingestion
- scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job()
- Each produced motion dict is passed to MotionDatabase.insert_motion()
- insert_motion writes to DuckDB (data/motions.db)
2. Enrichment
- summarizer.update_motion_summaries() reads motions lacking layman_explanation,
calls the LLM client (openai.OpenAI) and writes summary text back to the DB
3. Presentation / Interaction
- app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them
- Users vote; app.py writes votes into the database via db.update_user_vote()
- app.py calls db.calculate_party_matches() to compute match percentages for parties
External integrations & dependencies
-----------------------------------
- Tweede Kamer OData API (api_client.py)
- HTTP (requests)
- HTML parsing (BeautifulSoup) used by scraper.py
- DuckDB (database file at data/motions.db)
- ibis (read.py demonstrates an ibis.duckdb connection)
- Streamlit for UI
- OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py
Configuration
-------------
- config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include:
- config.DATABASE_PATH (default "data/motions.db")
- OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py
- QWEN_MODEL (or other model identifier) referenced in summarizer.py
- API timeout / batch size constants
- .env file present at repo root (do not commit secrets). See .env.example if present (none observed).
- Packaging metadata: pyproject.toml
Build, run & development notes
------------------------------
- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CI
workflows detected in the repository.
- Streamlit app: run `streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint).
- Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion.
Tests
-----
- There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification.
Notes / caveats
----------------
- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons
(e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`).
- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py,
scraper.py). Logging is not centralized (print statements used).
Where to look first (for contributors)
-------------------------------------
- app.py — follow the UI flow and see how votes & sessions are used
- database.py — core data model and calculations
- api_client.py — OData ingestion logic
- summarizer.py — LLM usage and environment variables
- scheduler.py — how ingestion is orchestrated over time

@ -0,0 +1,118 @@
CODE STYLE
==========
Purpose
-------
This document records the conventions already in use in the codebase so new contributors and AI
agents can produce code that fits the repository's existing style.
General
-------
- Language: Python (3.x)
- Project uses one file-per-module with descriptive snake_case filenames (e.g., api_client.py, database.py)
- Top-level module singletons are exposed when a single shared instance is desired (e.g. `db = MotionDatabase()`)
- Keep code synchronous unless you introduce async consistently across modules (none currently use async/await)
Naming
------
- Files / modules: snake_case.py (e.g., motion_scraper -> scraper.py, api_client.py)
- Classes: PascalCase (e.g., MotionDatabase, MotionSummarizer, TweedeKamerAPI)
- Functions and methods: snake_case (including private helpers with a single leading underscore)
- Constants / config fields: UPPER_SNAKE_CASE (placed in config.py and referenced via `from config import config`)
File organization
-----------------
- Keep top-level domain modules in the repository root (this repo uses a flat layout)
- Each module should contain one primary responsibility (e.g., database.py for DB logic)
- Module-level singletons: create at module bottom and import from other modules (pattern used widely)
Imports
-------
- Group imports in this order with a blank line between groups:
1. Standard library (datetime, json, typing)
2. Third-party libraries (requests, duckdb, ibis, streamlit)
3. Local imports (from config import config, from database import db)
- Use absolute imports (module name) rather than relative imports
Typing
------
- Add type hints to public function signatures where helpful (project uses typing in several places).
- Use typing.Dict, typing.List, typing.Optional for simple container annotations.
Error handling & logging
------------------------
- Current pattern: functions catch broad Exception and print error messages, then return a safe default
(False, [], None). Examples in database.py and api_client.py.
- When updating code, prefer to:
- Keep the existing behavior (return safe fallback) to avoid breaking call sites
- Consider adding structured logging (use logging module) rather than print, but maintain similar
high-level error flows unless refactoring intentionally.
LLM / external API calls
------------------------
- OpenAI-compatible client usage is in summarizer.py. Environment variables are read from config.py.
- Do NOT commit API keys or secrets. Use environment variables (OPENROUTER_API_KEY, etc.) and
reference them by name.
- Network calls are synchronous using requests. Keep request timeouts and error handling consistent with
existing patterns (catch requests.exceptions.RequestException and return safe fallback values).
Database patterns
-----------------
- Database is DuckDB stored at data/motions.db. The MotionDatabase class opens short-lived duckdb
connections inside methods (conn = duckdb.connect(self.db_path)). This pattern is used widely.
- Queries and schema initialization happen inside MotionDatabase._init_database(). Keep DDL grouped there.
- When writing methods that modify DB, follow the try/except + conn.close() pattern to guarantee cleanup.
Testing
-------
- Currently the project uses ad-hoc test scripts (test.py). If adding tests, follow pytest conventions:
- Place tests in tests/ directory
- Use filenames test_*.py and functions test_* with assertions
- Mock external APIs (requests, LLM client) via monkeypatch or unittest.mock
Patterns observed (use these when adding new code)
-----------------------------------------------
- Singletons: expose module-level instance (e.g. `db = MotionDatabase()`), import it elsewhere
- Private helpers: name with a single leading underscore (e.g., _get_voting_records)
- Config: centralize in config.py and reference via `from config import config` (don't hardcode paths)
Do's and Don'ts
---------------
Do:
- Follow existing naming: snake_case for files/functions
- Add simple type hints for clarity
- Return the same safe fallback values used in existing functions on error
- Use module-level singletons for shared services if helpful
Don't:
- Don't add async/await in a single module without broader coordination
- Don't print secret values or commit .env files
- Don't create circular imports (be careful when modules instantiate singletons at import time)
Example snippets
----------------
Conformant class and method:
class ExampleService:
def __init__(self, param: str = config.DATABASE_PATH):
self.param = param
def do_work(self, items: typing.List[dict]) -> bool:
try:
# short-lived DB/HTTP usage
conn = duckdb.connect(config.DATABASE_PATH)
# ... perform work
conn.close()
return True
except Exception as e:
print(f"Error in do_work: {e}")
if 'conn' in locals():
conn.close()
return False
Adding a new module
-------------------
1. Create snake_case file (e.g., new_service.py)
2. Add a PascalCase class implementing the behavior and small helper functions prefixed with _
3. If you need a shared instance, create `service = NewService()` at the module bottom
4. Import via `from new_service import service` in other modules

@ -0,0 +1,36 @@
FROM python:3.13-slim
# Install minimal system deps
RUN apt-get update \
&& apt-get install -y --no-install-recommends build-essential curl ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user for running the app
RUN useradd -m -s /bin/bash app
WORKDIR /home/app/app
# Copy project files
COPY . /home/app/app
# Upgrade pip and install either pinned requirements or runtime defaults
RUN python -m pip install --upgrade pip
RUN if [ -f requirements.txt ]; then \
pip install -r requirements.txt; \
else \
pip install uv streamlit duckdb; \
fi
# Fix permissions
RUN chown -R app:app /home/app
USER app
ENV PYTHONPATH=/home/app/app
EXPOSE 8501
# Simple healthcheck that queries the Streamlit root
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s CMD curl -f http://localhost:8501/ || exit 1
# Run the Streamlit app via uv as preferred in this project
CMD ["uv", "run", "streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

@ -0,0 +1,90 @@
# Tweede Kamer Parliamentary Embedding Analysis
## Goal
Track how MPs shift politically over time and map motions onto a meaningful ideological axis, by embedding both MPs and motions into a shared vector space.
## Data
|Source|Content|
|------|-------|
|MP × motion vote matrix|yes / no / abstain per MP per motion|
|Motion text|Dutch-language motion descriptions|
|MP metadata|name, party, entry/exit dates|
|Timestamps|date of each vote|
## Approach: Late Fusion
Two independent embedding signals, combined per motion.
### 1. Vote embeddings (SVD)
- Build a sparse MP × motion matrix per time window
- Apply SVD to get latent vectors for both MPs and motions
- Encodes political alignment from actual voting behavior
### 2. Text embeddings (Qwen3-0.6B)
- Embed each motion's text using Qwen3-0.6B (multilingual, Dutch supported)
- Encodes semantic/policy topic of the motion
- Use a task instruction in English, e.g. `"Retrieve semantically similar Dutch parliamentary motions"`
### 3. Fusion
Concatenate (or weighted sum) the SVD motion vector and text vector into a single motion embedding. MPs retain their SVD vectors only.
## Temporal Tracking
### Time windows
- Default: **quarterly** (flexible — can be per half-year or per N votes)
- Adaptive option: fixed number of votes per window (e.g. 200) for stable SVD regardless of parliamentary rhythm
### Procrustes alignment
SVD axes are arbitrary per window and cannot be compared directly. Procrustes alignment finds the optimal rotation mapping one window's space onto the previous, using overlapping MPs as anchors.
```
R = argmin || W1[common] - W2[common] @ R ||
W2_aligned = W2 @ R # applied to all MPs, including newcomers
```
- Only overlapping MPs are needed to estimate R
- New MPs are placed into the aligned space via their voting pattern
- High Procrustes disparity score = structural political shift, not just individual drift
### Election transitions
At term boundaries (~60% MP overlap), alignment is noisier. Mitigation: chain alignments via the last quarter of the old term and first quarter of the new term, using only returning MPs.
## Analysis
|Question|Method|
|--------|------|
|MP drift over time|trajectory of MP vector across aligned windows|
|Political axis|first SVD component, or defined by anchor parties (e.g. VVD vs SP)|
|Swing voters|MPs closest to the boundary between party clusters|
|Thematic clustering|UMAP on fused motion embeddings|
|Cross-party coalitions|motions where party cluster boundaries blur|
|Party cohesion|variance of MP vectors within a party per window|
## Stack
|Component|Tool|
|---------|----|
|Matrix factorization|
````scipy.sparse.linalg.svds
````|
|Procrustes alignment|
````scipy.spatial.procrustes
````|
|Text embeddings|Qwen3-0.6B via
````sentence-transformers
````
or vLLM|
|Dimensionality reduction|UMAP|
|Visualization|Plotly (interactive trajectories)|
|Data handling|ibis / pandas|

@ -0,0 +1,188 @@
"""Thin AI provider adapter for OpenRouter-compatible backends.
Provides simple helpers for embeddings and chat completions using requests.
This module is intentionally small and dependency-light to make testing easy.
"""
from __future__ import annotations
import os
import time
import random
from typing import Any
import requests
class ProviderError(Exception):
"""Terminal provider error (non-retryable or configuration issues)."""
def _get_base_url() -> str:
# Support multiple env var names and fall back to OpenRouter default
return os.environ.get(
"OPENROUTER_URL",
os.environ.get("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1"),
)
def _get_api_key() -> str:
# Accept several common env var names for convenience
for name in ("OPENROUTER_API_KEY", "OPENROUTER_KEY", "OPENAI_API_KEY", "API_KEY"):
key = os.environ.get(name)
if key:
return key
raise ProviderError(
"OPENROUTER_API_KEY (or OPENAI_API_KEY) environment variable is required"
)
def _post_with_retries(
path: str, json: dict[str, Any], retries: int = 3
) -> requests.Response:
"""POST to the provider with a small retry/backoff for transient errors.
Retries on network errors (requests.ConnectionError) and 5xx responses.
"""
url = _get_base_url().rstrip("/") + path
headers = {
"Authorization": f"Bearer {_get_api_key()}",
"Content-Type": "application/json",
}
backoff = 0.5
for attempt in range(1, retries + 1):
try:
resp = requests.post(url, json=json, headers=headers, timeout=10)
except requests.ConnectionError as exc:
if attempt == retries:
raise ProviderError(
f"Connection error when calling provider: {exc}"
) from exc
sleep = backoff * (2 ** (attempt - 1))
sleep = sleep + random.uniform(0, sleep * 0.1)
time.sleep(sleep)
continue
# Treat 5xx as transient
if 500 <= getattr(resp, "status_code", 0) < 600:
if attempt == retries:
raise ProviderError(f"Provider returned HTTP {resp.status_code}")
sleep = backoff * (2 ** (attempt - 1))
sleep = sleep + random.uniform(0, sleep * 0.1)
time.sleep(sleep)
continue
return resp
# Should not reach here
raise ProviderError("Failed to call provider after retries")
def get_embedding(text: str, model: str | None = None) -> list[float]:
"""Return an embedding vector for `text` using the configured provider.
Raises ProviderError for configuration or provider-side failures.
"""
if not isinstance(text, str):
raise ProviderError("text must be a string")
# Resolve model: prefer explicit arg, then env vars, then sensible Qwen default
if model is None:
model = (
os.environ.get("EMBEDDING_MODEL")
or os.environ.get("QWEN_EMBEDDING_MODEL")
or "qwen/qwen3-embedding-4b"
)
resp = _post_with_retries("/embeddings", json={"model": model, "input": text})
try:
data = resp.json()
except Exception as exc:
raise ProviderError(f"Invalid JSON response from provider: {exc}") from exc
# Expecting {"data": [{"embedding": [...]}, ...]}
try:
embedding = data["data"][0]["embedding"]
except Exception as exc:
# If provider returns an error JSON, allow a local fallback when explicitly enabled
fallback = os.environ.get("ALLOW_LOCAL_EMBED_FALLBACK", "false").lower() in (
"1",
"true",
"yes",
)
if fallback:
# choose fallback dim via env or default
dim = int(os.environ.get("LOCAL_EMBED_DIM", "64"))
return _local_embedding(text, dim=dim)
raise ProviderError(f"Unexpected embedding response shape: {data}") from exc
if not isinstance(embedding, list):
raise ProviderError("Embedding is not a list")
return [float(x) for x in embedding]
def _local_embedding(text: str, dim: int = 64) -> list[float]:
"""Deterministic local fallback embedding based on SHA256.
Returns a list of `dim` floats in range [-1, 1]. Not semantically rich but useful
for local testing when provider embeddings are unavailable.
"""
import hashlib
h = hashlib.sha256(text.encode("utf8")).digest()
values = []
i = 0
# Expand digest if needed
while len(values) < dim:
# take 8 bytes -> 64-bit int
chunk = h[i % len(h) : (i % len(h)) + 8]
if len(chunk) < 8:
chunk = chunk.ljust(8, b"\0")
val = int.from_bytes(chunk, "big", signed=False)
# normalize to [-1,1]
valscale = (val / (2**64 - 1)) * 2.0 - 1.0
values.append(valscale)
i += 1
# re-hash occasionally to get more entropy
if i % (len(h) // 2 + 1) == 0:
h = hashlib.sha256(h + chunk).digest()
return values[:dim]
def chat_completion(messages: list[dict], model: str | None = None) -> str:
"""Return the assistant's content string for a chat completion request.
messages should be a list of dicts like {"role": "user", "content": "..."}.
"""
if not isinstance(messages, list):
raise ProviderError("messages must be a list of dicts")
# Resolve chat model: prefer explicit arg, then env var QWEN_MODEL, then a sensible default
if model is None:
model = (
os.environ.get("QWEN_MODEL")
or os.environ.get("CHAT_MODEL")
or "qwen/qwen-3.2"
)
resp = _post_with_retries(
"/chat/completions", json={"model": model, "messages": messages}
)
try:
data = resp.json()
except Exception as exc:
raise ProviderError(f"Invalid JSON response from provider: {exc}") from exc
# Expecting {"choices": [{"message": {"content": "..."}}]}
try:
content = data["choices"][0]["message"]["content"]
except Exception as exc:
raise ProviderError(
f"Unexpected chat completion response shape: {data}"
) from exc
return str(content)

@ -0,0 +1,389 @@
# api_client.py (complete updated version)
import requests
import json
import re
from datetime import datetime, timedelta
from typing import Dict, List, Optional
from config import config
import time
from collections import defaultdict
class TweedeKamerAPI:
def __init__(self):
self.odata_base_url = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0"
self.session = requests.Session()
self.session.headers.update(
{
"Accept": "application/json",
"User-Agent": "Dutch-Political-Compass-Tool/1.0",
}
)
def get_motions(
self, start_date: datetime = None, end_date: datetime = None, limit: int = 500
) -> List[Dict]:
"""Get motions with voting results using OData API"""
if not start_date:
start_date = datetime.now() - timedelta(days=730) # 2 years ago
try:
# Get voting records
voting_records = self._get_voting_records(start_date, end_date, limit)
print(f"Fetched {len(voting_records)} voting records from API")
# Group by Besluit_Id (decision/motion) and get motion details
motions = self._process_voting_records(voting_records)
print(f"Processed into {len(motions)} unique motions")
return motions
except Exception as e:
print(f"Error fetching motions from API: {e}")
return []
def _get_voting_records(
self, start_date: datetime, end_date: datetime = None, limit: int = 500
) -> List[Dict]:
"""Get individual voting records from the API"""
# Format date properly for OData
start_date_str = start_date.strftime("%Y-%m-%d")
filter_query = f"GewijzigdOp ge {start_date_str}T00:00:00Z"
if end_date:
end_date_str = end_date.strftime("%Y-%m-%d")
filter_query += f" and GewijzigdOp le {end_date_str}T23:59:59Z"
# Add filter to exclude deleted records
filter_query += " and Verwijderd eq false"
url = f"{self.odata_base_url}/Stemming"
params = {
"$filter": filter_query,
"$top": limit,
"$orderby": "GewijzigdOp desc",
}
try:
response = self.session.get(url, params=params, timeout=config.API_TIMEOUT)
response.raise_for_status()
data = response.json()
voting_records = data.get("value", [])
# If we got the maximum, there might be more data
if len(voting_records) == limit:
print(
f"Retrieved maximum {limit} records, there might be more data available"
)
return voting_records
except requests.exceptions.RequestException as e:
print(f"API request failed: {e}")
if hasattr(e, "response") and e.response is not None:
print(f"Response status: {e.response.status_code}")
print(f"Response text: {e.response.text[:500]}")
return []
def _process_voting_records(self, records: List[Dict]) -> List[Dict]:
"""Process individual voting records into grouped motions"""
# Group records by Besluit_Id (decision/motion)
motion_groups = defaultdict(
lambda: {"votes": {}, "besluit_id": None, "latest_date": None}
)
for record in records:
besluit_id = record.get("Besluit_Id")
if not besluit_id:
continue
# Extract party and vote information
party_name = record.get("ActorNaam")
vote_type = record.get("Soort", "").lower()
record_date = record.get("GewijzigdOp", "")
if not party_name:
continue
# Map vote types to our format
if vote_type == "voor":
vote = "voor"
elif vote_type == "tegen":
vote = "tegen"
else:
vote = "afwezig"
# Store the vote
motion_groups[besluit_id]["votes"][party_name] = vote
motion_groups[besluit_id]["besluit_id"] = besluit_id
# Track the latest date for this motion
if (
not motion_groups[besluit_id]["latest_date"]
or record_date > motion_groups[besluit_id]["latest_date"]
):
motion_groups[besluit_id]["latest_date"] = record_date
# Now get motion details for each unique Besluit_Id
motions = []
for besluit_id, motion_data in motion_groups.items():
if len(motion_data["votes"]) < 3: # Skip motions with too few votes
continue
# Get motion details
motion_details = self._get_motion_details(besluit_id)
if not motion_details:
# Create basic motion data if we can't get details
motion_details = {
"title": f"Motion {besluit_id[:8]}",
"description": "No description available",
"date": motion_data["latest_date"].split("T")[0]
if motion_data["latest_date"]
else datetime.now().strftime("%Y-%m-%d"),
}
# Calculate winning margin
voting_results = motion_data["votes"]
total_votes = sum(
1 for vote in voting_results.values() if vote in ["voor", "tegen"]
)
if total_votes == 0:
continue
votes_for = sum(1 for vote in voting_results.values() if vote == "voor")
winning_margin = abs(votes_for - (total_votes - votes_for)) / total_votes
motion = {
"title": motion_details["title"],
"description": motion_details["description"],
"date": motion_details["date"],
"policy_area": self._determine_policy_area(
motion_details["title"], motion_details["description"]
),
"voting_results": voting_results,
"winning_margin": winning_margin,
"url": f"https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit_id}",
"externe_identifier": motion_details.get("externe_identifier"),
"body_text": motion_details.get("body_text"),
}
motions.append(motion)
return motions
def _get_motion_details(self, besluit_id: str) -> Optional[Dict]:
"""Get motion details from Besluit endpoint.
Fetches Zaak.Onderwerp for the human-readable title, then follows the
Zaak Document DocumentVersie chain to get the ExterneIdentifier,
which is used to scrape the full motion body text from
zoek.officielebekendmakingen.nl.
"""
try:
# Step 1: Besluit → Zaak (title) + Zaak.Id for document lookup
url = f"{self.odata_base_url}/Besluit({besluit_id})"
params = {"$expand": "Zaak($select=Id,Onderwerp)"}
response = self.session.get(url, params=params, timeout=config.API_TIMEOUT)
response.raise_for_status()
record = response.json()
zaak_list = record.get("Zaak", [])
onderwerp = None
zaak_id = None
if zaak_list:
onderwerp = zaak_list[0].get("Onderwerp")
zaak_id = zaak_list[0].get("Id")
besluit_tekst = record.get("BesluitTekst") or ""
date_str = record.get("GewijzigdOp", "")
date = (
date_str.split("T")[0]
if date_str
else datetime.now().strftime("%Y-%m-%d")
)
title = onderwerp or f"Motion {besluit_id[:8]}"
description = onderwerp or besluit_tekst or "Geen beschrijving beschikbaar"
# Step 2: Fetch ExterneIdentifier via Zaak → Document → DocumentVersie
externe_identifier = None
body_text = None
if zaak_id:
externe_identifier = self._get_externe_identifier(zaak_id)
if externe_identifier:
body_text = self._fetch_body_text(externe_identifier)
return {
"title": title,
"description": body_text or description,
"date": date,
"externe_identifier": externe_identifier,
"body_text": body_text,
}
except Exception as e:
print(f"Error getting motion details for {besluit_id}: {e}")
return None
def _get_externe_identifier(self, zaak_id: str) -> Optional[str]:
"""Fetch the ExterneIdentifier for the first non-deleted DocumentVersie of a Zaak."""
try:
url = f"{self.odata_base_url}/Zaak({zaak_id})"
params = {
"$expand": "Document($expand=DocumentVersie($select=Id,ExterneIdentifier,Extensie,Verwijderd))"
}
response = self.session.get(url, params=params, timeout=config.API_TIMEOUT)
response.raise_for_status()
data = response.json()
for doc in data.get("Document", []):
for versie in doc.get("DocumentVersie", []):
if versie.get("Verwijderd"):
continue
ext_id = versie.get("ExterneIdentifier")
if ext_id:
return ext_id
except Exception as e:
print(f"Error fetching ExterneIdentifier for zaak {zaak_id}: {e}")
return None
def _fetch_body_text(self, externe_identifier: str) -> Optional[str]:
"""Scrape full motion body text from zoek.officielebekendmakingen.nl."""
try:
url = f"https://zoek.officielebekendmakingen.nl/{externe_identifier}.html"
response = self.session.get(url, timeout=config.API_TIMEOUT)
response.raise_for_status()
html = response.text
# Strip tags
text = re.sub(r"<[^>]+>", " ", html)
text = re.sub(r"&[a-z]+;", " ", text)
text = re.sub(r"\s+", " ", text).strip()
# Find the motion body starting at the first relevant keyword
start_keywords = [
"constaterende",
"overwegende",
"verzoekt",
"spreekt uit",
"roept op",
"de kamer,",
]
start_pos = len(text)
for kw in start_keywords:
pos = text.lower().find(kw)
if pos != -1 and pos < start_pos:
start_pos = pos
if start_pos == len(text):
return None # No motion body found
body = text[start_pos:]
# Trim at end markers
end_markers = [
"gaat over tot de orde van de dag",
"naar boven",
"deze motie is",
"nr.",
]
for marker in end_markers:
pos = body.lower().find(marker)
if pos != -1:
body = body[:pos]
body = body.strip()
return body if len(body) > 50 else None
except Exception as e:
print(f"Error fetching body text for {externe_identifier}: {e}")
return None
def _determine_policy_area(self, title: str, description: str) -> str:
"""Determine policy area from motion title and description"""
text = (title + " " + description).lower()
# Policy area keyword mapping
policy_mapping = {
"Economie": [
"economie",
"belasting",
"budget",
"financiën",
"werkgelegenheid",
"bedrijven",
"economisch",
],
"Klimaat": [
"klimaat",
"co2",
"duurzaam",
"energie",
"milieu",
"uitstoot",
"klimaatverandering",
],
"Immigratie": [
"migratie",
"asiel",
"vreemdeling",
"integratie",
"naturalisatie",
"immigratie",
],
"Zorg": [
"zorg",
"gezondheid",
"ziekenhuis",
"medicijn",
"arts",
"patiënt",
"gezondheidszorg",
],
"Onderwijs": [
"onderwijs",
"school",
"universiteit",
"student",
"leraar",
"educatie",
],
"Defensie": [
"defensie",
"militair",
"veiligheid",
"oorlog",
"leger",
"veiligheidsdienst",
],
}
for area, keywords in policy_mapping.items():
if any(keyword in text for keyword in keywords):
return area
return "Algemeen"
def test_api_connection(self) -> bool:
"""Test if API is accessible"""
try:
url = f"{self.odata_base_url}/Stemming"
params = {"$top": 1}
response = self.session.get(url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
return len(data.get("value", [])) > 0
except Exception as e:
print(f"API connection test failed: {e}")
return False

310
app.py

@ -0,0 +1,310 @@
# app.py
import streamlit as st
import pandas as pd
from datetime import datetime
from database import db
from summarizer import summarizer
from config import config
import json
# Page config
st.set_page_config(
page_title="Nederlandse Politieke Kompas", page_icon="🇳🇱", layout="wide"
)
def main():
st.title("🇳🇱 Nederlandse Politieke Kompas")
st.markdown(
"Ontdek welke politieke partij het beste bij jouw idealen past door te stemmen op echte Tweede Kamer moties."
)
# Initialize session state
if "session_id" not in st.session_state:
st.session_state.session_id = None
if "current_motion_index" not in st.session_state:
st.session_state.current_motion_index = 0
if "motions" not in st.session_state:
st.session_state.motions = []
if "show_results" not in st.session_state:
st.session_state.show_results = False
# Sidebar configuration
with st.sidebar:
st.header("Instellingen")
motion_count = st.slider(
"Aantal moties",
min_value=5,
max_value=25,
value=config.DEFAULT_MOTION_COUNT,
)
policy_area = st.selectbox("Beleidsgebied", config.POLICY_AREAS)
margin_range = st.slider(
"Controversiële moties (%)",
min_value=0,
max_value=100,
value=(
config.DEFAULT_WINNING_MARGIN_MIN,
config.DEFAULT_WINNING_MARGIN_MAX,
),
)
if st.button("Start Nieuwe Sessie"):
start_new_session(motion_count, policy_area, margin_range)
if st.button("Genereer AI Samenvattingen"):
with st.spinner("Genereren van samenvattingen..."):
summarizer.update_motion_summaries()
st.success("Samenvattingen bijgewerkt!")
# Main content
if not st.session_state.session_id:
show_welcome_screen(motion_count, policy_area, margin_range)
elif st.session_state.show_results:
show_results()
else:
show_motion_interface()
def start_new_session(motion_count, policy_area, margin_range):
"""Start a new voting session"""
# Get filtered motions
motions = db.get_filtered_motions(
policy_area=policy_area,
min_margin=margin_range[0] / 100,
max_margin=margin_range[1] / 100,
limit=motion_count,
)
if len(motions) < motion_count:
st.warning(
f"Slechts {len(motions)} moties gevonden met de geselecteerde criteria."
)
# Create session
session_id = db.create_session(motion_count)
# Update session state
st.session_state.session_id = session_id
st.session_state.motions = motions[:motion_count]
st.session_state.current_motion_index = 0
st.session_state.show_results = False
st.rerun()
def show_welcome_screen(motion_count, policy_area, margin_range):
"""Show welcome screen with start button"""
col1, col2, col3 = st.columns([1, 2, 1])
with col2:
st.markdown("### Welkom bij de Nederlandse Politieke Kompas!")
st.markdown(f"""
**Jouw instellingen:**
- 📊 **{motion_count} moties** uit het beleidsgebied **{policy_area}**
- 🎯 **Controversiële moties** tussen {margin_range[0]}% en {margin_range[1]}% marge
Klik op "Start Nieuwe Sessie" in de zijbalk om te beginnen met stemmen.
""")
st.info(
"💡 **Tip**: Kies 'Alle' als beleidsgebied voor een breed overzicht van verschillende onderwerpen."
)
def show_motion_interface():
"""Show motion voting interface"""
if not st.session_state.motions:
st.error("Geen moties gevonden. Start een nieuwe sessie.")
return
current_index = st.session_state.current_motion_index
total_motions = len(st.session_state.motions)
# Progress bar
progress = (current_index) / total_motions
st.progress(progress, text=f"Motie {current_index + 1} van {total_motions}")
if current_index >= total_motions:
st.session_state.show_results = True
st.rerun()
return
motion = st.session_state.motions[current_index]
# Motion display
st.header(f"Motie {current_index + 1}: {motion['title']}")
# Policy area tag
st.markdown(f"**Beleidsgebied:** {motion['policy_area']}")
# Layman explanation (prominent)
if motion.get("layman_explanation"):
st.markdown("### 📝 Uitleg in begrijpelijke taal:")
st.markdown(f"*{motion['layman_explanation']}*")
# Original description (collapsible)
motion_text = motion.get("body_text") or motion.get("description", "")
if motion_text:
label = (
"📋 Volledige motietekst"
if motion.get("body_text")
else "📋 Originele motiebeschrijving"
)
with st.expander(label):
st.write(motion_text)
# Voting buttons
st.markdown("### 🗳 Hoe zou jij stemmen?")
col1, col2, col3 = st.columns(3)
with col1:
if st.button("✅ Voor", use_container_width=True, type="primary"):
cast_vote("Voor")
with col2:
if st.button("❌ Tegen", use_container_width=True):
cast_vote("Tegen")
with col3:
if st.button("🚫 Geen stem", use_container_width=True):
cast_vote("Geen stem")
def cast_vote(vote_choice):
"""Record user vote and move to next motion"""
current_motion = st.session_state.motions[st.session_state.current_motion_index]
# Save vote to database
db.update_user_vote(st.session_state.session_id, current_motion["id"], vote_choice)
# Move to next motion
st.session_state.current_motion_index += 1
st.rerun()
def show_results():
"""Show voting results and party matches"""
st.header("🎯 Jouw Resultaten")
# Calculate party matches
party_matches = db.calculate_party_matches(st.session_state.session_id)
if not party_matches:
st.error("Geen resultaten beschikbaar.")
return
# Party ranking table
st.subheader("📊 Partij Overeenkomsten (van hoog naar laag)")
df = pd.DataFrame(party_matches)
df.columns = ["Partij", "Overeenkomst %", "Eens", "Totaal"]
# Style the dataframe
def color_agreement(val):
if val >= 80:
return "background-color: #d4edda"
elif val >= 60:
return "background-color: #fff3cd"
else:
return "background-color: #f8d7da"
styled_df = df.style.applymap(color_agreement, subset=["Overeenkomst %"])
st.dataframe(styled_df, use_container_width=True, hide_index=True)
# Top match highlight
top_match = party_matches[0]
st.success(
f"🏆 **Beste match:** {top_match['party']} ({top_match['agreement_percentage']}% overeenkomst)"
)
# Detailed motion overview
st.subheader("📋 Gedetailleerd Overzicht per Motie")
show_detailed_motion_results()
# New session button
if st.button("🔄 Start Nieuwe Sessie"):
# Clear session state
for key in ["session_id", "motions", "current_motion_index", "show_results"]:
if key in st.session_state:
del st.session_state[key]
st.rerun()
def show_detailed_motion_results():
"""Show detailed voting results for each motion"""
import duckdb
conn = duckdb.connect(config.DATABASE_PATH)
# Get user votes
user_data = conn.execute(
"""
SELECT user_votes FROM user_sessions WHERE session_id = ?
""",
(st.session_state.session_id,),
).fetchone()
if not user_data:
return
user_votes = json.loads(user_data[0])
# Get motion details
motion_ids = list(user_votes.keys())
if motion_ids:
placeholders = ",".join(["?" for _ in motion_ids])
motions = conn.execute(
f"""
SELECT id, title, layman_explanation, body_text, description, voting_results FROM motions
WHERE id IN ({placeholders})
""",
motion_ids,
).fetchall()
for (
motion_id,
title,
layman_explanation,
body_text,
description,
voting_results_json,
) in motions:
voting_results = json.loads(voting_results_json)
user_vote = user_votes[str(motion_id)]
with st.expander(f"**{title}** (Jouw stem: {user_vote})"):
# Show layman explanation prominently
if layman_explanation:
st.markdown("**📝 Uitleg:**")
st.markdown(f"*{layman_explanation}*")
# Show full motion body text if available, otherwise description
motion_text = body_text or description
if motion_text:
st.markdown("**📋 Motiebeschrijving:**")
st.write(motion_text)
# Create voting overview
parties_voor = [p for p, v in voting_results.items() if v == "voor"]
parties_tegen = [p for p, v in voting_results.items() if v == "tegen"]
col1, col2 = st.columns(2)
with col1:
st.markdown("**Voor:**")
st.write(", ".join(parties_voor) if parties_voor else "Geen")
with col2:
st.markdown("**Tegen:**")
st.write(", ".join(parties_tegen) if parties_tegen else "Geen")
conn.close()
if __name__ == "__main__":
main()

@ -0,0 +1,51 @@
# config.py (complete updated version)
import os
from dataclasses import dataclass
from typing import List
@dataclass
class Config:
# Database settings
DATABASE_PATH = "data/motions.db"
# API settings (updated)
TWEEDE_KAMER_ODATA_API = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0"
API_TIMEOUT = 30
API_BATCH_SIZE = 250 # Increased based on API capabilities
API_MAX_LIMIT = 250
# AI settings
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1"
QWEN_MODEL = "qwen/qwen-2.5-72b-instruct"
# App settings
DEFAULT_MOTION_COUNT = 10
DEFAULT_WINNING_MARGIN_MIN = (
0 # % - include all, filter by layman_explanation instead
)
DEFAULT_WINNING_MARGIN_MAX = 100 # %
SESSION_TIMEOUT_DAYS = 30
# Policy areas
POLICY_AREAS = [
"Alle",
"Economie",
"Klimaat",
"Immigratie",
"Zorg",
"Onderwijs",
"Defensie",
"Sociale Zaken",
"Algemeen",
]
# Scraper defaults (previously missing)
BASE_URL = (
"https://www.tweedekamer.nl/zoeken/zoekresultaten" # base for scraping motions
)
SCRAPING_DELAY = int(os.getenv("SCRAPING_DELAY", "5"))
config = Config()

@ -0,0 +1,582 @@
# database.py (final working version)
import duckdb
import json
import uuid
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Tuple
from config import config
import logging
_logger = logging.getLogger(__name__)
class MotionDatabase:
def __init__(self, db_path: str = config.DATABASE_PATH):
self.db_path = db_path
self._init_database()
def _init_database(self):
"""Initialize database with required tables"""
# Create directory if it doesn't exist
import os
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
conn = duckdb.connect(self.db_path)
# Create sequence for auto-incrementing IDs
try:
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1")
except:
pass
# Create tables with proper ID handling
conn.execute("""
CREATE TABLE IF NOT EXISTS motions (
id INTEGER DEFAULT nextval('motions_id_seq'),
title TEXT NOT NULL,
description TEXT,
date DATE,
policy_area TEXT,
voting_results JSON,
winning_margin FLOAT,
controversy_score FLOAT,
layman_explanation TEXT,
url TEXT UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS user_sessions (
session_id TEXT PRIMARY KEY,
user_votes JSON,
completed_motions INTEGER DEFAULT 0,
total_motions INTEGER DEFAULT 10,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS party_results (
session_id TEXT,
party_name TEXT,
agreement_percentage FLOAT,
agreed_motions JSON,
disagreed_motions JSON,
PRIMARY KEY (session_id, party_name)
)
""")
# New pipeline tables
conn.execute("""
CREATE SEQUENCE IF NOT EXISTS mp_votes_id_seq START 1
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS mp_votes (
id INTEGER DEFAULT nextval('mp_votes_id_seq'),
motion_id INTEGER NOT NULL,
mp_name TEXT NOT NULL,
party TEXT,
vote TEXT NOT NULL,
date DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS mp_metadata (
mp_name TEXT PRIMARY KEY,
party TEXT,
van DATE,
tot_en_met DATE,
persoon_id TEXT
)
""")
conn.execute("""
CREATE SEQUENCE IF NOT EXISTS svd_vectors_id_seq START 1
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS svd_vectors (
id INTEGER DEFAULT nextval('svd_vectors_id_seq'),
window_id TEXT NOT NULL,
entity_type TEXT NOT NULL,
entity_id TEXT NOT NULL,
vector JSON NOT NULL,
model TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
""")
conn.execute("""
CREATE SEQUENCE IF NOT EXISTS fused_embeddings_id_seq START 1
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS fused_embeddings (
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
motion_id INTEGER NOT NULL,
window_id TEXT NOT NULL,
vector JSON NOT NULL,
svd_dims INTEGER NOT NULL,
text_dims INTEGER NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
""")
conn.close()
def reset_database(self):
"""Development helper: drop known tables and re-run initialization.
WARNING: intended for dev/test only. This will remove tables and recreate schema.
"""
conn = duckdb.connect(self.db_path)
try:
# Drop known tables if they exist
for t in ("party_results", "user_sessions", "motions"):
try:
conn.execute(f"DROP TABLE IF EXISTS {t}")
except Exception:
pass
# Recreate schema
conn.close()
self._init_database()
finally:
try:
conn.close()
except Exception:
pass
def insert_motion(self, motion_data: Dict) -> bool:
"""Insert a new motion into database"""
try:
conn = duckdb.connect(self.db_path)
# Check if motion already exists by URL to avoid duplicates
existing = conn.execute(
"""
SELECT COUNT(*) FROM motions WHERE url = ?
""",
(motion_data["url"],),
).fetchone()
if existing and existing[0] > 0:
conn.close()
return False # Motion already exists
# Insert motion - id will be auto-generated by sequence
conn.execute(
"""
INSERT INTO motions
(title, description, date, policy_area, voting_results,
winning_margin, controversy_score, url, externe_identifier, body_text, created_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
""",
(
motion_data["title"],
motion_data["description"] or "",
motion_data["date"],
motion_data["policy_area"],
json.dumps(motion_data["voting_results"]),
motion_data["winning_margin"],
1 - motion_data["winning_margin"], # controversy score
motion_data["url"],
motion_data.get("externe_identifier"),
motion_data.get("body_text"),
),
)
conn.close()
return True
except Exception as e:
print(f"Error inserting motion: {e}")
if "conn" in locals():
conn.close()
return False
def get_filtered_motions(
self,
policy_area: str = "Alle",
min_margin: float = 0.2,
max_margin: float = 0.8,
limit: int = 100,
) -> List[Dict]:
"""Get motions filtered by criteria"""
conn = duckdb.connect(self.db_path)
query = """
SELECT * FROM motions
WHERE winning_margin BETWEEN ? AND ?
AND layman_explanation IS NOT NULL
AND layman_explanation != ''
"""
params = [min_margin, max_margin]
if policy_area != "Alle":
query += " AND policy_area = ?"
params.append(policy_area)
query += " ORDER BY controversy_score DESC LIMIT ?"
params.append(limit)
try:
result = conn.execute(query, params).fetchall()
columns = [desc[0] for desc in conn.description]
conn.close()
return [dict(zip(columns, row)) for row in result]
except Exception as e:
print(f"Error querying motions: {e}")
conn.close()
return []
def create_session(self, total_motions: int = 10) -> str:
"""Create new user session"""
session_id = str(uuid.uuid4())
conn = duckdb.connect(self.db_path)
conn.execute(
"""
INSERT INTO user_sessions (session_id, user_votes, total_motions)
VALUES (?, '{}', ?)
""",
(session_id, total_motions),
)
conn.close()
return session_id
def update_user_vote(self, session_id: str, motion_id: int, vote: str):
"""Update user vote for a motion"""
conn = duckdb.connect(self.db_path)
# Get current votes
current_votes = conn.execute(
"""
SELECT user_votes FROM user_sessions WHERE session_id = ?
""",
(session_id,),
).fetchone()
if current_votes:
votes_dict = json.loads(current_votes[0])
votes_dict[str(motion_id)] = vote
conn.execute(
"""
UPDATE user_sessions
SET user_votes = ?,
completed_motions = ?,
last_updated = CURRENT_TIMESTAMP
WHERE session_id = ?
""",
(json.dumps(votes_dict), len(votes_dict), session_id),
)
conn.close()
def calculate_party_matches(self, session_id: str) -> List[Dict]:
"""Calculate party agreement percentages"""
conn = duckdb.connect(self.db_path)
# Get user votes and motion data
user_data = conn.execute(
"""
SELECT user_votes FROM user_sessions WHERE session_id = ?
""",
(session_id,),
).fetchone()
if not user_data:
return []
user_votes = json.loads(user_data[0])
motion_ids = list(user_votes.keys())
if not motion_ids:
return []
# Get motion voting results
placeholders = ",".join(["?" for _ in motion_ids])
motions = conn.execute(
f"""
SELECT id, voting_results FROM motions
WHERE id IN ({placeholders})
""",
motion_ids,
).fetchall()
conn.close()
# Calculate agreements
party_scores = {}
for motion_id, voting_results_json in motions:
voting_results = json.loads(voting_results_json)
user_vote = user_votes[str(motion_id)]
if user_vote == "Geen stem": # Skip abstentions
continue
for party, party_vote in voting_results.items():
# Skip individual MP names (contain comma, e.g. "Yesilgöz-Zegerius, D.")
# Party/fractie names never contain a comma.
if "," in party:
continue
if party not in party_scores:
party_scores[party] = {"agreed": 0, "total": 0}
party_scores[party]["total"] += 1
# Check agreement
if (user_vote == "Voor" and party_vote == "voor") or (
user_vote == "Tegen" and party_vote == "tegen"
):
party_scores[party]["agreed"] += 1
# Convert to percentages and sort
results = []
for party, scores in party_scores.items():
if scores["total"] > 0:
agreement_pct = (scores["agreed"] / scores["total"]) * 100
results.append(
{
"party": party,
"agreement_percentage": round(agreement_pct, 1),
"agreed_motions": scores["agreed"],
"total_motions": scores["total"],
}
)
return sorted(results, key=lambda x: x["agreement_percentage"], reverse=True)
def store_embedding(self, motion_id: int, model: str, vector: List[float]) -> int:
"""Store an embedding for a motion. Returns inserted row id or -1 on failure."""
try:
conn = duckdb.connect(self.db_path)
# store vector as JSON
conn.execute(
"INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, CURRENT_TIMESTAMP)",
(motion_id, model, json.dumps(vector)),
)
row = conn.execute("SELECT max(id) FROM embeddings").fetchone()
conn.close()
if row and row[0] is not None:
return int(row[0])
return -1
except Exception as e:
print(f"Error storing embedding: {e}")
try:
conn.close()
except Exception:
pass
return -1
def search_similar(
self, query_vector: List[float], top_k: int = 5, model: Optional[str] = None
) -> List[Dict]:
"""Naive in-Python cosine similarity search over stored embeddings.
Returns list of dicts with keys: id, motion_id, model, score, created_at
"""
try:
conn = duckdb.connect(self.db_path)
if model:
rows = conn.execute(
"SELECT id, motion_id, model, vector, created_at FROM embeddings WHERE model = ?",
(model,),
).fetchall()
else:
rows = conn.execute(
"SELECT id, motion_id, model, vector, created_at FROM embeddings"
).fetchall()
conn.close()
results = []
import math
for r in rows:
id_, motion_id, mdl, vector_json, created_at = r
try:
vec = json.loads(vector_json)
except Exception:
continue
# cosine similarity
try:
dot = sum(float(a) * float(b) for a, b in zip(query_vector, vec))
na = math.sqrt(sum(float(a) * float(a) for a in query_vector))
nb = math.sqrt(sum(float(b) * float(b) for b in vec))
score = dot / (na * nb) if na and nb else 0.0
except Exception:
score = 0.0
results.append(
{
"id": id_,
"motion_id": motion_id,
"model": mdl,
"score": score,
"created_at": created_at,
}
)
results.sort(key=lambda x: x["score"], reverse=True)
return results[:top_k]
except Exception as e:
print(f"Error searching embeddings: {e}")
try:
conn.close()
except Exception:
pass
return []
def mp_votes_exists_for_motion(self, motion_id: int) -> bool:
try:
conn = duckdb.connect(self.db_path)
row = conn.execute(
"SELECT COUNT(*) FROM mp_votes WHERE motion_id = ?",
(motion_id,),
).fetchone()
conn.close()
return bool(row and row[0] > 0)
except Exception as e:
_logger.error(f"Error checking mp_votes existence: {e}")
try:
conn.close()
except Exception:
pass
return False
def insert_mp_vote(
self,
motion_id: int,
mp_name: str,
vote: str,
date: Optional[str] = None,
party: Optional[str] = None,
) -> int:
try:
conn = duckdb.connect(self.db_path)
conn.execute(
"""
INSERT INTO mp_votes (motion_id, mp_name, party, vote, date, created_at)
VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
""",
(motion_id, mp_name, party, vote, date),
)
row = conn.execute("SELECT max(id) FROM mp_votes").fetchone()
conn.close()
if row and row[0] is not None:
return int(row[0])
return -1
except Exception as e:
_logger.error(f"Error inserting mp_vote: {e}")
try:
conn.close()
except Exception:
pass
return -1
def upsert_mp_metadata(
self,
mp_name: str,
party: Optional[str],
van: Optional[str],
tot_en_met: Optional[str],
persoon_id: Optional[str],
) -> None:
try:
conn = duckdb.connect(self.db_path)
exists = conn.execute(
"SELECT COUNT(*) FROM mp_metadata WHERE mp_name = ?", (mp_name,)
).fetchone()
if exists and exists[0] > 0:
conn.execute(
"""
UPDATE mp_metadata SET party = ?, van = ?, tot_en_met = ?, persoon_id = ?
WHERE mp_name = ?
""",
(party, van, tot_en_met, persoon_id, mp_name),
)
else:
conn.execute(
"""
INSERT INTO mp_metadata (mp_name, party, van, tot_en_met, persoon_id)
VALUES (?, ?, ?, ?, ?)
""",
(mp_name, party, van, tot_en_met, persoon_id),
)
conn.close()
except Exception as e:
_logger.error(f"Error upserting mp_metadata: {e}")
try:
conn.close()
except Exception:
pass
def store_svd_vector(
self,
window_id: str,
entity_type: str,
entity_id: str,
vector: List[float],
model: Optional[str] = None,
) -> int:
try:
conn = duckdb.connect(self.db_path)
conn.execute(
"""
INSERT INTO svd_vectors (window_id, entity_type, entity_id, vector, model, created_at)
VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
""",
(window_id, entity_type, entity_id, json.dumps(vector), model),
)
row = conn.execute("SELECT max(id) FROM svd_vectors").fetchone()
conn.close()
if row and row[0] is not None:
return int(row[0])
return -1
except Exception as e:
_logger.error(f"Error storing svd_vector: {e}")
try:
conn.close()
except Exception:
pass
return -1
def store_fused_embedding(
self,
motion_id: int,
window_id: str,
vector: List[float],
svd_dims: int,
text_dims: int,
) -> int:
try:
conn = duckdb.connect(self.db_path)
conn.execute(
"""
INSERT INTO fused_embeddings (motion_id, window_id, vector, svd_dims, text_dims, created_at)
VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP)
""",
(motion_id, window_id, json.dumps(vector), svd_dims, text_dims),
)
row = conn.execute("SELECT max(id) FROM fused_embeddings").fetchone()
conn.close()
if row and row[0] is not None:
return int(row[0])
return -1
except Exception as e:
_logger.error(f"Error storing fused_embedding: {e}")
try:
conn.close()
except Exception:
pass
return -1
db = MotionDatabase()

@ -0,0 +1,20 @@
version: '3.8'
services:
stemwijzer:
build: .
image: stemwijzer:latest
container_name: stemwijzer_app
restart: unless-stopped
ports:
- "8501:8501"
volumes:
- ./data:/home/app/app/data:rw
environment:
- PYTHONPATH=/home/app/app
- OPENROUTER_API_KEY
- OTHER_SECRET
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8501/"]
interval: 30s
timeout: 3s
retries: 3

@ -0,0 +1,72 @@
# Recomputing Similarity (Admin)
This document explains the admin CLI and developer workflows for recomputing similarity scores and running clustering jobs locally.
## What this does
- Recompute similarity vectors/scores for existing records in the database.
- (Optionally) run the clusterer job that groups similar items based on recomputed vectors.
These operations are typically run as admin/maintenance tasks after changing the embedding/similarity logic or restoring a database snapshot.
## Migration filenames
When adding or running migrations related to similarity or clustering, follow the project's migration filename pattern. Migration files touching similarity will typically include keywords like `recompute_similarity` or `clusterer` in the filename, for example:
- `20260101_001_recompute_similarity.py`
- `20260215_002_clusterer_migration.py`
Check your migrations folder for the exact filenames used in your environment.
## Environment variables
When running the CLI locally you may need to set the following environment variables.
- `TEST_DB_URL` — connection string for a test/development database (used by local runs when you don't want to touch production data).
- `AI_PROVIDER_MOCK` — when set to a truthy value (`1`, `true`, `yes`) the AI/embedding provider is mocked so you don't make real API calls during development. Treat any non-empty value of `AI_PROVIDER_MOCK` as truthy.
- `SIMILARITY_TOP_N` — default number of top similar items to compute/keep for each record. The CLI `--top-n` flag overrides this value for the duration of the run.
Examples:
- Export in a shell (persistent for your session):
export TEST_DB_URL="postgresql://user:pass@localhost:5432/devdb"
export AI_PROVIDER_MOCK="true"
export SIMILARITY_TOP_N="50"
- Inline for a single command (non-persistent):
TEST_DB_URL="postgresql://user:pass@localhost/devdb" AI_PROVIDER_MOCK=1 python -m src.cli.recompute_similarity --batch-size 100
Notes:
- `--top-n` CLI flag takes precedence over `SIMILARITY_TOP_N` when both are provided.
- `AI_PROVIDER_MOCK` should be set to a truthy value (e.g. `1`, `true`, `yes`) to avoid real external AI calls during local runs.
## Running locally (development)
The CLI lives under src/cli. Use the module runner to execute the recompute script. Example commands:
Run a dry-run that doesn't persist changes:
```
python -m src.cli.recompute_similarity --top-n 10 --batch-size 100 --dry-run
```
Run for real (writes results to the DB):
```
python -m src.cli.recompute_similarity --top-n 50 --batch-size 500
```
Common flags
- `--top-n` — override SIMILARITY_TOP_N for this run.
- `--batch-size` — number of records to process per batch.
- `--dry-run` — inspect what would be changed without writing to the DB.
Notes
- Always point `TEST_DB_URL` at a non-production database when experimenting.
- Use `AI_PROVIDER_MOCK=true` to skip external calls and speed up local dev.
- If you change the embedding or similarity algorithm, re-run the recompute job and re-index/cluster as needed.
If you need help or encounter mismatches between migration files and the CLI, check the migrations folder and speak with the team member that authored the change.

@ -0,0 +1,67 @@
# fix_database.py (updated version)
import os
import duckdb
from config import config
def fix_database():
"""Completely reset the database with correct schema"""
# Remove the existing database file completely
if os.path.exists(config.DATABASE_PATH):
os.remove(config.DATABASE_PATH)
print("Removed existing database file")
# Create directory if it doesn't exist
os.makedirs(os.path.dirname(config.DATABASE_PATH), exist_ok=True)
# Initialize with correct schema
conn = duckdb.connect(config.DATABASE_PATH)
# Create sequence for auto-incrementing IDs
conn.execute("CREATE SEQUENCE motions_id_seq START 1")
# Create motions table with sequence-based auto-increment
conn.execute("""
CREATE TABLE motions (
id INTEGER DEFAULT nextval('motions_id_seq'),
title TEXT NOT NULL,
description TEXT,
date DATE,
policy_area TEXT,
voting_results JSON,
winning_margin FLOAT,
controversy_score FLOAT,
layman_explanation TEXT,
url TEXT UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
)
""")
conn.execute("""
CREATE TABLE user_sessions (
session_id TEXT PRIMARY KEY,
user_votes JSON,
completed_motions INTEGER DEFAULT 0,
total_motions INTEGER DEFAULT 10,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE party_results (
session_id TEXT,
party_name TEXT,
agreement_percentage FLOAT,
agreed_motions JSON,
disagreed_motions JSON,
PRIMARY KEY (session_id, party_name)
)
""")
conn.close()
print("Database recreated with correct schema using sequences")
if __name__ == "__main__":
fix_database()

@ -0,0 +1,6 @@
def main():
print("Hello from stemwijzer!")
if __name__ == "__main__":
main()

@ -0,0 +1,11 @@
-- Add a separate embeddings table for semantic search and storage of vectors (DuckDB-compatible)
CREATE TABLE IF NOT EXISTS embeddings (
id INTEGER,
motion_id INTEGER NOT NULL,
model TEXT NOT NULL,
vector JSON NOT NULL,
created_at TIMESTAMP DEFAULT current_timestamp
);
-- DuckDB does not support AUTOINCREMENT; emulate id via a sequence if needed elsewhere
CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1;
-- Populate id via trigger-like insert pattern is handled by application code (select nextval when inserting)

@ -0,0 +1,6 @@
-- Migration: add externe_identifier and body_text columns to motions
-- externe_identifier: e.g. "kst-36600-VII-28" from DocumentVersie.ExterneIdentifier
-- body_text: full plain-text motion body scraped from officielebekendmakingen.nl
ALTER TABLE motions ADD COLUMN IF NOT EXISTS externe_identifier VARCHAR;
ALTER TABLE motions ADD COLUMN IF NOT EXISTS body_text VARCHAR;

@ -0,0 +1,24 @@
-- Migration: create audit_events table
-- Date: 2026-03-22
-- Description: Placeholder migration to add an audit_events table to record audit logs.
--
-- Decision: The actual SQL is intentionally left commented out to avoid making
-- database changes during test runs. When ready to apply, uncomment and
-- adapt the SQL for your database engine.
/*
CREATE TABLE audit_events (
id UUID PRIMARY KEY,
actor_id UUID NOT NULL,
action TEXT NOT NULL,
target_type TEXT,
target_id UUID,
metadata JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT now()
);
-- Add indexes as needed, e.g.:
-- CREATE INDEX ON audit_events (actor_id);
*/
-- End of migration placeholder

@ -0,0 +1,15 @@
-- 2026-03-22-add-similarity-cache.sql
-- Placeholder migration for adding a similarity_cache table
-- Decision: Keep SQL commented out so CI does not accidentally modify databases.
/*
-- Example (commented out):
CREATE TABLE similarity_cache (
id SERIAL PRIMARY KEY,
key TEXT NOT NULL,
vector FLOAT8[] NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT now()
);
*/
-- No executable SQL in this file. Intentionally left as a safe no-op.

@ -0,0 +1,13 @@
----SQL
CREATE SEQUENCE IF NOT EXISTS fused_embeddings_id_seq START 1;
CREATE TABLE IF NOT EXISTS fused_embeddings (
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'),
motion_id INTEGER NOT NULL,
window_id TEXT NOT NULL,
vector JSON NOT NULL,
svd_dims INTEGER NOT NULL,
text_dims INTEGER NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
);
----END

@ -0,0 +1,9 @@
----SQL
CREATE TABLE IF NOT EXISTS mp_metadata (
mp_name TEXT PRIMARY KEY,
party TEXT,
van DATE,
tot_en_met DATE,
persoon_id TEXT
);
----END

@ -0,0 +1,13 @@
----SQL
CREATE SEQUENCE IF NOT EXISTS mp_votes_id_seq START 1;
CREATE TABLE IF NOT EXISTS mp_votes (
id INTEGER DEFAULT nextval('mp_votes_id_seq'),
motion_id INTEGER NOT NULL,
mp_name TEXT NOT NULL,
party TEXT,
vote TEXT NOT NULL,
date DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
);
----END

@ -0,0 +1,13 @@
----SQL
CREATE SEQUENCE IF NOT EXISTS svd_vectors_id_seq START 1;
CREATE TABLE IF NOT EXISTS svd_vectors (
id INTEGER DEFAULT nextval('svd_vectors_id_seq'),
window_id TEXT NOT NULL,
entity_type TEXT NOT NULL,
entity_id TEXT NOT NULL,
vector JSON NOT NULL,
model TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
);
----END

@ -0,0 +1,75 @@
import json
import logging
from typing import Optional
import duckdb
from database import MotionDatabase
_logger = logging.getLogger(__name__)
def extract_mp_votes(db_path: Optional[str] = None, limit: Optional[int] = None):
"""Extract individual MP votes from motions.voting_results and store them
in the mp_votes table.
Returns a dict with summary counts:
- motions_scanned: number of motions inspected
- mp_rows_inserted: number of mp_votes rows inserted
- motions_skipped: number of motions skipped because mp_votes already existed
"""
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase()
conn = duckdb.connect(db.db_path)
try:
# support optional limit to only scan a subset of motions
if limit is not None:
rows = conn.execute(
"SELECT id, voting_results, date FROM motions LIMIT ?", (limit,)
).fetchall()
else:
rows = conn.execute(
"SELECT id, voting_results, date FROM motions"
).fetchall()
finally:
conn.close()
mp_rows_inserted = 0
motions_skipped = 0
motions_scanned = 0
for motion_id, voting_results_json, date in rows:
motions_scanned += 1
try:
if db.mp_votes_exists_for_motion(motion_id):
_logger.debug(
"Skipping motion %s because mp_votes already exist", motion_id
)
motions_skipped += 1
continue
# voting_results may be stored as JSON text or as native JSON; ensure it's a dict
if isinstance(voting_results_json, str):
voting_results = json.loads(voting_results_json)
else:
voting_results = voting_results_json
for actor, vote in (voting_results or {}).items():
# Individual MP names contain a comma (e.g. "Last, F.")
if "," not in actor:
continue
inserted_id = db.insert_mp_vote(
motion_id=motion_id, mp_name=actor, vote=vote, date=date, party=None
)
if inserted_id and inserted_id > 0:
mp_rows_inserted += 1
except Exception as e:
_logger.error("Error processing motion %s: %s", motion_id, e)
return {
"motions_scanned": motions_scanned,
"mp_rows_inserted": mp_rows_inserted,
"motions_skipped": motions_skipped,
}

@ -0,0 +1,94 @@
import logging
from typing import Optional
import requests
from database import MotionDatabase
logger = logging.getLogger(__name__)
def normalize_mp_name(
achternaam: str, initialen: Optional[str], tussenvoegsel: Optional[str]
) -> str:
"""Reconstruct ActorNaam format used in voting_results keys.
Format: "{Tussenvoegsel} {Achternaam}, {Initialen}" with sensible stripping when
tussenvoegsel is missing.
"""
parts = []
if tussenvoegsel:
parts.append(tussenvoegsel)
parts.append(achternaam)
name = " ".join(parts).strip()
# Ensure the displayed name starts with an uppercase letter so
# ORDER BY mp_name behaves predictably across databases that may
# sort uppercase before lowercase. Only change the first character
# to upper-case to avoid lowercasing other letters (e.g. hyphenated
# or already capitalized parts).
if name and name[0].islower():
name = name[0].upper() + name[1:]
if initialen:
name = f"{name}, {initialen}"
return name
def fetch_mp_metadata(
db_path: str, odata_url: str = "https://odata.example/FractieZetelPersoon"
) -> int:
"""Fetch MP party membership and tenure from OData and upsert into DB.
Returns the number of records processed (inserted or updated).
"""
session = requests.Session()
try:
resp = session.get(odata_url)
resp.raise_for_status()
data = resp.json()
except Exception as e:
logger.error("Failed to fetch MP metadata: %s", e)
raise
values = data.get("value") if isinstance(data, dict) else None
if values is None:
logger.error("Unexpected OData payload; missing 'value' list")
return 0
db = MotionDatabase(db_path)
processed = 0
for item in values:
try:
persoon = item.get("Persoon") or {}
fractiezetel = item.get("FractieZetel") or {}
fractie = fractiezetel.get("Fractie") or {}
achternaam = persoon.get("Achternaam")
initialen = persoon.get("Initialen")
tussenvoegsel = persoon.get("Tussenvoegsel")
persoon_id = persoon.get("Id")
party = fractie.get("NaamNL")
van = item.get("Van")
tot_en_met = item.get("TotEnMet")
if not achternaam:
logger.debug("Skipping record without achternaam: %s", item)
continue
mp_name = normalize_mp_name(achternaam, initialen, tussenvoegsel)
db.upsert_mp_metadata(
mp_name=mp_name,
party=party,
van=van,
tot_en_met=tot_en_met,
persoon_id=persoon_id,
)
processed += 1
except Exception:
logger.exception("Error processing OData item: %s", item)
logger.info("Processed %d MP metadata records", processed)
return processed

@ -0,0 +1,116 @@
import json
import logging
from typing import Dict
import duckdb
from database import MotionDatabase
_logger = logging.getLogger(__name__)
def fuse_for_window(
window_id: str, db_path: str = None, model: str = None
) -> Dict[str, int]:
"""Fuse SVD vectors with text embeddings for motions in a window.
Parameters:
- window_id: id of the window to process
- db_path: optional path to duckdb database (if None MotionDatabase default is used)
- model: optional model name to filter text embeddings
Returns a dict with counts: inserted, skipped_missing_text, skipped_missing_svd, errors
"""
# Create MotionDatabase using provided path if given, otherwise use default
if db_path:
db = MotionDatabase(db_path=db_path)
conn = duckdb.connect(db_path)
else:
db = MotionDatabase()
# MotionDatabase always exposes the path it uses
conn = duckdb.connect(db.db_path)
# Fetch svd vectors for the window and entity_type=motion
rows = conn.execute(
"SELECT entity_id, vector FROM svd_vectors WHERE window_id = ? AND entity_type = ?",
(window_id, "motion"),
).fetchall()
# debug
_logger.debug("Found %d svd rows for window %s", len(rows), window_id)
inserted = 0
skipped_missing_text = 0
skipped_missing_svd = 0
errors = 0
for entity_id, svd_json in rows:
try:
svd_vec = json.loads(svd_json)
except Exception:
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id)
skipped_missing_svd += 1
continue
# Look up text embedding for this motion (most recent). If model is provided
# filter by model as well.
if model:
emb_row = conn.execute(
"SELECT vector FROM embeddings WHERE motion_id = ? AND model = ? ORDER BY created_at DESC LIMIT 1",
(int(entity_id), model),
).fetchone()
else:
emb_row = conn.execute(
"SELECT vector FROM embeddings WHERE motion_id = ? ORDER BY created_at DESC LIMIT 1",
(int(entity_id),),
).fetchone()
if not emb_row:
skipped_missing_text += 1
continue
try:
text_vec = json.loads(emb_row[0])
except Exception:
_logger.exception("Invalid text embedding JSON for motion %s", entity_id)
skipped_missing_text += 1
continue
try:
fused = list(svd_vec) + list(text_vec)
except Exception:
_logger.exception("Error concatenating vectors for motion %s", entity_id)
errors += 1
continue
# store fused embedding and check result
try:
res = db.store_fused_embedding(
int(entity_id),
window_id,
fused,
svd_dims=len(svd_vec),
text_dims=len(text_vec),
)
if res and res > 0:
inserted += 1
else:
errors += 1
_logger.error(
"Failed to store fused embedding for motion %s (db returned %s)",
entity_id,
res,
)
except Exception:
_logger.exception(
"Exception while storing fused embedding for motion %s", entity_id
)
errors += 1
conn.close()
return {
"inserted": inserted,
"skipped_missing_text": skipped_missing_text,
"skipped_missing_svd": skipped_missing_svd,
"errors": errors,
}

@ -0,0 +1,206 @@
import json
import logging
from typing import Optional, Dict, List, Tuple
import numpy as np
try:
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
from scipy.linalg import orthogonal_procrustes
_HAS_SCIPY = True
except Exception:
# Provide lightweight fallbacks for environments without scipy
csr_matrix = lambda x: x
def svds(a, k=1):
# fallback to numpy.linalg.svd on dense arrays
U, s, Vt = np.linalg.svd(np.array(a), full_matrices=False)
# return last k components to mimic scipy.svds behaviour
return U[:, -k:], s[-k:], Vt[-k:, :]
def orthogonal_procrustes(A, B):
# simple orthogonal Procrustes via SVD: find R minimizing ||A R - B||
U, _, Vt = np.linalg.svd(A.T.dot(B))
R = U.dot(Vt)
scale = 1.0
return R, scale
_HAS_SCIPY = False
import duckdb
from database import MotionDatabase
_logger = logging.getLogger(__name__)
# Map textual votes to numeric values for SVD
VOTE_MAP = {
"Voor": 1.0,
"voor": 1.0,
"Tegen": -1.0,
"tegen": -1.0,
"Geen stem": 0.0,
"Onbekend": 0.0,
"Onbekend stem": 0.0,
"Blanco": 0.0,
}
def _safe_k(mat: np.ndarray, k: int) -> int:
"""Return a safe k for svds: must be < min(mat.shape)."""
if mat is None:
return 0
m, n = mat.shape
min_dim = min(m, n)
# svds requires k < min_dim
if min_dim <= 1:
return 0
return min(k, min_dim - 1)
def _build_vote_matrix(
db: MotionDatabase, start_date: str, end_date: str
) -> Tuple[np.ndarray, List[str], List[int]]:
"""Build dense vote matrix (mp x motion) for votes between start_date and end_date.
Returns (matrix, mp_names, motion_ids)
"""
conn = duckdb.connect(db.db_path)
rows = conn.execute(
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?",
(start_date, end_date),
).fetchall()
conn.close()
if not rows:
return np.zeros((0, 0)), [], []
motion_ids = sorted({int(r[0]) for r in rows})
mp_names = sorted({r[1] for r in rows})
m = len(mp_names)
n = len(motion_ids)
mat = np.zeros((m, n), dtype=float)
mp_index = {name: i for i, name in enumerate(mp_names)}
motion_index = {mid: j for j, mid in enumerate(motion_ids)}
for motion_id, mp_name, vote in rows:
i = mp_index[mp_name]
j = motion_index[int(motion_id)]
val = VOTE_MAP.get(
vote, VOTE_MAP.get(vote.strip() if isinstance(vote, str) else vote, 0.0)
)
try:
mat[i, j] = float(val)
except Exception:
mat[i, j] = 0.0
return mat, mp_names, motion_ids
def _procrustes_align(
reference_anchor: np.ndarray,
current_anchor: np.ndarray,
min_overlap: int = 3,
) -> np.ndarray:
"""Align current_anchor to reference_anchor using orthogonal Procrustes.
This function will only attempt alignment when there is a reasonable number of
overlapping rows (default: min_overlap). If the overlap is too small or if any
input is invalid, the original current_anchor is returned unchanged.
Returns transformed_current_anchor
"""
# basic validation
if reference_anchor is None or current_anchor is None:
return current_anchor
if not isinstance(reference_anchor, np.ndarray) or not isinstance(
current_anchor, np.ndarray
):
return current_anchor
# Determine overlap by number of available rows. If too small, skip alignment.
n_ref = reference_anchor.shape[0]
n_cur = current_anchor.shape[0]
overlap = min(n_ref, n_cur)
if overlap < min_overlap:
_logger.debug(
"Procrustes alignment skipped: overlap %s < min_overlap %s",
overlap,
min_overlap,
)
return current_anchor
# Use only the overlapping rows to compute the orthogonal transform.
ref_sub = reference_anchor[:overlap, :]
cur_sub = current_anchor[:overlap, :]
try:
# orthogonal_procrustes(A, B) returns R, scale such that A @ R = B * scale
# We want to transform current_anchor to align with reference_anchor so
# call orthogonal_procrustes(cur_sub, ref_sub) and apply resulting R/scale
R, _scale = orthogonal_procrustes(cur_sub, ref_sub)
transformed = current_anchor.dot(R)
return transformed
except Exception:
_logger.exception("Procrustes alignment failed")
return current_anchor
def run_svd_for_window(
db: MotionDatabase,
window_id: str,
start_date: str,
end_date: str,
k: int = 50,
) -> Dict:
"""Run SVD on votes in given date window and store vectors in DB.
Returns metadata dict with keys: k_used, stored_mp, stored_motion
"""
mat, mp_names, motion_ids = _build_vote_matrix(db, start_date, end_date)
if mat.size == 0 or mat.shape[0] == 0 or mat.shape[1] == 0:
return {"k_used": 0, "stored_mp": 0, "stored_motion": 0}
k_used = _safe_k(mat, k)
if k_used <= 0:
return {"k_used": 0, "stored_mp": 0, "stored_motion": 0}
# use sparse svds for efficiency
try:
A = csr_matrix(mat)
U, s, Vt = svds(A, k=k_used)
# svds does not guarantee ordering of singular values; sort descending
idx = np.argsort(s)[::-1]
s = s[idx]
U = U[:, idx]
Vt = Vt[idx, :]
# weight by singular values
mp_vecs = (U * s.reshape(1, -1)).tolist() # m x k
motion_vecs = (Vt.T * s.reshape(1, -1)).tolist() # n x k
stored_mp = 0
stored_motion = 0
for i, mp_name in enumerate(mp_names):
db.store_svd_vector(window_id, "mp", mp_name, mp_vecs[i])
stored_mp += 1
for j, motion_id in enumerate(motion_ids):
db.store_svd_vector(window_id, "motion", str(motion_id), motion_vecs[j])
stored_motion += 1
return {
"k_used": k_used,
"stored_mp": stored_mp,
"stored_motion": stored_motion,
}
except Exception:
_logger.exception("SVD failed for window")
return {"k_used": 0, "stored_mp": 0, "stored_motion": 0}

@ -0,0 +1,122 @@
import logging
import json
from typing import Optional, List, Tuple
import duckdb
from database import MotionDatabase, db as default_db
import ai_provider
_logger = logging.getLogger(__name__)
DEFAULT_MODEL = "qwen/qwen3-embedding-4b"
def _select_text(
db: MotionDatabase, model: str, limit: Optional[int] = None
) -> List[Tuple[int, Optional[str]]]:
"""Select motions that do not yet have an embedding for `model`.
Returns list of (motion_id, text).
"""
conn = duckdb.connect(db.db_path)
params = [model]
# prefer layman_explanation > description > title (keep compatibility with existing tests)
sql = (
"SELECT m.id, COALESCE(m.layman_explanation, m.description, m.title) AS text"
" FROM motions m"
" LEFT JOIN embeddings e ON e.motion_id = m.id AND e.model = ?"
" WHERE e.id IS NULL"
)
if limit:
sql += " LIMIT ?"
params.append(limit)
try:
rows = conn.execute(sql, params).fetchall()
conn.close()
results: List[Tuple[int, Optional[str]]] = []
for r in rows:
text_val = r[1]
# treat empty strings as no text
if text_val is None:
text = None
else:
text = str(text_val).strip() or None
results.append((int(r[0]), text))
return results
except Exception as exc:
_logger.error("Error selecting motions for embeddings: %s", exc)
try:
conn.close()
except Exception:
pass
return []
def ensure_text_embeddings(
db_path: Optional[str] = None, model: Optional[str] = None
) -> Tuple[int, int, int, int]:
"""Ensure all motions have text embeddings for `model`.
Returns tuple (stored_count, skipped_existing, skipped_no_text, errors).
"""
model = model or DEFAULT_MODEL
db = MotionDatabase(db_path) if db_path else default_db
# motions to process
to_process = _select_text(db, model)
# how many already exist
conn = duckdb.connect(db.db_path)
try:
total_motions = conn.execute("SELECT COUNT(*) FROM motions").fetchone()[0]
except Exception:
total_motions = 0
try:
existing = conn.execute(
"SELECT COUNT(DISTINCT motion_id) FROM embeddings WHERE model = ?", (model,)
).fetchone()[0]
except Exception:
existing = 0
conn.close()
stored = 0
skipped_no_text = 0
errors = 0
for motion_id, text in to_process:
if not text:
_logger.info("Skipping motion %s: no text available", motion_id)
skipped_no_text += 1
continue
try:
vec = ai_provider.get_embedding(text, model=model)
if not isinstance(vec, list):
_logger.warning(
"Embedding provider returned non-list for motion %s", motion_id
)
errors += 1
continue
res = db.store_embedding(motion_id, model, vec)
if res and res > 0:
stored += 1
else:
_logger.error(
"Failed to store embedding for motion %s (store returned %s)",
motion_id,
res,
)
errors += 1
except Exception as exc:
_logger.error(
"Error computing/storing embedding for motion %s: %s", motion_id, exc
)
errors += 1
skipped_existing = int(existing)
return stored, skipped_existing, skipped_no_text, errors

@ -0,0 +1,18 @@
[project]
name = "stemwijzer"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
"duckdb>=1.3.2",
"ibis-framework[duckdb]>=10.8.0",
"openai>=1.99.7",
"scipy>=1.11",
"umap-learn>=0.5",
"plotly>=5.0",
"pytest>=9.0.2",
"requests>=2.32.4",
"schedule>=1.2.2",
"streamlit>=1.48.0",
]

@ -0,0 +1,9 @@
import ibis
con = ibis.duckdb.connect('data/motions.db')
print(con.tables)
for t in con.tables:
print(con.table(t).head().execute().to_string())

@ -0,0 +1,3 @@
# Run this to reset your database
from database import db
db.reset_database()

@ -0,0 +1,264 @@
# scheduler.py (fixed infinite loop issue)
import schedule
import time
import duckdb
from datetime import datetime, timedelta
from api_client import TweedeKamerAPI
from summarizer import summarizer
from database import db
from config import config
class DataUpdateScheduler:
def __init__(self):
self.api_client = TweedeKamerAPI()
def test_api_connection(self) -> bool:
"""Test API connection before proceeding"""
print("Testing API connection...")
if self.api_client.test_api_connection():
print("✅ API connection successful")
return True
else:
print("❌ API connection failed")
return False
def check_database_has_data(self) -> bool:
"""Check if database has any motion data"""
try:
conn = duckdb.connect(config.DATABASE_PATH)
result = conn.execute("SELECT COUNT(*) FROM motions").fetchone()
conn.close()
return result[0] > 0 if result else False
except Exception as e:
print(f"Error checking database: {e}")
return False
def update_motions_data(self, days_back: int = 30, max_records: int = 1000):
"""Fetch new motions from API and update database"""
print(f"Starting motion data update at {datetime.now()}")
if not self.test_api_connection():
return False
try:
# Fetch recent motions from API (respecting API limits)
start_date = datetime.now() - timedelta(days=days_back)
motions = self.api_client.get_motions(
start_date=start_date,
limit=max_records
)
print(f"Fetched {len(motions)} motions from API")
if not motions:
print("No motions received from API")
return False
# Insert new motions into database
successful_inserts = 0
duplicate_count = 0
for motion in motions:
if db.insert_motion(motion):
successful_inserts += 1
else:
duplicate_count += 1
print(f"Successfully inserted {successful_inserts} new motions")
if duplicate_count > 0:
print(f"Skipped {duplicate_count} duplicate motions")
# Generate AI summaries for new motions (only if we have new data)
if successful_inserts > 0:
print("Generating AI summaries for new motions...")
summarizer.update_motion_summaries()
print("Motion data update completed successfully")
return True
except Exception as e:
print(f"Error during motion data update: {e}")
return False
def initial_data_load(self):
"""Perform initial data load with comprehensive data"""
print("Performing initial comprehensive data load...")
if not self.test_api_connection():
return False
try:
# Start from 2 years ago but make sure we don't go into the future
start_date = datetime.now() - timedelta(days=730)
end_date = datetime.now()
print(f"Loading data from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
# Use a single request for recent data first, then expand if needed
chunk_days = 90 # 3-month chunks
current_date = start_date
all_motions = []
chunks_processed = 0
max_chunks = 10 # Safety limit to prevent infinite loops
while current_date < end_date and chunks_processed < max_chunks:
chunk_end_date = min(current_date + timedelta(days=chunk_days), end_date)
print(f"Fetching chunk {chunks_processed + 1}/{max_chunks}: {current_date.strftime('%Y-%m-%d')} to {chunk_end_date.strftime('%Y-%m-%d')}")
try:
# Fetch data for this time chunk
chunk_motions = self.api_client.get_motions(
start_date=current_date,
end_date=chunk_end_date,
limit=250 # Reasonable limit per chunk
)
if chunk_motions:
all_motions.extend(chunk_motions)
print(f"✅ Found {len(chunk_motions)} motions in this chunk (Total: {len(all_motions)})")
else:
print(f" No motions found in chunk {current_date.strftime('%Y-%m-%d')} to {chunk_end_date.strftime('%Y-%m-%d')}")
except Exception as e:
print(f"❌ Error fetching chunk {current_date.strftime('%Y-%m-%d')} to {chunk_end_date.strftime('%Y-%m-%d')}: {e}")
# IMPORTANT: Always increment the date to avoid infinite loop
current_date = chunk_end_date
chunks_processed += 1
# Add delay between chunks
if chunks_processed < max_chunks and current_date < end_date:
time.sleep(2)
print(f"Data collection completed. Total motions fetched: {len(all_motions)}")
if not all_motions:
print("❌ No motions retrieved from API. This might be normal if the API doesn't have recent data.")
print("💡 Try adjusting the date range or check if the API has data for the selected period.")
# Try a broader date range as fallback
print("🔄 Trying broader date range (last 30 days)...")
fallback_start = datetime.now() - timedelta(days=30)
fallback_motions = self.api_client.get_motions(
start_date=fallback_start,
limit=250
)
if fallback_motions:
all_motions = fallback_motions
print(f"✅ Fallback successful: Found {len(fallback_motions)} motions")
else:
print("❌ No data found even with broader date range")
return False
# Insert all motions with progress tracking
successful_inserts = 0
duplicate_count = 0
print(f"Inserting {len(all_motions)} motions into database...")
for i, motion in enumerate(all_motions):
if i % 25 == 0: # Progress indicator every 25 motions
print(f"Processing motion {i+1}/{len(all_motions)} ({((i+1)/len(all_motions)*100):.1f}%)")
if db.insert_motion(motion):
successful_inserts += 1
else:
duplicate_count += 1
print(f"✅ Successfully inserted {successful_inserts} motions")
if duplicate_count > 0:
print(f" Skipped {duplicate_count} duplicate motions")
# Generate summaries if we have data
if successful_inserts > 0:
print("🤖 Generating AI summaries...")
summarizer.update_motion_summaries()
print("🎉 Initial data load completed!")
return successful_inserts > 0
except Exception as e:
print(f"❌ Error during initial data load: {e}")
return False
def weekly_update_job(self):
"""Weekly job to update with new motions"""
print(f"Starting weekly update job at {datetime.now()}")
# Use smaller limits for regular updates
self.update_motions_data(days_back=14, max_records=250)
print("Weekly update job completed")
def run_scheduler(self):
"""Main scheduler function"""
print("=" * 50)
print("Dutch Political Compass Data Scheduler")
print("=" * 50)
# Check if database has data
has_data = self.check_database_has_data()
print(f"Database has existing data: {has_data}")
if not has_data:
print("\n🔄 No data found in database. Running initial data load...")
success = self.initial_data_load()
if success:
print("✅ Initial data load completed successfully!")
else:
print("❌ Initial data load failed or no data available.")
print("💡 You may need to check the API or adjust the date range.")
return
else:
print("✅ Database already contains motion data.")
# Ask if user wants to update anyway
try:
response = input("\nDo you want to fetch recent motions anyway? (y/n): ").lower().strip()
if response in ['y', 'yes']:
print("🔄 Updating with recent motions...")
self.update_motions_data(days_back=7, max_records=250)
except KeyboardInterrupt:
print("\nSkipping manual update.")
# Schedule regular updates
print("\n📅 Scheduling regular updates...")
schedule.every().monday.at("02:00").do(self.weekly_update_job)
schedule.every().thursday.at("14:00").do(lambda: self.update_motions_data(days_back=7, max_records=250))
print("Jobs scheduled:")
print("- Weekly motion update: Every Monday at 02:00")
print("- Mid-week update: Every Thursday at 14:00")
print(f"- API limit per request: {config.API_MAX_LIMIT} records")
print("\n🔄 Scheduler is now running. Press Ctrl+C to stop.")
try:
while True:
schedule.run_pending()
time.sleep(3600) # Check every hour
except KeyboardInterrupt:
print("\n👋 Scheduler stopped by user.")
def run_once():
"""Run data update once and exit"""
scheduler = DataUpdateScheduler()
print("Running one-time data update...")
has_data = scheduler.check_database_has_data()
if not has_data:
print("No existing data found. Running initial data load...")
scheduler.initial_data_load()
else:
print("Updating existing data with recent motions...")
scheduler.update_motions_data(days_back=14, max_records=250)
print("One-time update completed!")
if __name__ == "__main__":
import sys
if len(sys.argv) > 1 and sys.argv[1] == "--once":
run_once()
else:
scheduler = DataUpdateScheduler()
scheduler.run_scheduler()

@ -0,0 +1,183 @@
# scraper.py
import requests
from bs4 import BeautifulSoup
import time
import re
from datetime import datetime, timedelta
from typing import Dict, List, Optional
from database import db
from config import config
class MotionScraper:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
def scrape_motion_list(self, start_date: datetime = None, end_date: datetime = None) -> List[str]:
"""Scrape motion URLs from the main page"""
if not start_date:
start_date = datetime.now() - timedelta(days=730) # 2 years ago
if not end_date:
end_date = datetime.now()
motion_urls = []
page = 1
while True:
try:
url = f"{config.BASE_URL}?page={page}"
response = self.session.get(url, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
# Find motion links (adjust selectors based on actual HTML structure)
motion_links = soup.find_all('a', href=re.compile(r'/stemmingsuitslagen/'))
if not motion_links:
break
for link in motion_links:
href = link.get('href')
if href and href not in motion_urls:
motion_urls.append(href)
page += 1
time.sleep(config.SCRAPING_DELAY)
except Exception as e:
print(f"Error scraping page {page}: {e}")
break
return motion_urls
def parse_motion_detail(self, motion_url: str) -> Optional[Dict]:
"""Parse individual motion details"""
try:
full_url = f"https://www.tweedekamer.nl{motion_url}" if motion_url.startswith('/') else motion_url
response = self.session.get(full_url, timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
# Extract motion data (adjust selectors based on actual HTML structure)
title = self._extract_title(soup)
description = self._extract_description(soup)
date = self._extract_date(soup)
policy_area = self._extract_policy_area(soup)
voting_results = self._extract_voting_results(soup)
if not all([title, voting_results]):
return None
# Calculate winning margin
total_votes = sum(1 for vote in voting_results.values() if vote in ['voor', 'tegen'])
if total_votes == 0:
return None
votes_for = sum(1 for vote in voting_results.values() if vote == 'voor')
winning_margin = abs(votes_for - (total_votes - votes_for)) / total_votes
return {
'title': title,
'description': description or '',
'date': date,
'policy_area': policy_area or 'Onbekend',
'voting_results': voting_results,
'winning_margin': winning_margin,
'url': full_url
}
except Exception as e:
print(f"Error parsing motion {motion_url}: {e}")
return None
def _extract_title(self, soup: BeautifulSoup) -> Optional[str]:
"""Extract motion title"""
# Look for common title selectors
selectors = ['h1', '.motion-title', '.title', 'h2']
for selector in selectors:
element = soup.select_one(selector)
if element:
return element.get_text(strip=True)
return None
def _extract_description(self, soup: BeautifulSoup) -> Optional[str]:
"""Extract motion description"""
# Look for description elements
selectors = ['.motion-description', '.description', '.content', 'p']
for selector in selectors:
elements = soup.select(selector)
if elements:
return ' '.join(el.get_text(strip=True) for el in elements[:3])
return None
def _extract_date(self, soup: BeautifulSoup) -> Optional[str]:
"""Extract motion date"""
# Look for date patterns
date_pattern = re.compile(r'\d{1,2}-\d{1,2}-\d{4}|\d{4}-\d{1,2}-\d{1,2}')
text = soup.get_text()
match = date_pattern.search(text)
if match:
return match.group()
return datetime.now().strftime('%Y-%m-%d')
def _extract_policy_area(self, soup: BeautifulSoup) -> Optional[str]:
"""Extract policy area/category"""
# Look for category indicators
text = soup.get_text().lower()
for area in config.POLICY_AREAS[1:]: # Skip "Alle"
if area.lower() in text:
return area
return "Algemeen"
def _extract_voting_results(self, soup: BeautifulSoup) -> Dict[str, str]:
"""Extract party voting results"""
# This is a simplified extraction - you'll need to adjust based on actual HTML
voting_results = {}
# Look for voting tables or lists
tables = soup.find_all('table')
for table in tables:
rows = table.find_all('tr')
for row in rows:
cells = row.find_all(['td', 'th'])
if len(cells) >= 2:
party = cells[0].get_text(strip=True)
vote = cells[1].get_text(strip=True).lower()
if vote in ['voor', 'tegen', 'afwezig']:
voting_results[party] = vote
# Fallback: simulate some voting data for testing
if not voting_results:
parties = ['VVD', 'PVV', 'CDA', 'D66', 'GL', 'SP', 'PvdA', 'CU', 'PvdD', 'FVD', '50PLUS', 'SGP']
import random
for party in parties:
voting_results[party] = random.choice(['voor', 'tegen', 'afwezig'])
return voting_results
def run_scraping_job(self):
"""Main scraping job"""
print("Starting motion scraping...")
motion_urls = self.scrape_motion_list()
print(f"Found {len(motion_urls)} motion URLs")
successful_scrapes = 0
for i, url in enumerate(motion_urls):
print(f"Processing motion {i+1}/{len(motion_urls)}: {url}")
motion_data = self.parse_motion_detail(url)
if motion_data:
if db.insert_motion(motion_data):
successful_scrapes += 1
time.sleep(config.SCRAPING_DELAY)
print(f"Scraping completed. Successfully scraped {successful_scrapes} motions.")
scraper = MotionScraper()

@ -0,0 +1,128 @@
"""Compute summaries and embeddings for a small test batch of motions.
Usage:
# dry-run (no network calls)
python scripts/compute_test_batch.py --limit 20 --dry-run
# run (will call AI provider; requires OPENROUTER_API_KEY)
python scripts/compute_test_batch.py --limit 20
This script is intentionally simple and intended for manual invocation.
It will update motions.layman_explanation and store embeddings via db.store_embedding if available.
"""
from __future__ import annotations
import argparse
import logging
import sys
from typing import List
import duckdb
from config import config
import ai_provider
from database import db
from summarizer import MotionSummarizer
logger = logging.getLogger("compute_test_batch")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
def fetch_motion_candidates(limit: int) -> List[dict]:
conn = duckdb.connect(config.DATABASE_PATH)
try:
# Prefer motions that still lack a layman_explanation so we don't re-process recent ones
rows = conn.execute(
"SELECT id, title, description FROM motions WHERE layman_explanation IS NULL OR layman_explanation = '' ORDER BY created_at DESC LIMIT ?",
(limit,),
).fetchall()
return [{"id": r[0], "title": r[1], "description": r[2] or ""} for r in rows]
finally:
conn.close()
def process_batch(limit: int = 20, dry_run: bool = False):
summarizer = MotionSummarizer()
motions = fetch_motion_candidates(limit)
logger.info("Found %d motions to process", len(motions))
conn = duckdb.connect(config.DATABASE_PATH)
try:
for i, m in enumerate(motions, start=1):
mid = m["id"]
title = m["title"]
desc = m["description"]
logger.info(
"[%d/%d] Processing motion id=%s title=%s", i, len(motions), mid, title
)
if dry_run:
logger.info(
"Dry run: would generate summary and embedding for motion %s", mid
)
continue
# Generate summary
summary = summarizer.generate_layman_explanation(title, desc)
# Update DB
try:
conn.execute(
"UPDATE motions SET layman_explanation = ? WHERE id = ?",
(summary, mid),
)
except Exception as e:
logger.exception("Failed to update motion %s: %s", mid, e)
# Compute embedding and store
try:
emb = ai_provider.get_embedding(summary)
store_fn = getattr(db, "store_embedding", None)
if callable(store_fn):
store_fn(mid, "text-embedding-3-small", emb)
logger.info("Stored embedding for motion %s", mid)
else:
logger.warning(
"No store_embedding available on db; skipping storage"
)
except ai_provider.ProviderError as e:
logger.exception(
"Failed to compute/store embedding for motion %s: %s", mid, e
)
finally:
conn.close()
def main(argv=None):
p = argparse.ArgumentParser()
p.add_argument("--limit", type=int, default=20, help="Number of motions to process")
p.add_argument(
"--dry-run",
action="store_true",
help="Do not call external APIs; just show what would run",
)
args = p.parse_args(argv)
if args.dry_run:
logger.info("Running in dry-run mode; no network calls will be made")
# Safety: confirm when not dry-run
if not args.dry_run:
confirm = (
input(
f"This will call the AI provider for {args.limit} motions and may incur cost. Continue? (y/N): "
)
.strip()
.lower()
)
if confirm not in ("y", "yes"):
logger.info("Aborting per user choice")
sys.exit(0)
process_batch(limit=args.limit, dry_run=args.dry_run)
if __name__ == "__main__":
main()

@ -0,0 +1,35 @@
"""Motion-related simple types and JSON helpers.
Decision: MotionId is an alias for str for simplicity.
"""
from dataclasses import dataclass, asdict
from typing import List
import json
MotionId = str
Embedding = List[float]
@dataclass
class SimilarityNeighbor:
motion_id: MotionId
score: float
def to_json(neighbors: List[SimilarityNeighbor]) -> str:
"""Serialize a list of SimilarityNeighbor to a JSON string.
The format is a JSON list of objects with keys 'motion_id' and 'score'.
"""
list_of_dicts = [asdict(n) for n in neighbors]
return json.dumps(list_of_dicts)
def from_json(json_str: str) -> List[SimilarityNeighbor]:
"""Deserialize a JSON string (list of dicts) into SimilarityNeighbor list."""
parsed = json.loads(json_str)
return [
SimilarityNeighbor(motion_id=item["motion_id"], score=float(item["score"]))
for item in parsed
]

@ -0,0 +1,101 @@
# summarizer.py (refactored to use ai_provider)
from typing import Optional
import logging
import duckdb
from config import config
import ai_provider
from database import db
logger = logging.getLogger(__name__)
class MotionSummarizer:
def __init__(self):
# Stateless; use ai_provider functions directly
pass
def _build_prompt_messages(self, title: str, body_text: str) -> list[dict]:
prompt = f"""
Leg deze Nederlandse parlementaire motie uit in eenvoudige, toegankelijke taal:
Titel: {title}
Tekst: {body_text}
Geef een uitleg van 2-3 zinnen die:
- Gebruik maakt van alledaagse taal
- De praktische impact op burgers uitlegt
- Politiek jargon vermijdt
- Neutraal en feitelijk blijft
Antwoord alleen met de uitleg, geen introductie of extra tekst.
"""
return [
{
"role": "system",
"content": "Je bent een expert in het uitleggen van politieke onderwerpen in eenvoudige taal voor Nederlandse burgers.",
},
{"role": "user", "content": prompt},
]
def generate_layman_explanation(self, title: str, body_text: str) -> str:
"""Generate a layman-friendly explanation via ai_provider.
Returns an empty string on failure (non-fatal).
"""
messages = self._build_prompt_messages(title, body_text or "")
try:
return ai_provider.chat_completion(messages, model=config.QWEN_MODEL)
except ai_provider.ProviderError:
logger.exception("AI provider failed to generate summary")
return ""
def update_motion_summaries(
self,
compute_embeddings: bool = True,
embedding_model: str = "qwen/qwen3-embedding-4b",
):
"""Find motions missing layman_explanation and generate summaries.
Uses body_text when available, falls back to description, then title only.
If compute_embeddings is True and database provides store_embedding, compute and store embeddings.
"""
conn = duckdb.connect(config.DATABASE_PATH)
try:
rows = conn.execute(
"SELECT id, title, description, body_text FROM motions WHERE layman_explanation IS NULL OR layman_explanation = '' LIMIT 50"
).fetchall()
for motion_id, title, description, body_text in rows:
input_text = body_text or description or ""
summary = self.generate_layman_explanation(title, input_text)
if summary is None:
summary = ""
conn.execute(
"UPDATE motions SET layman_explanation = ? WHERE id = ?",
(summary, motion_id),
)
logger.info("Updated summary for motion %s", motion_id)
if compute_embeddings and summary:
logger.info(
"Computing embedding for motion %s using model %s",
motion_id,
embedding_model,
)
# compute embedding and try to store via database helper if available
try:
emb = ai_provider.get_embedding(summary, model=embedding_model)
store_fn = getattr(db, "store_embedding", None)
if callable(store_fn):
store_fn(motion_id, embedding_model, emb)
except ai_provider.ProviderError:
logger.exception(
"Failed to compute/store embedding for motion %s", motion_id
)
finally:
conn.close()
summarizer = MotionSummarizer()

@ -0,0 +1,16 @@
# test_single_insert.py
from database import db
test_motion = {
'title': 'Test Motion',
'description': 'This is a test motion',
'date': '2024-01-01',
'policy_area': 'Test',
'voting_results': {'VVD': 'voor', 'PvdA': 'tegen'},
'winning_margin': 0.5,
'url': 'https://test.com/motion1'
}
success = db.insert_motion(test_motion)
print(f"Insert successful: {success}")

@ -0,0 +1 @@
"""Make the tests directory a package so test helpers can be imported."""

@ -0,0 +1,63 @@
import tempfile
import pytest
# Load test fixtures from the utils package so pytest can discover them.
pytest_plugins = ["tests.utils.migration_fixtures"]
@pytest.fixture
def tmp_duckdb_path(tmp_path):
p = tmp_path / "test.db"
return str(p)
@pytest.fixture
def tmp_duckdb_conn(tmp_duckdb_path):
# Import duckdb lazily so running pytest doesn't fail on machines
# where duckdb is not installed (CI / contributor machines that don't
# need the duckdb-based fixtures). If duckdb is missing, skip this
# fixture at runtime when it's requested.
try:
import duckdb
except Exception:
pytest.skip("duckdb not installed, skipping duckdb fixtures")
conn = duckdb.connect(database=tmp_duckdb_path)
yield conn
try:
conn.close()
except Exception:
pass
@pytest.fixture
def monkeypatch_ai_provider(monkeypatch):
"""Patch ai_provider.get_embedding to return deterministic 16-dim vector."""
import ai_provider
fake = [0.01] * 16
monkeypatch.setattr(ai_provider, "get_embedding", lambda text, model=None: fake)
return fake
@pytest.fixture
def mock_odata_client(monkeypatch):
"""
Patch requests.Session.get for OData calls.
Returns a configurable mock set mock_odata_client.response to override.
"""
import requests
from unittest.mock import MagicMock
mock_response = MagicMock()
mock_response.raise_for_status.return_value = None
mock_response.json.return_value = {"value": []}
class MockSession:
response = mock_response
def get(self, *args, **kwargs):
return self.response
monkeypatch.setattr(requests, "Session", MockSession)
return mock_response

@ -0,0 +1 @@
"""Fixtures package for tests."""

@ -0,0 +1,40 @@
[
{
"motion_id": 1,
"date": "2024-01-15",
"voting_results": {
"VVD": "voor",
"PvdA": "tegen",
"CDA": "voor",
"D66": "voor",
"Wilders, G.": "voor",
"Yesilgöz-Zegerius, D.": "voor",
"Jetten, R.A.A.": "voor"
}
},
{
"motion_id": 2,
"date": "2024-02-10",
"voting_results": {
"VVD": "tegen",
"PvdA": "voor",
"CDA": "afwezig",
"D66": "voor",
"Wilders, G.": "tegen",
"Yesilgöz-Zegerius, D.": "tegen",
"Ploumen, L.J.": "voor"
}
},
{
"motion_id": 3,
"date": "2024-03-05",
"voting_results": {
"VVD": "voor",
"SP": "tegen",
"GroenLinks": "voor",
"PVV": "voor",
"Van der Plas, C.": "voor",
"Klever, N.C.": "voor"
}
}
]

@ -0,0 +1,87 @@
import json
import os
import numpy as np
import pytest
# duckdb is an optional dependency in some environments; skip test if not available
duckdb = pytest.importorskip("duckdb")
def test_pipeline_end_to_end(tmp_path, monkeypatch):
# ensure determinism for any random embedding generation
np.random.seed(0)
# prepare temp db
db_path = str(tmp_path / "motions.db")
# create the minimal MotionDatabase schema using existing code where possible
from database import MotionDatabase
db = MotionDatabase(db_path)
# create embeddings table (migration would normally do this)
conn = duckdb.connect(db.db_path)
conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1")
conn.execute(
"CREATE TABLE IF NOT EXISTS embeddings (id INTEGER PRIMARY KEY DEFAULT nextval('embeddings_id_seq'), motion_id INTEGER, model TEXT, vector JSON, created_at TIMESTAMP)"
)
# insert three motions
conn.execute(
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
("t1", "d1", "u1", "ex1"),
)
conn.execute(
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
("t2", "d2", "u2", "ex2"),
)
conn.execute(
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
("t3", "d3", "u3", "ex3"),
)
# fetch ids
rows = conn.execute("SELECT id FROM motions ORDER BY id").fetchall()
ids = [r[0] for r in rows]
# insert existing embedding for first motion
vec = json.dumps([0.1] * 16)
conn.execute(
"INSERT INTO embeddings (motion_id, model, vector) VALUES (?, ?, ?)",
(ids[0], "test-model", vec),
)
conn.close()
# monkeypatch ai_provider.get_embedding to deterministic vector
import ai_provider
def fake_get_embedding(text, model=None):
# produce a deterministic vector based on seeded numpy
return list(np.random.rand(16))
monkeypatch.setattr("ai_provider.get_embedding", fake_get_embedding)
# run ensure_text_embeddings
from pipeline.text_pipeline import ensure_text_embeddings
stored, skipped_existing, skipped_no_text, errors = ensure_text_embeddings(
db_path=db_path, model="test-model"
)
assert stored == 2
assert skipped_existing == 1
assert skipped_no_text == 0
assert errors == 0
# verify stored vectors length
conn = duckdb.connect(db.db_path)
rows = conn.execute(
"SELECT vector FROM embeddings WHERE model = ? ORDER BY motion_id",
("test-model",),
).fetchall()
conn.close()
assert len(rows) == 3
for r in rows:
v = json.loads(r[0])
assert len(v) == 16

@ -0,0 +1,58 @@
import os
import pathlib
import sqlite3
import re
import pytest
def test_migration_file_exists_and_name():
migrations_dir = pathlib.Path("migrations")
expected_name = "2026-03-22-add-audit-events.sql"
migration_path = migrations_dir / expected_name
# File must exist
assert migration_path.exists(), f"Migration file {migration_path} does not exist"
# Name sanity check
assert migration_path.name == expected_name
def _strip_sql_comments(sql_text: str) -> str:
# Remove SQL single-line comments -- ... and C-style /* ... */
# Use multiline-aware single-line removal for safety.
no_single = re.sub(r"--.*?$", "", sql_text, flags=re.MULTILINE)
no_block = re.sub(r"/\*.*?\*/", "", no_single, flags=re.DOTALL)
return no_block.strip()
def test_optional_apply_sql_if_db_available():
"""
If TEST_DB_URL is provided, attempt to apply the SQL.
For safety this test will skip applying when the SQL is empty or commented out.
Only sqlite URLs (sqlite:///path/to/db) are attempted here to avoid adding
extra dependencies; other URL schemes will cause the test to be skipped.
"""
db_url = os.environ.get("TEST_DB_URL")
if not db_url:
pytest.skip("TEST_DB_URL not set - skipping DB application")
migration_path = pathlib.Path("migrations") / "2026-03-22-add-audit-events.sql"
sql = migration_path.read_text(encoding="utf8")
stripped = _strip_sql_comments(sql)
if not stripped:
pytest.skip("Migration SQL is empty or commented out - skipping application")
# Only handle sqlite URLs here
if db_url.startswith("sqlite:///"):
db_path = db_url.replace("sqlite:///", "", 1)
try:
conn = sqlite3.connect(db_path)
try:
conn.executescript(sql)
finally:
conn.close()
except Exception as e:
pytest.skip(f"Could not apply SQL to sqlite DB: {e}")
else:
pytest.skip(f"TEST_DB_URL set but scheme not supported by this test: {db_url}")

@ -0,0 +1,85 @@
import os
import re
import pathlib
import pytest
# small migration filename/header tests; keep imports minimal
MIGRATION_FILENAME = "2026-03-22-add-similarity-cache.sql"
MIGRATION_PATH = pathlib.Path("migrations") / MIGRATION_FILENAME
def _strip_sql_comments(sql: str) -> str:
"""Remove SQL single-line (-- ...) and C-style (/* ... */) comments.
This is a best-effort stripper sufficient for the test's purpose.
"""
# remove block comments
sql = re.sub(r"/\*.*?\*/", "", sql, flags=re.S)
# remove line comments
sql = re.sub(r"--.*?$", "", sql, flags=re.M)
return sql.strip()
def test_migration_file_exists_and_header():
# file must exist
assert MIGRATION_PATH.exists(), f"Migration file {MIGRATION_PATH} not found"
text = MIGRATION_PATH.read_text(encoding="utf8")
# header should reference the filename and purpose
assert MIGRATION_FILENAME in text.splitlines()[0], (
"First line should include the filename"
)
assert "similarity" in text.lower(), "Header should mention similarity"
def test_optional_apply_migration_safe():
# If TEST_DB_URL is set, try to apply the SQL only if it contains non-comment statements.
db_url = os.environ.get("TEST_DB_URL")
sql = MIGRATION_PATH.read_text(encoding="utf8")
stripped = _strip_sql_comments(sql)
# If there is no DB url, consider this a filename/header validation test only.
if not db_url:
pytest.skip("TEST_DB_URL not set; skipping DB apply step")
# If the SQL is empty (only comments), nothing to apply — test passes.
if not stripped:
pytest.skip("Migration contains no executable SQL; nothing to apply")
# Otherwise attempt to execute the SQL. Be conservative: if drivers are missing or
# connection fails, skip the test rather than failing CI. Only unexpected errors
# during execution should fail the test.
try:
if db_url.startswith("sqlite:"):
import sqlite3
# sqlite URL might be sqlite:///path or sqlite:///:memory:
path = db_url.split("sqlite:", 1)[1]
# normalize prefixes like ///
path = path.lstrip("/") or ":memory:"
conn = sqlite3.connect(path)
try:
conn.executescript(sql)
finally:
conn.close()
elif db_url.startswith("postgresql:") or db_url.startswith("postgres:"):
try:
import psycopg2
except Exception as e: # pragma: no cover - driver may be absent in CI
pytest.skip(f"psycopg2 not available: {e}")
# psycopg2 accepts a DSN; rely on that here.
conn = psycopg2.connect(db_url)
try:
cur = conn.cursor()
cur.execute(sql)
conn.commit()
finally:
conn.close()
else:
pytest.skip(f"DB URL scheme not supported by this test: {db_url}")
except Exception as exc:
# Unexpected error while applying SQL should fail the test.
raise

@ -0,0 +1,29 @@
"""Smoke test for the migration test_db fixture.
This test imports the `test_db` fixture and asserts expected behavior in two
cases:
- If the environment variable TEST_DB_URL is not set, the fixture should yield
None.
- If TEST_DB_URL is set, the fixture should yield a connection-like object
(we check for an object with a `cursor` attribute or the sqlite3 connection
type).
"""
import os
import types
import pytest
def test_migration_fixture_smoke(test_db):
"""Smoke test ensuring the test_db fixture yields expected values."""
url = os.environ.get("TEST_DB_URL")
if not url:
assert test_db is None
else:
# For sqlite we expect a sqlite3.Connection which has a 'cursor'
# method. Be permissive and accept any object with a 'cursor'
# attribute or callable.
assert test_db is not None
assert hasattr(test_db, "cursor") or hasattr(test_db, "execute")

@ -0,0 +1,49 @@
import os
import types
import pytest
import ai_provider
class DummyResponse:
def __init__(self, status_code=200, json_data=None):
self.status_code = status_code
self._json = json_data or {}
def json(self):
return self._json
def test_get_embedding_success(monkeypatch):
fake = DummyResponse(json_data={"data": [{"embedding": [0.1, 0.2, 0.3]}]})
def fake_post(url, json, headers, timeout):
return fake
monkeypatch.setenv("OPENROUTER_API_KEY", "sk-test")
monkeypatch.setattr("requests.post", fake_post)
emb = ai_provider.get_embedding("hello world")
assert emb == [0.1, 0.2, 0.3]
def test_chat_completion_success(monkeypatch):
fake = DummyResponse(json_data={"choices": [{"message": {"content": "summary"}}]})
def fake_post(url, json, headers, timeout):
return fake
monkeypatch.setenv("OPENROUTER_API_KEY", "sk-test")
monkeypatch.setattr("requests.post", fake_post)
out = ai_provider.chat_completion([{"role": "user", "content": "hi"}])
assert out == "summary"
def test_missing_api_key_raises(monkeypatch):
# Ensure env var is not set
monkeypatch.delenv("OPENROUTER_API_KEY", raising=False)
with pytest.raises(ai_provider.ProviderError):
ai_provider.get_embedding("x")

@ -0,0 +1,74 @@
import json
import duckdb
import logging
from pipeline.extract_mp_votes import extract_mp_votes
from database import MotionDatabase
def test_extract_mp_votes(tmp_path):
db_file = tmp_path / "test.db"
# Initialize database
mdb = MotionDatabase(db_path=str(db_file))
# Load fixture
fixture_path = "tests/fixtures/sample_voting_results.json"
with open(fixture_path, "r") as fh:
fixtures = json.load(fh)
# Insert motions into motions table
conn = duckdb.connect(str(db_file))
try:
for item in fixtures:
motion_id = item.get("motion_id")
date = item.get("date")
voting_results = item.get("voting_results")
conn.execute(
"""
INSERT INTO motions (id, title, description, date, policy_area, voting_results, winning_margin, url)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""",
(
motion_id,
f"Test Motion {motion_id}",
"",
date,
"Test",
json.dumps(voting_results),
0.5,
f"http://example/{motion_id}",
),
)
finally:
conn.close()
# Run extraction
res = extract_mp_votes(db_path=str(db_file))
# Expected MP rows: count keys that contain a comma in fixtures
expected_mp_count = 0
for item in fixtures:
for k in item.get("voting_results", {}).keys():
if "," in k:
expected_mp_count += 1
assert res["mp_rows_inserted"] == expected_mp_count
assert res["motions_skipped"] == 0
# Verify mp_votes table contains only rows with comma in mp_name and count matches
conn = duckdb.connect(str(db_file))
try:
rows = conn.execute("SELECT mp_name FROM mp_votes").fetchall()
finally:
conn.close()
assert len(rows) == expected_mp_count
for (mp_name,) in rows:
assert "," in mp_name
# Running again should be idempotent: no new mp rows, motions_skipped > 0
res2 = extract_mp_votes(db_path=str(db_file))
assert res2["mp_rows_inserted"] == 0
assert res2["motions_skipped"] > 0

@ -0,0 +1,103 @@
import json
import requests
import types
import pytest
try:
import duckdb
except Exception:
pytest.skip(
"duckdb not installed, skipping fetch_mp_metadata tests",
allow_module_level=True,
)
from pipeline.fetch_mp_metadata import fetch_mp_metadata, normalize_mp_name
class MockResponse:
def __init__(self, data, status_code=200):
self._data = data
self.status_code = status_code
def raise_for_status(self):
if not (200 <= self.status_code < 300):
raise requests.HTTPError(f"status {self.status_code}")
def json(self):
return self._data
class MockSession:
def __init__(self, response):
self._response = response
def get(self, url):
return self._response
def test_fetch_mp_metadata_idempotent(tmp_path, monkeypatch):
# Prepare canned OData response with two FractieZetelPersoon records
data = {
"value": [
{
"Persoon": {
"Achternaam": "Yesilgöz-Zegerius",
"Initialen": "D.",
"Tussenvoegsel": None,
"Id": "guid-1",
},
"FractieZetel": {"Fractie": {"NaamNL": "VVD"}},
"Van": "2023-01-01",
"TotEnMet": None,
},
{
"Persoon": {
"Achternaam": "Plas",
"Initialen": "C.",
"Tussenvoegsel": "van der",
"Id": "guid-2",
},
"FractieZetel": {"Fractie": {"NaamNL": "BBB"}},
"Van": "2023-06-01",
"TotEnMet": "2024-01-01",
},
]
}
mock_resp = MockResponse(data)
mock_session = MockSession(mock_resp)
# Patch requests.Session to return our mock session
monkeypatch.setattr(requests, "Session", lambda: mock_session)
db_path = str(tmp_path / "test.db")
# First run
count = fetch_mp_metadata(db_path=db_path, odata_url="http://example/odata")
assert count == 2
# Verify DB contents
conn = duckdb.connect(db_path)
rows = conn.execute(
"SELECT mp_name, party, van, tot_en_met, persoon_id FROM mp_metadata ORDER BY mp_name"
).fetchall()
conn.close()
assert len(rows) == 2
# Check normalized names
assert rows[0][0] == normalize_mp_name("Plas", "C.", "van der")
assert rows[0][1] == "BBB"
assert str(rows[0][2]) == "2023-06-01"
assert str(rows[0][3]) == "2024-01-01"
assert rows[0][4] == "guid-2"
assert rows[1][0] == normalize_mp_name("Yesilgöz-Zegerius", "D.", None)
assert rows[1][1] == "VVD"
assert str(rows[1][2]) == "2023-01-01"
assert rows[1][3] == None
assert rows[1][4] == "guid-1"
# Run again to assert idempotence (no exception and same count processed)
count2 = fetch_mp_metadata(db_path=db_path, odata_url="http://example/odata")
assert count2 == 2

@ -0,0 +1,79 @@
import json
import duckdb
import pytest
from database import MotionDatabase
def test_fuse_for_window(tmp_path):
db_path = str(tmp_path / "motions.db")
# Create MotionDatabase (this will initialize schema except embeddings)
db = MotionDatabase(db_path=db_path)
# Create embeddings table (migration not run by MotionDatabase)
conn = duckdb.connect(db_path)
conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1")
conn.execute(
"""
CREATE TABLE IF NOT EXISTS embeddings (
id INTEGER DEFAULT nextval('embeddings_id_seq'),
motion_id INTEGER NOT NULL,
model TEXT NOT NULL,
vector JSON NOT NULL,
created_at TIMESTAMP DEFAULT current_timestamp,
PRIMARY KEY (id)
)
"""
)
conn.close()
# Insert 3 synthetic SVD vectors (k=4)
svd1 = [0.1, 0.2, 0.3, 0.4]
svd2 = [0.2, 0.1, 0.0, -0.1]
svd3 = [0.9, 0.8, 0.7, 0.6]
db.store_svd_vector("2024-Q1", "motion", "1", svd1)
db.store_svd_vector("2024-Q1", "motion", "2", svd2)
db.store_svd_vector("2024-Q1", "motion", "3", svd3)
# Insert text embeddings for motions 1 and 2 (16 dims)
text1 = [float(i) / 100.0 for i in range(16)]
text2 = [float(i) / 50.0 for i in range(16)]
conn = duckdb.connect(db_path)
conn.execute(
"INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, current_timestamp)",
(1, "text-model-1", json.dumps(text1)),
)
conn.execute(
"INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, current_timestamp)",
(2, "text-model-1", json.dumps(text2)),
)
conn.close()
# Import fuse function here to ensure module available
from pipeline.fusion import fuse_for_window
result = fuse_for_window("2024-Q1", db_path=db_path)
assert result["inserted"] == 2
assert result["skipped_missing_text"] == 1
# Verify fused embeddings stored
conn = duckdb.connect(db_path)
rows = conn.execute(
"SELECT motion_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE window_id = ?",
("2024-Q1",),
).fetchall()
conn.close()
# Expect two rows for motions 1 and 2
assert len(rows) == 2
for motion_id, vector_json, svd_dims, text_dims in rows:
vec = json.loads(vector_json)
assert svd_dims == 4
assert text_dims == 16
assert len(vec) == 20

@ -0,0 +1,31 @@
import os
import pytest
def test_embeddings_migration_creates_table(tmp_path):
try:
import duckdb
except ImportError:
pytest.skip("duckdb is not installed")
db_file = str(tmp_path / "migrations_test.db")
conn = duckdb.connect(database=db_file)
try:
sql = open("migrations/2026-03-19-add-embeddings.sql", "r").read()
conn.execute(sql)
# Use sequence to set id if present, otherwise provide explicit id
try:
next_id = conn.execute("SELECT nextval('embeddings_id_seq')").fetchone()[0]
except Exception:
next_id = 1
conn.execute(
"INSERT INTO embeddings (id, motion_id, model, vector) VALUES (?, ?, ?, ?)",
(next_id, 1, "m1", "[0.1, 0.2]"),
)
res = conn.execute(
"SELECT motion_id, model FROM embeddings WHERE motion_id = 1"
).fetchall()
assert len(res) == 1
assert res[0][1] == "m1"
finally:
conn.close()

@ -0,0 +1,219 @@
from pathlib import Path
try:
import duckdb
DB_BACKEND = "duckdb"
except Exception:
import sqlite3
DB_BACKEND = "sqlite3"
MIGRATIONS = [
(
"migrations/2026_03_21__create_mp_votes.sql",
"mp_votes",
[
"id",
"motion_id",
"mp_name",
"party",
"vote",
"date",
"created_at",
],
),
(
"migrations/2026_03_21__create_mp_metadata.sql",
"mp_metadata",
[
"mp_name",
"party",
"van",
"tot_en_met",
"persoon_id",
],
),
(
"migrations/2026_03_21__create_svd_vectors.sql",
"svd_vectors",
[
"id",
"window_id",
"entity_type",
"entity_id",
"vector",
"model",
"created_at",
],
),
(
"migrations/2026_03_21__create_fused_embeddings.sql",
"fused_embeddings",
[
"id",
"motion_id",
"window_id",
"vector",
"svd_dims",
"text_dims",
"created_at",
],
),
]
def test_run_migrations_and_tables(tmp_path):
db_path = tmp_path / "test.db"
if DB_BACKEND == "duckdb":
conn = duckdb.connect(str(db_path))
else:
conn = sqlite3.connect(str(db_path))
for sql_path, table_name, expected_cols in MIGRATIONS:
p = Path(sql_path)
assert p.exists(), f"Migration file {sql_path} must exist"
sql = p.read_text()
# If using sqlite3, transform SQL to be sqlite compatible
if DB_BACKEND == "sqlite3":
# remove CREATE SEQUENCE lines
lines = [
l
for l in sql.splitlines()
if not l.strip().upper().startswith("CREATE SEQUENCE")
]
sql2 = "\n".join(lines)
# remove DEFAULT nextval(...) occurrences
import re
sql2 = re.sub(
r"DEFAULT\s+nextval\('[^']+'\)", "", sql2, flags=re.IGNORECASE
)
# replace JSON type with TEXT
sql2 = re.sub(r"\bJSON\b", "TEXT", sql2, flags=re.IGNORECASE)
# execute as script (multiple statements)
conn.executescript(sql2)
else:
# execute migration SQL
conn.execute(sql)
# check columns via pragma
if DB_BACKEND == "duckdb":
rows = conn.execute(f"PRAGMA table_info('{table_name}')").fetchall()
col_names = [r[1] for r in rows]
else:
cur = conn.execute(f"PRAGMA table_info('{table_name}')")
rows = cur.fetchall()
col_names = [r[1] for r in rows]
for col in expected_cols:
assert col in col_names, (
f"Column {col} missing in table {table_name}, got {col_names}"
)
# perform a simple insert + select to validate basic round-trip
if table_name == "mp_votes":
if DB_BACKEND == "duckdb":
conn.execute(
"INSERT INTO mp_votes (motion_id, mp_name, party, vote, date) VALUES (1, 'Jane Doe', 'PartyX', 'Yea', '2026-03-21')"
)
res = conn.execute(
"SELECT motion_id, mp_name, party, vote, date FROM mp_votes WHERE motion_id=1"
).fetchone()
# DuckDB returns datetime.date for DATE columns; normalise to string
assert (
res[:4] == (1, "Jane Doe", "PartyX", "Yea")
and str(res[4]) == "2026-03-21"
)
else:
# sqlite: id has no default after transformation, provide id explicitly
conn.execute(
"INSERT INTO mp_votes (id, motion_id, mp_name, party, vote, date) VALUES (1, 1, 'Jane Doe', 'PartyX', 'Yea', '2026-03-21')"
)
res = conn.execute(
"SELECT motion_id, mp_name, party, vote, date FROM mp_votes WHERE id=1"
).fetchone()
assert res == (1, "Jane Doe", "PartyX", "Yea", "2026-03-21")
elif table_name == "mp_metadata":
conn.execute(
"INSERT INTO mp_metadata (mp_name, party, van, tot_en_met, persoon_id) VALUES ('Jane Doe', 'PartyX', '2020-01-01', '2024-12-31', 'pid-123')"
)
res = conn.execute(
"SELECT mp_name, party, van, tot_en_met, persoon_id FROM mp_metadata WHERE mp_name='Jane Doe'"
).fetchone()
# DuckDB returns datetime.date for DATE columns; normalise to string
assert (
res[0] == "Jane Doe"
and res[1] == "PartyX"
and str(res[2]) == "2020-01-01"
and str(res[3]) == "2024-12-31"
and res[4] == "pid-123"
)
elif table_name == "svd_vectors":
# JSON value as text
if DB_BACKEND == "duckdb":
conn.execute(
"INSERT INTO svd_vectors (window_id, entity_type, entity_id, vector, model) VALUES ('w1', 'typeA', 'e1', '[1,2,3]', 'm1')"
)
res = conn.execute(
"SELECT window_id, entity_type, entity_id, vector, model FROM svd_vectors WHERE window_id='w1'"
).fetchone()
# Note: DuckDB may return the JSON column as string; compare string form
assert (
res[0] == "w1"
and res[1] == "typeA"
and res[2] == "e1"
and (str(res[3]) == "[1,2,3]" or res[3] == "[1,2,3]")
and res[4] == "m1"
)
else:
# sqlite: provide id explicitly
conn.execute(
"INSERT INTO svd_vectors (id, window_id, entity_type, entity_id, vector, model) VALUES (1, 'w1', 'typeA', 'e1', '[1,2,3]', 'm1')"
)
res = conn.execute(
"SELECT window_id, entity_type, entity_id, vector, model FROM svd_vectors WHERE id=1"
).fetchone()
assert (
res[0] == "w1"
and res[1] == "typeA"
and res[2] == "e1"
and str(res[3]) == "[1,2,3]"
and res[4] == "m1"
)
elif table_name == "fused_embeddings":
if DB_BACKEND == "duckdb":
conn.execute(
"INSERT INTO fused_embeddings (motion_id, window_id, vector, svd_dims, text_dims) VALUES (2, 'w2', '[0.1,0.2]', 16, 128)"
)
res = conn.execute(
"SELECT motion_id, window_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE motion_id=2"
).fetchone()
assert (
res[0] == 2
and res[1] == "w2"
and (str(res[2]) == "[0.1,0.2]" or res[2] == "[0.1,0.2]")
and res[3] == 16
and res[4] == 128
)
else:
conn.execute(
"INSERT INTO fused_embeddings (id, motion_id, window_id, vector, svd_dims, text_dims) VALUES (1, 2, 'w2', '[0.1,0.2]', 16, 128)"
)
res = conn.execute(
"SELECT motion_id, window_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE id=1"
).fetchone()
assert (
res[0] == 2
and res[1] == "w2"
and str(res[2]) == "[0.1,0.2]"
and res[3] == 16
and res[4] == 128
)
conn.close()

@ -0,0 +1,5 @@
def test_scientific_deps_present():
content = open("pyproject.toml").read()
assert "scipy" in content
assert "umap-learn" in content
assert "plotly" in content

@ -0,0 +1,63 @@
import json
import numpy as np
import pytest
from database import db as motion_db
from pipeline.svd_pipeline import (
_safe_k,
_build_vote_matrix,
_procrustes_align,
run_svd_for_window,
)
def test_safe_k_and_build_and_run(tmp_path):
np.random.seed(0)
# reset DB file for test
db_path = tmp_path / "test.db"
# point the MotionDatabase to this test DB
motion_db.db_path = str(db_path)
motion_db._init_database()
# Create synthetic dataset: 5 MPs x 6 motions
mps = [f"MP_{i}" for i in range(5)]
motions = list(range(100, 106))
dates = ["2020-01-0" + str(i + 1) for i in range(6)]
votes = ["Voor", "Tegen", "Geen stem"]
# insert votes: fill full matrix using MotionDatabase helper
for j, motion_id in enumerate(motions):
for i, mp in enumerate(mps):
vote = votes[(i + j) % len(votes)]
motion_db.insert_mp_vote(motion_id, mp, vote, date=dates[j])
mat, mp_names, motion_ids = _build_vote_matrix(
motion_db, "2020-01-01", "2020-01-10"
)
assert mat.shape == (5, 6)
# _safe_k: with k=10 -> min_dim=5 -> returns 4
assert _safe_k(mat, 10) == 4
assert _safe_k(mat, 3) == 3
# run_svd_for_window with k=10 -> should use k_used=4
res = run_svd_for_window(motion_db, "w1", "2020-01-01", "2020-01-10", k=10)
assert res["k_used"] == 4
assert res["stored_mp"] == 5
assert res["stored_motion"] == 6
def test_procrustes_align():
np.random.seed(0)
# create reference anchors and current anchors rotated + noise
ref = np.random.randn(10, 3)
# create orthogonal rotation
Q, _ = np.linalg.qr(np.random.randn(3, 3))
cur = ref.dot(Q) + 0.1 * np.random.randn(10, 3)
before = np.linalg.norm(cur - ref)
transformed = _procrustes_align(ref, cur)
after = np.linalg.norm(transformed - ref)
assert after < before

@ -0,0 +1,80 @@
import json
import pytest
# duckdb is an optional dependency in some environments; skip test if not available
duckdb = pytest.importorskip("duckdb")
from database import MotionDatabase
def test_ensure_text_embeddings_monkeypatch(tmp_path, monkeypatch):
# prepare temp db
db_path = str(tmp_path / "motions.db")
db = MotionDatabase(db_path)
# create embeddings table (migration would normally do this)
conn = duckdb.connect(db.db_path)
# create embeddings table with autoincrement id for sqlite
conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1")
conn.execute(
"CREATE TABLE IF NOT EXISTS embeddings (id INTEGER PRIMARY KEY DEFAULT nextval('embeddings_id_seq'), motion_id INTEGER, model TEXT, vector JSON, created_at TIMESTAMP)"
)
# insert three motions
conn.execute(
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
("t1", "d1", "u1", "ex1"),
)
conn.execute(
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
("t2", "d2", "u2", "ex2"),
)
conn.execute(
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)",
("t3", "d3", "u3", "ex3"),
)
# fetch ids
rows = conn.execute("SELECT id FROM motions ORDER BY id").fetchall()
ids = [r[0] for r in rows]
# insert existing embedding for first motion
import json as _json
vec = _json.dumps([0.1] * 16)
conn.execute(
"INSERT INTO embeddings (motion_id, model, vector) VALUES (?, ?, ?)",
(ids[0], "test-model", vec),
)
conn.close()
# monkeypatch ai_provider.get_embedding
def fake_get_embedding(text, model=None):
return [0.1] * 16
monkeypatch.setattr("ai_provider.get_embedding", fake_get_embedding)
# run ensure_text_embeddings
from pipeline.text_pipeline import ensure_text_embeddings
stored, skipped_existing, skipped_no_text, errors = ensure_text_embeddings(
db_path=db_path, model="test-model"
)
assert stored == 2
assert skipped_existing == 1
assert skipped_no_text == 0
assert errors == 0
# verify stored vectors length
conn = duckdb.connect(db.db_path)
rows = conn.execute(
"SELECT vector FROM embeddings WHERE model = ? ORDER BY motion_id",
("test-model",),
).fetchall()
conn.close()
assert len(rows) == 3
for r in rows:
v = _json.loads(r[0])
assert len(v) == 16

@ -0,0 +1,22 @@
import json
from src.types.motion_types import SimilarityNeighbor, to_json, from_json
def test_similarity_neighbor_json_roundtrip():
neighbors = [
SimilarityNeighbor(motion_id="m1", score=0.9),
SimilarityNeighbor(motion_id="m2", score=0.75),
]
# Serialize to JSON string
json_str = to_json(neighbors)
assert isinstance(json_str, str)
# Ensure it's valid JSON
parsed = json.loads(json_str)
assert isinstance(parsed, list)
# Deserialize back to objects
recovered = from_json(json_str)
assert recovered == neighbors

@ -0,0 +1,66 @@
"""
Test helper fixtures for database migrations.
Provides a pytest fixture `test_db` that inspects the environment variable
`TEST_DB_URL` to decide what to yield:
- If `TEST_DB_URL` is not set, the fixture yields None. This allows tests to
be skipped or operate in a no-database mode in CI or local runs where a
test database is not available.
- If `TEST_DB_URL` is set and starts with "sqlite", an sqlite3 connection is
created via `sqlite3.connect` and yielded. The connection is closed after
the test completes.
Decision: keep this fixture lightweight and focused on sqlite for local
smoke-testing. If other database backends are needed later, expand this
fixture accordingly.
"""
from typing import Optional
import os
import sqlite3
import pytest
@pytest.fixture
def test_db():
"""Yield a test database connection or None.
Behavior:
- If TEST_DB_URL is not set in the environment, yield None.
- If TEST_DB_URL is set and begins with 'sqlite', open an sqlite3
connection and yield it. The connection will be closed when the test
finishes.
"""
url = os.environ.get("TEST_DB_URL")
if not url:
yield None
return
# Only support sqlite URLs in this lightweight fixture.
if url.startswith("sqlite"):
# For sqlite URLs, accept either a bare file path or a file:// style
# URL. sqlite3.connect handles file paths; if a file:// prefix is
# present, strip it.
path = url
if path.startswith("sqlite:///"):
# sqlite:///path => /path
path = path[len("sqlite:///") :]
elif path.startswith("sqlite://"):
path = path[len("sqlite://") :]
conn = sqlite3.connect(path)
try:
yield conn
finally:
try:
conn.close()
except Exception:
# Best-effort close; tests shouldn't fail on close errors.
pass
return
# Unknown or unsupported TEST_DB_URL scheme — yield None to keep tests
# tolerant in environments where the fixture can't create a connection.
yield None

@ -0,0 +1,50 @@
# Session: stemwijzer
Updated: 2026-03-20T00:23:33Z
## Goal
Preserve the minimal session state required to resume work on the stemwijzer project after context clears (success = ledger exists and is kept up-to-date).
## Constraints
- Keep the ledger CONCISE — only essential information
- Focus on WHAT and WHY, not HOW
- Mark uncertain information as UNCONFIRMED
- Include git branch and key file paths
## Progress
### Done
- [x] Create initial continuity ledger file
### In Progress
- [ ] Capture ongoing session context and update ledger after each meaningful change
### Blocked
- None currently
## Key Decisions
- **Session name = "stemwijzer"**: Chosen from repository context (UNCONFIRMED if a different canonical session name is preferred).
- **Do not auto-commit ledger changes**: Commits will only be made when the user explicitly requests it (follows Git Safety Protocol).
## Next Steps
1. Continue updating this ledger when tasks, files, or decisions change
2. Add entries for new branches or major feature work (mark as UNCONFIRMED when unsure)
3. Ask user before creating any git commits that include this ledger
## File Operations
### Read
- `README.md`
- `pyproject.toml`
- `thoughts/shared/plans/2026-03-19-stemwijzer-plan.md`
- `thoughts/shared/designs/2026-03-19-stemwijzer-design.md`
### Modified
- `thoughts/ledgers/CONTINUITY_stemwijzer.md` (new)
## Critical Context
- Repository branch observed: `main`
- Found project metadata in `pyproject.toml` indicating Python tooling preference
- Existing notes/plans located under `thoughts/shared/` (plans and designs from 2026-03-19)
- No existing continuity ledger was found prior to this creation
## Working Set
- Branch: `main`
- Key files: `README.md`, `pyproject.toml`, `thoughts/shared/plans/2026-03-19-stemwijzer-plan.md`, `thoughts/shared/designs/2026-03-19-stemwijzer-design.md`, `thoughts/ledgers/CONTINUITY_stemwijzer.md`

@ -0,0 +1,98 @@
---
date: 2026-03-19
topic: "Stemwijzer AI & DB design"
status: draft
---
## Problem Statement
We need a clear, low-risk design to improve AI usage and query ergonomics in this repository. The codebase currently ingests motions, stores them in DuckDB, and generates AI-driven layman summaries via an OpenRouter/OpenAI client. There are a few maintenance issues (e.g., missing config keys, a broken reset script) and no embedding/search infrastructure.
**Goal:**
- Centralize AI/LLM usage behind a provider abstraction so we can swap or prefer providers later.
- Introduce minimal embeddings storage and search so we can add semantic features without heavy infra.
- Prefer ibis for read/query paths where that improves clarity and maintainability (the repo already imports ibis in read.py).
## Constraints
- Work must be incremental and non-disruptive: keep existing DuckDB schema and write paths where possible.
- Do not add external services (vector DB) in the first iteration — store embeddings in DuckDB as JSON for now.
- Secrets must remain environment-driven (no checked-in secrets). Add env var defaults only.
- Keep changes small and well-tested; make it easy to roll back.
## Approach (chosen)
I'll introduce two small layers:
- **ai_provider**: a thin adapter that exposes get_embedding(text) and chat_completion(messages). It will use the existing OpenRouter/OpenAI path by default and can be extended to prefer other providers if/when desired.
- **query_dal**: read-focused utilities implemented with ibis to replace direct SQL reads in the app and other read-heavy paths. Writes (insert_motion, update_user_vote) stay in database.py initially.
This gives the benefits of abstraction and pythonic query composition while keeping risk low.
## Architecture
High level components (repo root):
- api_client.py — fetches motion data from Tweede Kamer OData (unchanged)
- scraper.py — optional HTML scraping fallback (unchanged)
- database.py — current writes, schema initialization (add small embeddings table)
- summarizer.py — generate layman summaries (refactor to use ai_provider)
- app.py — Streamlit UI (switch read paths to query_dal)
- scheduler.py — orchestrates ingestion and triggers summarization (unchanged)
Additions:
- ai_provider.py — single place for LLM/embedding calls and retries
- query_dal.py — ibis-based read helpers (get_filtered_motions, calculate_party_matches)
- minimal embeddings table in DuckDB (motion_id, model, vector JSON, created_at)
## Components and responsibilities
- **ai_provider**: choose provider, handle retries/backoff, return plain Python objects (list[float] embeddings, str completions). Keep error classes small and testable.
- **database (existing)**: add store_embedding and search_similar helpers (naive in-Python cosine scan). Keep insert_motion/update_user_vote unchanged to minimize risk.
- **query_dal**: use ibis for read queries used by Streamlit paths (get_filtered_motions, session lookups). Return parsed JSON fields.
- **summarizer**: call ai_provider.chat_completion to get summary; update motions.layman_explanation; optionally compute embedding via ai_provider.get_embedding and store via database.store_embedding.
- **app.py**: replace direct duckdb selects with query_dal functions.
## Data Flow
1. Ingest: scheduler / scraper / api_client fetch motions and call database.insert_motion(motion).
2. Summarize: summarizer calls ai_provider.chat_completion(summary prompt) → writes layman_explanation to motions table. Optionally computes embedding and writes to embeddings table.
3. Query: Streamlit app calls query_dal.get_filtered_motions (ibis) to load motions for sessions and query_dal.calculate_party_matches for results.
4. Semantic search (future): query_dal or app can call database.search_similar by providing an embedding computed with ai_provider.get_embedding.
## Error Handling
- ai_provider: retries with exponential backoff for transient errors; raises a ProviderError for terminal failures so callers can decide retry semantics.
- Summarizer: non-fatal on AI failures — store an empty/fallback summary and log the failure; surface a user-facing message in Streamlit if generating summaries fails interactively.
- DB functions: existing try/except patterns retained; ensure connections are closed on error.
## Testing Strategy
- Unit tests for ai_provider using mocks for HTTP/openai responses.
- DB tests using temporary DuckDB files to verify store_embedding and search_similar behavior.
- query_dal tests using ibis against a temporary DB file; ensure JSON fields parse correctly.
- Summarizer tests mock ai_provider to assert DB writes happen.
## Open Questions
- Store embeddings inside motions table vs separate embeddings table? Recommendation: separate embeddings table for clarity and easier upserts.
- Do we want to prefer other providers (Copilot) automatically? This repo currently references OPENROUTER. If user wants Copilot preference, we can add env vars and selection logic later.
## Next steps (short)
1. Add ai_provider.py (adapter) and tests.
2. Add embeddings table and store/search helpers in database.py and tests.
3. Add query_dal.py with ibis reads and tests.
4. Refactor summarizer.py to use ai_provider and optionally store embeddings.
5. Update Streamlit app read paths to use query_dal.
6. Fix housekeeping bugs: reset.py references reset_database(), scraper uses undefined SCRAPING_DELAY — address these small fixes in a separate patch.
I'm proceeding to save this design to thoughts/shared/designs/2026-03-19-stemwijzer-design.md and will spawn the planner to create a detailed implementation plan. Interrupt if you want changes to the design text above.

@ -0,0 +1,116 @@
---
date: 2026-03-21
topic: "Reuse motions as a guided policy explorer"
status: draft
---
## Problem Statement
We want to repurpose existing "motions" data so it becomes a lightweight, discovery-driven way for users to explore policy positions and discover related content. This is not a full proposal system; it's a guided exploration and bookmarking flow that leverages our existing ingestion, summarization, embeddings, and session voting work.
**Why now:** We already ingest motions, generate layman explanations, compute embeddings, and store per-session votes. Reusing those building blocks gives high user value with modest effort.
## Constraints
**Non-negotiables and technical limits:**
- Use the existing database schema where possible (motions table, embeddings table, user_sessions). Do not require a new external vector DB for MVP.
- Keep the Streamlit UI model (app.py) and session-based votes intact for the initial rollout.
- Avoid breaking migrations: rely on existing migrations and add new ones when necessary (no forced drops).
- Respect current error-handling posture: network calls can fail; system must degrade gracefully.
## Chosen Approach
I'm choosing a "Guided Policy Explorer" approach because it reuses thehighest-value existing pieces (summaries, embeddings, session voting) and delivers a clear UX that fits the current codebase. This gives immediate product value with low risk.
**Core idea:** present curated short sessions and motion detail pages that combine the existing layman explanation, party-match results, and semantic "related motions" powered by stored embeddings.
Alternatives considered:
- "Motion-as-Proposal platform": full lifecycle (draft → comment → vote). Rejected for MVP due to high complexity and data model changes.
- "Motion Digest / Research Assistant": read-only pages and newsletters. Lower effort, but less interactive and reuses fewer of our current session features.
## Architecture
High-level view (existing pieces in bold):
- Ingest: **api_client.py** + **scraper.py** gather motions and create motion records in the DB.
- Persist: **database.py** stores motions, embeddings, and user_sessions.
- Enrichment: **summarizer.py** + **ai_provider.py** generate layman explanations and embeddings.
- Background jobs: **scheduler.py** runs ingest, summarization, and periodic clustering.
- UI: **app.py** current Streamlit session flow — extend with "Explore" and "Motion detail" pages.
- New: small **clusterer / similarity API** to compute and cache related-motion lists per motion.
## Key Components & Responsibilities
- Motion Ingest (existing): keep ingest as-is; add metadata flags (e.g., curated, candidate).
- Motion Store (existing): motions table + embeddings table; add an **events/audit** table for user actions and important state transitions.
- Summarizer / Embedding Worker (existing): scheduled job that ensures motions have layman_explanation and embeddings; add retry/backoff and logging.
- Similarity service (new): computes nearest neighbors using stored vectors in-process for MVP and caches results in a small table. Swap to a vector index later if needed.
- Session & Voting (existing): continue using user_sessions JSON blob for individual sessions; add optional event log entries for each vote.
- UI (update): add "Explore" landing, motion detail view with layman text, party-match snapshot, related motions, and bookmark/flag actions. Reuse Streamlit components.
- Admin tooling (new): migration scripts, a CLI to recompute embeddings/similarity, and an audit query helper.
## Data Flow
1. Ingest job (api_client/scraper) produces motion records and calls db.insert_motion.
2. Summarizer worker picks up motions without layman_explanation or embeddings, calls ai_provider, and writes layman_explanation + embeddings.
3. Clusterer/similarity job computes related-motion lists using stored embeddings and writes them to a cache table.
4. UI "Explore" shows curated motion lists; "Motion detail" reads motion, layman_explanation, party-match snapshot, and cached related motions.
5. User vote actions update user_sessions and also append an event to the audit table for traceability.
6. Background analytics (optional) reuses user_events and embeddings for offline insights.
## Error Handling Strategy
- External calls: add retries with exponential backoff for AI provider and external APIs. Failures set a marker (e.g., summary_missing) and the system continues.
- Missing embeddings: UI gracefully disables "related motions" and offers "compute on demand".
- Idempotency: make insert_motion idempotent by URL/external id check at DB layer; use optimistic handling for duplicates.
- Concurrency: avoid read-modify-write races by writing user events (append-only) and deriving session state from events when race-prone updates are detected.
- Observability: replace prints with structured logging (module-level logger) and add basic metrics for worker errors, API failures, and queue lags.
## Testing Strategy
- Unit tests: DB helpers (insert_motion, store_embedding, similarity cache), summarizer functions (mock ai_provider), and session vote logic.
- Migration tests: follow the existing pattern of applying migration SQL in a temp DB and asserting schema.
- Integration tests: end-to-end ingest → summarize → embedding → similarity → UI-read path in CI (use monkeypatch for AI calls).
- Load tests: simulate a few thousand embeddings search calls against the in-process search to validate performance assumptions for MVP.
- Acceptance: confirm UX flows: Explore session, Motion detail, Vote -> party match, Related motions populated.
## High-level Plan & Estimates
Assumptions: one full-stack engineer (Python + Streamlit) and one part-time reviewer. All estimates are rough.
Milestone 0 — Validate & quick discovery (1 day)
- Locate user's added markdown plan and extract exact requirements. (I'm assuming the file exists in thoughts/shared; if not, we validated by searching.)
Milestone 1 — MVP (8–12 engineer days)
- Add similarity cache table and migration.
- Summarizer: make embedding generation robust with retries and store vectors.
- Clusterer job: compute and cache related motions.
- UI: Explore landing, Motion detail page, related motion UI, bookmark/flag button.
- Add event/audit table and write events on user votes and bookmarks.
Milestone 2 — Hardening & instrumentation (3–5 engineer days)
- Replace prints with structured logging across touched modules.
- Add migration tests and CI integration tests (mock AI).
- Add health metrics & basic alerting for worker failures.
Milestone 3 — Polish & UX feedback (3–5 engineer days)
- UX tweaks, performance tuning, compute on-demand fallback for embeddings, documentation, admin CLI.
Total MVP + polish: ~2–3 weeks of focused work.
## Risks & Mitigations
- Risk: Naive in-process embedding search will not scale. Mitigation: cache nearest neighbors per motion and plan a migration path to a vector index.
- Risk: AI provider flakiness. Mitigation: retries, timeouts, and clear UI fallback. Tests must mock provider in CI.
- Risk: Race conditions on session votes. Mitigation: append-only event log and derive authoritative session view from events when needed.
- Risk: Schema drift and missing migrations. Mitigation: add migration tests and document required migrations in repo.
## Open Questions
- Which exact user journeys do we want first (single-session discover vs. persistent account/bookmarking)?
- Do we want bookmarks persisted globally or per-session only? (Privacy implications.)
- What's acceptable latency for "related motions" — precomputed nightly vs. near-real-time?
- Any policy/legal ban on storing full body_text or on long-term retention of user votes?
---
I'm proceeding to create the design doc file at thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md and will spawn the implementation planner next. Interrupt if you want changes to the approach or scope now.

@ -0,0 +1,335 @@
# Guided Policy Explorer — Implementation Plan
**Goal:** Implement the Guided Policy Explorer MVP that reuses existing motions, layman summaries, embeddings and session votes to provide an Explore landing, Motion detail view, cached related motions (similarity cache), and accompanying background jobs and admin tooling.
Design: thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md
---
## Dependency Graph
```
Batch 1 (parallel): 1.1, 1.2, 1.3, 1.4, 1.5 [foundation - migrations, types, migration-tests]
Batch 2 (parallel): 2.1, 2.2, 2.3, 2.4 [core - similarity service, cache repo, audit repo, embeddings worker]
Batch 3 (parallel): 3.1, 3.2, 3.3, 3.4 [components - clusterer worker, CLI, API, Streamlit page]
Batch 4 (parallel): 4.1 [integration tests & docs - depends on 2.x & 3.x]
```
---
## Notes on planning choices
- Design requires a similarity cache and a small in-process nearest-neighbor search for MVP. I'm implementing this as: store precomputed top-N neighbor lists (IDs + scores) in a small SQL table and compute neighbors by scanning embeddings in-memory per batch job. Reason: avoids external vector DB and keeps implementation simple and testable.
- Design requires robust embedding generation. I'll implement exponential-backoff retry logic with a configurable retry count and timeouts in embeddings_worker; tests will monkeypatch the ai_provider to simulate failures.
- Migration tests: design asks to have migration tests, but migration SQL content is omitted per instructions. Tests will assert that migration files are present and follow naming conventions and will be marked to skip applying SQL unless a TEST_DB_URL env var is provided. This keeps CI safe while satisfying test coverage and developer verification.
---
## Batch 1: Foundation (parallel - 5 implementers)
All tasks in this batch have NO dependencies and run simultaneously.
### Task 1.1: Add similarity cache migration (placeholder)
**Title:** Migration: add similarity_cache table
**Description:** Add a migration file to create a similarity cache table that stores precomputed related-motion lists per motion (motion_id, neighbors_json, computed_at). SQL content intentionally left out per instructions; file is a placeholder that CI/tests will detect.
**Files:**
- migrations/2026-03-22-add-similarity-cache.sql
**Tests:**
- tests/migrations/test_2026_03_22_add_similarity_cache.py
**Estimated:** 1.0h
**Priority:** high
**Depends:** none
**Acceptance criteria:**
- Migration file exists at migrations/2026-03-22-add-similarity-cache.sql
- test_migration file runs and passes in default mode (it will only check filename & header). If TEST_DB_URL is set in env, test will attempt to run the SQL and must not error (SQL may be empty; test expects a no-op or valid SQL). Test is marked to skip DB application when TEST_DB_URL is unset.
---
### Task 1.2: Add audit/events migration (placeholder)
**Title:** Migration: add audit_events table
**Description:** Add a migration placeholder to create an audit/events table for append-only user events (vote, bookmark, flag). Actual SQL omitted.
**Files:**
- migrations/2026-03-22-add-audit-events.sql
**Tests:**
- tests/migrations/test_2026_03_22_add_audit_events.py
**Estimated:** 1.0h
**Priority:** high
**Depends:** none
**Acceptance criteria:**
- migrations/2026-03-22-add-audit-events.sql exists
- migration test verifies filename and is safe to run in CI (skips DB apply unless TEST_DB_URL provided).
---
### Task 1.3: Shared types for motions & similarity entries
**Title:** Types: motion and similarity types
**Description:** Add a small types module that centralizes typed dataclasses/interfaces used by similarity and cache modules (MotionId, Embedding vector typed alias, SimilarityNeighbor). This reduces coupling and makes tests easier to write.
**Files:**
- src/types/motion_types.py
**Tests:**
- tests/types/test_motion_types.py
**Estimated:** 1.5h
**Priority:** medium
**Depends:** none
**Acceptance criteria:**
- src/types/motion_types.py defines MotionId, Embedding, SimilarityNeighbor types and basic helpers (e.g., serialize/deserialize neighbors). Tests validate JSON round-trip of neighbors.
---
### Task 1.4: CI migration test helper
**Title:** Test helper: migration test utils
**Description:** Add a small test helper that other migration tests can use. It provides a pytest fixture that reads TEST_DB_URL and yields a DB connection or None and marks tests appropriately.
**Files:**
- tests/utils/migration_fixtures.py
**Tests:**
- tests/migrations/test_migration_fixtures_smoke.py
**Estimated:** 1.0h
**Priority:** medium
**Depends:** none
**Acceptance criteria:**
- migration_fixtures.py provides `test_db` fixture. The smoke test asserts fixture yields None when TEST_DB_URL unset and yields a connection-like object when set.
---
### Task 1.5: Add README admin docs for recomputing
**Title:** Docs: admin CLI usage and migration notes
**Description:** Add a short markdown doc describing the admin CLI, migration filenames, and how to run recompute/clusterer jobs locally for dev.
**Files:**
- docs/admin/recompute_similarity.md
**Tests:** none (doc only)
**Estimated:** 0.5h
**Priority:** low
**Depends:** none
**Acceptance criteria:**
- docs/admin/recompute_similarity.md exists and documents commands and env vars: TEST_DB_URL, AI_PROVIDER_MOCK, SIMILARITY_TOP_N.
---
## Batch 2: Core Modules (parallel - 4 implementers)
Depends: Batch 1
### Task 2.1: Similarity service (in-process search + utility)
**Title:** Similarity service implementation
**Description:** New service that, given motion embeddings, computes cosine similarity and returns top-N neighbors. Also exposes a convenience function to compute neighbors for one motion and return a list of (motion_id, score). This is pure Python and testable in-memory.
**Files:**
- src/services/similarity_service.py
**Tests:**
- tests/services/test_similarity_service.py
**Estimated:** 5.0h
**Priority:** high
**Depends:** 1.3
**Acceptance criteria:**
- similarity_service.py exposes compute_neighbors(embedding: list[float], all_embeddings: Dict[motion_id, embedding], top_n: int) -> List[SimilarityNeighbor]
- Unit tests cover exact small matrices and edge cases (empty, identical embeddings). All tests pass with `pytest tests/services/test_similarity_service.py`.
---
### Task 2.2: DB repo for similarity cache
**Title:** Repo: similarity_cache read/write
**Description:** Provide a small repository abstraction that reads and writes cached neighbor lists to the DB (serialize neighbors as JSON). Keep DB interactions minimal and testable using sqlite in-memory.
**Files:**
- src/db/similarity_cache_repo.py
**Tests:**
- tests/db/test_similarity_cache_repo.py
**Estimated:** 4.0h
**Priority:** high
**Depends:** 1.1, 1.3
**Acceptance criteria:**
- similarity_cache_repo provides functions: get_cached_neighbors(motion_id) -> Optional[List[SimilarityNeighbor]] and upsert_cached_neighbors(motion_id, neighbors, computed_at)
- Unit tests run against sqlite in-memory and assert correct serialization/deserialization.
---
### Task 2.3: Audit/events repository
**Title:** Repo: audit_events append-only writer
**Description:** Small repo to append audit events (user_id, session_id, motion_id, event_type, payload JSON, created_at). Provides an append_event function used by UI and session logic.
**Files:**
- src/db/audit_repo.py
**Tests:**
- tests/db/test_audit_repo.py
**Estimated:** 3.0h
**Priority:** medium
**Depends:** 1.2
**Acceptance criteria:**
- append_event writes a row to sqlite in-memory in test and read-back verifies fields and created_at presence. Functions are well typed and handle JSON payloads.
---
### Task 2.4: Embeddings worker helper (retries/backoff)
**Title:** Worker: robust embedding generator
**Description:** Add a worker helper that ensures embeddings exist for a motion. It calls ai_provider.get_embedding with retry/backoff and writes embedding via an abstracted DB function (the put function will be dependency-injected in tests). This module contains no long-running loop — it's a single-run helper function used by the scheduler.
**Files:**
- src/ai/embeddings_worker.py
**Tests:**
- tests/ai/test_embeddings_worker.py
**Estimated:** 4.0h
**Priority:** high
**Depends:** 1.3
**Acceptance criteria:**
- embeddings_worker.explain_and_embed(motion_id, text, put_embedding_fn) calls ai_provider and retries on simulated transient errors. Tests monkeypatch ai_provider to simulate 2 failing attempts then success and verify put_embedding_fn called exactly once with a vector-like object.
---
## Batch 3: Components (parallel - 4 implementers)
Depends: Batch 2
### Task 3.1: Clusterer scheduled job
**Title:** Worker: clusterer job that computes & writes caches
**Description:** Background job module that loads all embeddings, computes top-N neighbors for each motion using similarity_service, and writes cache rows via similarity_cache_repo. Designed to be runnable from CLI. It should respect a MAX runtime parameter (process batch size) for safe operation in dev.
**Files:**
- src/workers/clusterer.py
**Tests:**
- tests/workers/test_clusterer.py
**Estimated:** 6.0h
**Priority:** high
**Depends:** 2.1, 2.2, 2.4
**Acceptance criteria:**
- clusterer.run_batch(batch_size, top_n, load_embeddings_fn, upsert_cache_fn) exists and can be unit-tested by injecting small in-memory embeddings and verifying upsert_cache_fn called with expected neighbor lists.
---
### Task 3.2: Admin CLI: recompute-similarity
**Title:** CLI: recompute similarity & options
**Description:** Small CLI script (click or argparse) to trigger the clusterer job (full-run or limited). CLI accepts --top-n, --batch-size, --dry-run flags. Tests will monkeypatch clusterer.run_batch.
**Files:**
- src/cli/recompute_similarity.py
**Tests:**
- tests/cli/test_recompute_similarity.py
**Estimated:** 2.5h
**Priority:** medium
**Depends:** 3.1
**Acceptance criteria:**
- CLI parses flags and calls clusterer.run_batch with parsed args. tests assert proper arguments passed and dry-run does not call run_batch.
---
### Task 3.3: HTTP API endpoint for compute-on-demand / cached
**Title:** API: similarity endpoint
**Description:** Small Flask/FastAPI/WSGI handler module that returns cached related motions for a motion_id; if cache missing and a query param compute=true, it calls the similarity service to compute neighbors on demand (without persisting) and returns them. Keep the handler framework-agnostic so it can be wired into existing web framework; tests will call the handler function directly.
**Files:**
- src/api/similarity_api.py
**Tests:**
- tests/api/test_similarity_api.py
**Estimated:** 3.5h
**Priority:** medium
**Depends:** 2.1, 2.2
**Acceptance criteria:**
- Handler get_related(motion_id, compute=False, load_embedding_fn, load_all_embeddings_fn, cache_repo) returns cached neighbors when present and computes on demand when compute=True. Tests cover both code paths.
---
### Task 3.4: Streamlit UI: Explore landing & Motion detail module
**Title:** UI: explore page and motion detail component
**Description:** Add a Streamlit helper module providing functions to render the Explore landing and Motion detail sections. Avoid modifying existing app.py in this MVP; instead provide a module that app.py can import. The module will expose pure functions where possible to ease testing; tests will verify behavior by calling functions and mocking DB/AI calls.
**Files:**
- src/ui/explore_page.py
**Tests:**
- tests/ui/test_explore_page.py
**Estimated:** 5.0h
**Priority:** medium
**Depends:** 2.2, 2.3, 2.4
**Acceptance criteria:**
- explore_page.render_explore(session, load_curated_fn, load_cached_neighbors_fn) returns a data structure (not direct Streamlit calls) that app.py can choose to render. Tests assert correct payload for a sample session and that missing embeddings gracefully remove related motions.
---
## Batch 4: Integration & Docs (parallel - 2 implementers)
Depends: Batch 2 & 3
### Task 4.1: Integration test: ingest → summarize → embed → cluster → UI read
**Title:** Integration test for the end-to-end path (mvp)
**Description:** Add an integration pytest that simulates: create 3 synthetic motions, call embeddings_worker (monkeypatched AI provider), run clusterer on the in-memory dataset, and assert similarity cache rows exist and explore_page returns related motions. Use sqlite in-memory and monkeypatch ai_provider to return deterministic vectors.
**Files:**
- tests/integration/test_end_to_end_explore_flow.py
**Tests:**
- (this is the test file)
**Estimated:** 8.0h
**Priority:** high
**Depends:** 1.3, 2.1, 2.2, 2.4, 3.1, 3.4
**Acceptance criteria:**
- Running `pytest tests/integration/test_end_to_end_explore_flow.py` passes locally with no external network calls when AI provider is monkeypatched via monkeypatch fixture. The test asserts that at least one neighbor exists for a motion and the explore_page data includes it.
---
## CI / Test instructions
- Run unit tests: pytest tests/unit (or full suite: pytest)
- Run a single module test: pytest tests/services/test_similarity_service.py::test_compute_neighbors_basic
- Integration tests: pytest tests/integration/test_end_to_end_explore_flow.py
Monkeypatching AI provider in CI/local tests:
- Use the `monkeypatch` pytest fixture to patch `src.ai.ai_provider.get_embedding` and `src.ai.ai_provider.summarize` (if used). Example in tests: monkeypatch.setattr('src.ai.ai_provider.get_embedding', fake_get_embedding)
- CI should set env var AI_PROVIDER_MOCK=1 for additional safety; tests will check this var and use mocks if present.
Temp DB setup for tests:
- Unit tests should use sqlite in-memory ("sqlite:///:memory:") via a `test_db` fixture in tests/utils/migration_fixtures.py.
- Migration tests: If TEST_DB_URL env var is set, the migration tests will attempt to apply SQL to that DB; otherwise they will run in dry-run / skip-apply mode and only validate filename and header.
Example pytest commands:
- pytest -q
- pytest -q tests/services/test_similarity_service.py -k compute_neighbors
Notes for CI pipeline:
- Ensure Python dependencies include pytest, pytest-mock and any DB driver required (sqlite built-in is fine). No external AI keys required — tests must mock AI provider.
---
## 3-Sprint Schedule (2-week sprints)
Sprint 1 (Weeks 1–2) — Milestone 1: MVP foundation + core similarity
- Goals: Add migrations, types, similarity service, similarity cache repo, audit repo, embeddings worker helper
- Tasks: 1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 2.3, 2.4
Sprint 2 (Weeks 3–4) — Milestone 1 continued: background job, CLI, API, UI
- Goals: Implement clusterer job, CLI, similarity API, explore_page UI module; initial integration smoke tests
- Tasks: 3.1, 3.2, 3.3, 3.4, initial lightweight integration test scaffolding
Sprint 3 (Weeks 5–6) — Milestone 2 & 3: hardening, integration tests, docs
- Goals: Full integration tests, migration tests, docs, logging hardening, small UX polish
- Tasks: 4.1, docs improvements from 1.5, logging conversion across modules (follow-up small PRs as needed)
Notes:
- Estimates assume 1 full-stack engineer + 1 reviewer. Sprint 1 is AMA-heavy; reviewer will focus on migrations and core algorithms. Sprint 2 focuses on wiring and UI; reviewer focuses on integration and UX. Sprint 3 finishes tests and polish.
---
## Assumptions
- The repository uses Python 3.10+ and pytest for tests. If different, adjust test fixtures accordingly.
- Existing DB access helpers exist (a simple execute/connection helper). If not, tests use sqlite3 directly and repository code will accept a DB connection/cursor via dependency injection.
- The project already has an ai_provider abstraction at src/ai/ai_provider.py with functions `get_embedding(text) -> list[float]` and `summarize(text) -> str` — tests will monkeypatch these. If the names differ, adapt imports when implementing.
- Streamlit app remains `app.py` and can import src/ui/explore_page.py — I deliberately do not modify app.py in this plan to keep the change set minimal.
- We will store embeddings as arrays in an embeddings table; similarity modules will load them via an injected loader function to keep unit tests pure.
---
## Open Questions / Implementation Clarifications
1. Bookmarks persistence: design left bookmarks as open (session vs. persistent). For MVP we will record bookmark events in the audit_events table (append-only) and treat them as per-session by default. If persistent bookmarks required later, a new table/migration will be added.
2. Which web framework to wire the similarity_api into? The plan keeps handler framework-agnostic; we need guidance whether app uses Flask/FastAPI/Starlette to add the route. Implementer should wire into existing HTTP routing pattern.
3. Embedding storage format: assume float arrays stored as JSON or array type in DB. If project uses a binary blob, adjust serialization in similarity_cache_repo and tests accordingly.
4. Acceptable top-N neighbor size for caches. Default SIMILARITY_TOP_N = 10; CLI and worker accept override. If product wants 50, increase later.
---
## How a single implementer should proceed (step-by-step)
1. Start with Batch 1 tasks 1.1–1.4. Create migrations placeholders and types module. Run migration filename tests.
2. Implement similarity_service (2.1) and its unit tests. This is the critical algorithm that must be rock-solid.
3. Implement similarity_cache_repo (2.2) and audit_repo (2.3) using sqlite in-memory for tests. Run unit tests.
4. Implement embeddings_worker helper (2.4) and add tests that mock ai_provider. Ensure CI will not call real AI.
5. Implement clusterer (3.1) and test with in-memory data by injecting loader/upsert functions.
6. Add admin CLI (3.2) to run clusterer; add small doc (1.5) describing how to run it locally.
7. Implement API handler (3.3) and UI helper (3.4). Tests should mock DB and AI as needed.
8. Finish with integration test (4.1) to stitch the pieces together. Iterate on bug fixes and reviewer feedback.
---
## Acceptance criteria for the feature (MVP)
- Explore landing exists and can present curated motions (using existing curated flag). Data payload returned by explore_page includes motion metadata and layman_explanation.
- Motion detail returns layman_explanation, party-match snapshot (existing), and related motions computed from cached neighbor lists when available.
- Background clusterer job can recompute cached neighbor lists and the CLI can trigger it.
- Tests cover core algorithm (similarity computation), cache repo serialization, embedders (mocked), and at least one end-to-end smoke integration test.
---
If anything in this plan should be narrowed further (for a smaller initial PR) I recommend focusing Sprint 1 + clusterer CLI (Tasks 1.x + 2.x + 3.1 + 3.2) and deferring UI wiring until clusterer and cache are validated.

@ -0,0 +1,106 @@
---
date: 2026-03-19
topic: "Stemwijzer AI & DB implementation plan"
status: draft
---
## Summary
Implementation plan derived from thoughts/shared/designs/2026-03-19-stemwijzer-design.md.
Goal: add a provider abstraction for AI calls, minimal embeddings stored in DuckDB (JSON), and an ibis-based read DAL. Keep changes small, additive and well-tested.
## High-level approach (chosen)
- Add **ai_provider**: adapter exposing get_embedding(text) and chat_completion(messages) with retries and ProviderError.
- Add **embeddings** table (DuckDB) and store/search helpers in database.py (naive Python cosine scan).
- Add **query_dal**: ibis-based read helpers for Streamlit (get_filtered_motions, calculate_party_matches).
- Refactor summarizer to call ai_provider and optionally store embeddings.
- Minimal housekeeping fixes: reset.py and SCRAPING_DELAY in scraper.py.
## Micro-tasks (11 tasks)
All tasks are intentionally small (file-level changes + tests). Estimates assume one developer full-time; see Risk and Calendar section below.
Batch 1 (foundation, parallelizable)
1. Add tests fixtures for temporary DuckDB (tests/conftest.py) — 2h — low risk
2. Add migration SQL to create embeddings table (migrations/2026-03-19-add-embeddings.sql) — 1h — low risk
3. Add ai_provider adapter (src/ai_provider.py) + tests (tests/test_ai_provider.py) — 6h — medium risk
4. Add scraper SCRAPING_DELAY default (src/scraper.py) + tests — 1h — low risk
5. Fix reset script to run migrations (src/reset.py) + tests — 2h — low risk
Batch 2 (core modules)
6. Add store_embedding and search_similar to src/database.py + tests (tests/test_database_embeddings.py) — 8h — medium risk
7. Add query_dal (src/query_dal.py) with ibis reads + tests (tests/test_query_dal.py) — 6h — medium risk
8. Refactor summarizer to use ai_provider and optionally store embeddings (src/summarizer.py) + tests (tests/test_summarizer.py) — 6h — medium risk
Batch 3 (integration)
9. Add CLI semantic search helper (src/cli_search.py) + tests — 4h — low-medium risk
10. Update app read paths to use query_dal (src/app.py) + tests — 3h — low risk
Batch 4 (docs/config)
11. Add .env.example entries for new env vars — 1h — low risk
## PR order (recommended, small focused PRs)
1. PR A — tests/conftest (fixtures)
2. PR B — migration SQL (embeddings table)
3. PR C — ai_provider + tests
4. PR D — database store/search helpers + tests
5. PR E — query_dal + tests
6. PR F — summarizer refactor + tests
7. PR G — cli_search + tests
8. PR H — app read changes + tests
9. PR I — scraper/reset small fixes + tests
10. PR J — .env.example
## Estimates & schedule (one dev, full-time ~8h/day)
- Total estimated effort: ~50 hours (~6.25 days) + buffer → ~7 calendar days.
- Conservative schedule: Batch 1 (2 days), Batch 2 (3 days), Batch 3 (1 day), Buffer/Review (1 day).
## DB migration steps
- Add migrations/2026-03-19-add-embeddings.sql (additive).
- Apply on staging first; backup DB, run migration, verify `SELECT count(*) FROM embeddings`.
- No changes to motions table in first iteration.
## Testing strategy
- Unit tests for ai_provider (mock HTTP responses). Use monkeypatch to avoid network.
- DB tests use temporary DuckDB files (pytest fixtures) to verify storing and searching embeddings.
- query_dal tests use ibis.duckdb.connect against a temporary DB file and parse JSON fields.
- Summarizer tests mock ai_provider to assert DB writes (summary and optional embedding).
## Error handling
- ai_provider: retry/backoff for transient errors; raise ProviderError for terminal failures.
- Summarizer: non-fatal on AI failures — write fallback/empty summary, log, and surface message in UI when interactive.
- DB functions: keep try/except patterns and ensure connections closed on error.
## Risks & mitigations
- ai_provider changes: medium risk — mitigate with retries, clear ProviderError, and thorough unit tests.
- Embedding search: medium (naive scan performance) — mitigate by keeping implementation simple and planning for ANN/FAISS later.
- ibis usage: medium — mitigate with tests and keep query_dal narrow.
## Next actions (what I'll do now)
- I wrote this implementation plan to thoughts/shared/plans/2026-03-19-stemwijzer-plan.md (draft).
- I will NOT start applying code changes automatically. If you want, I can:
- (A) Create the first PR patch (tests/conftest.py + migration) and open a draft for review, or
- (B) Start implementing Task 1.1 (ai_provider) next.
Interrupt if you want changes to the plan or a different PR ordering. Otherwise tell me which task to start and I'll create the first patch.

@ -0,0 +1,129 @@
#!/usr/bin/env python3
"""Query Tweede Kamer OData endpoints to locate motion body text.
This script performs the API calls described in the task and prints
structured information about responses (status code, keys, candidate
fields that may contain text or content URLs).
File: tools/query_tk_api.py
"""
import json
import sys
from urllib.parse import quote
try:
import requests
except Exception:
print("missing requests library", file=sys.stderr)
raise
BASE = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0"
ZAAK_ID = "e6fd62f1-29be-4955-9811-03d46da2fc3a"
def try_get(path):
url = BASE.rstrip("/") + "/" + path.lstrip("/")
print("\nGET", url)
r = requests.get(url, headers={"Accept": "application/json"})
print("->", r.status_code, r.headers.get("Content-Type"))
# try to print JSON keys or text length
ct = r.headers.get("Content-Type", "")
if "application/json" in ct or r.text.strip().startswith("{"):
try:
j = r.json()
print("JSON keys:", list(j.keys()))
# pretty-print limited
print("JSON preview:", json.dumps(j, indent=2)[:4000])
return j
except Exception as e:
print("failed to parse json:", e)
else:
print("text length:", len(r.content))
print("headers:", dict(r.headers))
print("first 800 bytes:\n", r.content[:800])
return None
def main():
# 1. Zaak expand Document
tried = []
patterns = [
f"Zaak({ZAAK_ID})?$expand=Document",
f"Zaak(guid'{ZAAK_ID}')?$expand=Document",
f"Zaak('{ZAAK_ID}')?$expand=Document",
]
zaak_json = None
for p in patterns:
tried.append(p)
zaak_json = try_get(p)
if zaak_json and "Document" in (zaak_json.get("value") or zaak_json):
break
# If top-level 'value' exists (collection), try to find first
if zaak_json and "value" in zaak_json:
# If API returned a collection, pick first
val = zaak_json["value"]
if isinstance(val, list) and val:
zaak = val[0]
else:
zaak = None
else:
zaak = zaak_json
print("\n--- Zaak object (extracted) ---")
print(json.dumps(zaak, indent=2)[:4000])
docs = []
if zaak:
# Document may be navigation property 'Document' or 'Documents'
for key in ("Document", "Documents"):
if key in zaak:
val = zaak[key]
if isinstance(val, list):
docs.extend(val)
elif isinstance(val, dict):
docs.append(val)
print("\nFound", len(docs), "Document entries")
for i, d in enumerate(docs):
print("\n--- Document", i, "---")
print(json.dumps(d, indent=2)[:4000])
# 2. Try DocumentVersie endpoint
# We'll attempt: DocumentVersie?$filter=DocumentId eq guid'...'
for d in docs:
doc_id = d.get("Id") or d.get("DocumentId") or d.get("IdDocument")
if not doc_id:
# maybe OData provided @odata.id
if "@odata.id" in d:
# extract id from URI - last segment
seg = d["@odata.id"].rstrip("/").split("/")[-1]
doc_id = seg
if not doc_id:
continue
print("\nQuerying DocumentVersie for Document id:", doc_id)
q1 = f"DocumentVersie?$filter=DocumentId%20eq%20guid'{doc_id}'"
j = try_get(q1)
# also try expanding from Document
q2 = f"Document({quote(doc_id)})?$expand=DocumentVersie"
j2 = try_get(q2)
# try direct DocumentVersie by key
q3 = f"DocumentVersie(guid'{doc_id}')"
j3 = try_get(q3)
# 3. Try content stream patterns
candidates = [
f"Document({quote(doc_id)})/Content",
f"Document({quote(doc_id)})/$value",
f"Document({quote(doc_id)})/Inhoud",
f"Resource('{doc_id}')",
f"Resource({quote(doc_id)})",
]
for c in candidates:
try_get(c)
if __name__ == "__main__":
main()

1246
uv.lock

File diff suppressed because it is too large Load Diff

@ -0,0 +1,9 @@
import duckdb
from config import config
conn = duckdb.connect(config.DATABASE_PATH)
result = conn.execute("PRAGMA table_info('motions')").fetchall()
for row in result:
print(row)
conn.close()
Loading…
Cancel
Save