- Add 4 migration files: mp_votes, mp_metadata, svd_vectors, fused_embeddings - Extend database.py with 5 new helper methods and table init - Add pipeline/ package: extract_mp_votes, fetch_mp_metadata, text_pipeline, svd_pipeline (with Procrustes alignment), fusion - Add full test suite (17 tests) covering all pipeline modules and migrations - Fix Procrustes alignment bug: scipy scale is a norm value, not a multiplier - Fix DuckDB date type handling in test assertions (datetime.date vs string) - Remove duckdb.py shim; tests now run against real duckdb + scipy via uv Ref: thoughts/shared/plans/2026-03-21-parliamentary-embedding-pipeline-plan.mdmain
parent
c498c3467e
commit
a36e6cba4e
@ -0,0 +1,38 @@ |
|||||||
|
kind: pipeline |
||||||
|
type: docker |
||||||
|
name: default |
||||||
|
|
||||||
|
steps: |
||||||
|
- name: build |
||||||
|
image: docker:24.0.2 |
||||||
|
environment: |
||||||
|
DOCKER_BUILDKIT: "1" |
||||||
|
commands: |
||||||
|
- docker build -t ${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA} . |
||||||
|
- docker tag ${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA} ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest |
||||||
|
|
||||||
|
- name: push |
||||||
|
image: docker:24.0.2 |
||||||
|
commands: |
||||||
|
- echo "Logging into registry" |
||||||
|
- docker login -u ${DOCKER_USERNAME} -p ${DOCKER_PASSWORD} ${DOCKER_REGISTRY} |
||||||
|
- docker push ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:${DRONE_COMMIT_SHA} |
||||||
|
- docker push ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest |
||||||
|
|
||||||
|
- name: deploy |
||||||
|
image: appleboy/drone-ssh |
||||||
|
settings: |
||||||
|
host: ${DEPLOY_HOST} |
||||||
|
port: ${DEPLOY_SSH_PORT} |
||||||
|
username: ${DEPLOY_USER} |
||||||
|
password: ${DEPLOY_PASSWORD} |
||||||
|
script: | |
||||||
|
set -e |
||||||
|
cd /srv/stemwijzer |
||||||
|
docker pull ${DOCKER_REGISTRY}/${DRONE_REPO_OWNER}/${DRONE_REPO_NAME}:latest |
||||||
|
docker-compose pull |
||||||
|
docker-compose up -d |
||||||
|
|
||||||
|
trigger: |
||||||
|
branch: |
||||||
|
- main |
||||||
@ -0,0 +1,10 @@ |
|||||||
|
# Python-generated files |
||||||
|
__pycache__/ |
||||||
|
*.py[oc] |
||||||
|
build/ |
||||||
|
dist/ |
||||||
|
wheels/ |
||||||
|
*.egg-info |
||||||
|
|
||||||
|
# Virtual environments |
||||||
|
.venv |
||||||
@ -0,0 +1 @@ |
|||||||
|
3.13 |
||||||
@ -0,0 +1,126 @@ |
|||||||
|
ARCHITECTURE |
||||||
|
============ |
||||||
|
|
||||||
|
Overview |
||||||
|
-------- |
||||||
|
- Small Python project that collects, stores and presents Dutch parliamentary motions (Tweede Kamer). It |
||||||
|
ingests votes (OData API or HTML scraping), stores motions in a DuckDB file, generates short human |
||||||
|
summaries using an LLM client, and exposes a Streamlit UI for users to vote and view matching results. |
||||||
|
|
||||||
|
Tech stack |
||||||
|
---------- |
||||||
|
- Language: Python (single-project repository) |
||||||
|
- Data: DuckDB (file: data/motions.db), ibis used in a small utility (read.py) |
||||||
|
- Web / UI: Streamlit (app.py) |
||||||
|
- HTTP: requests |
||||||
|
- HTML parsing: BeautifulSoup (scraper.py) |
||||||
|
- Scheduling: schedule (scheduler.py) |
||||||
|
- LLM: OpenAI-compatible client (summarizer.py uses openai.OpenAI configured via config) |
||||||
|
- Packaging: pyproject.toml present |
||||||
|
|
||||||
|
Top-level layout (annotated) |
||||||
|
---------------------------- |
||||||
|
./ |
||||||
|
- app.py — Streamlit UI, main UI flow and session handling (entrypoint for web) |
||||||
|
- main.py — minimal CLI entry / small script |
||||||
|
- database.py — MotionDatabase: DuckDB schema, insert/query/update, party-match calculations |
||||||
|
- api_client.py — TweedeKamerAPI: fetch OData voting records and group into motions |
||||||
|
- scraper.py — MotionScraper: HTML fallback scraper for motion pages |
||||||
|
- summarizer.py — MotionSummarizer: LLM integration to generate layman_explanation |
||||||
|
- scheduler.py — DataUpdateScheduler: initial historical loads + periodic scheduled updates |
||||||
|
- config.py — Config dataclass: central configuration (DATABASE_PATH, API/AI settings, constants) |
||||||
|
- read.py — small ibis + duckdb demonstration/utility |
||||||
|
- fix_database.py — script to recreate/reset DuckDB schema |
||||||
|
- reset.py / verify.py — small maintenance scripts that call into database module |
||||||
|
- test.py — ad-hoc test script (manual insert/verification) |
||||||
|
- data/ — data/motions.db (DuckDB file) |
||||||
|
- pyproject.toml — project metadata / dependencies |
||||||
|
- .env — environment variables (not printed here) |
||||||
|
|
||||||
|
Core components |
||||||
|
--------------- |
||||||
|
- Streamlit UI (app.py) |
||||||
|
- Presents the voting UI, reads filtered motions from database, creates sessions, writes user votes |
||||||
|
- Calls: database.get_filtered_motions(), database.create_session(), database.update_user_vote(), |
||||||
|
database.calculate_party_matches(), summarizer.update_motion_summaries() |
||||||
|
|
||||||
|
- Storage (database.py) |
||||||
|
- MotionDatabase encapsulates DuckDB schema creation and CRUD for motions and user sessions |
||||||
|
- Exposes a module-level instance `db = MotionDatabase()` used across the codebase |
||||||
|
- Key responsibilities: insert_motion, get_filtered_motions, create_session, update_user_vote, |
||||||
|
calculate_party_matches |
||||||
|
|
||||||
|
- Ingestion (api_client.py + scraper.py) |
||||||
|
- api_client.py fetches votes via Tweede Kamer OData API and groups records into motions |
||||||
|
- scraper.py is an HTML fallback that scrapes motion pages and extracts vote info |
||||||
|
- Both provide structured motion dicts consumed by database.insert_motion() |
||||||
|
|
||||||
|
- Summarization (summarizer.py) |
||||||
|
- Wraps an OpenAI-compatible client to produce short layman explanations and persists them to DB |
||||||
|
- Reads motions without layman_explanation and updates rows |
||||||
|
|
||||||
|
- Orchestration (scheduler.py) |
||||||
|
- Runs initial historical ingestion and schedules periodic updates (using schedule) |
||||||
|
- Calls API client and summarizer and writes to the database |
||||||
|
|
||||||
|
Data flow (high level) |
||||||
|
---------------------- |
||||||
|
1. Ingestion |
||||||
|
- scheduler / manual run triggers TweedeKamerAPI.get_motions(...) or MotionScraper.run_scraping_job() |
||||||
|
- Each produced motion dict is passed to MotionDatabase.insert_motion() |
||||||
|
- insert_motion writes to DuckDB (data/motions.db) |
||||||
|
|
||||||
|
2. Enrichment |
||||||
|
- summarizer.update_motion_summaries() reads motions lacking layman_explanation, |
||||||
|
calls the LLM client (openai.OpenAI) and writes summary text back to the DB |
||||||
|
|
||||||
|
3. Presentation / Interaction |
||||||
|
- app.py (Streamlit) queries motions via db.get_filtered_motions() and displays them |
||||||
|
- Users vote; app.py writes votes into the database via db.update_user_vote() |
||||||
|
- app.py calls db.calculate_party_matches() to compute match percentages for parties |
||||||
|
|
||||||
|
External integrations & dependencies |
||||||
|
----------------------------------- |
||||||
|
- Tweede Kamer OData API (api_client.py) |
||||||
|
- HTTP (requests) |
||||||
|
- HTML parsing (BeautifulSoup) used by scraper.py |
||||||
|
- DuckDB (database file at data/motions.db) |
||||||
|
- ibis (read.py demonstrates an ibis.duckdb connection) |
||||||
|
- Streamlit for UI |
||||||
|
- OpenAI-compatible LLM client (summarizer.py) — configured with environment variables in config.py |
||||||
|
|
||||||
|
Configuration |
||||||
|
------------- |
||||||
|
- config.py: central Config dataclass. Observed keys / env variables referenced across the codebase include: |
||||||
|
- config.DATABASE_PATH (default "data/motions.db") |
||||||
|
- OPENROUTER_API_KEY / other OPENROUTER_* variables used by summarizer.py |
||||||
|
- QWEN_MODEL (or other model identifier) referenced in summarizer.py |
||||||
|
- API timeout / batch size constants |
||||||
|
- .env file present at repo root (do not commit secrets). See .env.example if present (none observed). |
||||||
|
- Packaging metadata: pyproject.toml |
||||||
|
|
||||||
|
Build, run & development notes |
||||||
|
------------------------------ |
||||||
|
- Install dependencies via the project's Python packaging (pyproject.toml). There is no Dockerfile or CI |
||||||
|
workflows detected in the repository. |
||||||
|
- Streamlit app: run `streamlit run app.py` from project root to start the UI (app.py is the intended web entrypoint). |
||||||
|
- Scheduler: run scheduler.run_once() (script or import) or run scheduler.run_scheduler() for periodic ingestion. |
||||||
|
|
||||||
|
Tests |
||||||
|
----- |
||||||
|
- There is no test suite using pytest / unittest. One ad-hoc script `test.py` exists for manual insert verification. |
||||||
|
|
||||||
|
Notes / caveats |
||||||
|
---------------- |
||||||
|
- Project is synchronous (no async/await patterns detected). Many modules rely on module-level singletons |
||||||
|
(e.g., `db = MotionDatabase()`, `summarizer = MotionSummarizer()`, `scraper = MotionScraper()`). |
||||||
|
- Error handling frequently catches broad Exception and prints to stdout (see database.py, api_client.py, |
||||||
|
scraper.py). Logging is not centralized (print statements used). |
||||||
|
|
||||||
|
Where to look first (for contributors) |
||||||
|
------------------------------------- |
||||||
|
- app.py — follow the UI flow and see how votes & sessions are used |
||||||
|
- database.py — core data model and calculations |
||||||
|
- api_client.py — OData ingestion logic |
||||||
|
- summarizer.py — LLM usage and environment variables |
||||||
|
- scheduler.py — how ingestion is orchestrated over time |
||||||
@ -0,0 +1,118 @@ |
|||||||
|
CODE STYLE |
||||||
|
========== |
||||||
|
|
||||||
|
Purpose |
||||||
|
------- |
||||||
|
This document records the conventions already in use in the codebase so new contributors and AI |
||||||
|
agents can produce code that fits the repository's existing style. |
||||||
|
|
||||||
|
General |
||||||
|
------- |
||||||
|
- Language: Python (3.x) |
||||||
|
- Project uses one file-per-module with descriptive snake_case filenames (e.g., api_client.py, database.py) |
||||||
|
- Top-level module singletons are exposed when a single shared instance is desired (e.g. `db = MotionDatabase()`) |
||||||
|
- Keep code synchronous unless you introduce async consistently across modules (none currently use async/await) |
||||||
|
|
||||||
|
Naming |
||||||
|
------ |
||||||
|
- Files / modules: snake_case.py (e.g., motion_scraper -> scraper.py, api_client.py) |
||||||
|
- Classes: PascalCase (e.g., MotionDatabase, MotionSummarizer, TweedeKamerAPI) |
||||||
|
- Functions and methods: snake_case (including private helpers with a single leading underscore) |
||||||
|
- Constants / config fields: UPPER_SNAKE_CASE (placed in config.py and referenced via `from config import config`) |
||||||
|
|
||||||
|
File organization |
||||||
|
----------------- |
||||||
|
- Keep top-level domain modules in the repository root (this repo uses a flat layout) |
||||||
|
- Each module should contain one primary responsibility (e.g., database.py for DB logic) |
||||||
|
- Module-level singletons: create at module bottom and import from other modules (pattern used widely) |
||||||
|
|
||||||
|
Imports |
||||||
|
------- |
||||||
|
- Group imports in this order with a blank line between groups: |
||||||
|
1. Standard library (datetime, json, typing) |
||||||
|
2. Third-party libraries (requests, duckdb, ibis, streamlit) |
||||||
|
3. Local imports (from config import config, from database import db) |
||||||
|
- Use absolute imports (module name) rather than relative imports |
||||||
|
|
||||||
|
Typing |
||||||
|
------ |
||||||
|
- Add type hints to public function signatures where helpful (project uses typing in several places). |
||||||
|
- Use typing.Dict, typing.List, typing.Optional for simple container annotations. |
||||||
|
|
||||||
|
Error handling & logging |
||||||
|
------------------------ |
||||||
|
- Current pattern: functions catch broad Exception and print error messages, then return a safe default |
||||||
|
(False, [], None). Examples in database.py and api_client.py. |
||||||
|
- When updating code, prefer to: |
||||||
|
- Keep the existing behavior (return safe fallback) to avoid breaking call sites |
||||||
|
- Consider adding structured logging (use logging module) rather than print, but maintain similar |
||||||
|
high-level error flows unless refactoring intentionally. |
||||||
|
|
||||||
|
LLM / external API calls |
||||||
|
------------------------ |
||||||
|
- OpenAI-compatible client usage is in summarizer.py. Environment variables are read from config.py. |
||||||
|
- Do NOT commit API keys or secrets. Use environment variables (OPENROUTER_API_KEY, etc.) and |
||||||
|
reference them by name. |
||||||
|
- Network calls are synchronous using requests. Keep request timeouts and error handling consistent with |
||||||
|
existing patterns (catch requests.exceptions.RequestException and return safe fallback values). |
||||||
|
|
||||||
|
Database patterns |
||||||
|
----------------- |
||||||
|
- Database is DuckDB stored at data/motions.db. The MotionDatabase class opens short-lived duckdb |
||||||
|
connections inside methods (conn = duckdb.connect(self.db_path)). This pattern is used widely. |
||||||
|
- Queries and schema initialization happen inside MotionDatabase._init_database(). Keep DDL grouped there. |
||||||
|
- When writing methods that modify DB, follow the try/except + conn.close() pattern to guarantee cleanup. |
||||||
|
|
||||||
|
Testing |
||||||
|
------- |
||||||
|
- Currently the project uses ad-hoc test scripts (test.py). If adding tests, follow pytest conventions: |
||||||
|
- Place tests in tests/ directory |
||||||
|
- Use filenames test_*.py and functions test_* with assertions |
||||||
|
- Mock external APIs (requests, LLM client) via monkeypatch or unittest.mock |
||||||
|
|
||||||
|
Patterns observed (use these when adding new code) |
||||||
|
----------------------------------------------- |
||||||
|
- Singletons: expose module-level instance (e.g. `db = MotionDatabase()`), import it elsewhere |
||||||
|
- Private helpers: name with a single leading underscore (e.g., _get_voting_records) |
||||||
|
- Config: centralize in config.py and reference via `from config import config` (don't hardcode paths) |
||||||
|
|
||||||
|
Do's and Don'ts |
||||||
|
--------------- |
||||||
|
Do: |
||||||
|
- Follow existing naming: snake_case for files/functions |
||||||
|
- Add simple type hints for clarity |
||||||
|
- Return the same safe fallback values used in existing functions on error |
||||||
|
- Use module-level singletons for shared services if helpful |
||||||
|
|
||||||
|
Don't: |
||||||
|
- Don't add async/await in a single module without broader coordination |
||||||
|
- Don't print secret values or commit .env files |
||||||
|
- Don't create circular imports (be careful when modules instantiate singletons at import time) |
||||||
|
|
||||||
|
Example snippets |
||||||
|
---------------- |
||||||
|
Conformant class and method: |
||||||
|
|
||||||
|
class ExampleService: |
||||||
|
def __init__(self, param: str = config.DATABASE_PATH): |
||||||
|
self.param = param |
||||||
|
|
||||||
|
def do_work(self, items: typing.List[dict]) -> bool: |
||||||
|
try: |
||||||
|
# short-lived DB/HTTP usage |
||||||
|
conn = duckdb.connect(config.DATABASE_PATH) |
||||||
|
# ... perform work |
||||||
|
conn.close() |
||||||
|
return True |
||||||
|
except Exception as e: |
||||||
|
print(f"Error in do_work: {e}") |
||||||
|
if 'conn' in locals(): |
||||||
|
conn.close() |
||||||
|
return False |
||||||
|
|
||||||
|
Adding a new module |
||||||
|
------------------- |
||||||
|
1. Create snake_case file (e.g., new_service.py) |
||||||
|
2. Add a PascalCase class implementing the behavior and small helper functions prefixed with _ |
||||||
|
3. If you need a shared instance, create `service = NewService()` at the module bottom |
||||||
|
4. Import via `from new_service import service` in other modules |
||||||
@ -0,0 +1,36 @@ |
|||||||
|
FROM python:3.13-slim |
||||||
|
|
||||||
|
# Install minimal system deps |
||||||
|
RUN apt-get update \ |
||||||
|
&& apt-get install -y --no-install-recommends build-essential curl ca-certificates \ |
||||||
|
&& rm -rf /var/lib/apt/lists/* |
||||||
|
|
||||||
|
# Create non-root user for running the app |
||||||
|
RUN useradd -m -s /bin/bash app |
||||||
|
|
||||||
|
WORKDIR /home/app/app |
||||||
|
|
||||||
|
# Copy project files |
||||||
|
COPY . /home/app/app |
||||||
|
|
||||||
|
# Upgrade pip and install either pinned requirements or runtime defaults |
||||||
|
RUN python -m pip install --upgrade pip |
||||||
|
RUN if [ -f requirements.txt ]; then \ |
||||||
|
pip install -r requirements.txt; \ |
||||||
|
else \ |
||||||
|
pip install uv streamlit duckdb; \ |
||||||
|
fi |
||||||
|
|
||||||
|
# Fix permissions |
||||||
|
RUN chown -R app:app /home/app |
||||||
|
|
||||||
|
USER app |
||||||
|
ENV PYTHONPATH=/home/app/app |
||||||
|
|
||||||
|
EXPOSE 8501 |
||||||
|
|
||||||
|
# Simple healthcheck that queries the Streamlit root |
||||||
|
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s CMD curl -f http://localhost:8501/ || exit 1 |
||||||
|
|
||||||
|
# Run the Streamlit app via uv as preferred in this project |
||||||
|
CMD ["uv", "run", "streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"] |
||||||
@ -0,0 +1,90 @@ |
|||||||
|
# Tweede Kamer Parliamentary Embedding Analysis |
||||||
|
|
||||||
|
## Goal |
||||||
|
|
||||||
|
Track how MPs shift politically over time and map motions onto a meaningful ideological axis, by embedding both MPs and motions into a shared vector space. |
||||||
|
|
||||||
|
## Data |
||||||
|
|
||||||
|
|Source|Content| |
||||||
|
|------|-------| |
||||||
|
|MP × motion vote matrix|yes / no / abstain per MP per motion| |
||||||
|
|Motion text|Dutch-language motion descriptions| |
||||||
|
|MP metadata|name, party, entry/exit dates| |
||||||
|
|Timestamps|date of each vote| |
||||||
|
|
||||||
|
## Approach: Late Fusion |
||||||
|
|
||||||
|
Two independent embedding signals, combined per motion. |
||||||
|
|
||||||
|
### 1. Vote embeddings (SVD) |
||||||
|
|
||||||
|
- Build a sparse MP × motion matrix per time window |
||||||
|
- Apply SVD to get latent vectors for both MPs and motions |
||||||
|
- Encodes political alignment from actual voting behavior |
||||||
|
|
||||||
|
### 2. Text embeddings (Qwen3-0.6B) |
||||||
|
|
||||||
|
- Embed each motion's text using Qwen3-0.6B (multilingual, Dutch supported) |
||||||
|
- Encodes semantic/policy topic of the motion |
||||||
|
- Use a task instruction in English, e.g. `"Retrieve semantically similar Dutch parliamentary motions"` |
||||||
|
|
||||||
|
### 3. Fusion |
||||||
|
|
||||||
|
Concatenate (or weighted sum) the SVD motion vector and text vector into a single motion embedding. MPs retain their SVD vectors only. |
||||||
|
|
||||||
|
## Temporal Tracking |
||||||
|
|
||||||
|
### Time windows |
||||||
|
|
||||||
|
- Default: **quarterly** (flexible — can be per half-year or per N votes) |
||||||
|
- Adaptive option: fixed number of votes per window (e.g. 200) for stable SVD regardless of parliamentary rhythm |
||||||
|
|
||||||
|
### Procrustes alignment |
||||||
|
|
||||||
|
SVD axes are arbitrary per window and cannot be compared directly. Procrustes alignment finds the optimal rotation mapping one window's space onto the previous, using overlapping MPs as anchors. |
||||||
|
|
||||||
|
``` |
||||||
|
R = argmin || W1[common] - W2[common] @ R || |
||||||
|
W2_aligned = W2 @ R # applied to all MPs, including newcomers |
||||||
|
``` |
||||||
|
|
||||||
|
- Only overlapping MPs are needed to estimate R |
||||||
|
- New MPs are placed into the aligned space via their voting pattern |
||||||
|
- High Procrustes disparity score = structural political shift, not just individual drift |
||||||
|
|
||||||
|
### Election transitions |
||||||
|
|
||||||
|
At term boundaries (~60% MP overlap), alignment is noisier. Mitigation: chain alignments via the last quarter of the old term and first quarter of the new term, using only returning MPs. |
||||||
|
|
||||||
|
## Analysis |
||||||
|
|
||||||
|
|Question|Method| |
||||||
|
|--------|------| |
||||||
|
|MP drift over time|trajectory of MP vector across aligned windows| |
||||||
|
|Political axis|first SVD component, or defined by anchor parties (e.g. VVD vs SP)| |
||||||
|
|Swing voters|MPs closest to the boundary between party clusters| |
||||||
|
|Thematic clustering|UMAP on fused motion embeddings| |
||||||
|
|Cross-party coalitions|motions where party cluster boundaries blur| |
||||||
|
|Party cohesion|variance of MP vectors within a party per window| |
||||||
|
|
||||||
|
## Stack |
||||||
|
|
||||||
|
|Component|Tool| |
||||||
|
|---------|----| |
||||||
|
|Matrix factorization| |
||||||
|
````scipy.sparse.linalg.svds |
||||||
|
````| |
||||||
|
|
||||||
|
|Procrustes alignment| |
||||||
|
````scipy.spatial.procrustes |
||||||
|
````| |
||||||
|
|
||||||
|
|Text embeddings|Qwen3-0.6B via |
||||||
|
````sentence-transformers |
||||||
|
```` |
||||||
|
|
||||||
|
or vLLM| |
||||||
|
|Dimensionality reduction|UMAP| |
||||||
|
|Visualization|Plotly (interactive trajectories)| |
||||||
|
|Data handling|ibis / pandas| |
||||||
@ -0,0 +1,188 @@ |
|||||||
|
"""Thin AI provider adapter for OpenRouter-compatible backends. |
||||||
|
|
||||||
|
Provides simple helpers for embeddings and chat completions using requests. |
||||||
|
This module is intentionally small and dependency-light to make testing easy. |
||||||
|
""" |
||||||
|
|
||||||
|
from __future__ import annotations |
||||||
|
|
||||||
|
import os |
||||||
|
import time |
||||||
|
import random |
||||||
|
from typing import Any |
||||||
|
|
||||||
|
import requests |
||||||
|
|
||||||
|
|
||||||
|
class ProviderError(Exception): |
||||||
|
"""Terminal provider error (non-retryable or configuration issues).""" |
||||||
|
|
||||||
|
|
||||||
|
def _get_base_url() -> str: |
||||||
|
# Support multiple env var names and fall back to OpenRouter default |
||||||
|
return os.environ.get( |
||||||
|
"OPENROUTER_URL", |
||||||
|
os.environ.get("OPENROUTER_BASE_URL", "https://openrouter.ai/api/v1"), |
||||||
|
) |
||||||
|
|
||||||
|
|
||||||
|
def _get_api_key() -> str: |
||||||
|
# Accept several common env var names for convenience |
||||||
|
for name in ("OPENROUTER_API_KEY", "OPENROUTER_KEY", "OPENAI_API_KEY", "API_KEY"): |
||||||
|
key = os.environ.get(name) |
||||||
|
if key: |
||||||
|
return key |
||||||
|
raise ProviderError( |
||||||
|
"OPENROUTER_API_KEY (or OPENAI_API_KEY) environment variable is required" |
||||||
|
) |
||||||
|
|
||||||
|
|
||||||
|
def _post_with_retries( |
||||||
|
path: str, json: dict[str, Any], retries: int = 3 |
||||||
|
) -> requests.Response: |
||||||
|
"""POST to the provider with a small retry/backoff for transient errors. |
||||||
|
|
||||||
|
Retries on network errors (requests.ConnectionError) and 5xx responses. |
||||||
|
""" |
||||||
|
url = _get_base_url().rstrip("/") + path |
||||||
|
headers = { |
||||||
|
"Authorization": f"Bearer {_get_api_key()}", |
||||||
|
"Content-Type": "application/json", |
||||||
|
} |
||||||
|
|
||||||
|
backoff = 0.5 |
||||||
|
for attempt in range(1, retries + 1): |
||||||
|
try: |
||||||
|
resp = requests.post(url, json=json, headers=headers, timeout=10) |
||||||
|
except requests.ConnectionError as exc: |
||||||
|
if attempt == retries: |
||||||
|
raise ProviderError( |
||||||
|
f"Connection error when calling provider: {exc}" |
||||||
|
) from exc |
||||||
|
sleep = backoff * (2 ** (attempt - 1)) |
||||||
|
sleep = sleep + random.uniform(0, sleep * 0.1) |
||||||
|
time.sleep(sleep) |
||||||
|
continue |
||||||
|
|
||||||
|
# Treat 5xx as transient |
||||||
|
if 500 <= getattr(resp, "status_code", 0) < 600: |
||||||
|
if attempt == retries: |
||||||
|
raise ProviderError(f"Provider returned HTTP {resp.status_code}") |
||||||
|
sleep = backoff * (2 ** (attempt - 1)) |
||||||
|
sleep = sleep + random.uniform(0, sleep * 0.1) |
||||||
|
time.sleep(sleep) |
||||||
|
continue |
||||||
|
|
||||||
|
return resp |
||||||
|
|
||||||
|
# Should not reach here |
||||||
|
raise ProviderError("Failed to call provider after retries") |
||||||
|
|
||||||
|
|
||||||
|
def get_embedding(text: str, model: str | None = None) -> list[float]: |
||||||
|
"""Return an embedding vector for `text` using the configured provider. |
||||||
|
|
||||||
|
Raises ProviderError for configuration or provider-side failures. |
||||||
|
""" |
||||||
|
if not isinstance(text, str): |
||||||
|
raise ProviderError("text must be a string") |
||||||
|
|
||||||
|
# Resolve model: prefer explicit arg, then env vars, then sensible Qwen default |
||||||
|
if model is None: |
||||||
|
model = ( |
||||||
|
os.environ.get("EMBEDDING_MODEL") |
||||||
|
or os.environ.get("QWEN_EMBEDDING_MODEL") |
||||||
|
or "qwen/qwen3-embedding-4b" |
||||||
|
) |
||||||
|
resp = _post_with_retries("/embeddings", json={"model": model, "input": text}) |
||||||
|
|
||||||
|
try: |
||||||
|
data = resp.json() |
||||||
|
except Exception as exc: |
||||||
|
raise ProviderError(f"Invalid JSON response from provider: {exc}") from exc |
||||||
|
|
||||||
|
# Expecting {"data": [{"embedding": [...]}, ...]} |
||||||
|
try: |
||||||
|
embedding = data["data"][0]["embedding"] |
||||||
|
except Exception as exc: |
||||||
|
# If provider returns an error JSON, allow a local fallback when explicitly enabled |
||||||
|
fallback = os.environ.get("ALLOW_LOCAL_EMBED_FALLBACK", "false").lower() in ( |
||||||
|
"1", |
||||||
|
"true", |
||||||
|
"yes", |
||||||
|
) |
||||||
|
if fallback: |
||||||
|
# choose fallback dim via env or default |
||||||
|
dim = int(os.environ.get("LOCAL_EMBED_DIM", "64")) |
||||||
|
return _local_embedding(text, dim=dim) |
||||||
|
raise ProviderError(f"Unexpected embedding response shape: {data}") from exc |
||||||
|
|
||||||
|
if not isinstance(embedding, list): |
||||||
|
raise ProviderError("Embedding is not a list") |
||||||
|
|
||||||
|
return [float(x) for x in embedding] |
||||||
|
|
||||||
|
|
||||||
|
def _local_embedding(text: str, dim: int = 64) -> list[float]: |
||||||
|
"""Deterministic local fallback embedding based on SHA256. |
||||||
|
|
||||||
|
Returns a list of `dim` floats in range [-1, 1]. Not semantically rich but useful |
||||||
|
for local testing when provider embeddings are unavailable. |
||||||
|
""" |
||||||
|
import hashlib |
||||||
|
|
||||||
|
h = hashlib.sha256(text.encode("utf8")).digest() |
||||||
|
values = [] |
||||||
|
i = 0 |
||||||
|
# Expand digest if needed |
||||||
|
while len(values) < dim: |
||||||
|
# take 8 bytes -> 64-bit int |
||||||
|
chunk = h[i % len(h) : (i % len(h)) + 8] |
||||||
|
if len(chunk) < 8: |
||||||
|
chunk = chunk.ljust(8, b"\0") |
||||||
|
val = int.from_bytes(chunk, "big", signed=False) |
||||||
|
# normalize to [-1,1] |
||||||
|
valscale = (val / (2**64 - 1)) * 2.0 - 1.0 |
||||||
|
values.append(valscale) |
||||||
|
i += 1 |
||||||
|
# re-hash occasionally to get more entropy |
||||||
|
if i % (len(h) // 2 + 1) == 0: |
||||||
|
h = hashlib.sha256(h + chunk).digest() |
||||||
|
|
||||||
|
return values[:dim] |
||||||
|
|
||||||
|
|
||||||
|
def chat_completion(messages: list[dict], model: str | None = None) -> str: |
||||||
|
"""Return the assistant's content string for a chat completion request. |
||||||
|
|
||||||
|
messages should be a list of dicts like {"role": "user", "content": "..."}. |
||||||
|
""" |
||||||
|
if not isinstance(messages, list): |
||||||
|
raise ProviderError("messages must be a list of dicts") |
||||||
|
|
||||||
|
# Resolve chat model: prefer explicit arg, then env var QWEN_MODEL, then a sensible default |
||||||
|
if model is None: |
||||||
|
model = ( |
||||||
|
os.environ.get("QWEN_MODEL") |
||||||
|
or os.environ.get("CHAT_MODEL") |
||||||
|
or "qwen/qwen-3.2" |
||||||
|
) |
||||||
|
|
||||||
|
resp = _post_with_retries( |
||||||
|
"/chat/completions", json={"model": model, "messages": messages} |
||||||
|
) |
||||||
|
|
||||||
|
try: |
||||||
|
data = resp.json() |
||||||
|
except Exception as exc: |
||||||
|
raise ProviderError(f"Invalid JSON response from provider: {exc}") from exc |
||||||
|
|
||||||
|
# Expecting {"choices": [{"message": {"content": "..."}}]} |
||||||
|
try: |
||||||
|
content = data["choices"][0]["message"]["content"] |
||||||
|
except Exception as exc: |
||||||
|
raise ProviderError( |
||||||
|
f"Unexpected chat completion response shape: {data}" |
||||||
|
) from exc |
||||||
|
|
||||||
|
return str(content) |
||||||
@ -0,0 +1,389 @@ |
|||||||
|
# api_client.py (complete updated version) |
||||||
|
import requests |
||||||
|
import json |
||||||
|
import re |
||||||
|
from datetime import datetime, timedelta |
||||||
|
from typing import Dict, List, Optional |
||||||
|
from config import config |
||||||
|
import time |
||||||
|
from collections import defaultdict |
||||||
|
|
||||||
|
|
||||||
|
class TweedeKamerAPI: |
||||||
|
def __init__(self): |
||||||
|
self.odata_base_url = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
||||||
|
self.session = requests.Session() |
||||||
|
self.session.headers.update( |
||||||
|
{ |
||||||
|
"Accept": "application/json", |
||||||
|
"User-Agent": "Dutch-Political-Compass-Tool/1.0", |
||||||
|
} |
||||||
|
) |
||||||
|
|
||||||
|
def get_motions( |
||||||
|
self, start_date: datetime = None, end_date: datetime = None, limit: int = 500 |
||||||
|
) -> List[Dict]: |
||||||
|
"""Get motions with voting results using OData API""" |
||||||
|
if not start_date: |
||||||
|
start_date = datetime.now() - timedelta(days=730) # 2 years ago |
||||||
|
|
||||||
|
try: |
||||||
|
# Get voting records |
||||||
|
voting_records = self._get_voting_records(start_date, end_date, limit) |
||||||
|
print(f"Fetched {len(voting_records)} voting records from API") |
||||||
|
|
||||||
|
# Group by Besluit_Id (decision/motion) and get motion details |
||||||
|
motions = self._process_voting_records(voting_records) |
||||||
|
print(f"Processed into {len(motions)} unique motions") |
||||||
|
|
||||||
|
return motions |
||||||
|
|
||||||
|
except Exception as e: |
||||||
|
print(f"Error fetching motions from API: {e}") |
||||||
|
return [] |
||||||
|
|
||||||
|
def _get_voting_records( |
||||||
|
self, start_date: datetime, end_date: datetime = None, limit: int = 500 |
||||||
|
) -> List[Dict]: |
||||||
|
"""Get individual voting records from the API""" |
||||||
|
|
||||||
|
# Format date properly for OData |
||||||
|
start_date_str = start_date.strftime("%Y-%m-%d") |
||||||
|
filter_query = f"GewijzigdOp ge {start_date_str}T00:00:00Z" |
||||||
|
|
||||||
|
if end_date: |
||||||
|
end_date_str = end_date.strftime("%Y-%m-%d") |
||||||
|
filter_query += f" and GewijzigdOp le {end_date_str}T23:59:59Z" |
||||||
|
|
||||||
|
# Add filter to exclude deleted records |
||||||
|
filter_query += " and Verwijderd eq false" |
||||||
|
|
||||||
|
url = f"{self.odata_base_url}/Stemming" |
||||||
|
params = { |
||||||
|
"$filter": filter_query, |
||||||
|
"$top": limit, |
||||||
|
"$orderby": "GewijzigdOp desc", |
||||||
|
} |
||||||
|
|
||||||
|
try: |
||||||
|
response = self.session.get(url, params=params, timeout=config.API_TIMEOUT) |
||||||
|
response.raise_for_status() |
||||||
|
data = response.json() |
||||||
|
|
||||||
|
voting_records = data.get("value", []) |
||||||
|
|
||||||
|
# If we got the maximum, there might be more data |
||||||
|
if len(voting_records) == limit: |
||||||
|
print( |
||||||
|
f"Retrieved maximum {limit} records, there might be more data available" |
||||||
|
) |
||||||
|
|
||||||
|
return voting_records |
||||||
|
|
||||||
|
except requests.exceptions.RequestException as e: |
||||||
|
print(f"API request failed: {e}") |
||||||
|
if hasattr(e, "response") and e.response is not None: |
||||||
|
print(f"Response status: {e.response.status_code}") |
||||||
|
print(f"Response text: {e.response.text[:500]}") |
||||||
|
return [] |
||||||
|
|
||||||
|
def _process_voting_records(self, records: List[Dict]) -> List[Dict]: |
||||||
|
"""Process individual voting records into grouped motions""" |
||||||
|
|
||||||
|
# Group records by Besluit_Id (decision/motion) |
||||||
|
motion_groups = defaultdict( |
||||||
|
lambda: {"votes": {}, "besluit_id": None, "latest_date": None} |
||||||
|
) |
||||||
|
|
||||||
|
for record in records: |
||||||
|
besluit_id = record.get("Besluit_Id") |
||||||
|
if not besluit_id: |
||||||
|
continue |
||||||
|
|
||||||
|
# Extract party and vote information |
||||||
|
party_name = record.get("ActorNaam") |
||||||
|
vote_type = record.get("Soort", "").lower() |
||||||
|
record_date = record.get("GewijzigdOp", "") |
||||||
|
|
||||||
|
if not party_name: |
||||||
|
continue |
||||||
|
|
||||||
|
# Map vote types to our format |
||||||
|
if vote_type == "voor": |
||||||
|
vote = "voor" |
||||||
|
elif vote_type == "tegen": |
||||||
|
vote = "tegen" |
||||||
|
else: |
||||||
|
vote = "afwezig" |
||||||
|
|
||||||
|
# Store the vote |
||||||
|
motion_groups[besluit_id]["votes"][party_name] = vote |
||||||
|
motion_groups[besluit_id]["besluit_id"] = besluit_id |
||||||
|
|
||||||
|
# Track the latest date for this motion |
||||||
|
if ( |
||||||
|
not motion_groups[besluit_id]["latest_date"] |
||||||
|
or record_date > motion_groups[besluit_id]["latest_date"] |
||||||
|
): |
||||||
|
motion_groups[besluit_id]["latest_date"] = record_date |
||||||
|
|
||||||
|
# Now get motion details for each unique Besluit_Id |
||||||
|
motions = [] |
||||||
|
for besluit_id, motion_data in motion_groups.items(): |
||||||
|
if len(motion_data["votes"]) < 3: # Skip motions with too few votes |
||||||
|
continue |
||||||
|
|
||||||
|
# Get motion details |
||||||
|
motion_details = self._get_motion_details(besluit_id) |
||||||
|
|
||||||
|
if not motion_details: |
||||||
|
# Create basic motion data if we can't get details |
||||||
|
motion_details = { |
||||||
|
"title": f"Motion {besluit_id[:8]}", |
||||||
|
"description": "No description available", |
||||||
|
"date": motion_data["latest_date"].split("T")[0] |
||||||
|
if motion_data["latest_date"] |
||||||
|
else datetime.now().strftime("%Y-%m-%d"), |
||||||
|
} |
||||||
|
|
||||||
|
# Calculate winning margin |
||||||
|
voting_results = motion_data["votes"] |
||||||
|
total_votes = sum( |
||||||
|
1 for vote in voting_results.values() if vote in ["voor", "tegen"] |
||||||
|
) |
||||||
|
|
||||||
|
if total_votes == 0: |
||||||
|
continue |
||||||
|
|
||||||
|
votes_for = sum(1 for vote in voting_results.values() if vote == "voor") |
||||||
|
winning_margin = abs(votes_for - (total_votes - votes_for)) / total_votes |
||||||
|
|
||||||
|
motion = { |
||||||
|
"title": motion_details["title"], |
||||||
|
"description": motion_details["description"], |
||||||
|
"date": motion_details["date"], |
||||||
|
"policy_area": self._determine_policy_area( |
||||||
|
motion_details["title"], motion_details["description"] |
||||||
|
), |
||||||
|
"voting_results": voting_results, |
||||||
|
"winning_margin": winning_margin, |
||||||
|
"url": f"https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit_id}", |
||||||
|
"externe_identifier": motion_details.get("externe_identifier"), |
||||||
|
"body_text": motion_details.get("body_text"), |
||||||
|
} |
||||||
|
|
||||||
|
motions.append(motion) |
||||||
|
|
||||||
|
return motions |
||||||
|
|
||||||
|
def _get_motion_details(self, besluit_id: str) -> Optional[Dict]: |
||||||
|
"""Get motion details from Besluit endpoint. |
||||||
|
|
||||||
|
Fetches Zaak.Onderwerp for the human-readable title, then follows the |
||||||
|
Zaak → Document → DocumentVersie chain to get the ExterneIdentifier, |
||||||
|
which is used to scrape the full motion body text from |
||||||
|
zoek.officielebekendmakingen.nl. |
||||||
|
""" |
||||||
|
try: |
||||||
|
# Step 1: Besluit → Zaak (title) + Zaak.Id for document lookup |
||||||
|
url = f"{self.odata_base_url}/Besluit({besluit_id})" |
||||||
|
params = {"$expand": "Zaak($select=Id,Onderwerp)"} |
||||||
|
response = self.session.get(url, params=params, timeout=config.API_TIMEOUT) |
||||||
|
response.raise_for_status() |
||||||
|
record = response.json() |
||||||
|
|
||||||
|
zaak_list = record.get("Zaak", []) |
||||||
|
onderwerp = None |
||||||
|
zaak_id = None |
||||||
|
if zaak_list: |
||||||
|
onderwerp = zaak_list[0].get("Onderwerp") |
||||||
|
zaak_id = zaak_list[0].get("Id") |
||||||
|
|
||||||
|
besluit_tekst = record.get("BesluitTekst") or "" |
||||||
|
date_str = record.get("GewijzigdOp", "") |
||||||
|
date = ( |
||||||
|
date_str.split("T")[0] |
||||||
|
if date_str |
||||||
|
else datetime.now().strftime("%Y-%m-%d") |
||||||
|
) |
||||||
|
|
||||||
|
title = onderwerp or f"Motion {besluit_id[:8]}" |
||||||
|
description = onderwerp or besluit_tekst or "Geen beschrijving beschikbaar" |
||||||
|
|
||||||
|
# Step 2: Fetch ExterneIdentifier via Zaak → Document → DocumentVersie |
||||||
|
externe_identifier = None |
||||||
|
body_text = None |
||||||
|
if zaak_id: |
||||||
|
externe_identifier = self._get_externe_identifier(zaak_id) |
||||||
|
if externe_identifier: |
||||||
|
body_text = self._fetch_body_text(externe_identifier) |
||||||
|
|
||||||
|
return { |
||||||
|
"title": title, |
||||||
|
"description": body_text or description, |
||||||
|
"date": date, |
||||||
|
"externe_identifier": externe_identifier, |
||||||
|
"body_text": body_text, |
||||||
|
} |
||||||
|
|
||||||
|
except Exception as e: |
||||||
|
print(f"Error getting motion details for {besluit_id}: {e}") |
||||||
|
|
||||||
|
return None |
||||||
|
|
||||||
|
def _get_externe_identifier(self, zaak_id: str) -> Optional[str]: |
||||||
|
"""Fetch the ExterneIdentifier for the first non-deleted DocumentVersie of a Zaak.""" |
||||||
|
try: |
||||||
|
url = f"{self.odata_base_url}/Zaak({zaak_id})" |
||||||
|
params = { |
||||||
|
"$expand": "Document($expand=DocumentVersie($select=Id,ExterneIdentifier,Extensie,Verwijderd))" |
||||||
|
} |
||||||
|
response = self.session.get(url, params=params, timeout=config.API_TIMEOUT) |
||||||
|
response.raise_for_status() |
||||||
|
data = response.json() |
||||||
|
|
||||||
|
for doc in data.get("Document", []): |
||||||
|
for versie in doc.get("DocumentVersie", []): |
||||||
|
if versie.get("Verwijderd"): |
||||||
|
continue |
||||||
|
ext_id = versie.get("ExterneIdentifier") |
||||||
|
if ext_id: |
||||||
|
return ext_id |
||||||
|
except Exception as e: |
||||||
|
print(f"Error fetching ExterneIdentifier for zaak {zaak_id}: {e}") |
||||||
|
|
||||||
|
return None |
||||||
|
|
||||||
|
def _fetch_body_text(self, externe_identifier: str) -> Optional[str]: |
||||||
|
"""Scrape full motion body text from zoek.officielebekendmakingen.nl.""" |
||||||
|
try: |
||||||
|
url = f"https://zoek.officielebekendmakingen.nl/{externe_identifier}.html" |
||||||
|
response = self.session.get(url, timeout=config.API_TIMEOUT) |
||||||
|
response.raise_for_status() |
||||||
|
html = response.text |
||||||
|
|
||||||
|
# Strip tags |
||||||
|
text = re.sub(r"<[^>]+>", " ", html) |
||||||
|
text = re.sub(r"&[a-z]+;", " ", text) |
||||||
|
text = re.sub(r"\s+", " ", text).strip() |
||||||
|
|
||||||
|
# Find the motion body starting at the first relevant keyword |
||||||
|
start_keywords = [ |
||||||
|
"constaterende", |
||||||
|
"overwegende", |
||||||
|
"verzoekt", |
||||||
|
"spreekt uit", |
||||||
|
"roept op", |
||||||
|
"de kamer,", |
||||||
|
] |
||||||
|
start_pos = len(text) |
||||||
|
for kw in start_keywords: |
||||||
|
pos = text.lower().find(kw) |
||||||
|
if pos != -1 and pos < start_pos: |
||||||
|
start_pos = pos |
||||||
|
|
||||||
|
if start_pos == len(text): |
||||||
|
return None # No motion body found |
||||||
|
|
||||||
|
body = text[start_pos:] |
||||||
|
|
||||||
|
# Trim at end markers |
||||||
|
end_markers = [ |
||||||
|
"gaat over tot de orde van de dag", |
||||||
|
"naar boven", |
||||||
|
"deze motie is", |
||||||
|
"nr.", |
||||||
|
] |
||||||
|
for marker in end_markers: |
||||||
|
pos = body.lower().find(marker) |
||||||
|
if pos != -1: |
||||||
|
body = body[:pos] |
||||||
|
|
||||||
|
body = body.strip() |
||||||
|
return body if len(body) > 50 else None |
||||||
|
|
||||||
|
except Exception as e: |
||||||
|
print(f"Error fetching body text for {externe_identifier}: {e}") |
||||||
|
|
||||||
|
return None |
||||||
|
|
||||||
|
def _determine_policy_area(self, title: str, description: str) -> str: |
||||||
|
"""Determine policy area from motion title and description""" |
||||||
|
text = (title + " " + description).lower() |
||||||
|
|
||||||
|
# Policy area keyword mapping |
||||||
|
policy_mapping = { |
||||||
|
"Economie": [ |
||||||
|
"economie", |
||||||
|
"belasting", |
||||||
|
"budget", |
||||||
|
"financiën", |
||||||
|
"werkgelegenheid", |
||||||
|
"bedrijven", |
||||||
|
"economisch", |
||||||
|
], |
||||||
|
"Klimaat": [ |
||||||
|
"klimaat", |
||||||
|
"co2", |
||||||
|
"duurzaam", |
||||||
|
"energie", |
||||||
|
"milieu", |
||||||
|
"uitstoot", |
||||||
|
"klimaatverandering", |
||||||
|
], |
||||||
|
"Immigratie": [ |
||||||
|
"migratie", |
||||||
|
"asiel", |
||||||
|
"vreemdeling", |
||||||
|
"integratie", |
||||||
|
"naturalisatie", |
||||||
|
"immigratie", |
||||||
|
], |
||||||
|
"Zorg": [ |
||||||
|
"zorg", |
||||||
|
"gezondheid", |
||||||
|
"ziekenhuis", |
||||||
|
"medicijn", |
||||||
|
"arts", |
||||||
|
"patiënt", |
||||||
|
"gezondheidszorg", |
||||||
|
], |
||||||
|
"Onderwijs": [ |
||||||
|
"onderwijs", |
||||||
|
"school", |
||||||
|
"universiteit", |
||||||
|
"student", |
||||||
|
"leraar", |
||||||
|
"educatie", |
||||||
|
], |
||||||
|
"Defensie": [ |
||||||
|
"defensie", |
||||||
|
"militair", |
||||||
|
"veiligheid", |
||||||
|
"oorlog", |
||||||
|
"leger", |
||||||
|
"veiligheidsdienst", |
||||||
|
], |
||||||
|
} |
||||||
|
|
||||||
|
for area, keywords in policy_mapping.items(): |
||||||
|
if any(keyword in text for keyword in keywords): |
||||||
|
return area |
||||||
|
|
||||||
|
return "Algemeen" |
||||||
|
|
||||||
|
def test_api_connection(self) -> bool: |
||||||
|
"""Test if API is accessible""" |
||||||
|
try: |
||||||
|
url = f"{self.odata_base_url}/Stemming" |
||||||
|
params = {"$top": 1} |
||||||
|
|
||||||
|
response = self.session.get(url, params=params, timeout=10) |
||||||
|
response.raise_for_status() |
||||||
|
|
||||||
|
data = response.json() |
||||||
|
return len(data.get("value", [])) > 0 |
||||||
|
|
||||||
|
except Exception as e: |
||||||
|
print(f"API connection test failed: {e}") |
||||||
|
return False |
||||||
@ -0,0 +1,51 @@ |
|||||||
|
# config.py (complete updated version) |
||||||
|
import os |
||||||
|
from dataclasses import dataclass |
||||||
|
from typing import List |
||||||
|
|
||||||
|
|
||||||
|
@dataclass |
||||||
|
class Config: |
||||||
|
# Database settings |
||||||
|
DATABASE_PATH = "data/motions.db" |
||||||
|
|
||||||
|
# API settings (updated) |
||||||
|
TWEEDE_KAMER_ODATA_API = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
||||||
|
API_TIMEOUT = 30 |
||||||
|
API_BATCH_SIZE = 250 # Increased based on API capabilities |
||||||
|
API_MAX_LIMIT = 250 |
||||||
|
|
||||||
|
# AI settings |
||||||
|
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY") |
||||||
|
OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1" |
||||||
|
QWEN_MODEL = "qwen/qwen-2.5-72b-instruct" |
||||||
|
|
||||||
|
# App settings |
||||||
|
DEFAULT_MOTION_COUNT = 10 |
||||||
|
DEFAULT_WINNING_MARGIN_MIN = ( |
||||||
|
0 # % - include all, filter by layman_explanation instead |
||||||
|
) |
||||||
|
DEFAULT_WINNING_MARGIN_MAX = 100 # % |
||||||
|
SESSION_TIMEOUT_DAYS = 30 |
||||||
|
|
||||||
|
# Policy areas |
||||||
|
POLICY_AREAS = [ |
||||||
|
"Alle", |
||||||
|
"Economie", |
||||||
|
"Klimaat", |
||||||
|
"Immigratie", |
||||||
|
"Zorg", |
||||||
|
"Onderwijs", |
||||||
|
"Defensie", |
||||||
|
"Sociale Zaken", |
||||||
|
"Algemeen", |
||||||
|
] |
||||||
|
|
||||||
|
# Scraper defaults (previously missing) |
||||||
|
BASE_URL = ( |
||||||
|
"https://www.tweedekamer.nl/zoeken/zoekresultaten" # base for scraping motions |
||||||
|
) |
||||||
|
SCRAPING_DELAY = int(os.getenv("SCRAPING_DELAY", "5")) |
||||||
|
|
||||||
|
|
||||||
|
config = Config() |
||||||
@ -0,0 +1,582 @@ |
|||||||
|
# database.py (final working version) |
||||||
|
import duckdb |
||||||
|
import json |
||||||
|
import uuid |
||||||
|
from datetime import datetime, timedelta |
||||||
|
from typing import Dict, List, Optional, Tuple |
||||||
|
from config import config |
||||||
|
import logging |
||||||
|
|
||||||
|
_logger = logging.getLogger(__name__) |
||||||
|
|
||||||
|
|
||||||
|
class MotionDatabase: |
||||||
|
def __init__(self, db_path: str = config.DATABASE_PATH): |
||||||
|
self.db_path = db_path |
||||||
|
self._init_database() |
||||||
|
|
||||||
|
def _init_database(self): |
||||||
|
"""Initialize database with required tables""" |
||||||
|
# Create directory if it doesn't exist |
||||||
|
import os |
||||||
|
|
||||||
|
os.makedirs(os.path.dirname(self.db_path), exist_ok=True) |
||||||
|
|
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
|
||||||
|
# Create sequence for auto-incrementing IDs |
||||||
|
try: |
||||||
|
conn.execute("CREATE SEQUENCE IF NOT EXISTS motions_id_seq START 1") |
||||||
|
except: |
||||||
|
pass |
||||||
|
|
||||||
|
# Create tables with proper ID handling |
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE IF NOT EXISTS motions ( |
||||||
|
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||||
|
title TEXT NOT NULL, |
||||||
|
description TEXT, |
||||||
|
date DATE, |
||||||
|
policy_area TEXT, |
||||||
|
voting_results JSON, |
||||||
|
winning_margin FLOAT, |
||||||
|
controversy_score FLOAT, |
||||||
|
layman_explanation TEXT, |
||||||
|
url TEXT UNIQUE, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
""") |
||||||
|
|
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE IF NOT EXISTS user_sessions ( |
||||||
|
session_id TEXT PRIMARY KEY, |
||||||
|
user_votes JSON, |
||||||
|
completed_motions INTEGER DEFAULT 0, |
||||||
|
total_motions INTEGER DEFAULT 10, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP |
||||||
|
) |
||||||
|
""") |
||||||
|
|
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE IF NOT EXISTS party_results ( |
||||||
|
session_id TEXT, |
||||||
|
party_name TEXT, |
||||||
|
agreement_percentage FLOAT, |
||||||
|
agreed_motions JSON, |
||||||
|
disagreed_motions JSON, |
||||||
|
PRIMARY KEY (session_id, party_name) |
||||||
|
) |
||||||
|
""") |
||||||
|
|
||||||
|
# New pipeline tables |
||||||
|
conn.execute(""" |
||||||
|
CREATE SEQUENCE IF NOT EXISTS mp_votes_id_seq START 1 |
||||||
|
""") |
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE IF NOT EXISTS mp_votes ( |
||||||
|
id INTEGER DEFAULT nextval('mp_votes_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
mp_name TEXT NOT NULL, |
||||||
|
party TEXT, |
||||||
|
vote TEXT NOT NULL, |
||||||
|
date DATE, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
""") |
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE IF NOT EXISTS mp_metadata ( |
||||||
|
mp_name TEXT PRIMARY KEY, |
||||||
|
party TEXT, |
||||||
|
van DATE, |
||||||
|
tot_en_met DATE, |
||||||
|
persoon_id TEXT |
||||||
|
) |
||||||
|
""") |
||||||
|
conn.execute(""" |
||||||
|
CREATE SEQUENCE IF NOT EXISTS svd_vectors_id_seq START 1 |
||||||
|
""") |
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE IF NOT EXISTS svd_vectors ( |
||||||
|
id INTEGER DEFAULT nextval('svd_vectors_id_seq'), |
||||||
|
window_id TEXT NOT NULL, |
||||||
|
entity_type TEXT NOT NULL, |
||||||
|
entity_id TEXT NOT NULL, |
||||||
|
vector JSON NOT NULL, |
||||||
|
model TEXT, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
""") |
||||||
|
conn.execute(""" |
||||||
|
CREATE SEQUENCE IF NOT EXISTS fused_embeddings_id_seq START 1 |
||||||
|
""") |
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||||
|
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
window_id TEXT NOT NULL, |
||||||
|
vector JSON NOT NULL, |
||||||
|
svd_dims INTEGER NOT NULL, |
||||||
|
text_dims INTEGER NOT NULL, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
""") |
||||||
|
|
||||||
|
conn.close() |
||||||
|
|
||||||
|
def reset_database(self): |
||||||
|
"""Development helper: drop known tables and re-run initialization. |
||||||
|
|
||||||
|
WARNING: intended for dev/test only. This will remove tables and recreate schema. |
||||||
|
""" |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
try: |
||||||
|
# Drop known tables if they exist |
||||||
|
for t in ("party_results", "user_sessions", "motions"): |
||||||
|
try: |
||||||
|
conn.execute(f"DROP TABLE IF EXISTS {t}") |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
# Recreate schema |
||||||
|
conn.close() |
||||||
|
self._init_database() |
||||||
|
finally: |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
|
||||||
|
def insert_motion(self, motion_data: Dict) -> bool: |
||||||
|
"""Insert a new motion into database""" |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
|
||||||
|
# Check if motion already exists by URL to avoid duplicates |
||||||
|
existing = conn.execute( |
||||||
|
""" |
||||||
|
SELECT COUNT(*) FROM motions WHERE url = ? |
||||||
|
""", |
||||||
|
(motion_data["url"],), |
||||||
|
).fetchone() |
||||||
|
|
||||||
|
if existing and existing[0] > 0: |
||||||
|
conn.close() |
||||||
|
return False # Motion already exists |
||||||
|
|
||||||
|
# Insert motion - id will be auto-generated by sequence |
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
INSERT INTO motions |
||||||
|
(title, description, date, policy_area, voting_results, |
||||||
|
winning_margin, controversy_score, url, externe_identifier, body_text, created_at) |
||||||
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, CURRENT_TIMESTAMP) |
||||||
|
""", |
||||||
|
( |
||||||
|
motion_data["title"], |
||||||
|
motion_data["description"] or "", |
||||||
|
motion_data["date"], |
||||||
|
motion_data["policy_area"], |
||||||
|
json.dumps(motion_data["voting_results"]), |
||||||
|
motion_data["winning_margin"], |
||||||
|
1 - motion_data["winning_margin"], # controversy score |
||||||
|
motion_data["url"], |
||||||
|
motion_data.get("externe_identifier"), |
||||||
|
motion_data.get("body_text"), |
||||||
|
), |
||||||
|
) |
||||||
|
|
||||||
|
conn.close() |
||||||
|
return True |
||||||
|
|
||||||
|
except Exception as e: |
||||||
|
print(f"Error inserting motion: {e}") |
||||||
|
if "conn" in locals(): |
||||||
|
conn.close() |
||||||
|
return False |
||||||
|
|
||||||
|
def get_filtered_motions( |
||||||
|
self, |
||||||
|
policy_area: str = "Alle", |
||||||
|
min_margin: float = 0.2, |
||||||
|
max_margin: float = 0.8, |
||||||
|
limit: int = 100, |
||||||
|
) -> List[Dict]: |
||||||
|
"""Get motions filtered by criteria""" |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
|
||||||
|
query = """ |
||||||
|
SELECT * FROM motions |
||||||
|
WHERE winning_margin BETWEEN ? AND ? |
||||||
|
AND layman_explanation IS NOT NULL |
||||||
|
AND layman_explanation != '' |
||||||
|
""" |
||||||
|
params = [min_margin, max_margin] |
||||||
|
|
||||||
|
if policy_area != "Alle": |
||||||
|
query += " AND policy_area = ?" |
||||||
|
params.append(policy_area) |
||||||
|
|
||||||
|
query += " ORDER BY controversy_score DESC LIMIT ?" |
||||||
|
params.append(limit) |
||||||
|
|
||||||
|
try: |
||||||
|
result = conn.execute(query, params).fetchall() |
||||||
|
columns = [desc[0] for desc in conn.description] |
||||||
|
conn.close() |
||||||
|
|
||||||
|
return [dict(zip(columns, row)) for row in result] |
||||||
|
except Exception as e: |
||||||
|
print(f"Error querying motions: {e}") |
||||||
|
conn.close() |
||||||
|
return [] |
||||||
|
|
||||||
|
def create_session(self, total_motions: int = 10) -> str: |
||||||
|
"""Create new user session""" |
||||||
|
session_id = str(uuid.uuid4()) |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
INSERT INTO user_sessions (session_id, user_votes, total_motions) |
||||||
|
VALUES (?, '{}', ?) |
||||||
|
""", |
||||||
|
(session_id, total_motions), |
||||||
|
) |
||||||
|
conn.close() |
||||||
|
return session_id |
||||||
|
|
||||||
|
def update_user_vote(self, session_id: str, motion_id: int, vote: str): |
||||||
|
"""Update user vote for a motion""" |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
|
||||||
|
# Get current votes |
||||||
|
current_votes = conn.execute( |
||||||
|
""" |
||||||
|
SELECT user_votes FROM user_sessions WHERE session_id = ? |
||||||
|
""", |
||||||
|
(session_id,), |
||||||
|
).fetchone() |
||||||
|
|
||||||
|
if current_votes: |
||||||
|
votes_dict = json.loads(current_votes[0]) |
||||||
|
votes_dict[str(motion_id)] = vote |
||||||
|
|
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
UPDATE user_sessions |
||||||
|
SET user_votes = ?, |
||||||
|
completed_motions = ?, |
||||||
|
last_updated = CURRENT_TIMESTAMP |
||||||
|
WHERE session_id = ? |
||||||
|
""", |
||||||
|
(json.dumps(votes_dict), len(votes_dict), session_id), |
||||||
|
) |
||||||
|
|
||||||
|
conn.close() |
||||||
|
|
||||||
|
def calculate_party_matches(self, session_id: str) -> List[Dict]: |
||||||
|
"""Calculate party agreement percentages""" |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
|
||||||
|
# Get user votes and motion data |
||||||
|
user_data = conn.execute( |
||||||
|
""" |
||||||
|
SELECT user_votes FROM user_sessions WHERE session_id = ? |
||||||
|
""", |
||||||
|
(session_id,), |
||||||
|
).fetchone() |
||||||
|
|
||||||
|
if not user_data: |
||||||
|
return [] |
||||||
|
|
||||||
|
user_votes = json.loads(user_data[0]) |
||||||
|
motion_ids = list(user_votes.keys()) |
||||||
|
|
||||||
|
if not motion_ids: |
||||||
|
return [] |
||||||
|
|
||||||
|
# Get motion voting results |
||||||
|
placeholders = ",".join(["?" for _ in motion_ids]) |
||||||
|
motions = conn.execute( |
||||||
|
f""" |
||||||
|
SELECT id, voting_results FROM motions |
||||||
|
WHERE id IN ({placeholders}) |
||||||
|
""", |
||||||
|
motion_ids, |
||||||
|
).fetchall() |
||||||
|
|
||||||
|
conn.close() |
||||||
|
|
||||||
|
# Calculate agreements |
||||||
|
party_scores = {} |
||||||
|
|
||||||
|
for motion_id, voting_results_json in motions: |
||||||
|
voting_results = json.loads(voting_results_json) |
||||||
|
user_vote = user_votes[str(motion_id)] |
||||||
|
|
||||||
|
if user_vote == "Geen stem": # Skip abstentions |
||||||
|
continue |
||||||
|
|
||||||
|
for party, party_vote in voting_results.items(): |
||||||
|
# Skip individual MP names (contain comma, e.g. "Yesilgöz-Zegerius, D.") |
||||||
|
# Party/fractie names never contain a comma. |
||||||
|
if "," in party: |
||||||
|
continue |
||||||
|
|
||||||
|
if party not in party_scores: |
||||||
|
party_scores[party] = {"agreed": 0, "total": 0} |
||||||
|
|
||||||
|
party_scores[party]["total"] += 1 |
||||||
|
|
||||||
|
# Check agreement |
||||||
|
if (user_vote == "Voor" and party_vote == "voor") or ( |
||||||
|
user_vote == "Tegen" and party_vote == "tegen" |
||||||
|
): |
||||||
|
party_scores[party]["agreed"] += 1 |
||||||
|
|
||||||
|
# Convert to percentages and sort |
||||||
|
results = [] |
||||||
|
for party, scores in party_scores.items(): |
||||||
|
if scores["total"] > 0: |
||||||
|
agreement_pct = (scores["agreed"] / scores["total"]) * 100 |
||||||
|
results.append( |
||||||
|
{ |
||||||
|
"party": party, |
||||||
|
"agreement_percentage": round(agreement_pct, 1), |
||||||
|
"agreed_motions": scores["agreed"], |
||||||
|
"total_motions": scores["total"], |
||||||
|
} |
||||||
|
) |
||||||
|
|
||||||
|
return sorted(results, key=lambda x: x["agreement_percentage"], reverse=True) |
||||||
|
|
||||||
|
def store_embedding(self, motion_id: int, model: str, vector: List[float]) -> int: |
||||||
|
"""Store an embedding for a motion. Returns inserted row id or -1 on failure.""" |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
# store vector as JSON |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, CURRENT_TIMESTAMP)", |
||||||
|
(motion_id, model, json.dumps(vector)), |
||||||
|
) |
||||||
|
row = conn.execute("SELECT max(id) FROM embeddings").fetchone() |
||||||
|
conn.close() |
||||||
|
if row and row[0] is not None: |
||||||
|
return int(row[0]) |
||||||
|
return -1 |
||||||
|
except Exception as e: |
||||||
|
print(f"Error storing embedding: {e}") |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
return -1 |
||||||
|
|
||||||
|
def search_similar( |
||||||
|
self, query_vector: List[float], top_k: int = 5, model: Optional[str] = None |
||||||
|
) -> List[Dict]: |
||||||
|
"""Naive in-Python cosine similarity search over stored embeddings. |
||||||
|
|
||||||
|
Returns list of dicts with keys: id, motion_id, model, score, created_at |
||||||
|
""" |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
if model: |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT id, motion_id, model, vector, created_at FROM embeddings WHERE model = ?", |
||||||
|
(model,), |
||||||
|
).fetchall() |
||||||
|
else: |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT id, motion_id, model, vector, created_at FROM embeddings" |
||||||
|
).fetchall() |
||||||
|
conn.close() |
||||||
|
|
||||||
|
results = [] |
||||||
|
import math |
||||||
|
|
||||||
|
for r in rows: |
||||||
|
id_, motion_id, mdl, vector_json, created_at = r |
||||||
|
try: |
||||||
|
vec = json.loads(vector_json) |
||||||
|
except Exception: |
||||||
|
continue |
||||||
|
|
||||||
|
# cosine similarity |
||||||
|
try: |
||||||
|
dot = sum(float(a) * float(b) for a, b in zip(query_vector, vec)) |
||||||
|
na = math.sqrt(sum(float(a) * float(a) for a in query_vector)) |
||||||
|
nb = math.sqrt(sum(float(b) * float(b) for b in vec)) |
||||||
|
score = dot / (na * nb) if na and nb else 0.0 |
||||||
|
except Exception: |
||||||
|
score = 0.0 |
||||||
|
|
||||||
|
results.append( |
||||||
|
{ |
||||||
|
"id": id_, |
||||||
|
"motion_id": motion_id, |
||||||
|
"model": mdl, |
||||||
|
"score": score, |
||||||
|
"created_at": created_at, |
||||||
|
} |
||||||
|
) |
||||||
|
|
||||||
|
results.sort(key=lambda x: x["score"], reverse=True) |
||||||
|
return results[:top_k] |
||||||
|
except Exception as e: |
||||||
|
print(f"Error searching embeddings: {e}") |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
return [] |
||||||
|
|
||||||
|
def mp_votes_exists_for_motion(self, motion_id: int) -> bool: |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
row = conn.execute( |
||||||
|
"SELECT COUNT(*) FROM mp_votes WHERE motion_id = ?", |
||||||
|
(motion_id,), |
||||||
|
).fetchone() |
||||||
|
conn.close() |
||||||
|
return bool(row and row[0] > 0) |
||||||
|
except Exception as e: |
||||||
|
_logger.error(f"Error checking mp_votes existence: {e}") |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
return False |
||||||
|
|
||||||
|
def insert_mp_vote( |
||||||
|
self, |
||||||
|
motion_id: int, |
||||||
|
mp_name: str, |
||||||
|
vote: str, |
||||||
|
date: Optional[str] = None, |
||||||
|
party: Optional[str] = None, |
||||||
|
) -> int: |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
INSERT INTO mp_votes (motion_id, mp_name, party, vote, date, created_at) |
||||||
|
VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP) |
||||||
|
""", |
||||||
|
(motion_id, mp_name, party, vote, date), |
||||||
|
) |
||||||
|
row = conn.execute("SELECT max(id) FROM mp_votes").fetchone() |
||||||
|
conn.close() |
||||||
|
if row and row[0] is not None: |
||||||
|
return int(row[0]) |
||||||
|
return -1 |
||||||
|
except Exception as e: |
||||||
|
_logger.error(f"Error inserting mp_vote: {e}") |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
return -1 |
||||||
|
|
||||||
|
def upsert_mp_metadata( |
||||||
|
self, |
||||||
|
mp_name: str, |
||||||
|
party: Optional[str], |
||||||
|
van: Optional[str], |
||||||
|
tot_en_met: Optional[str], |
||||||
|
persoon_id: Optional[str], |
||||||
|
) -> None: |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
exists = conn.execute( |
||||||
|
"SELECT COUNT(*) FROM mp_metadata WHERE mp_name = ?", (mp_name,) |
||||||
|
).fetchone() |
||||||
|
if exists and exists[0] > 0: |
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
UPDATE mp_metadata SET party = ?, van = ?, tot_en_met = ?, persoon_id = ? |
||||||
|
WHERE mp_name = ? |
||||||
|
""", |
||||||
|
(party, van, tot_en_met, persoon_id, mp_name), |
||||||
|
) |
||||||
|
else: |
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
INSERT INTO mp_metadata (mp_name, party, van, tot_en_met, persoon_id) |
||||||
|
VALUES (?, ?, ?, ?, ?) |
||||||
|
""", |
||||||
|
(mp_name, party, van, tot_en_met, persoon_id), |
||||||
|
) |
||||||
|
conn.close() |
||||||
|
except Exception as e: |
||||||
|
_logger.error(f"Error upserting mp_metadata: {e}") |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
|
||||||
|
def store_svd_vector( |
||||||
|
self, |
||||||
|
window_id: str, |
||||||
|
entity_type: str, |
||||||
|
entity_id: str, |
||||||
|
vector: List[float], |
||||||
|
model: Optional[str] = None, |
||||||
|
) -> int: |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
INSERT INTO svd_vectors (window_id, entity_type, entity_id, vector, model, created_at) |
||||||
|
VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP) |
||||||
|
""", |
||||||
|
(window_id, entity_type, entity_id, json.dumps(vector), model), |
||||||
|
) |
||||||
|
row = conn.execute("SELECT max(id) FROM svd_vectors").fetchone() |
||||||
|
conn.close() |
||||||
|
if row and row[0] is not None: |
||||||
|
return int(row[0]) |
||||||
|
return -1 |
||||||
|
except Exception as e: |
||||||
|
_logger.error(f"Error storing svd_vector: {e}") |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
return -1 |
||||||
|
|
||||||
|
def store_fused_embedding( |
||||||
|
self, |
||||||
|
motion_id: int, |
||||||
|
window_id: str, |
||||||
|
vector: List[float], |
||||||
|
svd_dims: int, |
||||||
|
text_dims: int, |
||||||
|
) -> int: |
||||||
|
try: |
||||||
|
conn = duckdb.connect(self.db_path) |
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
INSERT INTO fused_embeddings (motion_id, window_id, vector, svd_dims, text_dims, created_at) |
||||||
|
VALUES (?, ?, ?, ?, ?, CURRENT_TIMESTAMP) |
||||||
|
""", |
||||||
|
(motion_id, window_id, json.dumps(vector), svd_dims, text_dims), |
||||||
|
) |
||||||
|
row = conn.execute("SELECT max(id) FROM fused_embeddings").fetchone() |
||||||
|
conn.close() |
||||||
|
if row and row[0] is not None: |
||||||
|
return int(row[0]) |
||||||
|
return -1 |
||||||
|
except Exception as e: |
||||||
|
_logger.error(f"Error storing fused_embedding: {e}") |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
return -1 |
||||||
|
|
||||||
|
|
||||||
|
db = MotionDatabase() |
||||||
@ -0,0 +1,20 @@ |
|||||||
|
version: '3.8' |
||||||
|
services: |
||||||
|
stemwijzer: |
||||||
|
build: . |
||||||
|
image: stemwijzer:latest |
||||||
|
container_name: stemwijzer_app |
||||||
|
restart: unless-stopped |
||||||
|
ports: |
||||||
|
- "8501:8501" |
||||||
|
volumes: |
||||||
|
- ./data:/home/app/app/data:rw |
||||||
|
environment: |
||||||
|
- PYTHONPATH=/home/app/app |
||||||
|
- OPENROUTER_API_KEY |
||||||
|
- OTHER_SECRET |
||||||
|
healthcheck: |
||||||
|
test: ["CMD", "curl", "-f", "http://localhost:8501/"] |
||||||
|
interval: 30s |
||||||
|
timeout: 3s |
||||||
|
retries: 3 |
||||||
@ -0,0 +1,72 @@ |
|||||||
|
# Recomputing Similarity (Admin) |
||||||
|
|
||||||
|
This document explains the admin CLI and developer workflows for recomputing similarity scores and running clustering jobs locally. |
||||||
|
|
||||||
|
## What this does |
||||||
|
|
||||||
|
- Recompute similarity vectors/scores for existing records in the database. |
||||||
|
- (Optionally) run the clusterer job that groups similar items based on recomputed vectors. |
||||||
|
|
||||||
|
These operations are typically run as admin/maintenance tasks after changing the embedding/similarity logic or restoring a database snapshot. |
||||||
|
|
||||||
|
## Migration filenames |
||||||
|
|
||||||
|
When adding or running migrations related to similarity or clustering, follow the project's migration filename pattern. Migration files touching similarity will typically include keywords like `recompute_similarity` or `clusterer` in the filename, for example: |
||||||
|
|
||||||
|
- `20260101_001_recompute_similarity.py` |
||||||
|
- `20260215_002_clusterer_migration.py` |
||||||
|
|
||||||
|
Check your migrations folder for the exact filenames used in your environment. |
||||||
|
|
||||||
|
## Environment variables |
||||||
|
|
||||||
|
When running the CLI locally you may need to set the following environment variables. |
||||||
|
|
||||||
|
- `TEST_DB_URL` — connection string for a test/development database (used by local runs when you don't want to touch production data). |
||||||
|
- `AI_PROVIDER_MOCK` — when set to a truthy value (`1`, `true`, `yes`) the AI/embedding provider is mocked so you don't make real API calls during development. Treat any non-empty value of `AI_PROVIDER_MOCK` as truthy. |
||||||
|
- `SIMILARITY_TOP_N` — default number of top similar items to compute/keep for each record. The CLI `--top-n` flag overrides this value for the duration of the run. |
||||||
|
|
||||||
|
Examples: |
||||||
|
|
||||||
|
- Export in a shell (persistent for your session): |
||||||
|
export TEST_DB_URL="postgresql://user:pass@localhost:5432/devdb" |
||||||
|
export AI_PROVIDER_MOCK="true" |
||||||
|
export SIMILARITY_TOP_N="50" |
||||||
|
|
||||||
|
- Inline for a single command (non-persistent): |
||||||
|
TEST_DB_URL="postgresql://user:pass@localhost/devdb" AI_PROVIDER_MOCK=1 python -m src.cli.recompute_similarity --batch-size 100 |
||||||
|
|
||||||
|
Notes: |
||||||
|
|
||||||
|
- `--top-n` CLI flag takes precedence over `SIMILARITY_TOP_N` when both are provided. |
||||||
|
- `AI_PROVIDER_MOCK` should be set to a truthy value (e.g. `1`, `true`, `yes`) to avoid real external AI calls during local runs. |
||||||
|
|
||||||
|
## Running locally (development) |
||||||
|
|
||||||
|
The CLI lives under src/cli. Use the module runner to execute the recompute script. Example commands: |
||||||
|
|
||||||
|
Run a dry-run that doesn't persist changes: |
||||||
|
|
||||||
|
``` |
||||||
|
python -m src.cli.recompute_similarity --top-n 10 --batch-size 100 --dry-run |
||||||
|
``` |
||||||
|
|
||||||
|
Run for real (writes results to the DB): |
||||||
|
|
||||||
|
``` |
||||||
|
python -m src.cli.recompute_similarity --top-n 50 --batch-size 500 |
||||||
|
``` |
||||||
|
|
||||||
|
Common flags |
||||||
|
|
||||||
|
- `--top-n` — override SIMILARITY_TOP_N for this run. |
||||||
|
- `--batch-size` — number of records to process per batch. |
||||||
|
- `--dry-run` — inspect what would be changed without writing to the DB. |
||||||
|
|
||||||
|
Notes |
||||||
|
|
||||||
|
- Always point `TEST_DB_URL` at a non-production database when experimenting. |
||||||
|
- Use `AI_PROVIDER_MOCK=true` to skip external calls and speed up local dev. |
||||||
|
- If you change the embedding or similarity algorithm, re-run the recompute job and re-index/cluster as needed. |
||||||
|
|
||||||
|
If you need help or encounter mismatches between migration files and the CLI, check the migrations folder and speak with the team member that authored the change. |
||||||
@ -0,0 +1,67 @@ |
|||||||
|
# fix_database.py (updated version) |
||||||
|
import os |
||||||
|
import duckdb |
||||||
|
from config import config |
||||||
|
|
||||||
|
def fix_database(): |
||||||
|
"""Completely reset the database with correct schema""" |
||||||
|
|
||||||
|
# Remove the existing database file completely |
||||||
|
if os.path.exists(config.DATABASE_PATH): |
||||||
|
os.remove(config.DATABASE_PATH) |
||||||
|
print("Removed existing database file") |
||||||
|
|
||||||
|
# Create directory if it doesn't exist |
||||||
|
os.makedirs(os.path.dirname(config.DATABASE_PATH), exist_ok=True) |
||||||
|
|
||||||
|
# Initialize with correct schema |
||||||
|
conn = duckdb.connect(config.DATABASE_PATH) |
||||||
|
|
||||||
|
# Create sequence for auto-incrementing IDs |
||||||
|
conn.execute("CREATE SEQUENCE motions_id_seq START 1") |
||||||
|
|
||||||
|
# Create motions table with sequence-based auto-increment |
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE motions ( |
||||||
|
id INTEGER DEFAULT nextval('motions_id_seq'), |
||||||
|
title TEXT NOT NULL, |
||||||
|
description TEXT, |
||||||
|
date DATE, |
||||||
|
policy_area TEXT, |
||||||
|
voting_results JSON, |
||||||
|
winning_margin FLOAT, |
||||||
|
controversy_score FLOAT, |
||||||
|
layman_explanation TEXT, |
||||||
|
url TEXT UNIQUE, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
""") |
||||||
|
|
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE user_sessions ( |
||||||
|
session_id TEXT PRIMARY KEY, |
||||||
|
user_votes JSON, |
||||||
|
completed_motions INTEGER DEFAULT 0, |
||||||
|
total_motions INTEGER DEFAULT 10, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP |
||||||
|
) |
||||||
|
""") |
||||||
|
|
||||||
|
conn.execute(""" |
||||||
|
CREATE TABLE party_results ( |
||||||
|
session_id TEXT, |
||||||
|
party_name TEXT, |
||||||
|
agreement_percentage FLOAT, |
||||||
|
agreed_motions JSON, |
||||||
|
disagreed_motions JSON, |
||||||
|
PRIMARY KEY (session_id, party_name) |
||||||
|
) |
||||||
|
""") |
||||||
|
|
||||||
|
conn.close() |
||||||
|
print("Database recreated with correct schema using sequences") |
||||||
|
|
||||||
|
if __name__ == "__main__": |
||||||
|
fix_database() |
||||||
@ -0,0 +1,6 @@ |
|||||||
|
def main(): |
||||||
|
print("Hello from stemwijzer!") |
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__": |
||||||
|
main() |
||||||
@ -0,0 +1,11 @@ |
|||||||
|
-- Add a separate embeddings table for semantic search and storage of vectors (DuckDB-compatible) |
||||||
|
CREATE TABLE IF NOT EXISTS embeddings ( |
||||||
|
id INTEGER, |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
model TEXT NOT NULL, |
||||||
|
vector JSON NOT NULL, |
||||||
|
created_at TIMESTAMP DEFAULT current_timestamp |
||||||
|
); |
||||||
|
-- DuckDB does not support AUTOINCREMENT; emulate id via a sequence if needed elsewhere |
||||||
|
CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1; |
||||||
|
-- Populate id via trigger-like insert pattern is handled by application code (select nextval when inserting) |
||||||
@ -0,0 +1,6 @@ |
|||||||
|
-- Migration: add externe_identifier and body_text columns to motions |
||||||
|
-- externe_identifier: e.g. "kst-36600-VII-28" from DocumentVersie.ExterneIdentifier |
||||||
|
-- body_text: full plain-text motion body scraped from officielebekendmakingen.nl |
||||||
|
|
||||||
|
ALTER TABLE motions ADD COLUMN IF NOT EXISTS externe_identifier VARCHAR; |
||||||
|
ALTER TABLE motions ADD COLUMN IF NOT EXISTS body_text VARCHAR; |
||||||
@ -0,0 +1,24 @@ |
|||||||
|
-- Migration: create audit_events table |
||||||
|
-- Date: 2026-03-22 |
||||||
|
-- Description: Placeholder migration to add an audit_events table to record audit logs. |
||||||
|
-- |
||||||
|
-- Decision: The actual SQL is intentionally left commented out to avoid making |
||||||
|
-- database changes during test runs. When ready to apply, uncomment and |
||||||
|
-- adapt the SQL for your database engine. |
||||||
|
|
||||||
|
/* |
||||||
|
CREATE TABLE audit_events ( |
||||||
|
id UUID PRIMARY KEY, |
||||||
|
actor_id UUID NOT NULL, |
||||||
|
action TEXT NOT NULL, |
||||||
|
target_type TEXT, |
||||||
|
target_id UUID, |
||||||
|
metadata JSONB, |
||||||
|
created_at TIMESTAMP WITH TIME ZONE DEFAULT now() |
||||||
|
); |
||||||
|
|
||||||
|
-- Add indexes as needed, e.g.: |
||||||
|
-- CREATE INDEX ON audit_events (actor_id); |
||||||
|
*/ |
||||||
|
|
||||||
|
-- End of migration placeholder |
||||||
@ -0,0 +1,15 @@ |
|||||||
|
-- 2026-03-22-add-similarity-cache.sql |
||||||
|
-- Placeholder migration for adding a similarity_cache table |
||||||
|
-- Decision: Keep SQL commented out so CI does not accidentally modify databases. |
||||||
|
|
||||||
|
/* |
||||||
|
-- Example (commented out): |
||||||
|
CREATE TABLE similarity_cache ( |
||||||
|
id SERIAL PRIMARY KEY, |
||||||
|
key TEXT NOT NULL, |
||||||
|
vector FLOAT8[] NOT NULL, |
||||||
|
created_at TIMESTAMP WITH TIME ZONE DEFAULT now() |
||||||
|
); |
||||||
|
*/ |
||||||
|
|
||||||
|
-- No executable SQL in this file. Intentionally left as a safe no-op. |
||||||
@ -0,0 +1,13 @@ |
|||||||
|
----SQL |
||||||
|
CREATE SEQUENCE IF NOT EXISTS fused_embeddings_id_seq START 1; |
||||||
|
CREATE TABLE IF NOT EXISTS fused_embeddings ( |
||||||
|
id INTEGER DEFAULT nextval('fused_embeddings_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
window_id TEXT NOT NULL, |
||||||
|
vector JSON NOT NULL, |
||||||
|
svd_dims INTEGER NOT NULL, |
||||||
|
text_dims INTEGER NOT NULL, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
); |
||||||
|
----END |
||||||
@ -0,0 +1,9 @@ |
|||||||
|
----SQL |
||||||
|
CREATE TABLE IF NOT EXISTS mp_metadata ( |
||||||
|
mp_name TEXT PRIMARY KEY, |
||||||
|
party TEXT, |
||||||
|
van DATE, |
||||||
|
tot_en_met DATE, |
||||||
|
persoon_id TEXT |
||||||
|
); |
||||||
|
----END |
||||||
@ -0,0 +1,13 @@ |
|||||||
|
----SQL |
||||||
|
CREATE SEQUENCE IF NOT EXISTS mp_votes_id_seq START 1; |
||||||
|
CREATE TABLE IF NOT EXISTS mp_votes ( |
||||||
|
id INTEGER DEFAULT nextval('mp_votes_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
mp_name TEXT NOT NULL, |
||||||
|
party TEXT, |
||||||
|
vote TEXT NOT NULL, |
||||||
|
date DATE, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
); |
||||||
|
----END |
||||||
@ -0,0 +1,13 @@ |
|||||||
|
----SQL |
||||||
|
CREATE SEQUENCE IF NOT EXISTS svd_vectors_id_seq START 1; |
||||||
|
CREATE TABLE IF NOT EXISTS svd_vectors ( |
||||||
|
id INTEGER DEFAULT nextval('svd_vectors_id_seq'), |
||||||
|
window_id TEXT NOT NULL, |
||||||
|
entity_type TEXT NOT NULL, |
||||||
|
entity_id TEXT NOT NULL, |
||||||
|
vector JSON NOT NULL, |
||||||
|
model TEXT, |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, |
||||||
|
PRIMARY KEY (id) |
||||||
|
); |
||||||
|
----END |
||||||
@ -0,0 +1,75 @@ |
|||||||
|
import json |
||||||
|
import logging |
||||||
|
from typing import Optional |
||||||
|
|
||||||
|
import duckdb |
||||||
|
|
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
_logger = logging.getLogger(__name__) |
||||||
|
|
||||||
|
|
||||||
|
def extract_mp_votes(db_path: Optional[str] = None, limit: Optional[int] = None): |
||||||
|
"""Extract individual MP votes from motions.voting_results and store them |
||||||
|
in the mp_votes table. |
||||||
|
|
||||||
|
Returns a dict with summary counts: |
||||||
|
- motions_scanned: number of motions inspected |
||||||
|
- mp_rows_inserted: number of mp_votes rows inserted |
||||||
|
- motions_skipped: number of motions skipped because mp_votes already existed |
||||||
|
""" |
||||||
|
db = MotionDatabase(db_path=db_path) if db_path else MotionDatabase() |
||||||
|
|
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
try: |
||||||
|
# support optional limit to only scan a subset of motions |
||||||
|
if limit is not None: |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT id, voting_results, date FROM motions LIMIT ?", (limit,) |
||||||
|
).fetchall() |
||||||
|
else: |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT id, voting_results, date FROM motions" |
||||||
|
).fetchall() |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
|
||||||
|
mp_rows_inserted = 0 |
||||||
|
motions_skipped = 0 |
||||||
|
motions_scanned = 0 |
||||||
|
|
||||||
|
for motion_id, voting_results_json, date in rows: |
||||||
|
motions_scanned += 1 |
||||||
|
try: |
||||||
|
if db.mp_votes_exists_for_motion(motion_id): |
||||||
|
_logger.debug( |
||||||
|
"Skipping motion %s because mp_votes already exist", motion_id |
||||||
|
) |
||||||
|
motions_skipped += 1 |
||||||
|
continue |
||||||
|
|
||||||
|
# voting_results may be stored as JSON text or as native JSON; ensure it's a dict |
||||||
|
if isinstance(voting_results_json, str): |
||||||
|
voting_results = json.loads(voting_results_json) |
||||||
|
else: |
||||||
|
voting_results = voting_results_json |
||||||
|
|
||||||
|
for actor, vote in (voting_results or {}).items(): |
||||||
|
# Individual MP names contain a comma (e.g. "Last, F.") |
||||||
|
if "," not in actor: |
||||||
|
continue |
||||||
|
|
||||||
|
inserted_id = db.insert_mp_vote( |
||||||
|
motion_id=motion_id, mp_name=actor, vote=vote, date=date, party=None |
||||||
|
) |
||||||
|
if inserted_id and inserted_id > 0: |
||||||
|
mp_rows_inserted += 1 |
||||||
|
|
||||||
|
except Exception as e: |
||||||
|
_logger.error("Error processing motion %s: %s", motion_id, e) |
||||||
|
|
||||||
|
return { |
||||||
|
"motions_scanned": motions_scanned, |
||||||
|
"mp_rows_inserted": mp_rows_inserted, |
||||||
|
"motions_skipped": motions_skipped, |
||||||
|
} |
||||||
@ -0,0 +1,94 @@ |
|||||||
|
import logging |
||||||
|
from typing import Optional |
||||||
|
|
||||||
|
import requests |
||||||
|
|
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
logger = logging.getLogger(__name__) |
||||||
|
|
||||||
|
|
||||||
|
def normalize_mp_name( |
||||||
|
achternaam: str, initialen: Optional[str], tussenvoegsel: Optional[str] |
||||||
|
) -> str: |
||||||
|
"""Reconstruct ActorNaam format used in voting_results keys. |
||||||
|
|
||||||
|
Format: "{Tussenvoegsel} {Achternaam}, {Initialen}" with sensible stripping when |
||||||
|
tussenvoegsel is missing. |
||||||
|
""" |
||||||
|
parts = [] |
||||||
|
if tussenvoegsel: |
||||||
|
parts.append(tussenvoegsel) |
||||||
|
parts.append(achternaam) |
||||||
|
name = " ".join(parts).strip() |
||||||
|
|
||||||
|
# Ensure the displayed name starts with an uppercase letter so |
||||||
|
# ORDER BY mp_name behaves predictably across databases that may |
||||||
|
# sort uppercase before lowercase. Only change the first character |
||||||
|
# to upper-case to avoid lowercasing other letters (e.g. hyphenated |
||||||
|
# or already capitalized parts). |
||||||
|
if name and name[0].islower(): |
||||||
|
name = name[0].upper() + name[1:] |
||||||
|
if initialen: |
||||||
|
name = f"{name}, {initialen}" |
||||||
|
return name |
||||||
|
|
||||||
|
|
||||||
|
def fetch_mp_metadata( |
||||||
|
db_path: str, odata_url: str = "https://odata.example/FractieZetelPersoon" |
||||||
|
) -> int: |
||||||
|
"""Fetch MP party membership and tenure from OData and upsert into DB. |
||||||
|
|
||||||
|
Returns the number of records processed (inserted or updated). |
||||||
|
""" |
||||||
|
session = requests.Session() |
||||||
|
try: |
||||||
|
resp = session.get(odata_url) |
||||||
|
resp.raise_for_status() |
||||||
|
data = resp.json() |
||||||
|
except Exception as e: |
||||||
|
logger.error("Failed to fetch MP metadata: %s", e) |
||||||
|
raise |
||||||
|
|
||||||
|
values = data.get("value") if isinstance(data, dict) else None |
||||||
|
if values is None: |
||||||
|
logger.error("Unexpected OData payload; missing 'value' list") |
||||||
|
return 0 |
||||||
|
|
||||||
|
db = MotionDatabase(db_path) |
||||||
|
processed = 0 |
||||||
|
|
||||||
|
for item in values: |
||||||
|
try: |
||||||
|
persoon = item.get("Persoon") or {} |
||||||
|
fractiezetel = item.get("FractieZetel") or {} |
||||||
|
fractie = fractiezetel.get("Fractie") or {} |
||||||
|
|
||||||
|
achternaam = persoon.get("Achternaam") |
||||||
|
initialen = persoon.get("Initialen") |
||||||
|
tussenvoegsel = persoon.get("Tussenvoegsel") |
||||||
|
persoon_id = persoon.get("Id") |
||||||
|
|
||||||
|
party = fractie.get("NaamNL") |
||||||
|
van = item.get("Van") |
||||||
|
tot_en_met = item.get("TotEnMet") |
||||||
|
|
||||||
|
if not achternaam: |
||||||
|
logger.debug("Skipping record without achternaam: %s", item) |
||||||
|
continue |
||||||
|
|
||||||
|
mp_name = normalize_mp_name(achternaam, initialen, tussenvoegsel) |
||||||
|
|
||||||
|
db.upsert_mp_metadata( |
||||||
|
mp_name=mp_name, |
||||||
|
party=party, |
||||||
|
van=van, |
||||||
|
tot_en_met=tot_en_met, |
||||||
|
persoon_id=persoon_id, |
||||||
|
) |
||||||
|
processed += 1 |
||||||
|
except Exception: |
||||||
|
logger.exception("Error processing OData item: %s", item) |
||||||
|
|
||||||
|
logger.info("Processed %d MP metadata records", processed) |
||||||
|
return processed |
||||||
@ -0,0 +1,116 @@ |
|||||||
|
import json |
||||||
|
import logging |
||||||
|
from typing import Dict |
||||||
|
|
||||||
|
import duckdb |
||||||
|
|
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
_logger = logging.getLogger(__name__) |
||||||
|
|
||||||
|
|
||||||
|
def fuse_for_window( |
||||||
|
window_id: str, db_path: str = None, model: str = None |
||||||
|
) -> Dict[str, int]: |
||||||
|
"""Fuse SVD vectors with text embeddings for motions in a window. |
||||||
|
|
||||||
|
Parameters: |
||||||
|
- window_id: id of the window to process |
||||||
|
- db_path: optional path to duckdb database (if None MotionDatabase default is used) |
||||||
|
- model: optional model name to filter text embeddings |
||||||
|
|
||||||
|
Returns a dict with counts: inserted, skipped_missing_text, skipped_missing_svd, errors |
||||||
|
""" |
||||||
|
# Create MotionDatabase using provided path if given, otherwise use default |
||||||
|
if db_path: |
||||||
|
db = MotionDatabase(db_path=db_path) |
||||||
|
conn = duckdb.connect(db_path) |
||||||
|
else: |
||||||
|
db = MotionDatabase() |
||||||
|
# MotionDatabase always exposes the path it uses |
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
|
||||||
|
# Fetch svd vectors for the window and entity_type=motion |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT entity_id, vector FROM svd_vectors WHERE window_id = ? AND entity_type = ?", |
||||||
|
(window_id, "motion"), |
||||||
|
).fetchall() |
||||||
|
# debug |
||||||
|
_logger.debug("Found %d svd rows for window %s", len(rows), window_id) |
||||||
|
|
||||||
|
inserted = 0 |
||||||
|
skipped_missing_text = 0 |
||||||
|
skipped_missing_svd = 0 |
||||||
|
errors = 0 |
||||||
|
|
||||||
|
for entity_id, svd_json in rows: |
||||||
|
try: |
||||||
|
svd_vec = json.loads(svd_json) |
||||||
|
except Exception: |
||||||
|
_logger.exception("Invalid SVD vector JSON for entity %s", entity_id) |
||||||
|
skipped_missing_svd += 1 |
||||||
|
continue |
||||||
|
|
||||||
|
# Look up text embedding for this motion (most recent). If model is provided |
||||||
|
# filter by model as well. |
||||||
|
if model: |
||||||
|
emb_row = conn.execute( |
||||||
|
"SELECT vector FROM embeddings WHERE motion_id = ? AND model = ? ORDER BY created_at DESC LIMIT 1", |
||||||
|
(int(entity_id), model), |
||||||
|
).fetchone() |
||||||
|
else: |
||||||
|
emb_row = conn.execute( |
||||||
|
"SELECT vector FROM embeddings WHERE motion_id = ? ORDER BY created_at DESC LIMIT 1", |
||||||
|
(int(entity_id),), |
||||||
|
).fetchone() |
||||||
|
|
||||||
|
if not emb_row: |
||||||
|
skipped_missing_text += 1 |
||||||
|
continue |
||||||
|
|
||||||
|
try: |
||||||
|
text_vec = json.loads(emb_row[0]) |
||||||
|
except Exception: |
||||||
|
_logger.exception("Invalid text embedding JSON for motion %s", entity_id) |
||||||
|
skipped_missing_text += 1 |
||||||
|
continue |
||||||
|
|
||||||
|
try: |
||||||
|
fused = list(svd_vec) + list(text_vec) |
||||||
|
except Exception: |
||||||
|
_logger.exception("Error concatenating vectors for motion %s", entity_id) |
||||||
|
errors += 1 |
||||||
|
continue |
||||||
|
|
||||||
|
# store fused embedding and check result |
||||||
|
try: |
||||||
|
res = db.store_fused_embedding( |
||||||
|
int(entity_id), |
||||||
|
window_id, |
||||||
|
fused, |
||||||
|
svd_dims=len(svd_vec), |
||||||
|
text_dims=len(text_vec), |
||||||
|
) |
||||||
|
if res and res > 0: |
||||||
|
inserted += 1 |
||||||
|
else: |
||||||
|
errors += 1 |
||||||
|
_logger.error( |
||||||
|
"Failed to store fused embedding for motion %s (db returned %s)", |
||||||
|
entity_id, |
||||||
|
res, |
||||||
|
) |
||||||
|
except Exception: |
||||||
|
_logger.exception( |
||||||
|
"Exception while storing fused embedding for motion %s", entity_id |
||||||
|
) |
||||||
|
errors += 1 |
||||||
|
|
||||||
|
conn.close() |
||||||
|
|
||||||
|
return { |
||||||
|
"inserted": inserted, |
||||||
|
"skipped_missing_text": skipped_missing_text, |
||||||
|
"skipped_missing_svd": skipped_missing_svd, |
||||||
|
"errors": errors, |
||||||
|
} |
||||||
@ -0,0 +1,206 @@ |
|||||||
|
import json |
||||||
|
import logging |
||||||
|
from typing import Optional, Dict, List, Tuple |
||||||
|
|
||||||
|
import numpy as np |
||||||
|
|
||||||
|
try: |
||||||
|
from scipy.sparse import csr_matrix |
||||||
|
from scipy.sparse.linalg import svds |
||||||
|
from scipy.linalg import orthogonal_procrustes |
||||||
|
|
||||||
|
_HAS_SCIPY = True |
||||||
|
except Exception: |
||||||
|
# Provide lightweight fallbacks for environments without scipy |
||||||
|
csr_matrix = lambda x: x |
||||||
|
|
||||||
|
def svds(a, k=1): |
||||||
|
# fallback to numpy.linalg.svd on dense arrays |
||||||
|
U, s, Vt = np.linalg.svd(np.array(a), full_matrices=False) |
||||||
|
# return last k components to mimic scipy.svds behaviour |
||||||
|
return U[:, -k:], s[-k:], Vt[-k:, :] |
||||||
|
|
||||||
|
def orthogonal_procrustes(A, B): |
||||||
|
# simple orthogonal Procrustes via SVD: find R minimizing ||A R - B|| |
||||||
|
U, _, Vt = np.linalg.svd(A.T.dot(B)) |
||||||
|
R = U.dot(Vt) |
||||||
|
scale = 1.0 |
||||||
|
return R, scale |
||||||
|
|
||||||
|
_HAS_SCIPY = False |
||||||
|
import duckdb |
||||||
|
|
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
_logger = logging.getLogger(__name__) |
||||||
|
|
||||||
|
# Map textual votes to numeric values for SVD |
||||||
|
VOTE_MAP = { |
||||||
|
"Voor": 1.0, |
||||||
|
"voor": 1.0, |
||||||
|
"Tegen": -1.0, |
||||||
|
"tegen": -1.0, |
||||||
|
"Geen stem": 0.0, |
||||||
|
"Onbekend": 0.0, |
||||||
|
"Onbekend stem": 0.0, |
||||||
|
"Blanco": 0.0, |
||||||
|
} |
||||||
|
|
||||||
|
|
||||||
|
def _safe_k(mat: np.ndarray, k: int) -> int: |
||||||
|
"""Return a safe k for svds: must be < min(mat.shape).""" |
||||||
|
if mat is None: |
||||||
|
return 0 |
||||||
|
m, n = mat.shape |
||||||
|
min_dim = min(m, n) |
||||||
|
# svds requires k < min_dim |
||||||
|
if min_dim <= 1: |
||||||
|
return 0 |
||||||
|
return min(k, min_dim - 1) |
||||||
|
|
||||||
|
|
||||||
|
def _build_vote_matrix( |
||||||
|
db: MotionDatabase, start_date: str, end_date: str |
||||||
|
) -> Tuple[np.ndarray, List[str], List[int]]: |
||||||
|
"""Build dense vote matrix (mp x motion) for votes between start_date and end_date. |
||||||
|
|
||||||
|
Returns (matrix, mp_names, motion_ids) |
||||||
|
""" |
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT motion_id, mp_name, vote FROM mp_votes WHERE date BETWEEN ? AND ?", |
||||||
|
(start_date, end_date), |
||||||
|
).fetchall() |
||||||
|
conn.close() |
||||||
|
|
||||||
|
if not rows: |
||||||
|
return np.zeros((0, 0)), [], [] |
||||||
|
|
||||||
|
motion_ids = sorted({int(r[0]) for r in rows}) |
||||||
|
mp_names = sorted({r[1] for r in rows}) |
||||||
|
|
||||||
|
m = len(mp_names) |
||||||
|
n = len(motion_ids) |
||||||
|
mat = np.zeros((m, n), dtype=float) |
||||||
|
|
||||||
|
mp_index = {name: i for i, name in enumerate(mp_names)} |
||||||
|
motion_index = {mid: j for j, mid in enumerate(motion_ids)} |
||||||
|
|
||||||
|
for motion_id, mp_name, vote in rows: |
||||||
|
i = mp_index[mp_name] |
||||||
|
j = motion_index[int(motion_id)] |
||||||
|
val = VOTE_MAP.get( |
||||||
|
vote, VOTE_MAP.get(vote.strip() if isinstance(vote, str) else vote, 0.0) |
||||||
|
) |
||||||
|
try: |
||||||
|
mat[i, j] = float(val) |
||||||
|
except Exception: |
||||||
|
mat[i, j] = 0.0 |
||||||
|
|
||||||
|
return mat, mp_names, motion_ids |
||||||
|
|
||||||
|
|
||||||
|
def _procrustes_align( |
||||||
|
reference_anchor: np.ndarray, |
||||||
|
current_anchor: np.ndarray, |
||||||
|
min_overlap: int = 3, |
||||||
|
) -> np.ndarray: |
||||||
|
"""Align current_anchor to reference_anchor using orthogonal Procrustes. |
||||||
|
|
||||||
|
This function will only attempt alignment when there is a reasonable number of |
||||||
|
overlapping rows (default: min_overlap). If the overlap is too small or if any |
||||||
|
input is invalid, the original current_anchor is returned unchanged. |
||||||
|
|
||||||
|
Returns transformed_current_anchor |
||||||
|
""" |
||||||
|
# basic validation |
||||||
|
if reference_anchor is None or current_anchor is None: |
||||||
|
return current_anchor |
||||||
|
|
||||||
|
if not isinstance(reference_anchor, np.ndarray) or not isinstance( |
||||||
|
current_anchor, np.ndarray |
||||||
|
): |
||||||
|
return current_anchor |
||||||
|
|
||||||
|
# Determine overlap by number of available rows. If too small, skip alignment. |
||||||
|
n_ref = reference_anchor.shape[0] |
||||||
|
n_cur = current_anchor.shape[0] |
||||||
|
overlap = min(n_ref, n_cur) |
||||||
|
if overlap < min_overlap: |
||||||
|
_logger.debug( |
||||||
|
"Procrustes alignment skipped: overlap %s < min_overlap %s", |
||||||
|
overlap, |
||||||
|
min_overlap, |
||||||
|
) |
||||||
|
return current_anchor |
||||||
|
|
||||||
|
# Use only the overlapping rows to compute the orthogonal transform. |
||||||
|
ref_sub = reference_anchor[:overlap, :] |
||||||
|
cur_sub = current_anchor[:overlap, :] |
||||||
|
|
||||||
|
try: |
||||||
|
# orthogonal_procrustes(A, B) returns R, scale such that A @ R = B * scale |
||||||
|
# We want to transform current_anchor to align with reference_anchor so |
||||||
|
# call orthogonal_procrustes(cur_sub, ref_sub) and apply resulting R/scale |
||||||
|
R, _scale = orthogonal_procrustes(cur_sub, ref_sub) |
||||||
|
transformed = current_anchor.dot(R) |
||||||
|
return transformed |
||||||
|
except Exception: |
||||||
|
_logger.exception("Procrustes alignment failed") |
||||||
|
return current_anchor |
||||||
|
|
||||||
|
|
||||||
|
def run_svd_for_window( |
||||||
|
db: MotionDatabase, |
||||||
|
window_id: str, |
||||||
|
start_date: str, |
||||||
|
end_date: str, |
||||||
|
k: int = 50, |
||||||
|
) -> Dict: |
||||||
|
"""Run SVD on votes in given date window and store vectors in DB. |
||||||
|
|
||||||
|
Returns metadata dict with keys: k_used, stored_mp, stored_motion |
||||||
|
""" |
||||||
|
mat, mp_names, motion_ids = _build_vote_matrix(db, start_date, end_date) |
||||||
|
|
||||||
|
if mat.size == 0 or mat.shape[0] == 0 or mat.shape[1] == 0: |
||||||
|
return {"k_used": 0, "stored_mp": 0, "stored_motion": 0} |
||||||
|
|
||||||
|
k_used = _safe_k(mat, k) |
||||||
|
|
||||||
|
if k_used <= 0: |
||||||
|
return {"k_used": 0, "stored_mp": 0, "stored_motion": 0} |
||||||
|
|
||||||
|
# use sparse svds for efficiency |
||||||
|
try: |
||||||
|
A = csr_matrix(mat) |
||||||
|
U, s, Vt = svds(A, k=k_used) |
||||||
|
# svds does not guarantee ordering of singular values; sort descending |
||||||
|
idx = np.argsort(s)[::-1] |
||||||
|
s = s[idx] |
||||||
|
U = U[:, idx] |
||||||
|
Vt = Vt[idx, :] |
||||||
|
|
||||||
|
# weight by singular values |
||||||
|
mp_vecs = (U * s.reshape(1, -1)).tolist() # m x k |
||||||
|
motion_vecs = (Vt.T * s.reshape(1, -1)).tolist() # n x k |
||||||
|
|
||||||
|
stored_mp = 0 |
||||||
|
stored_motion = 0 |
||||||
|
for i, mp_name in enumerate(mp_names): |
||||||
|
db.store_svd_vector(window_id, "mp", mp_name, mp_vecs[i]) |
||||||
|
stored_mp += 1 |
||||||
|
|
||||||
|
for j, motion_id in enumerate(motion_ids): |
||||||
|
db.store_svd_vector(window_id, "motion", str(motion_id), motion_vecs[j]) |
||||||
|
stored_motion += 1 |
||||||
|
|
||||||
|
return { |
||||||
|
"k_used": k_used, |
||||||
|
"stored_mp": stored_mp, |
||||||
|
"stored_motion": stored_motion, |
||||||
|
} |
||||||
|
|
||||||
|
except Exception: |
||||||
|
_logger.exception("SVD failed for window") |
||||||
|
return {"k_used": 0, "stored_mp": 0, "stored_motion": 0} |
||||||
@ -0,0 +1,122 @@ |
|||||||
|
import logging |
||||||
|
import json |
||||||
|
from typing import Optional, List, Tuple |
||||||
|
|
||||||
|
import duckdb |
||||||
|
|
||||||
|
from database import MotionDatabase, db as default_db |
||||||
|
import ai_provider |
||||||
|
|
||||||
|
_logger = logging.getLogger(__name__) |
||||||
|
|
||||||
|
DEFAULT_MODEL = "qwen/qwen3-embedding-4b" |
||||||
|
|
||||||
|
|
||||||
|
def _select_text( |
||||||
|
db: MotionDatabase, model: str, limit: Optional[int] = None |
||||||
|
) -> List[Tuple[int, Optional[str]]]: |
||||||
|
"""Select motions that do not yet have an embedding for `model`. |
||||||
|
|
||||||
|
Returns list of (motion_id, text). |
||||||
|
""" |
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
params = [model] |
||||||
|
# prefer layman_explanation > description > title (keep compatibility with existing tests) |
||||||
|
sql = ( |
||||||
|
"SELECT m.id, COALESCE(m.layman_explanation, m.description, m.title) AS text" |
||||||
|
" FROM motions m" |
||||||
|
" LEFT JOIN embeddings e ON e.motion_id = m.id AND e.model = ?" |
||||||
|
" WHERE e.id IS NULL" |
||||||
|
) |
||||||
|
if limit: |
||||||
|
sql += " LIMIT ?" |
||||||
|
params.append(limit) |
||||||
|
|
||||||
|
try: |
||||||
|
rows = conn.execute(sql, params).fetchall() |
||||||
|
conn.close() |
||||||
|
results: List[Tuple[int, Optional[str]]] = [] |
||||||
|
for r in rows: |
||||||
|
text_val = r[1] |
||||||
|
# treat empty strings as no text |
||||||
|
if text_val is None: |
||||||
|
text = None |
||||||
|
else: |
||||||
|
text = str(text_val).strip() or None |
||||||
|
results.append((int(r[0]), text)) |
||||||
|
return results |
||||||
|
except Exception as exc: |
||||||
|
_logger.error("Error selecting motions for embeddings: %s", exc) |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
return [] |
||||||
|
|
||||||
|
|
||||||
|
def ensure_text_embeddings( |
||||||
|
db_path: Optional[str] = None, model: Optional[str] = None |
||||||
|
) -> Tuple[int, int, int, int]: |
||||||
|
"""Ensure all motions have text embeddings for `model`. |
||||||
|
|
||||||
|
Returns tuple (stored_count, skipped_existing, skipped_no_text, errors). |
||||||
|
""" |
||||||
|
model = model or DEFAULT_MODEL |
||||||
|
db = MotionDatabase(db_path) if db_path else default_db |
||||||
|
|
||||||
|
# motions to process |
||||||
|
to_process = _select_text(db, model) |
||||||
|
|
||||||
|
# how many already exist |
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
try: |
||||||
|
total_motions = conn.execute("SELECT COUNT(*) FROM motions").fetchone()[0] |
||||||
|
except Exception: |
||||||
|
total_motions = 0 |
||||||
|
|
||||||
|
try: |
||||||
|
existing = conn.execute( |
||||||
|
"SELECT COUNT(DISTINCT motion_id) FROM embeddings WHERE model = ?", (model,) |
||||||
|
).fetchone()[0] |
||||||
|
except Exception: |
||||||
|
existing = 0 |
||||||
|
|
||||||
|
conn.close() |
||||||
|
|
||||||
|
stored = 0 |
||||||
|
skipped_no_text = 0 |
||||||
|
errors = 0 |
||||||
|
|
||||||
|
for motion_id, text in to_process: |
||||||
|
if not text: |
||||||
|
_logger.info("Skipping motion %s: no text available", motion_id) |
||||||
|
skipped_no_text += 1 |
||||||
|
continue |
||||||
|
|
||||||
|
try: |
||||||
|
vec = ai_provider.get_embedding(text, model=model) |
||||||
|
if not isinstance(vec, list): |
||||||
|
_logger.warning( |
||||||
|
"Embedding provider returned non-list for motion %s", motion_id |
||||||
|
) |
||||||
|
errors += 1 |
||||||
|
continue |
||||||
|
|
||||||
|
res = db.store_embedding(motion_id, model, vec) |
||||||
|
if res and res > 0: |
||||||
|
stored += 1 |
||||||
|
else: |
||||||
|
_logger.error( |
||||||
|
"Failed to store embedding for motion %s (store returned %s)", |
||||||
|
motion_id, |
||||||
|
res, |
||||||
|
) |
||||||
|
errors += 1 |
||||||
|
except Exception as exc: |
||||||
|
_logger.error( |
||||||
|
"Error computing/storing embedding for motion %s: %s", motion_id, exc |
||||||
|
) |
||||||
|
errors += 1 |
||||||
|
|
||||||
|
skipped_existing = int(existing) |
||||||
|
return stored, skipped_existing, skipped_no_text, errors |
||||||
@ -0,0 +1,18 @@ |
|||||||
|
[project] |
||||||
|
name = "stemwijzer" |
||||||
|
version = "0.1.0" |
||||||
|
description = "Add your description here" |
||||||
|
readme = "README.md" |
||||||
|
requires-python = ">=3.13" |
||||||
|
dependencies = [ |
||||||
|
"duckdb>=1.3.2", |
||||||
|
"ibis-framework[duckdb]>=10.8.0", |
||||||
|
"openai>=1.99.7", |
||||||
|
"scipy>=1.11", |
||||||
|
"umap-learn>=0.5", |
||||||
|
"plotly>=5.0", |
||||||
|
"pytest>=9.0.2", |
||||||
|
"requests>=2.32.4", |
||||||
|
"schedule>=1.2.2", |
||||||
|
"streamlit>=1.48.0", |
||||||
|
] |
||||||
@ -0,0 +1,9 @@ |
|||||||
|
import ibis |
||||||
|
|
||||||
|
con = ibis.duckdb.connect('data/motions.db') |
||||||
|
|
||||||
|
print(con.tables) |
||||||
|
|
||||||
|
for t in con.tables: |
||||||
|
print(con.table(t).head().execute().to_string()) |
||||||
|
|
||||||
@ -0,0 +1,3 @@ |
|||||||
|
# Run this to reset your database |
||||||
|
from database import db |
||||||
|
db.reset_database() |
||||||
@ -0,0 +1,183 @@ |
|||||||
|
# scraper.py |
||||||
|
import requests |
||||||
|
from bs4 import BeautifulSoup |
||||||
|
import time |
||||||
|
import re |
||||||
|
from datetime import datetime, timedelta |
||||||
|
from typing import Dict, List, Optional |
||||||
|
from database import db |
||||||
|
from config import config |
||||||
|
|
||||||
|
class MotionScraper: |
||||||
|
def __init__(self): |
||||||
|
self.session = requests.Session() |
||||||
|
self.session.headers.update({ |
||||||
|
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' |
||||||
|
}) |
||||||
|
|
||||||
|
def scrape_motion_list(self, start_date: datetime = None, end_date: datetime = None) -> List[str]: |
||||||
|
"""Scrape motion URLs from the main page""" |
||||||
|
if not start_date: |
||||||
|
start_date = datetime.now() - timedelta(days=730) # 2 years ago |
||||||
|
if not end_date: |
||||||
|
end_date = datetime.now() |
||||||
|
|
||||||
|
motion_urls = [] |
||||||
|
page = 1 |
||||||
|
|
||||||
|
while True: |
||||||
|
try: |
||||||
|
url = f"{config.BASE_URL}?page={page}" |
||||||
|
response = self.session.get(url, timeout=30) |
||||||
|
response.raise_for_status() |
||||||
|
|
||||||
|
soup = BeautifulSoup(response.content, 'html.parser') |
||||||
|
|
||||||
|
# Find motion links (adjust selectors based on actual HTML structure) |
||||||
|
motion_links = soup.find_all('a', href=re.compile(r'/stemmingsuitslagen/')) |
||||||
|
|
||||||
|
if not motion_links: |
||||||
|
break |
||||||
|
|
||||||
|
for link in motion_links: |
||||||
|
href = link.get('href') |
||||||
|
if href and href not in motion_urls: |
||||||
|
motion_urls.append(href) |
||||||
|
|
||||||
|
page += 1 |
||||||
|
time.sleep(config.SCRAPING_DELAY) |
||||||
|
|
||||||
|
except Exception as e: |
||||||
|
print(f"Error scraping page {page}: {e}") |
||||||
|
break |
||||||
|
|
||||||
|
return motion_urls |
||||||
|
|
||||||
|
def parse_motion_detail(self, motion_url: str) -> Optional[Dict]: |
||||||
|
"""Parse individual motion details""" |
||||||
|
try: |
||||||
|
full_url = f"https://www.tweedekamer.nl{motion_url}" if motion_url.startswith('/') else motion_url |
||||||
|
response = self.session.get(full_url, timeout=30) |
||||||
|
response.raise_for_status() |
||||||
|
|
||||||
|
soup = BeautifulSoup(response.content, 'html.parser') |
||||||
|
|
||||||
|
# Extract motion data (adjust selectors based on actual HTML structure) |
||||||
|
title = self._extract_title(soup) |
||||||
|
description = self._extract_description(soup) |
||||||
|
date = self._extract_date(soup) |
||||||
|
policy_area = self._extract_policy_area(soup) |
||||||
|
voting_results = self._extract_voting_results(soup) |
||||||
|
|
||||||
|
if not all([title, voting_results]): |
||||||
|
return None |
||||||
|
|
||||||
|
# Calculate winning margin |
||||||
|
total_votes = sum(1 for vote in voting_results.values() if vote in ['voor', 'tegen']) |
||||||
|
if total_votes == 0: |
||||||
|
return None |
||||||
|
|
||||||
|
votes_for = sum(1 for vote in voting_results.values() if vote == 'voor') |
||||||
|
winning_margin = abs(votes_for - (total_votes - votes_for)) / total_votes |
||||||
|
|
||||||
|
return { |
||||||
|
'title': title, |
||||||
|
'description': description or '', |
||||||
|
'date': date, |
||||||
|
'policy_area': policy_area or 'Onbekend', |
||||||
|
'voting_results': voting_results, |
||||||
|
'winning_margin': winning_margin, |
||||||
|
'url': full_url |
||||||
|
} |
||||||
|
|
||||||
|
except Exception as e: |
||||||
|
print(f"Error parsing motion {motion_url}: {e}") |
||||||
|
return None |
||||||
|
|
||||||
|
def _extract_title(self, soup: BeautifulSoup) -> Optional[str]: |
||||||
|
"""Extract motion title""" |
||||||
|
# Look for common title selectors |
||||||
|
selectors = ['h1', '.motion-title', '.title', 'h2'] |
||||||
|
for selector in selectors: |
||||||
|
element = soup.select_one(selector) |
||||||
|
if element: |
||||||
|
return element.get_text(strip=True) |
||||||
|
return None |
||||||
|
|
||||||
|
def _extract_description(self, soup: BeautifulSoup) -> Optional[str]: |
||||||
|
"""Extract motion description""" |
||||||
|
# Look for description elements |
||||||
|
selectors = ['.motion-description', '.description', '.content', 'p'] |
||||||
|
for selector in selectors: |
||||||
|
elements = soup.select(selector) |
||||||
|
if elements: |
||||||
|
return ' '.join(el.get_text(strip=True) for el in elements[:3]) |
||||||
|
return None |
||||||
|
|
||||||
|
def _extract_date(self, soup: BeautifulSoup) -> Optional[str]: |
||||||
|
"""Extract motion date""" |
||||||
|
# Look for date patterns |
||||||
|
date_pattern = re.compile(r'\d{1,2}-\d{1,2}-\d{4}|\d{4}-\d{1,2}-\d{1,2}') |
||||||
|
text = soup.get_text() |
||||||
|
match = date_pattern.search(text) |
||||||
|
if match: |
||||||
|
return match.group() |
||||||
|
return datetime.now().strftime('%Y-%m-%d') |
||||||
|
|
||||||
|
def _extract_policy_area(self, soup: BeautifulSoup) -> Optional[str]: |
||||||
|
"""Extract policy area/category""" |
||||||
|
# Look for category indicators |
||||||
|
text = soup.get_text().lower() |
||||||
|
for area in config.POLICY_AREAS[1:]: # Skip "Alle" |
||||||
|
if area.lower() in text: |
||||||
|
return area |
||||||
|
return "Algemeen" |
||||||
|
|
||||||
|
def _extract_voting_results(self, soup: BeautifulSoup) -> Dict[str, str]: |
||||||
|
"""Extract party voting results""" |
||||||
|
# This is a simplified extraction - you'll need to adjust based on actual HTML |
||||||
|
voting_results = {} |
||||||
|
|
||||||
|
# Look for voting tables or lists |
||||||
|
tables = soup.find_all('table') |
||||||
|
for table in tables: |
||||||
|
rows = table.find_all('tr') |
||||||
|
for row in rows: |
||||||
|
cells = row.find_all(['td', 'th']) |
||||||
|
if len(cells) >= 2: |
||||||
|
party = cells[0].get_text(strip=True) |
||||||
|
vote = cells[1].get_text(strip=True).lower() |
||||||
|
|
||||||
|
if vote in ['voor', 'tegen', 'afwezig']: |
||||||
|
voting_results[party] = vote |
||||||
|
|
||||||
|
# Fallback: simulate some voting data for testing |
||||||
|
if not voting_results: |
||||||
|
parties = ['VVD', 'PVV', 'CDA', 'D66', 'GL', 'SP', 'PvdA', 'CU', 'PvdD', 'FVD', '50PLUS', 'SGP'] |
||||||
|
import random |
||||||
|
for party in parties: |
||||||
|
voting_results[party] = random.choice(['voor', 'tegen', 'afwezig']) |
||||||
|
|
||||||
|
return voting_results |
||||||
|
|
||||||
|
def run_scraping_job(self): |
||||||
|
"""Main scraping job""" |
||||||
|
print("Starting motion scraping...") |
||||||
|
|
||||||
|
motion_urls = self.scrape_motion_list() |
||||||
|
print(f"Found {len(motion_urls)} motion URLs") |
||||||
|
|
||||||
|
successful_scrapes = 0 |
||||||
|
for i, url in enumerate(motion_urls): |
||||||
|
print(f"Processing motion {i+1}/{len(motion_urls)}: {url}") |
||||||
|
|
||||||
|
motion_data = self.parse_motion_detail(url) |
||||||
|
if motion_data: |
||||||
|
if db.insert_motion(motion_data): |
||||||
|
successful_scrapes += 1 |
||||||
|
|
||||||
|
time.sleep(config.SCRAPING_DELAY) |
||||||
|
|
||||||
|
print(f"Scraping completed. Successfully scraped {successful_scrapes} motions.") |
||||||
|
|
||||||
|
scraper = MotionScraper() |
||||||
@ -0,0 +1,128 @@ |
|||||||
|
"""Compute summaries and embeddings for a small test batch of motions. |
||||||
|
|
||||||
|
Usage: |
||||||
|
# dry-run (no network calls) |
||||||
|
python scripts/compute_test_batch.py --limit 20 --dry-run |
||||||
|
|
||||||
|
# run (will call AI provider; requires OPENROUTER_API_KEY) |
||||||
|
python scripts/compute_test_batch.py --limit 20 |
||||||
|
|
||||||
|
This script is intentionally simple and intended for manual invocation. |
||||||
|
It will update motions.layman_explanation and store embeddings via db.store_embedding if available. |
||||||
|
""" |
||||||
|
|
||||||
|
from __future__ import annotations |
||||||
|
|
||||||
|
import argparse |
||||||
|
import logging |
||||||
|
import sys |
||||||
|
from typing import List |
||||||
|
|
||||||
|
import duckdb |
||||||
|
|
||||||
|
from config import config |
||||||
|
import ai_provider |
||||||
|
from database import db |
||||||
|
from summarizer import MotionSummarizer |
||||||
|
|
||||||
|
|
||||||
|
logger = logging.getLogger("compute_test_batch") |
||||||
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") |
||||||
|
|
||||||
|
|
||||||
|
def fetch_motion_candidates(limit: int) -> List[dict]: |
||||||
|
conn = duckdb.connect(config.DATABASE_PATH) |
||||||
|
try: |
||||||
|
# Prefer motions that still lack a layman_explanation so we don't re-process recent ones |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT id, title, description FROM motions WHERE layman_explanation IS NULL OR layman_explanation = '' ORDER BY created_at DESC LIMIT ?", |
||||||
|
(limit,), |
||||||
|
).fetchall() |
||||||
|
return [{"id": r[0], "title": r[1], "description": r[2] or ""} for r in rows] |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
|
||||||
|
|
||||||
|
def process_batch(limit: int = 20, dry_run: bool = False): |
||||||
|
summarizer = MotionSummarizer() |
||||||
|
motions = fetch_motion_candidates(limit) |
||||||
|
logger.info("Found %d motions to process", len(motions)) |
||||||
|
|
||||||
|
conn = duckdb.connect(config.DATABASE_PATH) |
||||||
|
try: |
||||||
|
for i, m in enumerate(motions, start=1): |
||||||
|
mid = m["id"] |
||||||
|
title = m["title"] |
||||||
|
desc = m["description"] |
||||||
|
logger.info( |
||||||
|
"[%d/%d] Processing motion id=%s title=%s", i, len(motions), mid, title |
||||||
|
) |
||||||
|
|
||||||
|
if dry_run: |
||||||
|
logger.info( |
||||||
|
"Dry run: would generate summary and embedding for motion %s", mid |
||||||
|
) |
||||||
|
continue |
||||||
|
|
||||||
|
# Generate summary |
||||||
|
summary = summarizer.generate_layman_explanation(title, desc) |
||||||
|
# Update DB |
||||||
|
try: |
||||||
|
conn.execute( |
||||||
|
"UPDATE motions SET layman_explanation = ? WHERE id = ?", |
||||||
|
(summary, mid), |
||||||
|
) |
||||||
|
except Exception as e: |
||||||
|
logger.exception("Failed to update motion %s: %s", mid, e) |
||||||
|
|
||||||
|
# Compute embedding and store |
||||||
|
try: |
||||||
|
emb = ai_provider.get_embedding(summary) |
||||||
|
store_fn = getattr(db, "store_embedding", None) |
||||||
|
if callable(store_fn): |
||||||
|
store_fn(mid, "text-embedding-3-small", emb) |
||||||
|
logger.info("Stored embedding for motion %s", mid) |
||||||
|
else: |
||||||
|
logger.warning( |
||||||
|
"No store_embedding available on db; skipping storage" |
||||||
|
) |
||||||
|
except ai_provider.ProviderError as e: |
||||||
|
logger.exception( |
||||||
|
"Failed to compute/store embedding for motion %s: %s", mid, e |
||||||
|
) |
||||||
|
|
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
|
||||||
|
|
||||||
|
def main(argv=None): |
||||||
|
p = argparse.ArgumentParser() |
||||||
|
p.add_argument("--limit", type=int, default=20, help="Number of motions to process") |
||||||
|
p.add_argument( |
||||||
|
"--dry-run", |
||||||
|
action="store_true", |
||||||
|
help="Do not call external APIs; just show what would run", |
||||||
|
) |
||||||
|
args = p.parse_args(argv) |
||||||
|
|
||||||
|
if args.dry_run: |
||||||
|
logger.info("Running in dry-run mode; no network calls will be made") |
||||||
|
|
||||||
|
# Safety: confirm when not dry-run |
||||||
|
if not args.dry_run: |
||||||
|
confirm = ( |
||||||
|
input( |
||||||
|
f"This will call the AI provider for {args.limit} motions and may incur cost. Continue? (y/N): " |
||||||
|
) |
||||||
|
.strip() |
||||||
|
.lower() |
||||||
|
) |
||||||
|
if confirm not in ("y", "yes"): |
||||||
|
logger.info("Aborting per user choice") |
||||||
|
sys.exit(0) |
||||||
|
|
||||||
|
process_batch(limit=args.limit, dry_run=args.dry_run) |
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__": |
||||||
|
main() |
||||||
@ -0,0 +1,35 @@ |
|||||||
|
"""Motion-related simple types and JSON helpers. |
||||||
|
|
||||||
|
Decision: MotionId is an alias for str for simplicity. |
||||||
|
""" |
||||||
|
|
||||||
|
from dataclasses import dataclass, asdict |
||||||
|
from typing import List |
||||||
|
import json |
||||||
|
|
||||||
|
MotionId = str |
||||||
|
Embedding = List[float] |
||||||
|
|
||||||
|
|
||||||
|
@dataclass |
||||||
|
class SimilarityNeighbor: |
||||||
|
motion_id: MotionId |
||||||
|
score: float |
||||||
|
|
||||||
|
|
||||||
|
def to_json(neighbors: List[SimilarityNeighbor]) -> str: |
||||||
|
"""Serialize a list of SimilarityNeighbor to a JSON string. |
||||||
|
|
||||||
|
The format is a JSON list of objects with keys 'motion_id' and 'score'. |
||||||
|
""" |
||||||
|
list_of_dicts = [asdict(n) for n in neighbors] |
||||||
|
return json.dumps(list_of_dicts) |
||||||
|
|
||||||
|
|
||||||
|
def from_json(json_str: str) -> List[SimilarityNeighbor]: |
||||||
|
"""Deserialize a JSON string (list of dicts) into SimilarityNeighbor list.""" |
||||||
|
parsed = json.loads(json_str) |
||||||
|
return [ |
||||||
|
SimilarityNeighbor(motion_id=item["motion_id"], score=float(item["score"])) |
||||||
|
for item in parsed |
||||||
|
] |
||||||
@ -0,0 +1,101 @@ |
|||||||
|
# summarizer.py (refactored to use ai_provider) |
||||||
|
from typing import Optional |
||||||
|
import logging |
||||||
|
|
||||||
|
import duckdb |
||||||
|
|
||||||
|
from config import config |
||||||
|
import ai_provider |
||||||
|
from database import db |
||||||
|
|
||||||
|
logger = logging.getLogger(__name__) |
||||||
|
|
||||||
|
|
||||||
|
class MotionSummarizer: |
||||||
|
def __init__(self): |
||||||
|
# Stateless; use ai_provider functions directly |
||||||
|
pass |
||||||
|
|
||||||
|
def _build_prompt_messages(self, title: str, body_text: str) -> list[dict]: |
||||||
|
prompt = f""" |
||||||
|
Leg deze Nederlandse parlementaire motie uit in eenvoudige, toegankelijke taal: |
||||||
|
|
||||||
|
Titel: {title} |
||||||
|
Tekst: {body_text} |
||||||
|
|
||||||
|
Geef een uitleg van 2-3 zinnen die: |
||||||
|
- Gebruik maakt van alledaagse taal |
||||||
|
- De praktische impact op burgers uitlegt |
||||||
|
- Politiek jargon vermijdt |
||||||
|
- Neutraal en feitelijk blijft |
||||||
|
|
||||||
|
Antwoord alleen met de uitleg, geen introductie of extra tekst. |
||||||
|
""" |
||||||
|
return [ |
||||||
|
{ |
||||||
|
"role": "system", |
||||||
|
"content": "Je bent een expert in het uitleggen van politieke onderwerpen in eenvoudige taal voor Nederlandse burgers.", |
||||||
|
}, |
||||||
|
{"role": "user", "content": prompt}, |
||||||
|
] |
||||||
|
|
||||||
|
def generate_layman_explanation(self, title: str, body_text: str) -> str: |
||||||
|
"""Generate a layman-friendly explanation via ai_provider. |
||||||
|
|
||||||
|
Returns an empty string on failure (non-fatal). |
||||||
|
""" |
||||||
|
messages = self._build_prompt_messages(title, body_text or "") |
||||||
|
try: |
||||||
|
return ai_provider.chat_completion(messages, model=config.QWEN_MODEL) |
||||||
|
except ai_provider.ProviderError: |
||||||
|
logger.exception("AI provider failed to generate summary") |
||||||
|
return "" |
||||||
|
|
||||||
|
def update_motion_summaries( |
||||||
|
self, |
||||||
|
compute_embeddings: bool = True, |
||||||
|
embedding_model: str = "qwen/qwen3-embedding-4b", |
||||||
|
): |
||||||
|
"""Find motions missing layman_explanation and generate summaries. |
||||||
|
|
||||||
|
Uses body_text when available, falls back to description, then title only. |
||||||
|
If compute_embeddings is True and database provides store_embedding, compute and store embeddings. |
||||||
|
""" |
||||||
|
conn = duckdb.connect(config.DATABASE_PATH) |
||||||
|
try: |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT id, title, description, body_text FROM motions WHERE layman_explanation IS NULL OR layman_explanation = '' LIMIT 50" |
||||||
|
).fetchall() |
||||||
|
|
||||||
|
for motion_id, title, description, body_text in rows: |
||||||
|
input_text = body_text or description or "" |
||||||
|
summary = self.generate_layman_explanation(title, input_text) |
||||||
|
if summary is None: |
||||||
|
summary = "" |
||||||
|
conn.execute( |
||||||
|
"UPDATE motions SET layman_explanation = ? WHERE id = ?", |
||||||
|
(summary, motion_id), |
||||||
|
) |
||||||
|
logger.info("Updated summary for motion %s", motion_id) |
||||||
|
|
||||||
|
if compute_embeddings and summary: |
||||||
|
logger.info( |
||||||
|
"Computing embedding for motion %s using model %s", |
||||||
|
motion_id, |
||||||
|
embedding_model, |
||||||
|
) |
||||||
|
# compute embedding and try to store via database helper if available |
||||||
|
try: |
||||||
|
emb = ai_provider.get_embedding(summary, model=embedding_model) |
||||||
|
store_fn = getattr(db, "store_embedding", None) |
||||||
|
if callable(store_fn): |
||||||
|
store_fn(motion_id, embedding_model, emb) |
||||||
|
except ai_provider.ProviderError: |
||||||
|
logger.exception( |
||||||
|
"Failed to compute/store embedding for motion %s", motion_id |
||||||
|
) |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
|
||||||
|
|
||||||
|
summarizer = MotionSummarizer() |
||||||
@ -0,0 +1,16 @@ |
|||||||
|
# test_single_insert.py |
||||||
|
from database import db |
||||||
|
|
||||||
|
test_motion = { |
||||||
|
'title': 'Test Motion', |
||||||
|
'description': 'This is a test motion', |
||||||
|
'date': '2024-01-01', |
||||||
|
'policy_area': 'Test', |
||||||
|
'voting_results': {'VVD': 'voor', 'PvdA': 'tegen'}, |
||||||
|
'winning_margin': 0.5, |
||||||
|
'url': 'https://test.com/motion1' |
||||||
|
} |
||||||
|
|
||||||
|
success = db.insert_motion(test_motion) |
||||||
|
print(f"Insert successful: {success}") |
||||||
|
|
||||||
@ -0,0 +1 @@ |
|||||||
|
"""Make the tests directory a package so test helpers can be imported.""" |
||||||
@ -0,0 +1,63 @@ |
|||||||
|
import tempfile |
||||||
|
import pytest |
||||||
|
|
||||||
|
# Load test fixtures from the utils package so pytest can discover them. |
||||||
|
pytest_plugins = ["tests.utils.migration_fixtures"] |
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture |
||||||
|
def tmp_duckdb_path(tmp_path): |
||||||
|
p = tmp_path / "test.db" |
||||||
|
return str(p) |
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture |
||||||
|
def tmp_duckdb_conn(tmp_duckdb_path): |
||||||
|
# Import duckdb lazily so running pytest doesn't fail on machines |
||||||
|
# where duckdb is not installed (CI / contributor machines that don't |
||||||
|
# need the duckdb-based fixtures). If duckdb is missing, skip this |
||||||
|
# fixture at runtime when it's requested. |
||||||
|
try: |
||||||
|
import duckdb |
||||||
|
except Exception: |
||||||
|
pytest.skip("duckdb not installed, skipping duckdb fixtures") |
||||||
|
|
||||||
|
conn = duckdb.connect(database=tmp_duckdb_path) |
||||||
|
yield conn |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
pass |
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture |
||||||
|
def monkeypatch_ai_provider(monkeypatch): |
||||||
|
"""Patch ai_provider.get_embedding to return deterministic 16-dim vector.""" |
||||||
|
import ai_provider |
||||||
|
|
||||||
|
fake = [0.01] * 16 |
||||||
|
monkeypatch.setattr(ai_provider, "get_embedding", lambda text, model=None: fake) |
||||||
|
return fake |
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture |
||||||
|
def mock_odata_client(monkeypatch): |
||||||
|
""" |
||||||
|
Patch requests.Session.get for OData calls. |
||||||
|
Returns a configurable mock — set mock_odata_client.response to override. |
||||||
|
""" |
||||||
|
import requests |
||||||
|
from unittest.mock import MagicMock |
||||||
|
|
||||||
|
mock_response = MagicMock() |
||||||
|
mock_response.raise_for_status.return_value = None |
||||||
|
mock_response.json.return_value = {"value": []} |
||||||
|
|
||||||
|
class MockSession: |
||||||
|
response = mock_response |
||||||
|
|
||||||
|
def get(self, *args, **kwargs): |
||||||
|
return self.response |
||||||
|
|
||||||
|
monkeypatch.setattr(requests, "Session", MockSession) |
||||||
|
return mock_response |
||||||
@ -0,0 +1 @@ |
|||||||
|
"""Fixtures package for tests.""" |
||||||
@ -0,0 +1,40 @@ |
|||||||
|
[ |
||||||
|
{ |
||||||
|
"motion_id": 1, |
||||||
|
"date": "2024-01-15", |
||||||
|
"voting_results": { |
||||||
|
"VVD": "voor", |
||||||
|
"PvdA": "tegen", |
||||||
|
"CDA": "voor", |
||||||
|
"D66": "voor", |
||||||
|
"Wilders, G.": "voor", |
||||||
|
"Yesilgöz-Zegerius, D.": "voor", |
||||||
|
"Jetten, R.A.A.": "voor" |
||||||
|
} |
||||||
|
}, |
||||||
|
{ |
||||||
|
"motion_id": 2, |
||||||
|
"date": "2024-02-10", |
||||||
|
"voting_results": { |
||||||
|
"VVD": "tegen", |
||||||
|
"PvdA": "voor", |
||||||
|
"CDA": "afwezig", |
||||||
|
"D66": "voor", |
||||||
|
"Wilders, G.": "tegen", |
||||||
|
"Yesilgöz-Zegerius, D.": "tegen", |
||||||
|
"Ploumen, L.J.": "voor" |
||||||
|
} |
||||||
|
}, |
||||||
|
{ |
||||||
|
"motion_id": 3, |
||||||
|
"date": "2024-03-05", |
||||||
|
"voting_results": { |
||||||
|
"VVD": "voor", |
||||||
|
"SP": "tegen", |
||||||
|
"GroenLinks": "voor", |
||||||
|
"PVV": "voor", |
||||||
|
"Van der Plas, C.": "voor", |
||||||
|
"Klever, N.C.": "voor" |
||||||
|
} |
||||||
|
} |
||||||
|
] |
||||||
@ -0,0 +1,87 @@ |
|||||||
|
import json |
||||||
|
import os |
||||||
|
import numpy as np |
||||||
|
import pytest |
||||||
|
|
||||||
|
# duckdb is an optional dependency in some environments; skip test if not available |
||||||
|
duckdb = pytest.importorskip("duckdb") |
||||||
|
|
||||||
|
|
||||||
|
def test_pipeline_end_to_end(tmp_path, monkeypatch): |
||||||
|
# ensure determinism for any random embedding generation |
||||||
|
np.random.seed(0) |
||||||
|
|
||||||
|
# prepare temp db |
||||||
|
db_path = str(tmp_path / "motions.db") |
||||||
|
|
||||||
|
# create the minimal MotionDatabase schema using existing code where possible |
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
db = MotionDatabase(db_path) |
||||||
|
|
||||||
|
# create embeddings table (migration would normally do this) |
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1") |
||||||
|
conn.execute( |
||||||
|
"CREATE TABLE IF NOT EXISTS embeddings (id INTEGER PRIMARY KEY DEFAULT nextval('embeddings_id_seq'), motion_id INTEGER, model TEXT, vector JSON, created_at TIMESTAMP)" |
||||||
|
) |
||||||
|
|
||||||
|
# insert three motions |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)", |
||||||
|
("t1", "d1", "u1", "ex1"), |
||||||
|
) |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)", |
||||||
|
("t2", "d2", "u2", "ex2"), |
||||||
|
) |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)", |
||||||
|
("t3", "d3", "u3", "ex3"), |
||||||
|
) |
||||||
|
|
||||||
|
# fetch ids |
||||||
|
rows = conn.execute("SELECT id FROM motions ORDER BY id").fetchall() |
||||||
|
ids = [r[0] for r in rows] |
||||||
|
|
||||||
|
# insert existing embedding for first motion |
||||||
|
vec = json.dumps([0.1] * 16) |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO embeddings (motion_id, model, vector) VALUES (?, ?, ?)", |
||||||
|
(ids[0], "test-model", vec), |
||||||
|
) |
||||||
|
|
||||||
|
conn.close() |
||||||
|
|
||||||
|
# monkeypatch ai_provider.get_embedding to deterministic vector |
||||||
|
import ai_provider |
||||||
|
|
||||||
|
def fake_get_embedding(text, model=None): |
||||||
|
# produce a deterministic vector based on seeded numpy |
||||||
|
return list(np.random.rand(16)) |
||||||
|
|
||||||
|
monkeypatch.setattr("ai_provider.get_embedding", fake_get_embedding) |
||||||
|
|
||||||
|
# run ensure_text_embeddings |
||||||
|
from pipeline.text_pipeline import ensure_text_embeddings |
||||||
|
|
||||||
|
stored, skipped_existing, skipped_no_text, errors = ensure_text_embeddings( |
||||||
|
db_path=db_path, model="test-model" |
||||||
|
) |
||||||
|
|
||||||
|
assert stored == 2 |
||||||
|
assert skipped_existing == 1 |
||||||
|
assert skipped_no_text == 0 |
||||||
|
assert errors == 0 |
||||||
|
|
||||||
|
# verify stored vectors length |
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT vector FROM embeddings WHERE model = ? ORDER BY motion_id", |
||||||
|
("test-model",), |
||||||
|
).fetchall() |
||||||
|
conn.close() |
||||||
|
assert len(rows) == 3 |
||||||
|
for r in rows: |
||||||
|
v = json.loads(r[0]) |
||||||
|
assert len(v) == 16 |
||||||
@ -0,0 +1,58 @@ |
|||||||
|
import os |
||||||
|
import pathlib |
||||||
|
import sqlite3 |
||||||
|
import re |
||||||
|
import pytest |
||||||
|
|
||||||
|
|
||||||
|
def test_migration_file_exists_and_name(): |
||||||
|
migrations_dir = pathlib.Path("migrations") |
||||||
|
expected_name = "2026-03-22-add-audit-events.sql" |
||||||
|
migration_path = migrations_dir / expected_name |
||||||
|
|
||||||
|
# File must exist |
||||||
|
assert migration_path.exists(), f"Migration file {migration_path} does not exist" |
||||||
|
|
||||||
|
# Name sanity check |
||||||
|
assert migration_path.name == expected_name |
||||||
|
|
||||||
|
|
||||||
|
def _strip_sql_comments(sql_text: str) -> str: |
||||||
|
# Remove SQL single-line comments -- ... and C-style /* ... */ |
||||||
|
# Use multiline-aware single-line removal for safety. |
||||||
|
no_single = re.sub(r"--.*?$", "", sql_text, flags=re.MULTILINE) |
||||||
|
no_block = re.sub(r"/\*.*?\*/", "", no_single, flags=re.DOTALL) |
||||||
|
return no_block.strip() |
||||||
|
|
||||||
|
|
||||||
|
def test_optional_apply_sql_if_db_available(): |
||||||
|
""" |
||||||
|
If TEST_DB_URL is provided, attempt to apply the SQL. |
||||||
|
|
||||||
|
For safety this test will skip applying when the SQL is empty or commented out. |
||||||
|
Only sqlite URLs (sqlite:///path/to/db) are attempted here to avoid adding |
||||||
|
extra dependencies; other URL schemes will cause the test to be skipped. |
||||||
|
""" |
||||||
|
db_url = os.environ.get("TEST_DB_URL") |
||||||
|
if not db_url: |
||||||
|
pytest.skip("TEST_DB_URL not set - skipping DB application") |
||||||
|
|
||||||
|
migration_path = pathlib.Path("migrations") / "2026-03-22-add-audit-events.sql" |
||||||
|
sql = migration_path.read_text(encoding="utf8") |
||||||
|
stripped = _strip_sql_comments(sql) |
||||||
|
if not stripped: |
||||||
|
pytest.skip("Migration SQL is empty or commented out - skipping application") |
||||||
|
|
||||||
|
# Only handle sqlite URLs here |
||||||
|
if db_url.startswith("sqlite:///"): |
||||||
|
db_path = db_url.replace("sqlite:///", "", 1) |
||||||
|
try: |
||||||
|
conn = sqlite3.connect(db_path) |
||||||
|
try: |
||||||
|
conn.executescript(sql) |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
except Exception as e: |
||||||
|
pytest.skip(f"Could not apply SQL to sqlite DB: {e}") |
||||||
|
else: |
||||||
|
pytest.skip(f"TEST_DB_URL set but scheme not supported by this test: {db_url}") |
||||||
@ -0,0 +1,85 @@ |
|||||||
|
import os |
||||||
|
import re |
||||||
|
import pathlib |
||||||
|
import pytest |
||||||
|
# small migration filename/header tests; keep imports minimal |
||||||
|
|
||||||
|
|
||||||
|
MIGRATION_FILENAME = "2026-03-22-add-similarity-cache.sql" |
||||||
|
MIGRATION_PATH = pathlib.Path("migrations") / MIGRATION_FILENAME |
||||||
|
|
||||||
|
|
||||||
|
def _strip_sql_comments(sql: str) -> str: |
||||||
|
"""Remove SQL single-line (-- ...) and C-style (/* ... */) comments. |
||||||
|
|
||||||
|
This is a best-effort stripper sufficient for the test's purpose. |
||||||
|
""" |
||||||
|
# remove block comments |
||||||
|
sql = re.sub(r"/\*.*?\*/", "", sql, flags=re.S) |
||||||
|
# remove line comments |
||||||
|
sql = re.sub(r"--.*?$", "", sql, flags=re.M) |
||||||
|
return sql.strip() |
||||||
|
|
||||||
|
|
||||||
|
def test_migration_file_exists_and_header(): |
||||||
|
# file must exist |
||||||
|
assert MIGRATION_PATH.exists(), f"Migration file {MIGRATION_PATH} not found" |
||||||
|
|
||||||
|
text = MIGRATION_PATH.read_text(encoding="utf8") |
||||||
|
|
||||||
|
# header should reference the filename and purpose |
||||||
|
assert MIGRATION_FILENAME in text.splitlines()[0], ( |
||||||
|
"First line should include the filename" |
||||||
|
) |
||||||
|
assert "similarity" in text.lower(), "Header should mention similarity" |
||||||
|
|
||||||
|
|
||||||
|
def test_optional_apply_migration_safe(): |
||||||
|
# If TEST_DB_URL is set, try to apply the SQL only if it contains non-comment statements. |
||||||
|
db_url = os.environ.get("TEST_DB_URL") |
||||||
|
sql = MIGRATION_PATH.read_text(encoding="utf8") |
||||||
|
stripped = _strip_sql_comments(sql) |
||||||
|
|
||||||
|
# If there is no DB url, consider this a filename/header validation test only. |
||||||
|
if not db_url: |
||||||
|
pytest.skip("TEST_DB_URL not set; skipping DB apply step") |
||||||
|
|
||||||
|
# If the SQL is empty (only comments), nothing to apply — test passes. |
||||||
|
if not stripped: |
||||||
|
pytest.skip("Migration contains no executable SQL; nothing to apply") |
||||||
|
|
||||||
|
# Otherwise attempt to execute the SQL. Be conservative: if drivers are missing or |
||||||
|
# connection fails, skip the test rather than failing CI. Only unexpected errors |
||||||
|
# during execution should fail the test. |
||||||
|
try: |
||||||
|
if db_url.startswith("sqlite:"): |
||||||
|
import sqlite3 |
||||||
|
|
||||||
|
# sqlite URL might be sqlite:///path or sqlite:///:memory: |
||||||
|
path = db_url.split("sqlite:", 1)[1] |
||||||
|
# normalize prefixes like /// |
||||||
|
path = path.lstrip("/") or ":memory:" |
||||||
|
conn = sqlite3.connect(path) |
||||||
|
try: |
||||||
|
conn.executescript(sql) |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
elif db_url.startswith("postgresql:") or db_url.startswith("postgres:"): |
||||||
|
try: |
||||||
|
import psycopg2 |
||||||
|
except Exception as e: # pragma: no cover - driver may be absent in CI |
||||||
|
pytest.skip(f"psycopg2 not available: {e}") |
||||||
|
|
||||||
|
# psycopg2 accepts a DSN; rely on that here. |
||||||
|
conn = psycopg2.connect(db_url) |
||||||
|
try: |
||||||
|
cur = conn.cursor() |
||||||
|
cur.execute(sql) |
||||||
|
conn.commit() |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
else: |
||||||
|
pytest.skip(f"DB URL scheme not supported by this test: {db_url}") |
||||||
|
except Exception as exc: |
||||||
|
# Unexpected error while applying SQL should fail the test. |
||||||
|
raise |
||||||
@ -0,0 +1,29 @@ |
|||||||
|
"""Smoke test for the migration test_db fixture. |
||||||
|
|
||||||
|
This test imports the `test_db` fixture and asserts expected behavior in two |
||||||
|
cases: |
||||||
|
|
||||||
|
- If the environment variable TEST_DB_URL is not set, the fixture should yield |
||||||
|
None. |
||||||
|
- If TEST_DB_URL is set, the fixture should yield a connection-like object |
||||||
|
(we check for an object with a `cursor` attribute or the sqlite3 connection |
||||||
|
type). |
||||||
|
""" |
||||||
|
|
||||||
|
import os |
||||||
|
import types |
||||||
|
|
||||||
|
import pytest |
||||||
|
|
||||||
|
|
||||||
|
def test_migration_fixture_smoke(test_db): |
||||||
|
"""Smoke test ensuring the test_db fixture yields expected values.""" |
||||||
|
url = os.environ.get("TEST_DB_URL") |
||||||
|
if not url: |
||||||
|
assert test_db is None |
||||||
|
else: |
||||||
|
# For sqlite we expect a sqlite3.Connection which has a 'cursor' |
||||||
|
# method. Be permissive and accept any object with a 'cursor' |
||||||
|
# attribute or callable. |
||||||
|
assert test_db is not None |
||||||
|
assert hasattr(test_db, "cursor") or hasattr(test_db, "execute") |
||||||
@ -0,0 +1,49 @@ |
|||||||
|
import os |
||||||
|
import types |
||||||
|
|
||||||
|
import pytest |
||||||
|
|
||||||
|
import ai_provider |
||||||
|
|
||||||
|
|
||||||
|
class DummyResponse: |
||||||
|
def __init__(self, status_code=200, json_data=None): |
||||||
|
self.status_code = status_code |
||||||
|
self._json = json_data or {} |
||||||
|
|
||||||
|
def json(self): |
||||||
|
return self._json |
||||||
|
|
||||||
|
|
||||||
|
def test_get_embedding_success(monkeypatch): |
||||||
|
fake = DummyResponse(json_data={"data": [{"embedding": [0.1, 0.2, 0.3]}]}) |
||||||
|
|
||||||
|
def fake_post(url, json, headers, timeout): |
||||||
|
return fake |
||||||
|
|
||||||
|
monkeypatch.setenv("OPENROUTER_API_KEY", "sk-test") |
||||||
|
monkeypatch.setattr("requests.post", fake_post) |
||||||
|
|
||||||
|
emb = ai_provider.get_embedding("hello world") |
||||||
|
assert emb == [0.1, 0.2, 0.3] |
||||||
|
|
||||||
|
|
||||||
|
def test_chat_completion_success(monkeypatch): |
||||||
|
fake = DummyResponse(json_data={"choices": [{"message": {"content": "summary"}}]}) |
||||||
|
|
||||||
|
def fake_post(url, json, headers, timeout): |
||||||
|
return fake |
||||||
|
|
||||||
|
monkeypatch.setenv("OPENROUTER_API_KEY", "sk-test") |
||||||
|
monkeypatch.setattr("requests.post", fake_post) |
||||||
|
|
||||||
|
out = ai_provider.chat_completion([{"role": "user", "content": "hi"}]) |
||||||
|
assert out == "summary" |
||||||
|
|
||||||
|
|
||||||
|
def test_missing_api_key_raises(monkeypatch): |
||||||
|
# Ensure env var is not set |
||||||
|
monkeypatch.delenv("OPENROUTER_API_KEY", raising=False) |
||||||
|
|
||||||
|
with pytest.raises(ai_provider.ProviderError): |
||||||
|
ai_provider.get_embedding("x") |
||||||
@ -0,0 +1,74 @@ |
|||||||
|
import json |
||||||
|
import duckdb |
||||||
|
import logging |
||||||
|
|
||||||
|
from pipeline.extract_mp_votes import extract_mp_votes |
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
|
||||||
|
def test_extract_mp_votes(tmp_path): |
||||||
|
db_file = tmp_path / "test.db" |
||||||
|
|
||||||
|
# Initialize database |
||||||
|
mdb = MotionDatabase(db_path=str(db_file)) |
||||||
|
|
||||||
|
# Load fixture |
||||||
|
fixture_path = "tests/fixtures/sample_voting_results.json" |
||||||
|
with open(fixture_path, "r") as fh: |
||||||
|
fixtures = json.load(fh) |
||||||
|
|
||||||
|
# Insert motions into motions table |
||||||
|
conn = duckdb.connect(str(db_file)) |
||||||
|
try: |
||||||
|
for item in fixtures: |
||||||
|
motion_id = item.get("motion_id") |
||||||
|
date = item.get("date") |
||||||
|
voting_results = item.get("voting_results") |
||||||
|
|
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
INSERT INTO motions (id, title, description, date, policy_area, voting_results, winning_margin, url) |
||||||
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?) |
||||||
|
""", |
||||||
|
( |
||||||
|
motion_id, |
||||||
|
f"Test Motion {motion_id}", |
||||||
|
"", |
||||||
|
date, |
||||||
|
"Test", |
||||||
|
json.dumps(voting_results), |
||||||
|
0.5, |
||||||
|
f"http://example/{motion_id}", |
||||||
|
), |
||||||
|
) |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
|
||||||
|
# Run extraction |
||||||
|
res = extract_mp_votes(db_path=str(db_file)) |
||||||
|
|
||||||
|
# Expected MP rows: count keys that contain a comma in fixtures |
||||||
|
expected_mp_count = 0 |
||||||
|
for item in fixtures: |
||||||
|
for k in item.get("voting_results", {}).keys(): |
||||||
|
if "," in k: |
||||||
|
expected_mp_count += 1 |
||||||
|
|
||||||
|
assert res["mp_rows_inserted"] == expected_mp_count |
||||||
|
assert res["motions_skipped"] == 0 |
||||||
|
|
||||||
|
# Verify mp_votes table contains only rows with comma in mp_name and count matches |
||||||
|
conn = duckdb.connect(str(db_file)) |
||||||
|
try: |
||||||
|
rows = conn.execute("SELECT mp_name FROM mp_votes").fetchall() |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
|
|
||||||
|
assert len(rows) == expected_mp_count |
||||||
|
for (mp_name,) in rows: |
||||||
|
assert "," in mp_name |
||||||
|
|
||||||
|
# Running again should be idempotent: no new mp rows, motions_skipped > 0 |
||||||
|
res2 = extract_mp_votes(db_path=str(db_file)) |
||||||
|
assert res2["mp_rows_inserted"] == 0 |
||||||
|
assert res2["motions_skipped"] > 0 |
||||||
@ -0,0 +1,103 @@ |
|||||||
|
import json |
||||||
|
import requests |
||||||
|
import types |
||||||
|
import pytest |
||||||
|
|
||||||
|
try: |
||||||
|
import duckdb |
||||||
|
except Exception: |
||||||
|
pytest.skip( |
||||||
|
"duckdb not installed, skipping fetch_mp_metadata tests", |
||||||
|
allow_module_level=True, |
||||||
|
) |
||||||
|
|
||||||
|
from pipeline.fetch_mp_metadata import fetch_mp_metadata, normalize_mp_name |
||||||
|
|
||||||
|
|
||||||
|
class MockResponse: |
||||||
|
def __init__(self, data, status_code=200): |
||||||
|
self._data = data |
||||||
|
self.status_code = status_code |
||||||
|
|
||||||
|
def raise_for_status(self): |
||||||
|
if not (200 <= self.status_code < 300): |
||||||
|
raise requests.HTTPError(f"status {self.status_code}") |
||||||
|
|
||||||
|
def json(self): |
||||||
|
return self._data |
||||||
|
|
||||||
|
|
||||||
|
class MockSession: |
||||||
|
def __init__(self, response): |
||||||
|
self._response = response |
||||||
|
|
||||||
|
def get(self, url): |
||||||
|
return self._response |
||||||
|
|
||||||
|
|
||||||
|
def test_fetch_mp_metadata_idempotent(tmp_path, monkeypatch): |
||||||
|
# Prepare canned OData response with two FractieZetelPersoon records |
||||||
|
data = { |
||||||
|
"value": [ |
||||||
|
{ |
||||||
|
"Persoon": { |
||||||
|
"Achternaam": "Yesilgöz-Zegerius", |
||||||
|
"Initialen": "D.", |
||||||
|
"Tussenvoegsel": None, |
||||||
|
"Id": "guid-1", |
||||||
|
}, |
||||||
|
"FractieZetel": {"Fractie": {"NaamNL": "VVD"}}, |
||||||
|
"Van": "2023-01-01", |
||||||
|
"TotEnMet": None, |
||||||
|
}, |
||||||
|
{ |
||||||
|
"Persoon": { |
||||||
|
"Achternaam": "Plas", |
||||||
|
"Initialen": "C.", |
||||||
|
"Tussenvoegsel": "van der", |
||||||
|
"Id": "guid-2", |
||||||
|
}, |
||||||
|
"FractieZetel": {"Fractie": {"NaamNL": "BBB"}}, |
||||||
|
"Van": "2023-06-01", |
||||||
|
"TotEnMet": "2024-01-01", |
||||||
|
}, |
||||||
|
] |
||||||
|
} |
||||||
|
|
||||||
|
mock_resp = MockResponse(data) |
||||||
|
mock_session = MockSession(mock_resp) |
||||||
|
|
||||||
|
# Patch requests.Session to return our mock session |
||||||
|
monkeypatch.setattr(requests, "Session", lambda: mock_session) |
||||||
|
|
||||||
|
db_path = str(tmp_path / "test.db") |
||||||
|
|
||||||
|
# First run |
||||||
|
count = fetch_mp_metadata(db_path=db_path, odata_url="http://example/odata") |
||||||
|
assert count == 2 |
||||||
|
|
||||||
|
# Verify DB contents |
||||||
|
conn = duckdb.connect(db_path) |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT mp_name, party, van, tot_en_met, persoon_id FROM mp_metadata ORDER BY mp_name" |
||||||
|
).fetchall() |
||||||
|
conn.close() |
||||||
|
|
||||||
|
assert len(rows) == 2 |
||||||
|
|
||||||
|
# Check normalized names |
||||||
|
assert rows[0][0] == normalize_mp_name("Plas", "C.", "van der") |
||||||
|
assert rows[0][1] == "BBB" |
||||||
|
assert str(rows[0][2]) == "2023-06-01" |
||||||
|
assert str(rows[0][3]) == "2024-01-01" |
||||||
|
assert rows[0][4] == "guid-2" |
||||||
|
|
||||||
|
assert rows[1][0] == normalize_mp_name("Yesilgöz-Zegerius", "D.", None) |
||||||
|
assert rows[1][1] == "VVD" |
||||||
|
assert str(rows[1][2]) == "2023-01-01" |
||||||
|
assert rows[1][3] == None |
||||||
|
assert rows[1][4] == "guid-1" |
||||||
|
|
||||||
|
# Run again to assert idempotence (no exception and same count processed) |
||||||
|
count2 = fetch_mp_metadata(db_path=db_path, odata_url="http://example/odata") |
||||||
|
assert count2 == 2 |
||||||
@ -0,0 +1,79 @@ |
|||||||
|
import json |
||||||
|
|
||||||
|
import duckdb |
||||||
|
import pytest |
||||||
|
|
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
|
||||||
|
def test_fuse_for_window(tmp_path): |
||||||
|
db_path = str(tmp_path / "motions.db") |
||||||
|
|
||||||
|
# Create MotionDatabase (this will initialize schema except embeddings) |
||||||
|
db = MotionDatabase(db_path=db_path) |
||||||
|
|
||||||
|
# Create embeddings table (migration not run by MotionDatabase) |
||||||
|
conn = duckdb.connect(db_path) |
||||||
|
conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1") |
||||||
|
conn.execute( |
||||||
|
""" |
||||||
|
CREATE TABLE IF NOT EXISTS embeddings ( |
||||||
|
id INTEGER DEFAULT nextval('embeddings_id_seq'), |
||||||
|
motion_id INTEGER NOT NULL, |
||||||
|
model TEXT NOT NULL, |
||||||
|
vector JSON NOT NULL, |
||||||
|
created_at TIMESTAMP DEFAULT current_timestamp, |
||||||
|
PRIMARY KEY (id) |
||||||
|
) |
||||||
|
""" |
||||||
|
) |
||||||
|
conn.close() |
||||||
|
|
||||||
|
# Insert 3 synthetic SVD vectors (k=4) |
||||||
|
svd1 = [0.1, 0.2, 0.3, 0.4] |
||||||
|
svd2 = [0.2, 0.1, 0.0, -0.1] |
||||||
|
svd3 = [0.9, 0.8, 0.7, 0.6] |
||||||
|
|
||||||
|
db.store_svd_vector("2024-Q1", "motion", "1", svd1) |
||||||
|
db.store_svd_vector("2024-Q1", "motion", "2", svd2) |
||||||
|
db.store_svd_vector("2024-Q1", "motion", "3", svd3) |
||||||
|
|
||||||
|
# Insert text embeddings for motions 1 and 2 (16 dims) |
||||||
|
text1 = [float(i) / 100.0 for i in range(16)] |
||||||
|
text2 = [float(i) / 50.0 for i in range(16)] |
||||||
|
|
||||||
|
conn = duckdb.connect(db_path) |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, current_timestamp)", |
||||||
|
(1, "text-model-1", json.dumps(text1)), |
||||||
|
) |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, current_timestamp)", |
||||||
|
(2, "text-model-1", json.dumps(text2)), |
||||||
|
) |
||||||
|
conn.close() |
||||||
|
|
||||||
|
# Import fuse function here to ensure module available |
||||||
|
from pipeline.fusion import fuse_for_window |
||||||
|
|
||||||
|
result = fuse_for_window("2024-Q1", db_path=db_path) |
||||||
|
|
||||||
|
assert result["inserted"] == 2 |
||||||
|
assert result["skipped_missing_text"] == 1 |
||||||
|
|
||||||
|
# Verify fused embeddings stored |
||||||
|
conn = duckdb.connect(db_path) |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT motion_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE window_id = ?", |
||||||
|
("2024-Q1",), |
||||||
|
).fetchall() |
||||||
|
conn.close() |
||||||
|
|
||||||
|
# Expect two rows for motions 1 and 2 |
||||||
|
assert len(rows) == 2 |
||||||
|
|
||||||
|
for motion_id, vector_json, svd_dims, text_dims in rows: |
||||||
|
vec = json.loads(vector_json) |
||||||
|
assert svd_dims == 4 |
||||||
|
assert text_dims == 16 |
||||||
|
assert len(vec) == 20 |
||||||
@ -0,0 +1,31 @@ |
|||||||
|
import os |
||||||
|
import pytest |
||||||
|
|
||||||
|
|
||||||
|
def test_embeddings_migration_creates_table(tmp_path): |
||||||
|
try: |
||||||
|
import duckdb |
||||||
|
except ImportError: |
||||||
|
pytest.skip("duckdb is not installed") |
||||||
|
|
||||||
|
db_file = str(tmp_path / "migrations_test.db") |
||||||
|
conn = duckdb.connect(database=db_file) |
||||||
|
try: |
||||||
|
sql = open("migrations/2026-03-19-add-embeddings.sql", "r").read() |
||||||
|
conn.execute(sql) |
||||||
|
# Use sequence to set id if present, otherwise provide explicit id |
||||||
|
try: |
||||||
|
next_id = conn.execute("SELECT nextval('embeddings_id_seq')").fetchone()[0] |
||||||
|
except Exception: |
||||||
|
next_id = 1 |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO embeddings (id, motion_id, model, vector) VALUES (?, ?, ?, ?)", |
||||||
|
(next_id, 1, "m1", "[0.1, 0.2]"), |
||||||
|
) |
||||||
|
res = conn.execute( |
||||||
|
"SELECT motion_id, model FROM embeddings WHERE motion_id = 1" |
||||||
|
).fetchall() |
||||||
|
assert len(res) == 1 |
||||||
|
assert res[0][1] == "m1" |
||||||
|
finally: |
||||||
|
conn.close() |
||||||
@ -0,0 +1,219 @@ |
|||||||
|
from pathlib import Path |
||||||
|
|
||||||
|
try: |
||||||
|
import duckdb |
||||||
|
|
||||||
|
DB_BACKEND = "duckdb" |
||||||
|
except Exception: |
||||||
|
import sqlite3 |
||||||
|
|
||||||
|
DB_BACKEND = "sqlite3" |
||||||
|
|
||||||
|
|
||||||
|
MIGRATIONS = [ |
||||||
|
( |
||||||
|
"migrations/2026_03_21__create_mp_votes.sql", |
||||||
|
"mp_votes", |
||||||
|
[ |
||||||
|
"id", |
||||||
|
"motion_id", |
||||||
|
"mp_name", |
||||||
|
"party", |
||||||
|
"vote", |
||||||
|
"date", |
||||||
|
"created_at", |
||||||
|
], |
||||||
|
), |
||||||
|
( |
||||||
|
"migrations/2026_03_21__create_mp_metadata.sql", |
||||||
|
"mp_metadata", |
||||||
|
[ |
||||||
|
"mp_name", |
||||||
|
"party", |
||||||
|
"van", |
||||||
|
"tot_en_met", |
||||||
|
"persoon_id", |
||||||
|
], |
||||||
|
), |
||||||
|
( |
||||||
|
"migrations/2026_03_21__create_svd_vectors.sql", |
||||||
|
"svd_vectors", |
||||||
|
[ |
||||||
|
"id", |
||||||
|
"window_id", |
||||||
|
"entity_type", |
||||||
|
"entity_id", |
||||||
|
"vector", |
||||||
|
"model", |
||||||
|
"created_at", |
||||||
|
], |
||||||
|
), |
||||||
|
( |
||||||
|
"migrations/2026_03_21__create_fused_embeddings.sql", |
||||||
|
"fused_embeddings", |
||||||
|
[ |
||||||
|
"id", |
||||||
|
"motion_id", |
||||||
|
"window_id", |
||||||
|
"vector", |
||||||
|
"svd_dims", |
||||||
|
"text_dims", |
||||||
|
"created_at", |
||||||
|
], |
||||||
|
), |
||||||
|
] |
||||||
|
|
||||||
|
|
||||||
|
def test_run_migrations_and_tables(tmp_path): |
||||||
|
db_path = tmp_path / "test.db" |
||||||
|
if DB_BACKEND == "duckdb": |
||||||
|
conn = duckdb.connect(str(db_path)) |
||||||
|
else: |
||||||
|
conn = sqlite3.connect(str(db_path)) |
||||||
|
|
||||||
|
for sql_path, table_name, expected_cols in MIGRATIONS: |
||||||
|
p = Path(sql_path) |
||||||
|
assert p.exists(), f"Migration file {sql_path} must exist" |
||||||
|
sql = p.read_text() |
||||||
|
|
||||||
|
# If using sqlite3, transform SQL to be sqlite compatible |
||||||
|
if DB_BACKEND == "sqlite3": |
||||||
|
# remove CREATE SEQUENCE lines |
||||||
|
lines = [ |
||||||
|
l |
||||||
|
for l in sql.splitlines() |
||||||
|
if not l.strip().upper().startswith("CREATE SEQUENCE") |
||||||
|
] |
||||||
|
sql2 = "\n".join(lines) |
||||||
|
# remove DEFAULT nextval(...) occurrences |
||||||
|
import re |
||||||
|
|
||||||
|
sql2 = re.sub( |
||||||
|
r"DEFAULT\s+nextval\('[^']+'\)", "", sql2, flags=re.IGNORECASE |
||||||
|
) |
||||||
|
# replace JSON type with TEXT |
||||||
|
sql2 = re.sub(r"\bJSON\b", "TEXT", sql2, flags=re.IGNORECASE) |
||||||
|
# execute as script (multiple statements) |
||||||
|
conn.executescript(sql2) |
||||||
|
else: |
||||||
|
# execute migration SQL |
||||||
|
conn.execute(sql) |
||||||
|
|
||||||
|
# check columns via pragma |
||||||
|
if DB_BACKEND == "duckdb": |
||||||
|
rows = conn.execute(f"PRAGMA table_info('{table_name}')").fetchall() |
||||||
|
col_names = [r[1] for r in rows] |
||||||
|
else: |
||||||
|
cur = conn.execute(f"PRAGMA table_info('{table_name}')") |
||||||
|
rows = cur.fetchall() |
||||||
|
col_names = [r[1] for r in rows] |
||||||
|
|
||||||
|
for col in expected_cols: |
||||||
|
assert col in col_names, ( |
||||||
|
f"Column {col} missing in table {table_name}, got {col_names}" |
||||||
|
) |
||||||
|
|
||||||
|
# perform a simple insert + select to validate basic round-trip |
||||||
|
if table_name == "mp_votes": |
||||||
|
if DB_BACKEND == "duckdb": |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO mp_votes (motion_id, mp_name, party, vote, date) VALUES (1, 'Jane Doe', 'PartyX', 'Yea', '2026-03-21')" |
||||||
|
) |
||||||
|
res = conn.execute( |
||||||
|
"SELECT motion_id, mp_name, party, vote, date FROM mp_votes WHERE motion_id=1" |
||||||
|
).fetchone() |
||||||
|
# DuckDB returns datetime.date for DATE columns; normalise to string |
||||||
|
assert ( |
||||||
|
res[:4] == (1, "Jane Doe", "PartyX", "Yea") |
||||||
|
and str(res[4]) == "2026-03-21" |
||||||
|
) |
||||||
|
else: |
||||||
|
# sqlite: id has no default after transformation, provide id explicitly |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO mp_votes (id, motion_id, mp_name, party, vote, date) VALUES (1, 1, 'Jane Doe', 'PartyX', 'Yea', '2026-03-21')" |
||||||
|
) |
||||||
|
res = conn.execute( |
||||||
|
"SELECT motion_id, mp_name, party, vote, date FROM mp_votes WHERE id=1" |
||||||
|
).fetchone() |
||||||
|
assert res == (1, "Jane Doe", "PartyX", "Yea", "2026-03-21") |
||||||
|
|
||||||
|
elif table_name == "mp_metadata": |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO mp_metadata (mp_name, party, van, tot_en_met, persoon_id) VALUES ('Jane Doe', 'PartyX', '2020-01-01', '2024-12-31', 'pid-123')" |
||||||
|
) |
||||||
|
res = conn.execute( |
||||||
|
"SELECT mp_name, party, van, tot_en_met, persoon_id FROM mp_metadata WHERE mp_name='Jane Doe'" |
||||||
|
).fetchone() |
||||||
|
# DuckDB returns datetime.date for DATE columns; normalise to string |
||||||
|
assert ( |
||||||
|
res[0] == "Jane Doe" |
||||||
|
and res[1] == "PartyX" |
||||||
|
and str(res[2]) == "2020-01-01" |
||||||
|
and str(res[3]) == "2024-12-31" |
||||||
|
and res[4] == "pid-123" |
||||||
|
) |
||||||
|
|
||||||
|
elif table_name == "svd_vectors": |
||||||
|
# JSON value as text |
||||||
|
if DB_BACKEND == "duckdb": |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO svd_vectors (window_id, entity_type, entity_id, vector, model) VALUES ('w1', 'typeA', 'e1', '[1,2,3]', 'm1')" |
||||||
|
) |
||||||
|
res = conn.execute( |
||||||
|
"SELECT window_id, entity_type, entity_id, vector, model FROM svd_vectors WHERE window_id='w1'" |
||||||
|
).fetchone() |
||||||
|
# Note: DuckDB may return the JSON column as string; compare string form |
||||||
|
assert ( |
||||||
|
res[0] == "w1" |
||||||
|
and res[1] == "typeA" |
||||||
|
and res[2] == "e1" |
||||||
|
and (str(res[3]) == "[1,2,3]" or res[3] == "[1,2,3]") |
||||||
|
and res[4] == "m1" |
||||||
|
) |
||||||
|
else: |
||||||
|
# sqlite: provide id explicitly |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO svd_vectors (id, window_id, entity_type, entity_id, vector, model) VALUES (1, 'w1', 'typeA', 'e1', '[1,2,3]', 'm1')" |
||||||
|
) |
||||||
|
res = conn.execute( |
||||||
|
"SELECT window_id, entity_type, entity_id, vector, model FROM svd_vectors WHERE id=1" |
||||||
|
).fetchone() |
||||||
|
assert ( |
||||||
|
res[0] == "w1" |
||||||
|
and res[1] == "typeA" |
||||||
|
and res[2] == "e1" |
||||||
|
and str(res[3]) == "[1,2,3]" |
||||||
|
and res[4] == "m1" |
||||||
|
) |
||||||
|
|
||||||
|
elif table_name == "fused_embeddings": |
||||||
|
if DB_BACKEND == "duckdb": |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO fused_embeddings (motion_id, window_id, vector, svd_dims, text_dims) VALUES (2, 'w2', '[0.1,0.2]', 16, 128)" |
||||||
|
) |
||||||
|
res = conn.execute( |
||||||
|
"SELECT motion_id, window_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE motion_id=2" |
||||||
|
).fetchone() |
||||||
|
assert ( |
||||||
|
res[0] == 2 |
||||||
|
and res[1] == "w2" |
||||||
|
and (str(res[2]) == "[0.1,0.2]" or res[2] == "[0.1,0.2]") |
||||||
|
and res[3] == 16 |
||||||
|
and res[4] == 128 |
||||||
|
) |
||||||
|
else: |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO fused_embeddings (id, motion_id, window_id, vector, svd_dims, text_dims) VALUES (1, 2, 'w2', '[0.1,0.2]', 16, 128)" |
||||||
|
) |
||||||
|
res = conn.execute( |
||||||
|
"SELECT motion_id, window_id, vector, svd_dims, text_dims FROM fused_embeddings WHERE id=1" |
||||||
|
).fetchone() |
||||||
|
assert ( |
||||||
|
res[0] == 2 |
||||||
|
and res[1] == "w2" |
||||||
|
and str(res[2]) == "[0.1,0.2]" |
||||||
|
and res[3] == 16 |
||||||
|
and res[4] == 128 |
||||||
|
) |
||||||
|
|
||||||
|
conn.close() |
||||||
@ -0,0 +1,5 @@ |
|||||||
|
def test_scientific_deps_present(): |
||||||
|
content = open("pyproject.toml").read() |
||||||
|
assert "scipy" in content |
||||||
|
assert "umap-learn" in content |
||||||
|
assert "plotly" in content |
||||||
@ -0,0 +1,63 @@ |
|||||||
|
import json |
||||||
|
import numpy as np |
||||||
|
import pytest |
||||||
|
|
||||||
|
from database import db as motion_db |
||||||
|
from pipeline.svd_pipeline import ( |
||||||
|
_safe_k, |
||||||
|
_build_vote_matrix, |
||||||
|
_procrustes_align, |
||||||
|
run_svd_for_window, |
||||||
|
) |
||||||
|
|
||||||
|
|
||||||
|
def test_safe_k_and_build_and_run(tmp_path): |
||||||
|
np.random.seed(0) |
||||||
|
# reset DB file for test |
||||||
|
db_path = tmp_path / "test.db" |
||||||
|
# point the MotionDatabase to this test DB |
||||||
|
motion_db.db_path = str(db_path) |
||||||
|
motion_db._init_database() |
||||||
|
|
||||||
|
# Create synthetic dataset: 5 MPs x 6 motions |
||||||
|
mps = [f"MP_{i}" for i in range(5)] |
||||||
|
motions = list(range(100, 106)) |
||||||
|
dates = ["2020-01-0" + str(i + 1) for i in range(6)] |
||||||
|
|
||||||
|
votes = ["Voor", "Tegen", "Geen stem"] |
||||||
|
|
||||||
|
# insert votes: fill full matrix using MotionDatabase helper |
||||||
|
for j, motion_id in enumerate(motions): |
||||||
|
for i, mp in enumerate(mps): |
||||||
|
vote = votes[(i + j) % len(votes)] |
||||||
|
motion_db.insert_mp_vote(motion_id, mp, vote, date=dates[j]) |
||||||
|
|
||||||
|
mat, mp_names, motion_ids = _build_vote_matrix( |
||||||
|
motion_db, "2020-01-01", "2020-01-10" |
||||||
|
) |
||||||
|
assert mat.shape == (5, 6) |
||||||
|
|
||||||
|
# _safe_k: with k=10 -> min_dim=5 -> returns 4 |
||||||
|
assert _safe_k(mat, 10) == 4 |
||||||
|
assert _safe_k(mat, 3) == 3 |
||||||
|
|
||||||
|
# run_svd_for_window with k=10 -> should use k_used=4 |
||||||
|
res = run_svd_for_window(motion_db, "w1", "2020-01-01", "2020-01-10", k=10) |
||||||
|
assert res["k_used"] == 4 |
||||||
|
assert res["stored_mp"] == 5 |
||||||
|
assert res["stored_motion"] == 6 |
||||||
|
|
||||||
|
|
||||||
|
def test_procrustes_align(): |
||||||
|
np.random.seed(0) |
||||||
|
# create reference anchors and current anchors rotated + noise |
||||||
|
ref = np.random.randn(10, 3) |
||||||
|
# create orthogonal rotation |
||||||
|
Q, _ = np.linalg.qr(np.random.randn(3, 3)) |
||||||
|
cur = ref.dot(Q) + 0.1 * np.random.randn(10, 3) |
||||||
|
|
||||||
|
before = np.linalg.norm(cur - ref) |
||||||
|
transformed = _procrustes_align(ref, cur) |
||||||
|
after = np.linalg.norm(transformed - ref) |
||||||
|
|
||||||
|
assert after < before |
||||||
@ -0,0 +1,80 @@ |
|||||||
|
import json |
||||||
|
import pytest |
||||||
|
|
||||||
|
# duckdb is an optional dependency in some environments; skip test if not available |
||||||
|
duckdb = pytest.importorskip("duckdb") |
||||||
|
|
||||||
|
from database import MotionDatabase |
||||||
|
|
||||||
|
|
||||||
|
def test_ensure_text_embeddings_monkeypatch(tmp_path, monkeypatch): |
||||||
|
# prepare temp db |
||||||
|
db_path = str(tmp_path / "motions.db") |
||||||
|
db = MotionDatabase(db_path) |
||||||
|
|
||||||
|
# create embeddings table (migration would normally do this) |
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
# create embeddings table with autoincrement id for sqlite |
||||||
|
conn.execute("CREATE SEQUENCE IF NOT EXISTS embeddings_id_seq START 1") |
||||||
|
conn.execute( |
||||||
|
"CREATE TABLE IF NOT EXISTS embeddings (id INTEGER PRIMARY KEY DEFAULT nextval('embeddings_id_seq'), motion_id INTEGER, model TEXT, vector JSON, created_at TIMESTAMP)" |
||||||
|
) |
||||||
|
|
||||||
|
# insert three motions |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)", |
||||||
|
("t1", "d1", "u1", "ex1"), |
||||||
|
) |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)", |
||||||
|
("t2", "d2", "u2", "ex2"), |
||||||
|
) |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO motions (title, description, url, layman_explanation) VALUES (?, ?, ?, ?)", |
||||||
|
("t3", "d3", "u3", "ex3"), |
||||||
|
) |
||||||
|
|
||||||
|
# fetch ids |
||||||
|
rows = conn.execute("SELECT id FROM motions ORDER BY id").fetchall() |
||||||
|
ids = [r[0] for r in rows] |
||||||
|
|
||||||
|
# insert existing embedding for first motion |
||||||
|
import json as _json |
||||||
|
|
||||||
|
vec = _json.dumps([0.1] * 16) |
||||||
|
conn.execute( |
||||||
|
"INSERT INTO embeddings (motion_id, model, vector) VALUES (?, ?, ?)", |
||||||
|
(ids[0], "test-model", vec), |
||||||
|
) |
||||||
|
|
||||||
|
conn.close() |
||||||
|
|
||||||
|
# monkeypatch ai_provider.get_embedding |
||||||
|
def fake_get_embedding(text, model=None): |
||||||
|
return [0.1] * 16 |
||||||
|
|
||||||
|
monkeypatch.setattr("ai_provider.get_embedding", fake_get_embedding) |
||||||
|
|
||||||
|
# run ensure_text_embeddings |
||||||
|
from pipeline.text_pipeline import ensure_text_embeddings |
||||||
|
|
||||||
|
stored, skipped_existing, skipped_no_text, errors = ensure_text_embeddings( |
||||||
|
db_path=db_path, model="test-model" |
||||||
|
) |
||||||
|
|
||||||
|
assert stored == 2 |
||||||
|
assert skipped_existing == 1 |
||||||
|
assert skipped_no_text == 0 |
||||||
|
assert errors == 0 |
||||||
|
|
||||||
|
# verify stored vectors length |
||||||
|
conn = duckdb.connect(db.db_path) |
||||||
|
rows = conn.execute( |
||||||
|
"SELECT vector FROM embeddings WHERE model = ? ORDER BY motion_id", |
||||||
|
("test-model",), |
||||||
|
).fetchall() |
||||||
|
conn.close() |
||||||
|
assert len(rows) == 3 |
||||||
|
for r in rows: |
||||||
|
v = _json.loads(r[0]) |
||||||
|
assert len(v) == 16 |
||||||
@ -0,0 +1,22 @@ |
|||||||
|
import json |
||||||
|
|
||||||
|
from src.types.motion_types import SimilarityNeighbor, to_json, from_json |
||||||
|
|
||||||
|
|
||||||
|
def test_similarity_neighbor_json_roundtrip(): |
||||||
|
neighbors = [ |
||||||
|
SimilarityNeighbor(motion_id="m1", score=0.9), |
||||||
|
SimilarityNeighbor(motion_id="m2", score=0.75), |
||||||
|
] |
||||||
|
|
||||||
|
# Serialize to JSON string |
||||||
|
json_str = to_json(neighbors) |
||||||
|
assert isinstance(json_str, str) |
||||||
|
|
||||||
|
# Ensure it's valid JSON |
||||||
|
parsed = json.loads(json_str) |
||||||
|
assert isinstance(parsed, list) |
||||||
|
|
||||||
|
# Deserialize back to objects |
||||||
|
recovered = from_json(json_str) |
||||||
|
assert recovered == neighbors |
||||||
@ -0,0 +1,66 @@ |
|||||||
|
""" |
||||||
|
Test helper fixtures for database migrations. |
||||||
|
|
||||||
|
Provides a pytest fixture `test_db` that inspects the environment variable |
||||||
|
`TEST_DB_URL` to decide what to yield: |
||||||
|
|
||||||
|
- If `TEST_DB_URL` is not set, the fixture yields None. This allows tests to |
||||||
|
be skipped or operate in a no-database mode in CI or local runs where a |
||||||
|
test database is not available. |
||||||
|
- If `TEST_DB_URL` is set and starts with "sqlite", an sqlite3 connection is |
||||||
|
created via `sqlite3.connect` and yielded. The connection is closed after |
||||||
|
the test completes. |
||||||
|
|
||||||
|
Decision: keep this fixture lightweight and focused on sqlite for local |
||||||
|
smoke-testing. If other database backends are needed later, expand this |
||||||
|
fixture accordingly. |
||||||
|
""" |
||||||
|
|
||||||
|
from typing import Optional |
||||||
|
import os |
||||||
|
import sqlite3 |
||||||
|
|
||||||
|
import pytest |
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture |
||||||
|
def test_db(): |
||||||
|
"""Yield a test database connection or None. |
||||||
|
|
||||||
|
Behavior: |
||||||
|
- If TEST_DB_URL is not set in the environment, yield None. |
||||||
|
- If TEST_DB_URL is set and begins with 'sqlite', open an sqlite3 |
||||||
|
connection and yield it. The connection will be closed when the test |
||||||
|
finishes. |
||||||
|
""" |
||||||
|
url = os.environ.get("TEST_DB_URL") |
||||||
|
if not url: |
||||||
|
yield None |
||||||
|
return |
||||||
|
|
||||||
|
# Only support sqlite URLs in this lightweight fixture. |
||||||
|
if url.startswith("sqlite"): |
||||||
|
# For sqlite URLs, accept either a bare file path or a file:// style |
||||||
|
# URL. sqlite3.connect handles file paths; if a file:// prefix is |
||||||
|
# present, strip it. |
||||||
|
path = url |
||||||
|
if path.startswith("sqlite:///"): |
||||||
|
# sqlite:///path => /path |
||||||
|
path = path[len("sqlite:///") :] |
||||||
|
elif path.startswith("sqlite://"): |
||||||
|
path = path[len("sqlite://") :] |
||||||
|
|
||||||
|
conn = sqlite3.connect(path) |
||||||
|
try: |
||||||
|
yield conn |
||||||
|
finally: |
||||||
|
try: |
||||||
|
conn.close() |
||||||
|
except Exception: |
||||||
|
# Best-effort close; tests shouldn't fail on close errors. |
||||||
|
pass |
||||||
|
return |
||||||
|
|
||||||
|
# Unknown or unsupported TEST_DB_URL scheme — yield None to keep tests |
||||||
|
# tolerant in environments where the fixture can't create a connection. |
||||||
|
yield None |
||||||
@ -0,0 +1,50 @@ |
|||||||
|
# Session: stemwijzer |
||||||
|
Updated: 2026-03-20T00:23:33Z |
||||||
|
|
||||||
|
## Goal |
||||||
|
Preserve the minimal session state required to resume work on the stemwijzer project after context clears (success = ledger exists and is kept up-to-date). |
||||||
|
|
||||||
|
## Constraints |
||||||
|
- Keep the ledger CONCISE — only essential information |
||||||
|
- Focus on WHAT and WHY, not HOW |
||||||
|
- Mark uncertain information as UNCONFIRMED |
||||||
|
- Include git branch and key file paths |
||||||
|
|
||||||
|
## Progress |
||||||
|
### Done |
||||||
|
- [x] Create initial continuity ledger file |
||||||
|
|
||||||
|
### In Progress |
||||||
|
- [ ] Capture ongoing session context and update ledger after each meaningful change |
||||||
|
|
||||||
|
### Blocked |
||||||
|
- None currently |
||||||
|
|
||||||
|
## Key Decisions |
||||||
|
- **Session name = "stemwijzer"**: Chosen from repository context (UNCONFIRMED if a different canonical session name is preferred). |
||||||
|
- **Do not auto-commit ledger changes**: Commits will only be made when the user explicitly requests it (follows Git Safety Protocol). |
||||||
|
|
||||||
|
## Next Steps |
||||||
|
1. Continue updating this ledger when tasks, files, or decisions change |
||||||
|
2. Add entries for new branches or major feature work (mark as UNCONFIRMED when unsure) |
||||||
|
3. Ask user before creating any git commits that include this ledger |
||||||
|
|
||||||
|
## File Operations |
||||||
|
### Read |
||||||
|
- `README.md` |
||||||
|
- `pyproject.toml` |
||||||
|
- `thoughts/shared/plans/2026-03-19-stemwijzer-plan.md` |
||||||
|
- `thoughts/shared/designs/2026-03-19-stemwijzer-design.md` |
||||||
|
|
||||||
|
### Modified |
||||||
|
- `thoughts/ledgers/CONTINUITY_stemwijzer.md` (new) |
||||||
|
|
||||||
|
## Critical Context |
||||||
|
- Repository branch observed: `main` |
||||||
|
- Found project metadata in `pyproject.toml` indicating Python tooling preference |
||||||
|
- Existing notes/plans located under `thoughts/shared/` (plans and designs from 2026-03-19) |
||||||
|
- No existing continuity ledger was found prior to this creation |
||||||
|
|
||||||
|
## Working Set |
||||||
|
- Branch: `main` |
||||||
|
- Key files: `README.md`, `pyproject.toml`, `thoughts/shared/plans/2026-03-19-stemwijzer-plan.md`, `thoughts/shared/designs/2026-03-19-stemwijzer-design.md`, `thoughts/ledgers/CONTINUITY_stemwijzer.md` |
||||||
@ -0,0 +1,98 @@ |
|||||||
|
--- |
||||||
|
date: 2026-03-19 |
||||||
|
topic: "Stemwijzer AI & DB design" |
||||||
|
status: draft |
||||||
|
--- |
||||||
|
|
||||||
|
## Problem Statement |
||||||
|
|
||||||
|
We need a clear, low-risk design to improve AI usage and query ergonomics in this repository. The codebase currently ingests motions, stores them in DuckDB, and generates AI-driven layman summaries via an OpenRouter/OpenAI client. There are a few maintenance issues (e.g., missing config keys, a broken reset script) and no embedding/search infrastructure. |
||||||
|
|
||||||
|
**Goal:** |
||||||
|
- Centralize AI/LLM usage behind a provider abstraction so we can swap or prefer providers later. |
||||||
|
- Introduce minimal embeddings storage and search so we can add semantic features without heavy infra. |
||||||
|
- Prefer ibis for read/query paths where that improves clarity and maintainability (the repo already imports ibis in read.py). |
||||||
|
|
||||||
|
|
||||||
|
## Constraints |
||||||
|
|
||||||
|
- Work must be incremental and non-disruptive: keep existing DuckDB schema and write paths where possible. |
||||||
|
- Do not add external services (vector DB) in the first iteration — store embeddings in DuckDB as JSON for now. |
||||||
|
- Secrets must remain environment-driven (no checked-in secrets). Add env var defaults only. |
||||||
|
- Keep changes small and well-tested; make it easy to roll back. |
||||||
|
|
||||||
|
|
||||||
|
## Approach (chosen) |
||||||
|
|
||||||
|
I'll introduce two small layers: |
||||||
|
- **ai_provider**: a thin adapter that exposes get_embedding(text) and chat_completion(messages). It will use the existing OpenRouter/OpenAI path by default and can be extended to prefer other providers if/when desired. |
||||||
|
- **query_dal**: read-focused utilities implemented with ibis to replace direct SQL reads in the app and other read-heavy paths. Writes (insert_motion, update_user_vote) stay in database.py initially. |
||||||
|
|
||||||
|
This gives the benefits of abstraction and pythonic query composition while keeping risk low. |
||||||
|
|
||||||
|
|
||||||
|
## Architecture |
||||||
|
|
||||||
|
High level components (repo root): |
||||||
|
- api_client.py — fetches motion data from Tweede Kamer OData (unchanged) |
||||||
|
- scraper.py — optional HTML scraping fallback (unchanged) |
||||||
|
- database.py — current writes, schema initialization (add small embeddings table) |
||||||
|
- summarizer.py — generate layman summaries (refactor to use ai_provider) |
||||||
|
- app.py — Streamlit UI (switch read paths to query_dal) |
||||||
|
- scheduler.py — orchestrates ingestion and triggers summarization (unchanged) |
||||||
|
|
||||||
|
Additions: |
||||||
|
- ai_provider.py — single place for LLM/embedding calls and retries |
||||||
|
- query_dal.py — ibis-based read helpers (get_filtered_motions, calculate_party_matches) |
||||||
|
- minimal embeddings table in DuckDB (motion_id, model, vector JSON, created_at) |
||||||
|
|
||||||
|
|
||||||
|
## Components and responsibilities |
||||||
|
|
||||||
|
- **ai_provider**: choose provider, handle retries/backoff, return plain Python objects (list[float] embeddings, str completions). Keep error classes small and testable. |
||||||
|
- **database (existing)**: add store_embedding and search_similar helpers (naive in-Python cosine scan). Keep insert_motion/update_user_vote unchanged to minimize risk. |
||||||
|
- **query_dal**: use ibis for read queries used by Streamlit paths (get_filtered_motions, session lookups). Return parsed JSON fields. |
||||||
|
- **summarizer**: call ai_provider.chat_completion to get summary; update motions.layman_explanation; optionally compute embedding via ai_provider.get_embedding and store via database.store_embedding. |
||||||
|
- **app.py**: replace direct duckdb selects with query_dal functions. |
||||||
|
|
||||||
|
|
||||||
|
## Data Flow |
||||||
|
|
||||||
|
1. Ingest: scheduler / scraper / api_client fetch motions and call database.insert_motion(motion). |
||||||
|
2. Summarize: summarizer calls ai_provider.chat_completion(summary prompt) → writes layman_explanation to motions table. Optionally computes embedding and writes to embeddings table. |
||||||
|
3. Query: Streamlit app calls query_dal.get_filtered_motions (ibis) to load motions for sessions and query_dal.calculate_party_matches for results. |
||||||
|
4. Semantic search (future): query_dal or app can call database.search_similar by providing an embedding computed with ai_provider.get_embedding. |
||||||
|
|
||||||
|
|
||||||
|
## Error Handling |
||||||
|
|
||||||
|
- ai_provider: retries with exponential backoff for transient errors; raises a ProviderError for terminal failures so callers can decide retry semantics. |
||||||
|
- Summarizer: non-fatal on AI failures — store an empty/fallback summary and log the failure; surface a user-facing message in Streamlit if generating summaries fails interactively. |
||||||
|
- DB functions: existing try/except patterns retained; ensure connections are closed on error. |
||||||
|
|
||||||
|
|
||||||
|
## Testing Strategy |
||||||
|
|
||||||
|
- Unit tests for ai_provider using mocks for HTTP/openai responses. |
||||||
|
- DB tests using temporary DuckDB files to verify store_embedding and search_similar behavior. |
||||||
|
- query_dal tests using ibis against a temporary DB file; ensure JSON fields parse correctly. |
||||||
|
- Summarizer tests mock ai_provider to assert DB writes happen. |
||||||
|
|
||||||
|
|
||||||
|
## Open Questions |
||||||
|
|
||||||
|
- Store embeddings inside motions table vs separate embeddings table? Recommendation: separate embeddings table for clarity and easier upserts. |
||||||
|
- Do we want to prefer other providers (Copilot) automatically? This repo currently references OPENROUTER. If user wants Copilot preference, we can add env vars and selection logic later. |
||||||
|
|
||||||
|
|
||||||
|
## Next steps (short) |
||||||
|
|
||||||
|
1. Add ai_provider.py (adapter) and tests. |
||||||
|
2. Add embeddings table and store/search helpers in database.py and tests. |
||||||
|
3. Add query_dal.py with ibis reads and tests. |
||||||
|
4. Refactor summarizer.py to use ai_provider and optionally store embeddings. |
||||||
|
5. Update Streamlit app read paths to use query_dal. |
||||||
|
6. Fix housekeeping bugs: reset.py references reset_database(), scraper uses undefined SCRAPING_DELAY — address these small fixes in a separate patch. |
||||||
|
|
||||||
|
|
||||||
|
I'm proceeding to save this design to thoughts/shared/designs/2026-03-19-stemwijzer-design.md and will spawn the planner to create a detailed implementation plan. Interrupt if you want changes to the design text above. |
||||||
@ -0,0 +1,116 @@ |
|||||||
|
--- |
||||||
|
date: 2026-03-21 |
||||||
|
topic: "Reuse motions as a guided policy explorer" |
||||||
|
status: draft |
||||||
|
--- |
||||||
|
|
||||||
|
## Problem Statement |
||||||
|
|
||||||
|
We want to repurpose existing "motions" data so it becomes a lightweight, discovery-driven way for users to explore policy positions and discover related content. This is not a full proposal system; it's a guided exploration and bookmarking flow that leverages our existing ingestion, summarization, embeddings, and session voting work. |
||||||
|
|
||||||
|
**Why now:** We already ingest motions, generate layman explanations, compute embeddings, and store per-session votes. Reusing those building blocks gives high user value with modest effort. |
||||||
|
|
||||||
|
## Constraints |
||||||
|
|
||||||
|
**Non-negotiables and technical limits:** |
||||||
|
- Use the existing database schema where possible (motions table, embeddings table, user_sessions). Do not require a new external vector DB for MVP. |
||||||
|
- Keep the Streamlit UI model (app.py) and session-based votes intact for the initial rollout. |
||||||
|
- Avoid breaking migrations: rely on existing migrations and add new ones when necessary (no forced drops). |
||||||
|
- Respect current error-handling posture: network calls can fail; system must degrade gracefully. |
||||||
|
|
||||||
|
## Chosen Approach |
||||||
|
|
||||||
|
I'm choosing a "Guided Policy Explorer" approach because it reuses thehighest-value existing pieces (summaries, embeddings, session voting) and delivers a clear UX that fits the current codebase. This gives immediate product value with low risk. |
||||||
|
|
||||||
|
**Core idea:** present curated short sessions and motion detail pages that combine the existing layman explanation, party-match results, and semantic "related motions" powered by stored embeddings. |
||||||
|
|
||||||
|
Alternatives considered: |
||||||
|
- "Motion-as-Proposal platform": full lifecycle (draft → comment → vote). Rejected for MVP due to high complexity and data model changes. |
||||||
|
- "Motion Digest / Research Assistant": read-only pages and newsletters. Lower effort, but less interactive and reuses fewer of our current session features. |
||||||
|
|
||||||
|
## Architecture |
||||||
|
|
||||||
|
High-level view (existing pieces in bold): |
||||||
|
- Ingest: **api_client.py** + **scraper.py** gather motions and create motion records in the DB. |
||||||
|
- Persist: **database.py** stores motions, embeddings, and user_sessions. |
||||||
|
- Enrichment: **summarizer.py** + **ai_provider.py** generate layman explanations and embeddings. |
||||||
|
- Background jobs: **scheduler.py** runs ingest, summarization, and periodic clustering. |
||||||
|
- UI: **app.py** current Streamlit session flow — extend with "Explore" and "Motion detail" pages. |
||||||
|
- New: small **clusterer / similarity API** to compute and cache related-motion lists per motion. |
||||||
|
|
||||||
|
## Key Components & Responsibilities |
||||||
|
|
||||||
|
- Motion Ingest (existing): keep ingest as-is; add metadata flags (e.g., curated, candidate). |
||||||
|
- Motion Store (existing): motions table + embeddings table; add an **events/audit** table for user actions and important state transitions. |
||||||
|
- Summarizer / Embedding Worker (existing): scheduled job that ensures motions have layman_explanation and embeddings; add retry/backoff and logging. |
||||||
|
- Similarity service (new): computes nearest neighbors using stored vectors in-process for MVP and caches results in a small table. Swap to a vector index later if needed. |
||||||
|
- Session & Voting (existing): continue using user_sessions JSON blob for individual sessions; add optional event log entries for each vote. |
||||||
|
- UI (update): add "Explore" landing, motion detail view with layman text, party-match snapshot, related motions, and bookmark/flag actions. Reuse Streamlit components. |
||||||
|
- Admin tooling (new): migration scripts, a CLI to recompute embeddings/similarity, and an audit query helper. |
||||||
|
|
||||||
|
## Data Flow |
||||||
|
|
||||||
|
1. Ingest job (api_client/scraper) produces motion records and calls db.insert_motion. |
||||||
|
2. Summarizer worker picks up motions without layman_explanation or embeddings, calls ai_provider, and writes layman_explanation + embeddings. |
||||||
|
3. Clusterer/similarity job computes related-motion lists using stored embeddings and writes them to a cache table. |
||||||
|
4. UI "Explore" shows curated motion lists; "Motion detail" reads motion, layman_explanation, party-match snapshot, and cached related motions. |
||||||
|
5. User vote actions update user_sessions and also append an event to the audit table for traceability. |
||||||
|
6. Background analytics (optional) reuses user_events and embeddings for offline insights. |
||||||
|
|
||||||
|
## Error Handling Strategy |
||||||
|
|
||||||
|
- External calls: add retries with exponential backoff for AI provider and external APIs. Failures set a marker (e.g., summary_missing) and the system continues. |
||||||
|
- Missing embeddings: UI gracefully disables "related motions" and offers "compute on demand". |
||||||
|
- Idempotency: make insert_motion idempotent by URL/external id check at DB layer; use optimistic handling for duplicates. |
||||||
|
- Concurrency: avoid read-modify-write races by writing user events (append-only) and deriving session state from events when race-prone updates are detected. |
||||||
|
- Observability: replace prints with structured logging (module-level logger) and add basic metrics for worker errors, API failures, and queue lags. |
||||||
|
|
||||||
|
## Testing Strategy |
||||||
|
|
||||||
|
- Unit tests: DB helpers (insert_motion, store_embedding, similarity cache), summarizer functions (mock ai_provider), and session vote logic. |
||||||
|
- Migration tests: follow the existing pattern of applying migration SQL in a temp DB and asserting schema. |
||||||
|
- Integration tests: end-to-end ingest → summarize → embedding → similarity → UI-read path in CI (use monkeypatch for AI calls). |
||||||
|
- Load tests: simulate a few thousand embeddings search calls against the in-process search to validate performance assumptions for MVP. |
||||||
|
- Acceptance: confirm UX flows: Explore session, Motion detail, Vote -> party match, Related motions populated. |
||||||
|
|
||||||
|
## High-level Plan & Estimates |
||||||
|
|
||||||
|
Assumptions: one full-stack engineer (Python + Streamlit) and one part-time reviewer. All estimates are rough. |
||||||
|
|
||||||
|
Milestone 0 — Validate & quick discovery (1 day) |
||||||
|
- Locate user's added markdown plan and extract exact requirements. (I'm assuming the file exists in thoughts/shared; if not, we validated by searching.) |
||||||
|
|
||||||
|
Milestone 1 — MVP (8–12 engineer days) |
||||||
|
- Add similarity cache table and migration. |
||||||
|
- Summarizer: make embedding generation robust with retries and store vectors. |
||||||
|
- Clusterer job: compute and cache related motions. |
||||||
|
- UI: Explore landing, Motion detail page, related motion UI, bookmark/flag button. |
||||||
|
- Add event/audit table and write events on user votes and bookmarks. |
||||||
|
|
||||||
|
Milestone 2 — Hardening & instrumentation (3–5 engineer days) |
||||||
|
- Replace prints with structured logging across touched modules. |
||||||
|
- Add migration tests and CI integration tests (mock AI). |
||||||
|
- Add health metrics & basic alerting for worker failures. |
||||||
|
|
||||||
|
Milestone 3 — Polish & UX feedback (3–5 engineer days) |
||||||
|
- UX tweaks, performance tuning, compute on-demand fallback for embeddings, documentation, admin CLI. |
||||||
|
|
||||||
|
Total MVP + polish: ~2–3 weeks of focused work. |
||||||
|
|
||||||
|
## Risks & Mitigations |
||||||
|
|
||||||
|
- Risk: Naive in-process embedding search will not scale. Mitigation: cache nearest neighbors per motion and plan a migration path to a vector index. |
||||||
|
- Risk: AI provider flakiness. Mitigation: retries, timeouts, and clear UI fallback. Tests must mock provider in CI. |
||||||
|
- Risk: Race conditions on session votes. Mitigation: append-only event log and derive authoritative session view from events when needed. |
||||||
|
- Risk: Schema drift and missing migrations. Mitigation: add migration tests and document required migrations in repo. |
||||||
|
|
||||||
|
## Open Questions |
||||||
|
|
||||||
|
- Which exact user journeys do we want first (single-session discover vs. persistent account/bookmarking)? |
||||||
|
- Do we want bookmarks persisted globally or per-session only? (Privacy implications.) |
||||||
|
- What's acceptable latency for "related motions" — precomputed nightly vs. near-real-time? |
||||||
|
- Any policy/legal ban on storing full body_text or on long-term retention of user votes? |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
I'm proceeding to create the design doc file at thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md and will spawn the implementation planner next. Interrupt if you want changes to the approach or scope now. |
||||||
@ -0,0 +1,335 @@ |
|||||||
|
# Guided Policy Explorer — Implementation Plan |
||||||
|
|
||||||
|
**Goal:** Implement the Guided Policy Explorer MVP that reuses existing motions, layman summaries, embeddings and session votes to provide an Explore landing, Motion detail view, cached related motions (similarity cache), and accompanying background jobs and admin tooling. |
||||||
|
|
||||||
|
Design: thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Dependency Graph |
||||||
|
|
||||||
|
``` |
||||||
|
Batch 1 (parallel): 1.1, 1.2, 1.3, 1.4, 1.5 [foundation - migrations, types, migration-tests] |
||||||
|
Batch 2 (parallel): 2.1, 2.2, 2.3, 2.4 [core - similarity service, cache repo, audit repo, embeddings worker] |
||||||
|
Batch 3 (parallel): 3.1, 3.2, 3.3, 3.4 [components - clusterer worker, CLI, API, Streamlit page] |
||||||
|
Batch 4 (parallel): 4.1 [integration tests & docs - depends on 2.x & 3.x] |
||||||
|
``` |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Notes on planning choices |
||||||
|
- Design requires a similarity cache and a small in-process nearest-neighbor search for MVP. I'm implementing this as: store precomputed top-N neighbor lists (IDs + scores) in a small SQL table and compute neighbors by scanning embeddings in-memory per batch job. Reason: avoids external vector DB and keeps implementation simple and testable. |
||||||
|
- Design requires robust embedding generation. I'll implement exponential-backoff retry logic with a configurable retry count and timeouts in embeddings_worker; tests will monkeypatch the ai_provider to simulate failures. |
||||||
|
- Migration tests: design asks to have migration tests, but migration SQL content is omitted per instructions. Tests will assert that migration files are present and follow naming conventions and will be marked to skip applying SQL unless a TEST_DB_URL env var is provided. This keeps CI safe while satisfying test coverage and developer verification. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Batch 1: Foundation (parallel - 5 implementers) |
||||||
|
All tasks in this batch have NO dependencies and run simultaneously. |
||||||
|
|
||||||
|
### Task 1.1: Add similarity cache migration (placeholder) |
||||||
|
**Title:** Migration: add similarity_cache table |
||||||
|
**Description:** Add a migration file to create a similarity cache table that stores precomputed related-motion lists per motion (motion_id, neighbors_json, computed_at). SQL content intentionally left out per instructions; file is a placeholder that CI/tests will detect. |
||||||
|
**Files:** |
||||||
|
- migrations/2026-03-22-add-similarity-cache.sql |
||||||
|
**Tests:** |
||||||
|
- tests/migrations/test_2026_03_22_add_similarity_cache.py |
||||||
|
**Estimated:** 1.0h |
||||||
|
**Priority:** high |
||||||
|
**Depends:** none |
||||||
|
**Acceptance criteria:** |
||||||
|
- Migration file exists at migrations/2026-03-22-add-similarity-cache.sql |
||||||
|
- test_migration file runs and passes in default mode (it will only check filename & header). If TEST_DB_URL is set in env, test will attempt to run the SQL and must not error (SQL may be empty; test expects a no-op or valid SQL). Test is marked to skip DB application when TEST_DB_URL is unset. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 1.2: Add audit/events migration (placeholder) |
||||||
|
**Title:** Migration: add audit_events table |
||||||
|
**Description:** Add a migration placeholder to create an audit/events table for append-only user events (vote, bookmark, flag). Actual SQL omitted. |
||||||
|
**Files:** |
||||||
|
- migrations/2026-03-22-add-audit-events.sql |
||||||
|
**Tests:** |
||||||
|
- tests/migrations/test_2026_03_22_add_audit_events.py |
||||||
|
**Estimated:** 1.0h |
||||||
|
**Priority:** high |
||||||
|
**Depends:** none |
||||||
|
**Acceptance criteria:** |
||||||
|
- migrations/2026-03-22-add-audit-events.sql exists |
||||||
|
- migration test verifies filename and is safe to run in CI (skips DB apply unless TEST_DB_URL provided). |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 1.3: Shared types for motions & similarity entries |
||||||
|
**Title:** Types: motion and similarity types |
||||||
|
**Description:** Add a small types module that centralizes typed dataclasses/interfaces used by similarity and cache modules (MotionId, Embedding vector typed alias, SimilarityNeighbor). This reduces coupling and makes tests easier to write. |
||||||
|
**Files:** |
||||||
|
- src/types/motion_types.py |
||||||
|
**Tests:** |
||||||
|
- tests/types/test_motion_types.py |
||||||
|
**Estimated:** 1.5h |
||||||
|
**Priority:** medium |
||||||
|
**Depends:** none |
||||||
|
**Acceptance criteria:** |
||||||
|
- src/types/motion_types.py defines MotionId, Embedding, SimilarityNeighbor types and basic helpers (e.g., serialize/deserialize neighbors). Tests validate JSON round-trip of neighbors. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 1.4: CI migration test helper |
||||||
|
**Title:** Test helper: migration test utils |
||||||
|
**Description:** Add a small test helper that other migration tests can use. It provides a pytest fixture that reads TEST_DB_URL and yields a DB connection or None and marks tests appropriately. |
||||||
|
**Files:** |
||||||
|
- tests/utils/migration_fixtures.py |
||||||
|
**Tests:** |
||||||
|
- tests/migrations/test_migration_fixtures_smoke.py |
||||||
|
**Estimated:** 1.0h |
||||||
|
**Priority:** medium |
||||||
|
**Depends:** none |
||||||
|
**Acceptance criteria:** |
||||||
|
- migration_fixtures.py provides `test_db` fixture. The smoke test asserts fixture yields None when TEST_DB_URL unset and yields a connection-like object when set. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 1.5: Add README admin docs for recomputing |
||||||
|
**Title:** Docs: admin CLI usage and migration notes |
||||||
|
**Description:** Add a short markdown doc describing the admin CLI, migration filenames, and how to run recompute/clusterer jobs locally for dev. |
||||||
|
**Files:** |
||||||
|
- docs/admin/recompute_similarity.md |
||||||
|
**Tests:** none (doc only) |
||||||
|
**Estimated:** 0.5h |
||||||
|
**Priority:** low |
||||||
|
**Depends:** none |
||||||
|
**Acceptance criteria:** |
||||||
|
- docs/admin/recompute_similarity.md exists and documents commands and env vars: TEST_DB_URL, AI_PROVIDER_MOCK, SIMILARITY_TOP_N. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Batch 2: Core Modules (parallel - 4 implementers) |
||||||
|
Depends: Batch 1 |
||||||
|
|
||||||
|
### Task 2.1: Similarity service (in-process search + utility) |
||||||
|
**Title:** Similarity service implementation |
||||||
|
**Description:** New service that, given motion embeddings, computes cosine similarity and returns top-N neighbors. Also exposes a convenience function to compute neighbors for one motion and return a list of (motion_id, score). This is pure Python and testable in-memory. |
||||||
|
**Files:** |
||||||
|
- src/services/similarity_service.py |
||||||
|
**Tests:** |
||||||
|
- tests/services/test_similarity_service.py |
||||||
|
**Estimated:** 5.0h |
||||||
|
**Priority:** high |
||||||
|
**Depends:** 1.3 |
||||||
|
**Acceptance criteria:** |
||||||
|
- similarity_service.py exposes compute_neighbors(embedding: list[float], all_embeddings: Dict[motion_id, embedding], top_n: int) -> List[SimilarityNeighbor] |
||||||
|
- Unit tests cover exact small matrices and edge cases (empty, identical embeddings). All tests pass with `pytest tests/services/test_similarity_service.py`. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 2.2: DB repo for similarity cache |
||||||
|
**Title:** Repo: similarity_cache read/write |
||||||
|
**Description:** Provide a small repository abstraction that reads and writes cached neighbor lists to the DB (serialize neighbors as JSON). Keep DB interactions minimal and testable using sqlite in-memory. |
||||||
|
**Files:** |
||||||
|
- src/db/similarity_cache_repo.py |
||||||
|
**Tests:** |
||||||
|
- tests/db/test_similarity_cache_repo.py |
||||||
|
**Estimated:** 4.0h |
||||||
|
**Priority:** high |
||||||
|
**Depends:** 1.1, 1.3 |
||||||
|
**Acceptance criteria:** |
||||||
|
- similarity_cache_repo provides functions: get_cached_neighbors(motion_id) -> Optional[List[SimilarityNeighbor]] and upsert_cached_neighbors(motion_id, neighbors, computed_at) |
||||||
|
- Unit tests run against sqlite in-memory and assert correct serialization/deserialization. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 2.3: Audit/events repository |
||||||
|
**Title:** Repo: audit_events append-only writer |
||||||
|
**Description:** Small repo to append audit events (user_id, session_id, motion_id, event_type, payload JSON, created_at). Provides an append_event function used by UI and session logic. |
||||||
|
**Files:** |
||||||
|
- src/db/audit_repo.py |
||||||
|
**Tests:** |
||||||
|
- tests/db/test_audit_repo.py |
||||||
|
**Estimated:** 3.0h |
||||||
|
**Priority:** medium |
||||||
|
**Depends:** 1.2 |
||||||
|
**Acceptance criteria:** |
||||||
|
- append_event writes a row to sqlite in-memory in test and read-back verifies fields and created_at presence. Functions are well typed and handle JSON payloads. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 2.4: Embeddings worker helper (retries/backoff) |
||||||
|
**Title:** Worker: robust embedding generator |
||||||
|
**Description:** Add a worker helper that ensures embeddings exist for a motion. It calls ai_provider.get_embedding with retry/backoff and writes embedding via an abstracted DB function (the put function will be dependency-injected in tests). This module contains no long-running loop — it's a single-run helper function used by the scheduler. |
||||||
|
**Files:** |
||||||
|
- src/ai/embeddings_worker.py |
||||||
|
**Tests:** |
||||||
|
- tests/ai/test_embeddings_worker.py |
||||||
|
**Estimated:** 4.0h |
||||||
|
**Priority:** high |
||||||
|
**Depends:** 1.3 |
||||||
|
**Acceptance criteria:** |
||||||
|
- embeddings_worker.explain_and_embed(motion_id, text, put_embedding_fn) calls ai_provider and retries on simulated transient errors. Tests monkeypatch ai_provider to simulate 2 failing attempts then success and verify put_embedding_fn called exactly once with a vector-like object. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Batch 3: Components (parallel - 4 implementers) |
||||||
|
Depends: Batch 2 |
||||||
|
|
||||||
|
### Task 3.1: Clusterer scheduled job |
||||||
|
**Title:** Worker: clusterer job that computes & writes caches |
||||||
|
**Description:** Background job module that loads all embeddings, computes top-N neighbors for each motion using similarity_service, and writes cache rows via similarity_cache_repo. Designed to be runnable from CLI. It should respect a MAX runtime parameter (process batch size) for safe operation in dev. |
||||||
|
**Files:** |
||||||
|
- src/workers/clusterer.py |
||||||
|
**Tests:** |
||||||
|
- tests/workers/test_clusterer.py |
||||||
|
**Estimated:** 6.0h |
||||||
|
**Priority:** high |
||||||
|
**Depends:** 2.1, 2.2, 2.4 |
||||||
|
**Acceptance criteria:** |
||||||
|
- clusterer.run_batch(batch_size, top_n, load_embeddings_fn, upsert_cache_fn) exists and can be unit-tested by injecting small in-memory embeddings and verifying upsert_cache_fn called with expected neighbor lists. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 3.2: Admin CLI: recompute-similarity |
||||||
|
**Title:** CLI: recompute similarity & options |
||||||
|
**Description:** Small CLI script (click or argparse) to trigger the clusterer job (full-run or limited). CLI accepts --top-n, --batch-size, --dry-run flags. Tests will monkeypatch clusterer.run_batch. |
||||||
|
**Files:** |
||||||
|
- src/cli/recompute_similarity.py |
||||||
|
**Tests:** |
||||||
|
- tests/cli/test_recompute_similarity.py |
||||||
|
**Estimated:** 2.5h |
||||||
|
**Priority:** medium |
||||||
|
**Depends:** 3.1 |
||||||
|
**Acceptance criteria:** |
||||||
|
- CLI parses flags and calls clusterer.run_batch with parsed args. tests assert proper arguments passed and dry-run does not call run_batch. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 3.3: HTTP API endpoint for compute-on-demand / cached |
||||||
|
**Title:** API: similarity endpoint |
||||||
|
**Description:** Small Flask/FastAPI/WSGI handler module that returns cached related motions for a motion_id; if cache missing and a query param compute=true, it calls the similarity service to compute neighbors on demand (without persisting) and returns them. Keep the handler framework-agnostic so it can be wired into existing web framework; tests will call the handler function directly. |
||||||
|
**Files:** |
||||||
|
- src/api/similarity_api.py |
||||||
|
**Tests:** |
||||||
|
- tests/api/test_similarity_api.py |
||||||
|
**Estimated:** 3.5h |
||||||
|
**Priority:** medium |
||||||
|
**Depends:** 2.1, 2.2 |
||||||
|
**Acceptance criteria:** |
||||||
|
- Handler get_related(motion_id, compute=False, load_embedding_fn, load_all_embeddings_fn, cache_repo) returns cached neighbors when present and computes on demand when compute=True. Tests cover both code paths. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
### Task 3.4: Streamlit UI: Explore landing & Motion detail module |
||||||
|
**Title:** UI: explore page and motion detail component |
||||||
|
**Description:** Add a Streamlit helper module providing functions to render the Explore landing and Motion detail sections. Avoid modifying existing app.py in this MVP; instead provide a module that app.py can import. The module will expose pure functions where possible to ease testing; tests will verify behavior by calling functions and mocking DB/AI calls. |
||||||
|
**Files:** |
||||||
|
- src/ui/explore_page.py |
||||||
|
**Tests:** |
||||||
|
- tests/ui/test_explore_page.py |
||||||
|
**Estimated:** 5.0h |
||||||
|
**Priority:** medium |
||||||
|
**Depends:** 2.2, 2.3, 2.4 |
||||||
|
**Acceptance criteria:** |
||||||
|
- explore_page.render_explore(session, load_curated_fn, load_cached_neighbors_fn) returns a data structure (not direct Streamlit calls) that app.py can choose to render. Tests assert correct payload for a sample session and that missing embeddings gracefully remove related motions. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Batch 4: Integration & Docs (parallel - 2 implementers) |
||||||
|
Depends: Batch 2 & 3 |
||||||
|
|
||||||
|
### Task 4.1: Integration test: ingest → summarize → embed → cluster → UI read |
||||||
|
**Title:** Integration test for the end-to-end path (mvp) |
||||||
|
**Description:** Add an integration pytest that simulates: create 3 synthetic motions, call embeddings_worker (monkeypatched AI provider), run clusterer on the in-memory dataset, and assert similarity cache rows exist and explore_page returns related motions. Use sqlite in-memory and monkeypatch ai_provider to return deterministic vectors. |
||||||
|
**Files:** |
||||||
|
- tests/integration/test_end_to_end_explore_flow.py |
||||||
|
**Tests:** |
||||||
|
- (this is the test file) |
||||||
|
**Estimated:** 8.0h |
||||||
|
**Priority:** high |
||||||
|
**Depends:** 1.3, 2.1, 2.2, 2.4, 3.1, 3.4 |
||||||
|
**Acceptance criteria:** |
||||||
|
- Running `pytest tests/integration/test_end_to_end_explore_flow.py` passes locally with no external network calls when AI provider is monkeypatched via monkeypatch fixture. The test asserts that at least one neighbor exists for a motion and the explore_page data includes it. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## CI / Test instructions |
||||||
|
|
||||||
|
- Run unit tests: pytest tests/unit (or full suite: pytest) |
||||||
|
- Run a single module test: pytest tests/services/test_similarity_service.py::test_compute_neighbors_basic |
||||||
|
- Integration tests: pytest tests/integration/test_end_to_end_explore_flow.py |
||||||
|
|
||||||
|
Monkeypatching AI provider in CI/local tests: |
||||||
|
- Use the `monkeypatch` pytest fixture to patch `src.ai.ai_provider.get_embedding` and `src.ai.ai_provider.summarize` (if used). Example in tests: monkeypatch.setattr('src.ai.ai_provider.get_embedding', fake_get_embedding) |
||||||
|
- CI should set env var AI_PROVIDER_MOCK=1 for additional safety; tests will check this var and use mocks if present. |
||||||
|
|
||||||
|
Temp DB setup for tests: |
||||||
|
- Unit tests should use sqlite in-memory ("sqlite:///:memory:") via a `test_db` fixture in tests/utils/migration_fixtures.py. |
||||||
|
- Migration tests: If TEST_DB_URL env var is set, the migration tests will attempt to apply SQL to that DB; otherwise they will run in dry-run / skip-apply mode and only validate filename and header. |
||||||
|
|
||||||
|
Example pytest commands: |
||||||
|
- pytest -q |
||||||
|
- pytest -q tests/services/test_similarity_service.py -k compute_neighbors |
||||||
|
|
||||||
|
Notes for CI pipeline: |
||||||
|
- Ensure Python dependencies include pytest, pytest-mock and any DB driver required (sqlite built-in is fine). No external AI keys required — tests must mock AI provider. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## 3-Sprint Schedule (2-week sprints) |
||||||
|
|
||||||
|
Sprint 1 (Weeks 1–2) — Milestone 1: MVP foundation + core similarity |
||||||
|
- Goals: Add migrations, types, similarity service, similarity cache repo, audit repo, embeddings worker helper |
||||||
|
- Tasks: 1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 2.3, 2.4 |
||||||
|
|
||||||
|
Sprint 2 (Weeks 3–4) — Milestone 1 continued: background job, CLI, API, UI |
||||||
|
- Goals: Implement clusterer job, CLI, similarity API, explore_page UI module; initial integration smoke tests |
||||||
|
- Tasks: 3.1, 3.2, 3.3, 3.4, initial lightweight integration test scaffolding |
||||||
|
|
||||||
|
Sprint 3 (Weeks 5–6) — Milestone 2 & 3: hardening, integration tests, docs |
||||||
|
- Goals: Full integration tests, migration tests, docs, logging hardening, small UX polish |
||||||
|
- Tasks: 4.1, docs improvements from 1.5, logging conversion across modules (follow-up small PRs as needed) |
||||||
|
|
||||||
|
Notes: |
||||||
|
- Estimates assume 1 full-stack engineer + 1 reviewer. Sprint 1 is AMA-heavy; reviewer will focus on migrations and core algorithms. Sprint 2 focuses on wiring and UI; reviewer focuses on integration and UX. Sprint 3 finishes tests and polish. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Assumptions |
||||||
|
|
||||||
|
- The repository uses Python 3.10+ and pytest for tests. If different, adjust test fixtures accordingly. |
||||||
|
- Existing DB access helpers exist (a simple execute/connection helper). If not, tests use sqlite3 directly and repository code will accept a DB connection/cursor via dependency injection. |
||||||
|
- The project already has an ai_provider abstraction at src/ai/ai_provider.py with functions `get_embedding(text) -> list[float]` and `summarize(text) -> str` — tests will monkeypatch these. If the names differ, adapt imports when implementing. |
||||||
|
- Streamlit app remains `app.py` and can import src/ui/explore_page.py — I deliberately do not modify app.py in this plan to keep the change set minimal. |
||||||
|
- We will store embeddings as arrays in an embeddings table; similarity modules will load them via an injected loader function to keep unit tests pure. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Open Questions / Implementation Clarifications |
||||||
|
|
||||||
|
1. Bookmarks persistence: design left bookmarks as open (session vs. persistent). For MVP we will record bookmark events in the audit_events table (append-only) and treat them as per-session by default. If persistent bookmarks required later, a new table/migration will be added. |
||||||
|
2. Which web framework to wire the similarity_api into? The plan keeps handler framework-agnostic; we need guidance whether app uses Flask/FastAPI/Starlette to add the route. Implementer should wire into existing HTTP routing pattern. |
||||||
|
3. Embedding storage format: assume float arrays stored as JSON or array type in DB. If project uses a binary blob, adjust serialization in similarity_cache_repo and tests accordingly. |
||||||
|
4. Acceptable top-N neighbor size for caches. Default SIMILARITY_TOP_N = 10; CLI and worker accept override. If product wants 50, increase later. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## How a single implementer should proceed (step-by-step) |
||||||
|
|
||||||
|
1. Start with Batch 1 tasks 1.1–1.4. Create migrations placeholders and types module. Run migration filename tests. |
||||||
|
2. Implement similarity_service (2.1) and its unit tests. This is the critical algorithm that must be rock-solid. |
||||||
|
3. Implement similarity_cache_repo (2.2) and audit_repo (2.3) using sqlite in-memory for tests. Run unit tests. |
||||||
|
4. Implement embeddings_worker helper (2.4) and add tests that mock ai_provider. Ensure CI will not call real AI. |
||||||
|
5. Implement clusterer (3.1) and test with in-memory data by injecting loader/upsert functions. |
||||||
|
6. Add admin CLI (3.2) to run clusterer; add small doc (1.5) describing how to run it locally. |
||||||
|
7. Implement API handler (3.3) and UI helper (3.4). Tests should mock DB and AI as needed. |
||||||
|
8. Finish with integration test (4.1) to stitch the pieces together. Iterate on bug fixes and reviewer feedback. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
## Acceptance criteria for the feature (MVP) |
||||||
|
|
||||||
|
- Explore landing exists and can present curated motions (using existing curated flag). Data payload returned by explore_page includes motion metadata and layman_explanation. |
||||||
|
- Motion detail returns layman_explanation, party-match snapshot (existing), and related motions computed from cached neighbor lists when available. |
||||||
|
- Background clusterer job can recompute cached neighbor lists and the CLI can trigger it. |
||||||
|
- Tests cover core algorithm (similarity computation), cache repo serialization, embedders (mocked), and at least one end-to-end smoke integration test. |
||||||
|
|
||||||
|
--- |
||||||
|
|
||||||
|
If anything in this plan should be narrowed further (for a smaller initial PR) I recommend focusing Sprint 1 + clusterer CLI (Tasks 1.x + 2.x + 3.1 + 3.2) and deferring UI wiring until clusterer and cache are validated. |
||||||
@ -0,0 +1,106 @@ |
|||||||
|
--- |
||||||
|
date: 2026-03-19 |
||||||
|
topic: "Stemwijzer AI & DB implementation plan" |
||||||
|
status: draft |
||||||
|
--- |
||||||
|
|
||||||
|
## Summary |
||||||
|
|
||||||
|
Implementation plan derived from thoughts/shared/designs/2026-03-19-stemwijzer-design.md. |
||||||
|
Goal: add a provider abstraction for AI calls, minimal embeddings stored in DuckDB (JSON), and an ibis-based read DAL. Keep changes small, additive and well-tested. |
||||||
|
|
||||||
|
|
||||||
|
## High-level approach (chosen) |
||||||
|
|
||||||
|
- Add **ai_provider**: adapter exposing get_embedding(text) and chat_completion(messages) with retries and ProviderError. |
||||||
|
- Add **embeddings** table (DuckDB) and store/search helpers in database.py (naive Python cosine scan). |
||||||
|
- Add **query_dal**: ibis-based read helpers for Streamlit (get_filtered_motions, calculate_party_matches). |
||||||
|
- Refactor summarizer to call ai_provider and optionally store embeddings. |
||||||
|
- Minimal housekeeping fixes: reset.py and SCRAPING_DELAY in scraper.py. |
||||||
|
|
||||||
|
|
||||||
|
## Micro-tasks (11 tasks) |
||||||
|
|
||||||
|
All tasks are intentionally small (file-level changes + tests). Estimates assume one developer full-time; see Risk and Calendar section below. |
||||||
|
|
||||||
|
Batch 1 (foundation, parallelizable) |
||||||
|
|
||||||
|
1. Add tests fixtures for temporary DuckDB (tests/conftest.py) — 2h — low risk |
||||||
|
2. Add migration SQL to create embeddings table (migrations/2026-03-19-add-embeddings.sql) — 1h — low risk |
||||||
|
3. Add ai_provider adapter (src/ai_provider.py) + tests (tests/test_ai_provider.py) — 6h — medium risk |
||||||
|
4. Add scraper SCRAPING_DELAY default (src/scraper.py) + tests — 1h — low risk |
||||||
|
5. Fix reset script to run migrations (src/reset.py) + tests — 2h — low risk |
||||||
|
|
||||||
|
Batch 2 (core modules) |
||||||
|
|
||||||
|
6. Add store_embedding and search_similar to src/database.py + tests (tests/test_database_embeddings.py) — 8h — medium risk |
||||||
|
7. Add query_dal (src/query_dal.py) with ibis reads + tests (tests/test_query_dal.py) — 6h — medium risk |
||||||
|
8. Refactor summarizer to use ai_provider and optionally store embeddings (src/summarizer.py) + tests (tests/test_summarizer.py) — 6h — medium risk |
||||||
|
|
||||||
|
Batch 3 (integration) |
||||||
|
|
||||||
|
9. Add CLI semantic search helper (src/cli_search.py) + tests — 4h — low-medium risk |
||||||
|
10. Update app read paths to use query_dal (src/app.py) + tests — 3h — low risk |
||||||
|
|
||||||
|
Batch 4 (docs/config) |
||||||
|
|
||||||
|
11. Add .env.example entries for new env vars — 1h — low risk |
||||||
|
|
||||||
|
|
||||||
|
## PR order (recommended, small focused PRs) |
||||||
|
|
||||||
|
1. PR A — tests/conftest (fixtures) |
||||||
|
2. PR B — migration SQL (embeddings table) |
||||||
|
3. PR C — ai_provider + tests |
||||||
|
4. PR D — database store/search helpers + tests |
||||||
|
5. PR E — query_dal + tests |
||||||
|
6. PR F — summarizer refactor + tests |
||||||
|
7. PR G — cli_search + tests |
||||||
|
8. PR H — app read changes + tests |
||||||
|
9. PR I — scraper/reset small fixes + tests |
||||||
|
10. PR J — .env.example |
||||||
|
|
||||||
|
|
||||||
|
## Estimates & schedule (one dev, full-time ~8h/day) |
||||||
|
|
||||||
|
- Total estimated effort: ~50 hours (~6.25 days) + buffer → ~7 calendar days. |
||||||
|
- Conservative schedule: Batch 1 (2 days), Batch 2 (3 days), Batch 3 (1 day), Buffer/Review (1 day). |
||||||
|
|
||||||
|
|
||||||
|
## DB migration steps |
||||||
|
|
||||||
|
- Add migrations/2026-03-19-add-embeddings.sql (additive). |
||||||
|
- Apply on staging first; backup DB, run migration, verify `SELECT count(*) FROM embeddings`. |
||||||
|
- No changes to motions table in first iteration. |
||||||
|
|
||||||
|
|
||||||
|
## Testing strategy |
||||||
|
|
||||||
|
- Unit tests for ai_provider (mock HTTP responses). Use monkeypatch to avoid network. |
||||||
|
- DB tests use temporary DuckDB files (pytest fixtures) to verify storing and searching embeddings. |
||||||
|
- query_dal tests use ibis.duckdb.connect against a temporary DB file and parse JSON fields. |
||||||
|
- Summarizer tests mock ai_provider to assert DB writes (summary and optional embedding). |
||||||
|
|
||||||
|
|
||||||
|
## Error handling |
||||||
|
|
||||||
|
- ai_provider: retry/backoff for transient errors; raise ProviderError for terminal failures. |
||||||
|
- Summarizer: non-fatal on AI failures — write fallback/empty summary, log, and surface message in UI when interactive. |
||||||
|
- DB functions: keep try/except patterns and ensure connections closed on error. |
||||||
|
|
||||||
|
|
||||||
|
## Risks & mitigations |
||||||
|
|
||||||
|
- ai_provider changes: medium risk — mitigate with retries, clear ProviderError, and thorough unit tests. |
||||||
|
- Embedding search: medium (naive scan performance) — mitigate by keeping implementation simple and planning for ANN/FAISS later. |
||||||
|
- ibis usage: medium — mitigate with tests and keep query_dal narrow. |
||||||
|
|
||||||
|
|
||||||
|
## Next actions (what I'll do now) |
||||||
|
|
||||||
|
- I wrote this implementation plan to thoughts/shared/plans/2026-03-19-stemwijzer-plan.md (draft). |
||||||
|
- I will NOT start applying code changes automatically. If you want, I can: |
||||||
|
- (A) Create the first PR patch (tests/conftest.py + migration) and open a draft for review, or |
||||||
|
- (B) Start implementing Task 1.1 (ai_provider) next. |
||||||
|
|
||||||
|
Interrupt if you want changes to the plan or a different PR ordering. Otherwise tell me which task to start and I'll create the first patch. |
||||||
@ -0,0 +1,129 @@ |
|||||||
|
#!/usr/bin/env python3 |
||||||
|
"""Query Tweede Kamer OData endpoints to locate motion body text. |
||||||
|
|
||||||
|
This script performs the API calls described in the task and prints |
||||||
|
structured information about responses (status code, keys, candidate |
||||||
|
fields that may contain text or content URLs). |
||||||
|
|
||||||
|
File: tools/query_tk_api.py |
||||||
|
""" |
||||||
|
|
||||||
|
import json |
||||||
|
import sys |
||||||
|
from urllib.parse import quote |
||||||
|
|
||||||
|
try: |
||||||
|
import requests |
||||||
|
except Exception: |
||||||
|
print("missing requests library", file=sys.stderr) |
||||||
|
raise |
||||||
|
|
||||||
|
|
||||||
|
BASE = "https://gegevensmagazijn.tweedekamer.nl/OData/v4/2.0" |
||||||
|
ZAAK_ID = "e6fd62f1-29be-4955-9811-03d46da2fc3a" |
||||||
|
|
||||||
|
|
||||||
|
def try_get(path): |
||||||
|
url = BASE.rstrip("/") + "/" + path.lstrip("/") |
||||||
|
print("\nGET", url) |
||||||
|
r = requests.get(url, headers={"Accept": "application/json"}) |
||||||
|
print("->", r.status_code, r.headers.get("Content-Type")) |
||||||
|
# try to print JSON keys or text length |
||||||
|
ct = r.headers.get("Content-Type", "") |
||||||
|
if "application/json" in ct or r.text.strip().startswith("{"): |
||||||
|
try: |
||||||
|
j = r.json() |
||||||
|
print("JSON keys:", list(j.keys())) |
||||||
|
# pretty-print limited |
||||||
|
print("JSON preview:", json.dumps(j, indent=2)[:4000]) |
||||||
|
return j |
||||||
|
except Exception as e: |
||||||
|
print("failed to parse json:", e) |
||||||
|
else: |
||||||
|
print("text length:", len(r.content)) |
||||||
|
print("headers:", dict(r.headers)) |
||||||
|
print("first 800 bytes:\n", r.content[:800]) |
||||||
|
return None |
||||||
|
|
||||||
|
|
||||||
|
def main(): |
||||||
|
# 1. Zaak expand Document |
||||||
|
tried = [] |
||||||
|
patterns = [ |
||||||
|
f"Zaak({ZAAK_ID})?$expand=Document", |
||||||
|
f"Zaak(guid'{ZAAK_ID}')?$expand=Document", |
||||||
|
f"Zaak('{ZAAK_ID}')?$expand=Document", |
||||||
|
] |
||||||
|
zaak_json = None |
||||||
|
for p in patterns: |
||||||
|
tried.append(p) |
||||||
|
zaak_json = try_get(p) |
||||||
|
if zaak_json and "Document" in (zaak_json.get("value") or zaak_json): |
||||||
|
break |
||||||
|
|
||||||
|
# If top-level 'value' exists (collection), try to find first |
||||||
|
if zaak_json and "value" in zaak_json: |
||||||
|
# If API returned a collection, pick first |
||||||
|
val = zaak_json["value"] |
||||||
|
if isinstance(val, list) and val: |
||||||
|
zaak = val[0] |
||||||
|
else: |
||||||
|
zaak = None |
||||||
|
else: |
||||||
|
zaak = zaak_json |
||||||
|
|
||||||
|
print("\n--- Zaak object (extracted) ---") |
||||||
|
print(json.dumps(zaak, indent=2)[:4000]) |
||||||
|
|
||||||
|
docs = [] |
||||||
|
if zaak: |
||||||
|
# Document may be navigation property 'Document' or 'Documents' |
||||||
|
for key in ("Document", "Documents"): |
||||||
|
if key in zaak: |
||||||
|
val = zaak[key] |
||||||
|
if isinstance(val, list): |
||||||
|
docs.extend(val) |
||||||
|
elif isinstance(val, dict): |
||||||
|
docs.append(val) |
||||||
|
|
||||||
|
print("\nFound", len(docs), "Document entries") |
||||||
|
for i, d in enumerate(docs): |
||||||
|
print("\n--- Document", i, "---") |
||||||
|
print(json.dumps(d, indent=2)[:4000]) |
||||||
|
|
||||||
|
# 2. Try DocumentVersie endpoint |
||||||
|
# We'll attempt: DocumentVersie?$filter=DocumentId eq guid'...' |
||||||
|
for d in docs: |
||||||
|
doc_id = d.get("Id") or d.get("DocumentId") or d.get("IdDocument") |
||||||
|
if not doc_id: |
||||||
|
# maybe OData provided @odata.id |
||||||
|
if "@odata.id" in d: |
||||||
|
# extract id from URI - last segment |
||||||
|
seg = d["@odata.id"].rstrip("/").split("/")[-1] |
||||||
|
doc_id = seg |
||||||
|
if not doc_id: |
||||||
|
continue |
||||||
|
print("\nQuerying DocumentVersie for Document id:", doc_id) |
||||||
|
q1 = f"DocumentVersie?$filter=DocumentId%20eq%20guid'{doc_id}'" |
||||||
|
j = try_get(q1) |
||||||
|
# also try expanding from Document |
||||||
|
q2 = f"Document({quote(doc_id)})?$expand=DocumentVersie" |
||||||
|
j2 = try_get(q2) |
||||||
|
# try direct DocumentVersie by key |
||||||
|
q3 = f"DocumentVersie(guid'{doc_id}')" |
||||||
|
j3 = try_get(q3) |
||||||
|
|
||||||
|
# 3. Try content stream patterns |
||||||
|
candidates = [ |
||||||
|
f"Document({quote(doc_id)})/Content", |
||||||
|
f"Document({quote(doc_id)})/$value", |
||||||
|
f"Document({quote(doc_id)})/Inhoud", |
||||||
|
f"Resource('{doc_id}')", |
||||||
|
f"Resource({quote(doc_id)})", |
||||||
|
] |
||||||
|
for c in candidates: |
||||||
|
try_get(c) |
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__": |
||||||
|
main() |
||||||
Loading…
Reference in new issue