feat: complete parliamentary embedding pipeline with full historical coverage

- Add fused (SVD + text) embedding pipeline for annual windows 2016-2026
- Fix store_fused_embedding duplicate bug: DELETE before INSERT (idempotent)
- Add --text-batch-size CLI flag to run_pipeline.py (default 200)
- Add explicit --start-date/--end-date to download_past_year.py
- Backfill mp_votes for all motions (party-level votes, 111k new rows)
- Add similarity cache recompute: 212k rows across 9 annual windows
- Improve ai_provider retry logic, text_pipeline batching
- Improve analysis/political_axis PCA handling and visualizations
- Add diagnostic/utility scripts: compare_svd, generate_compass, inspect_axis, etc.
- Untrack data/motions.db (3.6GB binary), add to .gitignore with outputs/
- Update continuity ledger with full session state
main
Sven Geboers 1 month ago
parent a78bee9b0a
commit daa22c5e2b
  1. 9
      .gitignore
  2. 88
      ai_provider.py
  3. 76
      analysis/political_axis.py
  4. 131
      analysis/visualize.py
  5. 3
      api_client.py
  6. BIN
      data/motions.db
  7. 13
      database.py
  8. 7
      outputs/anchor_axis_2025_Q2.html
  9. 7
      outputs/anchor_axis_2025_Q3.html
  10. 7
      outputs/anchor_axis_2025_Q4.html
  11. 7
      outputs/anchor_axis_2026_Q1.html
  12. 7
      outputs/political_axis_2025_Q1.html
  13. 7
      outputs/political_axis_2025_Q2.html
  14. 7
      outputs/political_axis_2025_Q3.html
  15. 7
      outputs/political_axis_2025_Q4.html
  16. 7
      outputs/political_axis_2026_Q1.html
  17. 7
      outputs/trajectories_normalized_top15.html
  18. 7
      outputs/trajectories_party_aligned.html
  19. 7
      outputs/trajectories_top15.html
  20. 8
      pipeline/run_pipeline.py
  21. 61
      pipeline/text_pipeline.py
  22. 204
      scripts/compare_svd_exclude_parties.py
  23. 33
      scripts/download_past_year.py
  24. 277
      scripts/fill_mp_votes_parties.py
  25. 157
      scripts/generate_compass.py
  26. 137
      scripts/inspect_axis.py
  27. 167
      scripts/recompute_svd.py
  28. 214
      scripts/svd_diagnostics.py
  29. 10
      tests/integration/test_pipeline_end_to_end.py
  30. 8
      tests/test_text_pipeline.py
  31. 111
      thoughts/ledgers/CONTINUITY_stemwijzer.md

9
.gitignore vendored

@ -8,3 +8,12 @@ wheels/
# Virtual environments
.venv
# Database files (large binary, not suited for git)
data/*.db
data/*.bak
data/*.json
# Generated output files
outputs/
outputs_*/

@ -110,30 +110,6 @@ def _post_with_retries(
time.sleep(sleep)
continue
# Treat 429 (rate limiting) as transient and respect Retry-After header when present
if status == 429:
if attempt == retries:
raise ProviderError(f"Provider returned HTTP {resp.status_code}")
retry_after = None
try:
# header may be present as int seconds or as string
retry_after = resp.headers.get("Retry-After")
except Exception:
retry_after = None
if retry_after is not None:
try:
sleep = float(retry_after)
except Exception:
# fallback to exponential backoff if header unparsable
sleep = backoff * (2 ** (attempt - 1))
else:
sleep = backoff * (2 ** (attempt - 1))
sleep = sleep + random.uniform(0, sleep * 0.1)
time.sleep(sleep)
continue
return resp
# Should not reach here
@ -184,6 +160,70 @@ def get_embedding(text: str, model: str | None = None) -> list[float]:
return [float(x) for x in embedding]
def get_embeddings_batch(
texts: list[str], model: str | None = None, batch_size: int = 50
) -> list[list[float]]:
"""Return embedding vectors for multiple texts using batched API calls.
The OpenAI/OpenRouter /embeddings endpoint accepts an array of inputs.
This sends texts in chunks of `batch_size` and returns one embedding per input,
preserving order. Raises ProviderError on failure.
"""
if not texts:
return []
if model is None:
model = (
os.environ.get("EMBEDDING_MODEL")
or os.environ.get("QWEN_EMBEDDING_MODEL")
or "qwen/qwen3-embedding-4b"
)
all_embeddings: list[list[float]] = []
for start in range(0, len(texts), batch_size):
chunk = texts[start : start + batch_size]
resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk})
try:
data = resp.json()
except Exception as exc:
raise ProviderError(f"Invalid JSON response from provider: {exc}") from exc
try:
items = data["data"]
except Exception as exc:
# Check local fallback
fallback = os.environ.get(
"ALLOW_LOCAL_EMBED_FALLBACK", "false"
).lower() in ("1", "true", "yes")
if fallback:
dim = int(os.environ.get("LOCAL_EMBED_DIM", "64"))
all_embeddings.extend(_local_embedding(t, dim=dim) for t in chunk)
continue
raise ProviderError(
f"Unexpected batch embedding response shape: {data}"
) from exc
# Sort by index to guarantee order (API spec says index field is present)
items_sorted = sorted(items, key=lambda x: x.get("index", 0))
if len(items_sorted) != len(chunk):
raise ProviderError(
f"Expected {len(chunk)} embeddings, got {len(items_sorted)}"
)
for item in items_sorted:
emb = item.get("embedding")
if not isinstance(emb, list):
raise ProviderError(
f"Embedding at index {item.get('index')} is not a list"
)
all_embeddings.append([float(x) for x in emb])
return all_embeddings
def _local_embedding(text: str, dim: int = 64) -> list[float]:
"""Deterministic local fallback embedding based on SHA256.

@ -161,6 +161,11 @@ def compute_2d_axes(
to load and align windows so the returned coordinates are consistent
across windows.
"""
# Import trajectory helper at runtime so tests can monkeypatch sys.modules
import importlib
_trajectory = importlib.import_module("analysis.trajectory")
if window_ids is None:
window_ids = _trajectory._load_window_ids(db_path)
@ -238,6 +243,77 @@ def compute_2d_axes(
"pca_residual_used": bool(pca_residual or evr1 > 0.85),
}
# Ensure consistent left/right and progressive/conservative orientation
# by checking canonical party centroids and flipping axis signs if needed.
try:
right_parties = {"PVV", "VVD", "FVD", "BBB", "JA21"}
left_parties = {"SP", "PvdA", "GroenLinks", "GroenLinks-PvdA", "DENK"}
cons_parties = {"PVV", "VVD", "FVD", "CDA", "SGP", "BBB", "JA21"}
prog_parties = {
"GroenLinks",
"PvdA",
"PvdD",
"SP",
"GroenLinks-PvdA",
"DENK",
}
# Build mapping of entity -> vector from stacked matrix M
ent_to_vec = {ent: vec for (wid, ent), vec in zip(entity_index, M)}
def _centroid_for_party_set(party_set):
vecs = []
for p in party_set:
if p in ent_to_vec:
vecs.append(ent_to_vec[p])
try:
conn = duckdb.connect(db_path)
rows = conn.execute(
"SELECT mp_name, party FROM mp_metadata"
).fetchall()
conn.close()
except Exception:
rows = []
for mp_name, party in rows:
if party in party_set and mp_name in ent_to_vec:
vecs.append(ent_to_vec[mp_name])
if not vecs:
return None
return np.mean(np.vstack(vecs), axis=0)
# X-axis: left vs right
left_cent = _centroid_for_party_set(left_parties)
right_cent = _centroid_for_party_set(right_parties)
if left_cent is not None and right_cent is not None:
left_proj = float(np.dot(left_cent - M.mean(axis=0), comp1_hat))
right_proj = float(np.dot(right_cent - M.mean(axis=0), comp1_hat))
if right_proj < left_proj:
_logger.info(
"Flipping PCA x-axis to match canonical left/right orientation (right_proj=%.3f left_proj=%.3f)",
right_proj,
left_proj,
)
axes["x_axis"] = -axes["x_axis"]
# Y-axis: progressive vs conservative — prefer positive = conservative
prog_cent = _centroid_for_party_set(prog_parties)
cons_cent = _centroid_for_party_set(cons_parties)
if prog_cent is not None and cons_cent is not None:
prog_proj = float(np.dot(prog_cent - M.mean(axis=0), comp2_hat))
cons_proj = float(np.dot(cons_cent - M.mean(axis=0), comp2_hat))
# We want positive Y to mean 'progressive'. If the progressive
# centroid currently projects lower than the conservative centroid,
# flip the sign so progressive > conservative.
if prog_proj < cons_proj:
_logger.info(
"Flipping PCA y-axis so positive Y corresponds to progressive (prog_proj=%.3f cons_proj=%.3f)",
prog_proj,
cons_proj,
)
axes["y_axis"] = -axes["y_axis"]
except Exception:
_logger.debug("Could not auto-orient PCA axes; leaving signs as-is")
# warn if PCA is effectively 1-D
if evr1 > 0.85 and not pca_residual:
_logger.warning(

@ -27,6 +27,58 @@ def _require_plotly():
raise ImportError("plotly is not installed. Install it with: uv add plotly")
def _load_party_map(db_path: str = "data/motions.db") -> Dict[str, str]:
"""Build a party mapping mp_name -> party.
Prefers mp_metadata where available; otherwise uses majority-party from mp_votes.
Returns a dict of mp_name -> party (strings).
"""
try:
import duckdb
except Exception:
_logger.debug("duckdb not available when building party map")
return {}
conn = duckdb.connect(db_path)
try:
# metadata-based mapping
rows = conn.execute(
"SELECT mp_name, party FROM mp_metadata WHERE party IS NOT NULL"
).fetchall()
meta_map = {r[0]: r[1] for r in rows}
# majority-party heuristic from mp_votes
rows = conn.execute(
"""
SELECT mp_name, party, COUNT(*) as n
FROM mp_votes
WHERE party IS NOT NULL
GROUP BY mp_name, party
"""
).fetchall()
counts: Dict[str, List[tuple]] = {}
for mp_name, party, n in rows:
counts.setdefault(mp_name, []).append((party, n))
maj_map: Dict[str, str] = {}
for mp_name, arr in counts.items():
maj_map[mp_name] = max(arr, key=lambda x: x[1])[0]
merged = dict(maj_map)
# prefer metadata mapping when available
merged.update(meta_map)
_logger.info(
"Built party map: %d from mp_votes majority, %d from mp_metadata",
len(maj_map),
len(meta_map),
)
return merged
finally:
try:
conn.close()
except Exception:
pass
def plot_umap_scatter(
motion_ids: List[int],
coords: List[List[float]],
@ -194,6 +246,7 @@ def plot_political_compass(
try:
import duckdb # type: ignore
conn = None
try:
conn = duckdb.connect(database="data/motions.db", read_only=True)
df = conn.execute("SELECT mp_name, party FROM mp_metadata").fetchdf()
@ -206,6 +259,7 @@ def plot_political_compass(
len(party_of),
)
finally:
if conn is not None:
try:
conn.close()
except Exception:
@ -221,8 +275,18 @@ def plot_political_compass(
scaled_ys = ys
if axis_def and y_scale is None:
evr = axis_def.get("explained_variance_ratio") if axis_def else None
if evr and isinstance(evr, (list, tuple)) and len(evr) >= 2:
evr1, evr2 = evr[0], evr[1]
# Accept lists/tuples or numpy arrays; avoid ambiguous truth checks
evr_list = None
if evr is not None:
try:
evr_list = list(evr)
except Exception:
try:
evr_list = [float(evr)]
except Exception:
evr_list = None
if evr_list is not None and len(evr_list) >= 2:
evr1, evr2 = float(evr_list[0]), float(evr_list[1])
if evr2 < 1e-6:
scale_guess = 1.0
else:
@ -237,30 +301,42 @@ def plot_political_compass(
elif axis_def and y_scale is not None:
scaled_ys = [y * float(y_scale) for y in ys]
# mark unknowns differently
unknown_flags = [1 if parties[i] == "Unknown" else 0 for i in range(len(names))]
# mark unknowns differently: use descriptive labels so the legend doesn't
# show numeric symbol values like "PVV, 0" when color and symbol combine.
unknown_labels = [
"Unknown" if parties[i] == "Unknown" else "Known" for i in range(len(names))
]
fig = px.scatter(
x=xs,
y=scaled_ys,
color=parties,
symbol=unknown_flags,
symbol=unknown_labels,
hover_name=names,
title=f"Political Compass ({window_id})",
labels={
"x": "Left ← — → Right",
"y": "Progressive ← — → Conservative",
"color": "Party",
"symbol": "Unknown",
"symbol": "Known?",
},
)
fig.update_traces(marker=dict(size=8, opacity=0.85))
# annotate explained variance if available
if axis_def and axis_def.get("method") == "pca":
evr = axis_def.get("explained_variance_ratio")
if evr and len(evr) >= 2:
evr_list = None
if evr is not None:
try:
evr_list = list(evr)
except Exception:
try:
evr_list = [float(evr)]
except Exception:
evr_list = None
if evr_list is not None and len(evr_list) >= 2:
fig.update_layout(
title=f"Political Compass ({window_id}) — PCA EVR PC1={evr[0] * 100:.1f}%, PC2={evr[1] * 100:.1f}%"
title=f"Political Compass ({window_id}) — PCA EVR PC1={evr_list[0] * 100:.1f}%, PC2={evr_list[1] * 100:.1f}%"
)
fig.write_html(output_path, include_plotlyjs="cdn")
_logger.info("Political compass written to %s", output_path)
@ -309,6 +385,45 @@ def plot_2d_trajectories(
)
)
# Add an arrow indicating the final direction (only one arrow per MP to
# avoid clutter). Use an annotation with an arrowhead from the penultimate
# to the last point and label the endpoint with the MP name.
try:
if len(xs) >= 2:
x0, y0 = xs[-2], ys[-2]
x1, y1 = xs[-1], ys[-1]
# small style choices — subtle arrow and a short label
fig.add_annotation(
x=x1,
y=y1,
ax=x0,
ay=y0,
xref="x",
yref="y",
axref="x",
ayref="y",
showarrow=True,
arrowhead=3,
arrowsize=1.0,
arrowwidth=1.2,
arrowcolor="rgba(0,0,0,0.6)",
opacity=0.8,
)
# endpoint label slightly offset to reduce overlap with marker
fig.add_annotation(
x=x1,
y=y1,
xref="x",
yref="y",
text=mp,
showarrow=False,
xanchor="left",
yanchor="bottom",
font=dict(size=10, color="rgba(0,0,0,0.8)"),
)
except Exception:
_logger.exception("Failed to add arrow/label for MP %s", mp)
fig.update_layout(
title="MP Trajectories on Political Compass",
xaxis_title="Left ← — → Right",

@ -178,7 +178,8 @@ class TweedeKamerAPI:
# Extract party and vote information
party_name = record.get("ActorNaam")
vote_type = record.get("Soort", "").lower()
# Some records have Soort explicitly set to None; guard against that
vote_type = str(record.get("Soort") or "").lower()
record_date = record.get("GewijzigdOp", "")
if not party_name:

Binary file not shown.

@ -464,18 +464,18 @@ class MotionDatabase:
"""Store an embedding for a motion. Returns inserted row id or -1 on failure."""
try:
conn = duckdb.connect(self.db_path)
# store vector as JSON
# Use explicit nextval for id since older tables may lack DEFAULT
conn.execute(
"INSERT INTO embeddings (motion_id, model, vector, created_at) VALUES (?, ?, ?, CURRENT_TIMESTAMP)",
"INSERT INTO embeddings (id, motion_id, model, vector, created_at) VALUES (nextval('embeddings_id_seq'), ?, ?, ?, CURRENT_TIMESTAMP)",
(motion_id, model, json.dumps(vector)),
)
row = conn.execute("SELECT max(id) FROM embeddings").fetchone()
row = conn.execute("SELECT currval('embeddings_id_seq')").fetchone()
conn.close()
if row and row[0] is not None:
return int(row[0])
return -1
except Exception as e:
print(f"Error storing embedding: {e}")
_logger.error("Error storing embedding: %s", e)
try:
conn.close()
except Exception:
@ -685,6 +685,11 @@ class MotionDatabase:
) -> int:
try:
conn = duckdb.connect(self.db_path)
# Delete any existing row for this (motion_id, window_id) to prevent duplicates
conn.execute(
"DELETE FROM fused_embeddings WHERE motion_id = ? AND window_id = ?",
(motion_id, window_id),
)
conn.execute(
"""
INSERT INTO fused_embeddings (motion_id, window_id, vector, svd_dims, text_dims, created_at)

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

@ -174,7 +174,7 @@ def run(args: argparse.Namespace) -> int:
from pipeline.text_pipeline import ensure_text_embeddings
stored, existing, no_text, errors = ensure_text_embeddings(
db_path=db_path, model=args.text_model
db_path=db_path, model=args.text_model, batch_size=args.text_batch_size
)
_logger.info(
" embeddings: stored=%d existing=%d no_text=%d errors=%d",
@ -240,6 +240,12 @@ def build_parser() -> argparse.ArgumentParser:
default=None,
help="Text embedding model (default: ai_provider default)",
)
parser.add_argument(
"--text-batch-size",
type=int,
default=200,
help="Number of texts per embedding API call (default: 200)",
)
parser.add_argument(
"--skip-metadata", action="store_true", help="Skip MP metadata fetch"
)

@ -55,10 +55,11 @@ def _select_text(
def ensure_text_embeddings(
db_path: Optional[str] = None, model: Optional[str] = None
db_path: Optional[str] = None, model: Optional[str] = None, batch_size: int = 50
) -> Tuple[int, int, int, int]:
"""Ensure all motions have text embeddings for `model`.
Uses batched API calls (batch_size texts per HTTP request) for speed.
Returns tuple (stored_count, skipped_existing, skipped_no_text, errors).
"""
model = model or DEFAULT_MODEL
@ -87,14 +88,54 @@ def ensure_text_embeddings(
skipped_no_text = 0
errors = 0
# Separate motions with text from those without
with_text: List[Tuple[int, str]] = []
for motion_id, text in to_process:
if not text:
_logger.info("Skipping motion %s: no text available", motion_id)
skipped_no_text += 1
continue
else:
with_text.append((motion_id, text))
_logger.info(
"Processing %d motions in batches of %d (%d skipped no text, %d already exist)",
len(with_text),
batch_size,
skipped_no_text,
existing,
)
# Process in batches
for batch_start in range(0, len(with_text), batch_size):
batch = with_text[batch_start : batch_start + batch_size]
batch_ids = [mid for mid, _ in batch]
batch_texts = [txt for _, txt in batch]
try:
vec = ai_provider.get_embedding(text, model=model)
vecs = ai_provider.get_embeddings_batch(
batch_texts, model=model, batch_size=batch_size
)
except Exception as exc:
_logger.error(
"Batch embedding failed for motions %s..%s: %s",
batch_ids[0],
batch_ids[-1],
exc,
)
errors += len(batch)
continue
if len(vecs) != len(batch):
_logger.error(
"Batch size mismatch: expected %d, got %d embeddings",
len(batch),
len(vecs),
)
errors += len(batch)
continue
batch_stored = 0
for (motion_id, _text), vec in zip(batch, vecs):
if not isinstance(vec, list):
_logger.warning(
"Embedding provider returned non-list for motion %s", motion_id
@ -102,9 +143,11 @@ def ensure_text_embeddings(
errors += 1
continue
try:
res = db.store_embedding(motion_id, model, vec)
if res and res > 0:
stored += 1
batch_stored += 1
else:
_logger.error(
"Failed to store embedding for motion %s (store returned %s)",
@ -114,9 +157,19 @@ def ensure_text_embeddings(
errors += 1
except Exception as exc:
_logger.error(
"Error computing/storing embedding for motion %s: %s", motion_id, exc
"Error storing embedding for motion %s: %s", motion_id, exc
)
errors += 1
_logger.info(
"Batch %d-%d: stored %d/%d (total: %d/%d)",
batch_start,
batch_start + len(batch),
batch_stored,
len(batch),
stored + existing,
total_motions,
)
skipped_existing = int(existing)
return stored, skipped_existing, skipped_no_text, errors

@ -0,0 +1,204 @@
"""Compare PCA axes with and without party-level vectors present.
Generates diagnostics and HTML plots (when plotly available) into outputs/.
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import sys
from typing import Dict, List
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if ROOT not in sys.path:
sys.path.insert(0, ROOT)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("compare_svd_exclude_parties")
def main(argv: List[str] | None = None):
p = argparse.ArgumentParser()
p.add_argument("--db", default="data/motions.db")
p.add_argument("--out", default="outputs")
args = p.parse_args(argv)
os.makedirs(args.out, exist_ok=True)
try:
from analysis import trajectory as traj
from analysis.visualize import (
_load_party_map,
plot_political_compass,
plot_2d_trajectories,
)
import numpy as np
except Exception as e:
logger.exception("Failed to import analysis modules: %s", e)
raise
window_ids = traj._load_window_ids(args.db)
if not window_ids:
logger.error("No SVD windows found")
return 1
latest = sorted(window_ids)[-1]
# load raw vectors for latest window
conn = None
try:
# build party name set from mp_metadata
import duckdb
conn = duckdb.connect(args.db)
rows = conn.execute(
"SELECT DISTINCT party FROM mp_metadata WHERE party IS NOT NULL"
).fetchall()
party_names = set(r[0] for r in rows if r[0])
finally:
if conn:
try:
conn.close()
except Exception:
pass
raw = traj._load_mp_vectors_for_window(args.db, latest)
# group by vector JSON-like key
groups: Dict[str, List[str]] = {}
for ent, vec in raw.items():
key = tuple([round(float(x), 8) for x in vec.tolist()])
groups.setdefault(str(key), []).append(ent)
group_list = sorted(groups.items(), key=lambda kv: len(kv[1]), reverse=True)
top_groups = [(len(v), v[:8]) for k, v in group_list[:20]]
logger.info("Top duplicate groups (count, sample entities): %s", top_groups)
# entities that are party names
party_entities = [ent for ent in raw.keys() if ent in party_names]
logger.info(
"Found %d party-like entities in svd_vectors for %s",
len(party_entities),
latest,
)
# Build aligned windows excluding party-level entities
raw_window_vecs = {
wid: traj._load_mp_vectors_for_window(args.db, wid) for wid in window_ids
}
# create filtered copy that removes party-level entity ids
filtered_window_vecs = {
wid: {ent: vec for ent, vec in d.items() if ent not in party_names}
for wid, d in raw_window_vecs.items()
}
aligned_filtered = traj._procrustes_align_windows(filtered_window_vecs)
# stack and compute PCA
all_vecs = []
entity_index = []
for wid, d in aligned_filtered.items():
for ent, v in d.items():
n = np.linalg.norm(v)
all_vecs.append(v / n if n > 1e-10 else v)
entity_index.append((wid, ent))
if not all_vecs:
logger.error("No vectors left after excluding parties — aborting")
return 2
M = np.vstack(all_vecs)
Mc = M - M.mean(axis=0)
try:
U, s, Vt = np.linalg.svd(Mc, full_matrices=False)
except Exception:
logger.exception("SVD failed on filtered data")
return 3
sv2 = s**2
evr = sv2 / (sv2.sum() + 1e-20)
logger.info("Filtered PCA EVR top2: %s", evr[:2].tolist())
comp1 = Vt[0]
comp1_hat = comp1 / (np.linalg.norm(comp1) + 1e-12)
comp2 = Vt[1] if Vt.shape[0] > 1 else np.zeros_like(comp1)
comp2_hat = comp2 / (np.linalg.norm(comp2) + 1e-12)
# project filtered entities for latest window
filtered_positions = {}
global_mean = M.mean(axis=0)
for (wid, ent), vec in zip(entity_index, M):
if wid != latest:
continue
v_centered = vec - global_mean
x = float(np.dot(v_centered, comp1_hat))
y = float(np.dot(v_centered, comp2_hat))
filtered_positions[ent] = (x, y)
# save JSON and small report
out_json = os.path.join(args.out, "svd_filtered_positions.json")
with open(out_json, "w", encoding="utf-8") as f:
json.dump(
{
"latest": latest,
"positions": filtered_positions,
"evr": evr[:2].tolist(),
},
f,
indent=2,
)
logger.info("Wrote filtered positions to %s", out_json)
# Also generate plots if plotly available
try:
party_map = _load_party_map(args.db)
# positions_by_window format expected by plot functions — include only latest
positions_by_window = {latest: filtered_positions}
pcomp_out = os.path.join(args.out, f"political_compass_filtered_{latest}.html")
plot_political_compass(
positions_by_window,
window_id=latest,
party_of=party_map,
axis_def={"method": "pca", "explained_variance_ratio": evr[:2]},
output_path=pcomp_out,
)
logger.info("Wrote filtered compass to %s", pcomp_out)
# simple trajectory plotting for filtered set — top movers by count
traj_out = os.path.join(args.out, f"trajectories_filtered_{latest}.html")
# Build simple per-MP coords across windows for filtered set
mp_coords = {}
for wid in window_ids:
for ent, coord in aligned_filtered.get(wid, {}).items():
if ent not in mp_coords:
mp_coords[ent] = []
mp_coords[ent].append((wid, tuple(coord.tolist())))
# pick MPs with at least 2 windows
names = [n for n, v in mp_coords.items() if len(v) >= 2]
plot_2d_trajectories(
{
wid: {
n: mp_coords[n][i][1]
for n in names
for i, (w, _) in enumerate(mp_coords[n])
if w == wid
}
for wid in window_ids
},
mp_names=names[:50],
output_path=traj_out,
)
logger.info("Wrote filtered trajectories to %s", traj_out)
except Exception:
logger.exception("Plotting filtered results failed — plots skipped")
# console summary
print("Top duplicate groups (count, sample):")
for k, v in group_list[:20]:
print(len(v), v[:6])
return 0
if __name__ == "__main__":
raise SystemExit(main())

@ -1,12 +1,13 @@
"""download_past_year.py — One-shot data download: past year of parliamentary motions.
"""download_past_year.py — One-shot data download: parliamentary motions for a date range.
Fetches Stemming records from the OData API in quarterly chunks (90-day windows),
Fetches Stemming records from the OData API in chunks (default 90-day windows),
stores motions into data/motions.db using MotionDatabase.insert_motion().
Skips AI summarisation this is a raw data fetch for the embedding pipeline.
Usage:
uv run python scripts/download_past_year.py [--db-path data/motions.db] [--days 365]
uv run python scripts/download_past_year.py --start-date 2019-01-01 --end-date 2022-01-01
"""
import argparse
@ -21,10 +22,25 @@ from database import MotionDatabase
def main():
parser = argparse.ArgumentParser(description="Download past year of motions")
parser = argparse.ArgumentParser(description="Download motions for a date range")
parser.add_argument("--db-path", default="data/motions.db")
parser.add_argument(
"--days", type=int, default=365, help="How many days back to fetch"
"--days",
type=int,
default=365,
help="How many days back to fetch (ignored if --start-date given)",
)
parser.add_argument(
"--start-date",
type=str,
default=None,
help="Explicit start date YYYY-MM-DD (overrides --days)",
)
parser.add_argument(
"--end-date",
type=str,
default=None,
help="Explicit end date YYYY-MM-DD (default: today)",
)
parser.add_argument("--chunk-days", type=int, default=90, help="Days per API chunk")
parser.add_argument(
@ -41,7 +57,14 @@ def main():
api = TweedeKamerAPI()
db = MotionDatabase(args.db_path)
end_date = datetime.now()
end_date = (
datetime.strptime(args.end_date, "%Y-%m-%d")
if args.end_date
else datetime.now()
)
if args.start_date:
start_date = datetime.strptime(args.start_date, "%Y-%m-%d")
else:
start_date = end_date - timedelta(days=args.days)
print(

@ -0,0 +1,277 @@
"""Backfill missing mp_votes.party values from mp_metadata and co-voting inference.
Multi-tier strategy:
1) Tussenvoegsel-aware name match against mp_metadata.
2) Majority party already recorded in mp_votes for the same MP.
3) Looser last-name-token match against mp_metadata.
4) Co-voting inference: for MPs still unresolved, find which party's MPs
they vote identically with most often, using a Jaccard-style overlap.
Usage:
uv run python3 scripts/fill_mp_votes_parties.py --db data/motions.db
"""
from __future__ import annotations
import argparse
import logging
import re
import unicodedata
from collections import defaultdict
from datetime import datetime
import duckdb
logger = logging.getLogger("fill_mp_votes_parties")
_TUSSENVOEGSEL = {
"van de",
"van den",
"van der",
"van het",
"van",
"de",
"den",
"der",
"het",
"ter",
"ten",
"el",
"al",
"in 't",
}
# Build a regex that matches any known tussenvoegsel (longest first to avoid
# partial matches like "van" eating the "van" in "van der").
_TV_PATTERN = re.compile(
r"\b("
+ "|".join(re.escape(tv) for tv in sorted(_TUSSENVOEGSEL, key=len, reverse=True))
+ r")\b",
re.IGNORECASE,
)
def normalize_mp_key(name: str) -> str:
"""Produce a canonical key that matches regardless of tussenvoegsel position.
Both "Burg van der, E." (mp_votes style) and "Van der Burg, E."
(mp_metadata style) should produce the same key. Also strips diacritics
so "Kostić, I." matches "Kostic, I.".
Strategy: split into pre-comma and post-comma parts. From the pre-comma
part, extract any tussenvoegsel tokens and the remaining lastname.
Canonical key = "lastname tussenvoegsel initials", all lowercased.
"""
if not name:
return ""
# Strip diacritics: NFD decompose then drop combining marks
s = unicodedata.normalize("NFD", name)
s = "".join(c for c in s if unicodedata.category(c) != "Mn")
# remove parenthetical fullnames e.g. "(Christine)"
s = re.sub(r"\s*\(.*?\)", "", s).strip()
# remove dots and commas for splitting but keep the comma position
# Split on first comma: last_part, initials_part
parts = s.split(",", 1)
last_part = parts[0].strip()
initials_part = parts[1].strip() if len(parts) > 1 else ""
# Clean initials: remove dots
initials = re.sub(r"\.", "", initials_part).strip().lower()
# From last_part, extract tussenvoegsel and lastname
last_lower = last_part.lower()
# Find all tussenvoegsel matches
found_tv = []
remaining = last_lower
for m in _TV_PATTERN.finditer(last_lower):
found_tv.append(m.group(0).lower())
# Remove tussenvoegsel tokens from remaining to get the pure lastname
remaining = _TV_PATTERN.sub("", last_lower).strip()
remaining = re.sub(r"\s+", " ", remaining).strip()
# Sort tussenvoegsel to canonical order
tv_str = " ".join(sorted(found_tv)) if found_tv else ""
# Build canonical key: "lastname tv initials"
key_parts = [remaining]
if tv_str:
key_parts.append(tv_str)
if initials:
key_parts.append(initials)
return " ".join(key_parts)
def pick_preferred_party(records: list) -> str | None:
# records: list of dicts with keys party, van, tot
# prefer active membership
for r in records:
if r.get("tot") is None and r.get("party"):
return r.get("party")
# otherwise pick most recent van
best = None
best_date = None
for r in records:
van = r.get("van")
try:
d = datetime.fromisoformat(van).date() if van else None
except Exception:
d = None
if d and (best_date is None or d > best_date):
best_date = d
best = r
if best:
return best.get("party")
# fallback to any party present
for r in records:
if r.get("party"):
return r.get("party")
return None
def _infer_party_by_covoting(conn, mp_name: str, min_overlap: int = 10) -> str | None:
"""Infer party by finding which known-party MPs vote identically most often.
For each motion where *mp_name* voted, find all other MPs who cast the
same vote AND already have a party assigned. The party with the highest
agreement count wins, provided the overlap exceeds *min_overlap*.
"""
rows = conn.execute(
"""
SELECT other.party, COUNT(*) AS agreement
FROM mp_votes me
JOIN mp_votes other
ON me.motion_id = other.motion_id
AND me.vote = other.vote
WHERE me.mp_name = ?
AND other.mp_name != ?
AND other.party IS NOT NULL
AND other.party != ''
AND other.mp_name LIKE '%,%'
GROUP BY other.party
ORDER BY agreement DESC
LIMIT 5
""",
(mp_name, mp_name),
).fetchall()
if not rows:
return None
best_party, best_count = rows[0]
if best_count < min_overlap:
return None
# Require meaningful margin over second-best to avoid ambiguous assignment
if len(rows) > 1:
second_count = rows[1][1]
# Best must have at least 20% more agreement than runner-up
if best_count < second_count * 1.2:
logger.debug(
"Co-voting ambiguous for %s: %s=%d vs %s=%d",
mp_name,
best_party,
best_count,
rows[1][0],
second_count,
)
return None
logger.info(
"Co-voting inferred %s -> %s (agreement=%d)",
mp_name,
best_party,
best_count,
)
return best_party
def main(argv=None) -> int:
p = argparse.ArgumentParser()
p.add_argument("--db", default="data/motions.db")
args = p.parse_args(argv)
conn = duckdb.connect(args.db)
# Load mp_metadata
md_rows = conn.execute(
"SELECT mp_name, party, van, tot_en_met FROM mp_metadata"
).fetchall()
metadata = defaultdict(list)
for mp_name, party, van, tot in md_rows:
key = normalize_mp_key(mp_name)
metadata[key].append(
{"mp_name": mp_name, "party": party, "van": van, "tot": tot}
)
# Build majority-party mapping from existing mp_votes (non-null parties)
party_counts = defaultdict(lambda: defaultdict(int))
rows_counts = conn.execute(
"SELECT mp_name, party, COUNT(*) FROM mp_votes WHERE party IS NOT NULL AND party != '' GROUP BY mp_name, party"
).fetchall()
for mp_name, party, cnt in rows_counts:
key = normalize_mp_key(mp_name)
party_counts[key][party] += cnt
majority_by_norm = {
k: max(v.items(), key=lambda kv: kv[1])[0] for k, v in party_counts.items()
}
# Target mp_votes rows: individual MPs (contain comma) with NULL or empty party
target_rows = conn.execute(
"SELECT id, mp_name FROM mp_votes WHERE (party IS NULL OR party = '') AND mp_name LIKE '%,%'"
).fetchall()
updated = 0
# Track MPs that need co-voting inference (tier 4) — collect after tiers 1-3
covote_candidates: dict[str, list[int]] = defaultdict(list) # mp_name -> [ids]
for id_, mp_name in target_rows:
key = normalize_mp_key(mp_name)
chosen_party = None
# 1) exact normalized metadata match
if key in metadata:
chosen_party = pick_preferred_party(metadata[key])
# 2) fallback to majority observed in mp_votes
if not chosen_party:
chosen_party = majority_by_norm.get(key)
# 3) try looser substring matches on lastname token
if not chosen_party:
tokens = key.split()
if tokens:
lastname = tokens[0]
# find metadata keys that start with lastname
for meta_key, recs in metadata.items():
if meta_key.split()[0] == lastname:
chosen_party = pick_preferred_party(recs)
if chosen_party:
break
if chosen_party:
conn.execute(
"UPDATE mp_votes SET party = ? WHERE id = ?", (chosen_party, id_)
)
updated += 1
else:
covote_candidates[mp_name].append(id_)
# 4) Co-voting inference for remaining unresolved MPs
for mp_name, ids in covote_candidates.items():
inferred = _infer_party_by_covoting(conn, mp_name)
if inferred:
for id_ in ids:
conn.execute(
"UPDATE mp_votes SET party = ? WHERE id = ?", (inferred, id_)
)
updated += 1
conn.close()
logger.info("Updated %d mp_votes rows with party info", updated)
return 0
if __name__ == "__main__":
raise SystemExit(main())

@ -0,0 +1,157 @@
"""Generate political compass and 2D trajectories HTML outputs.
This script computes 2D axes using residual-PCA (or anchor), applies the
party-fill helper to colour MPs, and writes self-contained HTML files into
an outputs/ directory.
Usage:
python scripts/generate_compass.py --db data/motions.db --out outputs --method pca --pca-residual
The script is defensive: if required optional libraries (duckdb, plotly,
scipy) are missing it will log and exit without raising an uncaught exception.
"""
from __future__ import annotations
import argparse
import logging
import os
import sys
from typing import Optional
# Ensure project root is on sys.path so `import analysis.*` works when the
# script is executed from the repository root or from scripts/ directly.
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if ROOT not in sys.path:
sys.path.insert(0, ROOT)
logger = logging.getLogger("generate_compass")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
def main(argv: Optional[list] = None):
p = argparse.ArgumentParser()
p.add_argument("--db", default="data/motions.db", help="Path to duckdb database")
p.add_argument("--out", default="outputs", help="Output directory")
p.add_argument("--method", choices=["pca", "anchor"], default="pca")
p.add_argument(
"--pca-residual", action="store_true", help="Use residual PCA for second axis"
)
p.add_argument(
"--y-scale",
type=float,
default=None,
help="Optional manual y-axis scale multiplier",
)
args = p.parse_args(argv)
# Lazy imports so the script exits gracefully if deps missing
try:
from analysis.political_axis import compute_2d_axes
from analysis.visualize import (
plot_political_compass,
plot_2d_trajectories,
_load_party_map,
)
except Exception as e: # pragma: no cover - runtime helper
logger.exception("Required analysis modules could not be imported: %s", e)
sys.exit(1)
# Ensure output dir exists
os.makedirs(args.out, exist_ok=True)
logger.info(
"Computing 2D axes (method=%s pca_residual=%s)", args.method, args.pca_residual
)
try:
positions_by_window, axis_def = compute_2d_axes(
args.db,
method=args.method,
pca_residual=args.pca_residual,
normalize_vectors=True,
)
except Exception as e: # defensive
logger.exception("compute_2d_axes failed: %s", e)
sys.exit(1)
if not positions_by_window:
logger.error("No positions produced — aborting")
sys.exit(1)
# pick latest window (lexicographic order is used elsewhere in codebase)
window_id = sorted(positions_by_window.keys())[-1]
# Build party mapping to colour points
try:
party_map = _load_party_map(args.db)
except Exception:
logger.exception("Failed to build party map; proceeding without it")
party_map = None
# Output files
compass_out = os.path.join(
args.out, f"political_compass_{args.method}_{window_id}.html"
)
traj_out = os.path.join(args.out, f"trajectories_compass_{args.method}_top50.html")
try:
plot_political_compass(
positions_by_window,
window_id=window_id,
party_of=party_map,
axis_def=axis_def,
y_scale=args.y_scale,
output_path=compass_out,
)
logger.info("Wrote compass to %s", compass_out)
except Exception:
logger.exception("Failed to write political compass")
try:
# Build 2D trajectories from the already-computed positions_by_window so
# we keep the same PCA/anchor axes (compute_2d_trajectories would call
# compute_2d_axes again which may use different defaults).
import numpy as _np
window_ids = sorted(positions_by_window.keys())
mp_data = {}
for wid in window_ids:
pos = positions_by_window.get(wid, {})
for mp_name, coord in pos.items():
mp_data.setdefault(mp_name, {"windows": [], "coords": []})
mp_data[mp_name]["windows"].append(wid)
mp_data[mp_name]["coords"].append(tuple(coord))
trajs = {}
for mp_name, data in mp_data.items():
if len(data["windows"]) < 2:
continue
coords = [_np.array(c, dtype=float) for c in data["coords"]]
step_vecs = [coords[i + 1] - coords[i] for i in range(len(coords) - 1)]
mags = [float(_np.linalg.norm(v)) for v in step_vecs]
trajs[mp_name] = {
"windows": data["windows"],
"coords": [[float(c[0]), float(c[1])] for c in coords],
"step_vectors": [[float(v[0]), float(v[1])] for v in step_vecs],
"step_magnitudes": mags,
"total_magnitude": float(sum(mags)),
}
ranked = sorted(
trajs.items(), key=lambda kv: kv[1]["total_magnitude"], reverse=True
)
top_names = [mp for mp, _ in ranked[:50]] if ranked else None
plot_2d_trajectories(
positions_by_window, mp_names=top_names, output_path=traj_out
)
logger.info("Wrote trajectories to %s", traj_out)
except Exception:
logger.exception("Failed to compute/write trajectories")
if __name__ == "__main__":
main()

@ -0,0 +1,137 @@
"""Inspect PCA axes and per-MP projections for diagnostics.
Usage:
uv run python3 scripts/inspect_axis.py --db data/motions.db --out outputs
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import sys
from typing import Dict, List
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if ROOT not in sys.path:
sys.path.insert(0, ROOT)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("inspect_axis")
def main(argv: List[str] | None = None):
p = argparse.ArgumentParser()
p.add_argument("--db", default="data/motions.db")
p.add_argument("--out", default="outputs")
p.add_argument("--method", choices=["pca", "anchor"], default="pca")
p.add_argument("--pca-residual", action="store_true")
p.add_argument("--normalize", action="store_true", default=True)
args = p.parse_args(argv)
os.makedirs(args.out, exist_ok=True)
try:
from analysis.political_axis import compute_2d_axes
from analysis.visualize import _load_party_map
except Exception as e:
logger.exception("Failed to import analysis modules: %s", e)
raise
positions_by_window, axes = compute_2d_axes(
args.db,
method=args.method,
pca_residual=args.pca_residual,
normalize_vectors=args.normalize,
)
if not positions_by_window:
logger.error("No positions produced")
return 2
latest = sorted(positions_by_window.keys())[-1]
pos = positions_by_window[latest]
names = list(pos.keys())
coords = list(pos.values())
xs = [c[0] for c in coords]
ys = [c[1] for c in coords]
import numpy as _np
x_std = float(_np.std(xs))
y_std = float(_np.std(ys))
x_min, x_max = min(xs), max(xs)
y_min, y_max = min(ys), max(ys)
party_map = _load_party_map(args.db)
# load mp_votes counts
try:
import duckdb
conn = duckdb.connect(args.db)
rows = conn.execute(
"SELECT mp_name, COUNT(*) FROM mp_votes GROUP BY mp_name"
).fetchall()
conn.close()
vote_counts = {r[0]: int(r[1]) for r in rows}
except Exception:
vote_counts = {}
# extremes
sorted_by_x = sorted(pos.items(), key=lambda kv: kv[1][0])
sorted_by_y = sorted(pos.items(), key=lambda kv: kv[1][1])
def info_for(name: str):
party = party_map.get(name)
count = vote_counts.get(name, None)
x, y = pos.get(name, (None, None))
return {"name": name, "party": party, "count": count, "x": x, "y": y}
report = {
"db": args.db,
"latest_window": latest,
"n_entities": len(names),
"x_std": x_std,
"y_std": y_std,
"x_min": x_min,
"x_max": x_max,
"y_min": y_min,
"y_max": y_max,
"evr": axes.get("explained_variance_ratio") if axes else None,
"top_left_by_x": [info_for(n) for n, _ in sorted_by_x[:10]],
"top_right_by_x": [info_for(n) for n, _ in sorted_by_x[-10:]],
"top_by_y": [info_for(n) for n, _ in sorted_by_y[-10:]],
"bottom_by_y": [info_for(n) for n, _ in sorted_by_y[:10]],
}
# count how many are near-center along x within small fraction of std
threshold = 0.2 * x_std if x_std > 0 else 0.01
near_center = [n for n, (x, y) in pos.items() if abs(x) < threshold]
report["near_center_count"] = len(near_center)
report["near_center_sample"] = near_center[:40]
# check duplicate coordinate pairs
coord_pairs = [(_np.round(c[0], 6), _np.round(c[1], 6)) for c in coords]
unique_coords = set(coord_pairs)
report["n_unique_coords"] = len(unique_coords)
report["n_total_entities"] = len(names)
# look up particular MPs
for q in ("Ouwehand", "Keijzer", "Mona"):
found = [n for n in names if q.lower() in n.lower()]
report[f"matches_{q}"] = [info_for(n) for n in found]
out_json = os.path.join(args.out, "inspect_axis.json")
with open(out_json, "w", encoding="utf-8") as f:
json.dump(report, f, indent=2)
logger.info("Wrote inspection to %s", out_json)
print(json.dumps(report, indent=2))
return 0
if __name__ == "__main__":
raise SystemExit(main())

@ -0,0 +1,167 @@
"""Recompute per-window SVD into a fresh DB copy and re-run 2D axes.
This script copies the current data/motions.db to a new file (data/motions_recompute.db),
clears any existing svd_vectors rows for the target windows in the new DB, runs
SVD on each window, then computes 2D axes and writes compass + trajectories into
outputs_recomputed/ for inspection.
Usage:
uv run python3 scripts/recompute_svd.py --db data/motions.db --out outputs_recomputed
"""
from __future__ import annotations
import argparse
import calendar
import logging
import os
import shutil
import sys
from datetime import date
from typing import List, Tuple
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if ROOT not in sys.path:
sys.path.insert(0, ROOT)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("recompute_svd")
def quarter_bounds(window_id: str) -> Tuple[str, str]:
# window_id like '2026-Q1'
year, q = window_id.split("-Q")
y = int(year)
qn = int(q)
starts = {1: (1, 1), 2: (4, 1), 3: (7, 1), 4: (10, 1)}
ends = {1: (3, 31), 2: (6, 30), 3: (9, 30), 4: (12, 31)}
s_m, s_d = starts[qn]
e_m, e_d = ends[qn]
start = date(y, s_m, s_d).isoformat()
end = date(y, e_m, e_d).isoformat()
return start, end
def main(argv: List[str] | None = None) -> int:
p = argparse.ArgumentParser()
p.add_argument("--db", default="data/motions.db")
p.add_argument("--out", default="outputs_recomputed")
p.add_argument("--k", type=int, default=50)
args = p.parse_args(argv)
os.makedirs(args.out, exist_ok=True)
# Copy DB to a new file so we don't clobber originals
src = args.db
dst = os.path.splitext(src)[0] + "_recompute.db"
logger.info("Copying %s -> %s", src, dst)
shutil.copyfile(src, dst)
# Lazy imports
try:
from database import MotionDatabase
from pipeline.svd_pipeline import run_svd_for_window
from analysis.political_axis import compute_2d_axes
from analysis.visualize import (
plot_political_compass,
plot_2d_trajectories,
_load_party_map,
)
from analysis import trajectory as traj
except Exception as e:
logger.exception("Import failed: %s", e)
return 2
# build MotionDatabase pointing to new file
db = MotionDatabase(dst)
# find windows from original DB via trajectory helper
window_ids = traj._load_window_ids(src)
if not window_ids:
logger.error("No windows found in source DB %s", src)
return 3
logger.info("Will recompute SVD for windows: %s", window_ids)
# clear existing svd_vectors rows for these windows in dst DB
import duckdb
conn = duckdb.connect(dst)
try:
conn.execute(
"DELETE FROM svd_vectors WHERE window_id IN ({})".format(
",".join([f"'{w}'" for w in window_ids])
)
)
conn.commit()
logger.info("Cleared existing svd_vectors rows for windows in %s", dst)
finally:
conn.close()
# Run SVD per window
for wid in window_ids:
start, end = quarter_bounds(wid)
logger.info("Running SVD for %s (%s -> %s) k=%d", wid, start, end, args.k)
res = run_svd_for_window(
db=db, window_id=wid, start_date=start, end_date=end, k=args.k
)
logger.info("SVD result for %s: %s", wid, res)
# Recompute 2D axes and plots from the recomputed DB
logger.info("Computing 2D axes (pca_residual=True) from recomputed DB")
positions_by_window, axes = compute_2d_axes(
dst, method="pca", pca_residual=True, normalize_vectors=True
)
if not positions_by_window:
logger.error("No positions returned from compute_2d_axes on recomputed DB")
return 5
latest = sorted(positions_by_window.keys())[-1]
party_map = _load_party_map(dst)
compass_out = os.path.join(args.out, f"political_compass_recomputed_{latest}.html")
traj_out = os.path.join(args.out, f"trajectories_recomputed_{latest}_top50.html")
plot_political_compass(
positions_by_window,
window_id=latest,
party_of=party_map,
axis_def=axes,
output_path=compass_out,
)
logger.info("Wrote recomputed compass to %s", compass_out)
# compute simple trajectories from positions_by_window
# build per-MP coords
mp_coords = {}
for wid in sorted(positions_by_window.keys()):
for mp, coord in positions_by_window[wid].items():
mp_coords.setdefault(mp, []).append((wid, coord))
names = [n for n, v in mp_coords.items() if len(v) >= 2]
plot_2d_trajectories(positions_by_window, mp_names=names[:50], output_path=traj_out)
logger.info("Wrote recomputed trajectories to %s", traj_out)
# write a short diagnostic JSON (convert numpy arrays to lists)
import json
import numpy as _np
def _to_serializable(o):
if isinstance(o, _np.ndarray):
return o.tolist()
if isinstance(o, (_np.floating, _np.integer)):
return float(o)
raise TypeError(f"Object of type {type(o)} is not JSON serializable")
diag = {"windows": window_ids, "axes": axes}
with open(
os.path.join(args.out, "recompute_diag.json"), "w", encoding="utf-8"
) as f:
json.dump(diag, f, indent=2, default=_to_serializable)
logger.info("Recompute complete; outputs in %s and DB copy at %s", args.out, dst)
return 0
if __name__ == "__main__":
raise SystemExit(main())

@ -0,0 +1,214 @@
"""SVD and PCA diagnostics for the political compass pipeline.
Produces a small text report and JSON summary in the outputs/ directory.
Usage:
uv run python3 scripts/svd_diagnostics.py --db data/motions.db --out outputs
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import sys
from statistics import mean
from typing import Dict, List, Optional, Tuple
ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if ROOT not in sys.path:
sys.path.insert(0, ROOT)
logger = logging.getLogger("svd_diagnostics")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
def find_by_substring(names: List[str], query: str) -> List[str]:
q = query.lower()
return [n for n in names if q in n.lower()]
def main(argv: Optional[list] = None):
p = argparse.ArgumentParser()
p.add_argument("--db", default="data/motions.db")
p.add_argument("--out", default="outputs")
args = p.parse_args(argv)
os.makedirs(args.out, exist_ok=True)
try:
from analysis import trajectory as traj
from analysis.political_axis import compute_2d_axes
from analysis.visualize import _load_party_map
except Exception as e: # pragma: no cover - runtime
logger.exception("Could not import analysis modules: %s", e)
raise
# Load windows and aligned vectors
window_ids = traj._load_window_ids(args.db)
if not window_ids:
logger.error("No SVD windows found in DB %s", args.db)
return 1
logger.info("Found windows: %s", window_ids)
raw_window_vecs = {
wid: traj._load_mp_vectors_for_window(args.db, wid) for wid in window_ids
}
aligned_window_vecs = traj._procrustes_align_windows(raw_window_vecs)
# Compute global PCA axes (residual and non-residual) for comparison
positions_residual, axes_residual = compute_2d_axes(
args.db,
window_ids=window_ids,
method="pca",
normalize_vectors=True,
pca_residual=True,
)
positions_plain, axes_plain = compute_2d_axes(
args.db,
window_ids=window_ids,
method="pca",
normalize_vectors=True,
pca_residual=False,
)
out_report = []
def add(line: str):
out_report.append(line)
logger.info(line)
add("PCA diagnostics report")
add(f"DB: {args.db}")
add(f"Windows: {window_ids}")
add("")
evr_res = axes_residual.get("explained_variance_ratio") if axes_residual else None
evr_plain = axes_plain.get("explained_variance_ratio") if axes_plain else None
add(f"Residual PCA EVR: {evr_res}")
add(f"Plain PCA EVR: {evr_plain}")
# pick latest window for detailed inspection
latest = sorted(window_ids)[-1]
add("")
add(f"Inspecting latest window: {latest}")
pos = positions_residual.get(latest, {})
names = list(pos.keys())
xs = [v[0] for v in pos.values()]
ys = [v[1] for v in pos.values()]
def stats(arr: List[float]) -> Tuple[float, float]:
if not arr:
return 0.0, 0.0
mn = min(arr)
mx = max(arr)
return mn, mx
add(f"Entities in latest window: {len(names)}")
add(f"X range (left-right): {stats(xs)}")
add(f"Y range (prog-cons): {stats(ys)}")
# stdevs
try:
import numpy as _np
x_std = float(_np.std(xs))
y_std = float(_np.std(ys))
except Exception:
x_std = 0.0
y_std = 0.0
add(
f"Std dev X: {x_std:.6f}, Std dev Y: {y_std:.6f} (ratio Y/X = {y_std / (x_std + 1e-12):.3f})"
)
# show extremes on X and Y
sorted_by_x = sorted(pos.items(), key=lambda kv: kv[1][0])
sorted_by_y = sorted(pos.items(), key=lambda kv: kv[1][1])
add("")
add("Left-most (by X):")
for name, (x, y) in sorted_by_x[:8]:
add(f" {name:40s} x={x:.4f} y={y:.4f}")
add("")
add("Right-most (by X):")
for name, (x, y) in sorted_by_x[-8:]:
add(f" {name:40s} x={x:.4f} y={y:.4f}")
add("")
add("Top (conservative) (by Y):")
for name, (x, y) in sorted_by_y[-8:]:
add(f" {name:40s} x={x:.4f} y={y:.4f}")
add("")
add("Bottom (progressive) (by Y):")
for name, (x, y) in sorted_by_y[:8]:
add(f" {name:40s} x={x:.4f} y={y:.4f}")
# Find specific MPs mentioned by user
matches_ouwehand = find_by_substring(names, "ouwehand")
matches_mona = find_by_substring(names, "mona")
add("")
add(f"Matches for 'Ouwehand': {matches_ouwehand}")
for n in matches_ouwehand:
x, y = pos.get(n)
add(f" {n} -> x={x:.4f} y={y:.4f}")
add(f"Matches for 'Mona': {matches_mona}")
for n in matches_mona:
x, y = pos.get(n)
add(f" {n} -> x={x:.4f} y={y:.4f}")
# Party centroids
party_map = _load_party_map(args.db)
parties: Dict[str, List[Tuple[float, float]]] = {}
for mp, coord in pos.items():
party = party_map.get(mp)
if party:
parties.setdefault(party, []).append(coord)
party_centroids: Dict[str, Tuple[float, float]] = {}
for party, coords in parties.items():
xs_p = [c[0] for c in coords]
ys_p = [c[1] for c in coords]
party_centroids[party] = (mean(xs_p), mean(ys_p))
add("")
add(f"Computed {len(party_centroids)} party centroids (from mp_metadata majority)")
sorted_parties_by_x = sorted(party_centroids.items(), key=lambda kv: kv[1][0])
add("Party centroids left→right:")
for p, (x, y) in sorted_parties_by_x:
add(f" {p:20s} x={x:.4f} y={y:.4f}")
sorted_parties_by_y = sorted(party_centroids.items(), key=lambda kv: kv[1][1])
add("")
add("Party centroids prog→cons:")
for p, (x, y) in sorted_parties_by_y:
add(f" {p:20s} x={x:.4f} y={y:.4f}")
# Save report and a small JSON summary
report_path = os.path.join(args.out, "svd_diagnostics.txt")
summary_path = os.path.join(args.out, "svd_diagnostics.json")
with open(report_path, "w", encoding="utf-8") as f:
f.write("\n".join(out_report))
summary = {
"db": args.db,
"windows": window_ids,
"latest_window": latest,
"evr_residual": evr_res,
"evr_plain": evr_plain,
"n_entities_latest": len(names),
"x_std": x_std,
"y_std": y_std,
"party_centroids": party_centroids,
}
with open(summary_path, "w", encoding="utf-8") as f:
json.dump(summary, f, indent=2)
logger.info("Diagnostic report written to %s and %s", report_path, summary_path)
return 0
if __name__ == "__main__":
raise SystemExit(main())

@ -53,14 +53,14 @@ def test_pipeline_end_to_end(tmp_path, monkeypatch):
conn.close()
# monkeypatch ai_provider.get_embedding to deterministic vector
# monkeypatch ai_provider.get_embeddings_batch to deterministic vectors
import ai_provider
def fake_get_embedding(text, model=None):
# produce a deterministic vector based on seeded numpy
return list(np.random.rand(16))
def fake_get_embeddings_batch(texts, model=None, batch_size=50):
# produce a deterministic vector per text based on seeded numpy
return [list(np.random.rand(16)) for _ in texts]
monkeypatch.setattr("ai_provider.get_embedding", fake_get_embedding)
monkeypatch.setattr("ai_provider.get_embeddings_batch", fake_get_embeddings_batch)
# run ensure_text_embeddings
from pipeline.text_pipeline import ensure_text_embeddings

@ -49,11 +49,11 @@ def test_ensure_text_embeddings_monkeypatch(tmp_path, monkeypatch):
conn.close()
# monkeypatch ai_provider.get_embedding
def fake_get_embedding(text, model=None):
return [0.1] * 16
# monkeypatch ai_provider.get_embeddings_batch (used by batched pipeline)
def fake_get_embeddings_batch(texts, model=None, batch_size=50):
return [[0.1] * 16 for _ in texts]
monkeypatch.setattr("ai_provider.get_embedding", fake_get_embedding)
monkeypatch.setattr("ai_provider.get_embeddings_batch", fake_get_embeddings_batch)
# run ensure_text_embeddings
from pipeline.text_pipeline import ensure_text_embeddings

@ -1,50 +1,79 @@
# Session: stemwijzer
Updated: 2026-03-20T00:23:33Z
# Session: stemwijzer — Parliamentary Embedding Pipeline
Updated: 2026-03-22T16:00:00Z
## Goal
Preserve the minimal session state required to resume work on the stemwijzer project after context clears (success = ledger exists and is kept up-to-date).
2D political compass + motion similarity search from parliamentary votes + motion text.
Full historical coverage 2016–2026, precomputed similarity cache, fused (SVD + text) embeddings.
## Constraints
- Keep the ledger CONCISE — only essential information
- Focus on WHAT and WHY, not HOW
- Mark uncertain information as UNCONFIRMED
- Include git branch and key file paths
- DuckDB only (`data/motions.db`); open/close `duckdb.connect(self.db_path)` per method
- Vectors stored as JSON text (no external vector DB)
- Logging via `logging.getLogger(__name__)`; no `print()` in library modules
- Tests run offline (network monkeypatched) — use `.venv/bin/python -m pytest -q`
- Do NOT modify `app.py` or `scheduler.py`
- Use `.venv/bin/python` (Arch Linux system Python is externally managed)
## Progress
### Done
- [x] Create initial continuity ledger file
## Current DB State (verified 2026-03-22 ~16:00)
### In Progress
- [ ] Capture ongoing session context and update ledger after each meaningful change
| Table | Rows |
|---|---|
| motions | 10,613 |
| embeddings | 10,753 |
| svd_vectors | 24,528 |
| fused_embeddings | **10,613** (1:1 with motions, 0 duplicates) |
| similarity_cache | **212,206** (top_k=20, all annual windows) |
| mp_votes | 199,967 |
| mp_metadata | 798 |
### Blocked
- None currently
## Annual Window Coverage
| Year | Motions | Fused | Similarity |
|---|---|---|---|
| 2016 | 132 | 132 | 2,640 |
| 2017 | 30 | 30 | 600 |
| 2018 | 100 | 100 | 2,000 |
| 2019 | 3 | 3 | 6 |
| 2020 | 0 | 0 | 0 (no data) |
| 2021 | 0 | 0 | 0 (no data) |
| 2022 | 4,116 | 4,116 | 82,320 |
| 2023 | 621 | 621 | 12,420 |
| 2024 | 948 | 948 | 18,960 |
| 2025 | 3,715 | 3,715 | 74,300 |
| 2026 | 948 | 948 | 18,960 |
## Completed This Session
- [x] Text embeddings: ran with real OpenRouter API at batch_size=200 → 10,753 embedding rows
- [x] Re-ran `extract_mp_votes` on all motions → 111,978 new rows (party-level votes backfilled)
- [x] SVD re-run (annual 2016–2026) with full vote data → 24,528 svd_vector rows
- [x] Fixed `store_fused_embedding` double-counting bug: added DELETE before INSERT
- [x] Cleaned and re-ran fusion → 10,613 fused rows, zero duplicates
- [x] Re-ran similarity cache top_k=20 for all 9 active windows → 212,206 rows
- [x] Test suite: **34 passed, 2 skipped**
## Key Decisions
- **Session name = "stemwijzer"**: Chosen from repository context (UNCONFIRMED if a different canonical session name is preferred).
- **Do not auto-commit ledger changes**: Commits will only be made when the user explicitly requests it (follows Git Safety Protocol).
## Next Steps
1. Continue updating this ledger when tasks, files, or decisions change
2. Add entries for new branches or major feature work (mark as UNCONFIRMED when unsure)
3. Ask user before creating any git commits that include this ledger
## File Operations
### Read
- `README.md`
- `pyproject.toml`
- `thoughts/shared/plans/2026-03-19-stemwijzer-plan.md`
- `thoughts/shared/designs/2026-03-19-stemwijzer-design.md`
### Modified
- `thoughts/ledgers/CONTINUITY_stemwijzer.md` (new)
## Critical Context
- Repository branch observed: `main`
- Found project metadata in `pyproject.toml` indicating Python tooling preference
- Existing notes/plans located under `thoughts/shared/` (plans and designs from 2026-03-19)
- No existing continuity ledger was found prior to this creation
## Working Set
- Branch: `main`
- Key files: `README.md`, `pyproject.toml`, `thoughts/shared/plans/2026-03-19-stemwijzer-plan.md`, `thoughts/shared/designs/2026-03-19-stemwijzer-design.md`, `thoughts/ledgers/CONTINUITY_stemwijzer.md`
- `store_fused_embedding` (database.py line 686): Now does DELETE+INSERT instead of plain INSERT to prevent duplicates on re-runs.
- Annual windows chosen for historical political compass (2016–2026).
- top_k=20 for similarity cache.
- Party-level votes (e.g. `{"PVV": "voor"}`) handled in `extract_mp_votes` — actor without comma → `party=actor_name`.
## Open Items (not blocking, data coverage gaps)
1. **2020–2021 data gap**: No motions in DB at all. Need to run downloader with `--start-date 2019-01-01 --end-date 2021-12-31` if data exists in API.
2. **2024 gap ~3,020 motions**: OData API has ~3,968 2024 motions, only 948 in DB. Root cause unclear — needs investigation of URL-based dedup in `insert_motion`.
3. **"Verworpen." dedup**: Short-text motions (title="Verworpen.") get spurious similarity=1.0. UI/query layer should filter `score < 0.999 OR title != 'Verworpen.'`.
4. **svd_vectors has duplicates**: 2025 has 7,430 rows for 3,715 motions (2x). Doesn't affect fused_embeddings (DELETE+INSERT handles it) but wastes space. Low priority.
## Key File Paths
- DB: `data/motions.db`
- Venv: `.venv/bin/python`
- Pipeline entry: `pipeline/run_pipeline.py`
- Fusion: `pipeline/fusion.py`
- SVD: `pipeline/svd_pipeline.py`
- Text embeddings: `pipeline/text_pipeline.py`
- MP votes extraction: `pipeline/extract_mp_votes.py`
- Database layer: `database.py`
- Similarity compute: `similarity/compute.py`
- Similarity lookup: `similarity/lookup.py`
- Tests: `tests/` (pytest, offline)
## Branch
`main`

Loading…
Cancel
Save