11 KiB
| date | topic | status |
|---|---|---|
| 2026-03-21 | Parliamentary Embedding Pipeline (Late Fusion) | validated |
Problem Statement
We want to implement the late-fusion embedding system described in EMBEDDING_ANALYSIS.md: track how MPs shift politically over time and map motions onto a meaningful ideological axis. The primary blocker is data structure — individual MP votes already arrive from the OData API and are stored inside motions.voting_results as a mixed JSON blob (party names + MP names together). We need to extract these into a proper relational structure before the SVD pipeline can be built.
Why this is the right next step: We already have motion text, layman explanations, text embeddings infrastructure (Qwen3 via ai_provider), and DuckDB. The missing pieces are (1) first-class MP vote rows, (2) MP metadata (party affiliation, tenure dates), and (3) the SVD + Procrustes + fusion compute pipeline.
Constraints
- DuckDB only — no pgvector, no external vector store. In-Python compute (scipy) is correct.
- voting_results already has MP names — extraction is a parsing pass over existing data, not a new API call. Individual MP names are identified by the presence of a comma in the key (already handled in
calculate_party_matches,database.py:264). - Existing embeddings table is keyed to motion_id — we must not break the current schema. SVD and fused vectors go into new tables.
- ai_provider.get_embedding already works — use it as-is for text embeddings; no model changes needed for MVP.
- ibis/DuckDB preferred over raw SQL for analysis queries (per project preferences).
- uv for dependency management; add
scipy,umap-learn,plotly,sentence-transformers(or use existing ai_provider for embeddings).
Approach
Late-fusion pipeline in four phases:
- Extract — parse MP-level votes out of
voting_resultsJSON into anmp_votestable; fetch MP metadata from OData intomp_metadata. - Compute SVD — per time window, build sparse MP × motion matrix → SVD → Procrustes-align windows sequentially.
- Text embeddings — ensure every motion has a text embedding (existing path; just fill gaps).
- Fuse — concatenate aligned SVD motion vector + text embedding → store in
fused_embeddingstable.
Alternatives considered:
- Pure text embeddings only: easier but loses the behavioral (voting) signal entirely. Rejected because the whole point of the plan is the fused representation.
- Store aligned SVD + rotation matrices separately: more flexible for recomputing, but adds complexity. MVP will store aligned vectors directly; rotation matrices are logged for debugging but not persisted.
Architecture
Data layer (DB):
motions (existing)
embeddings (existing — text vectors keyed to motion_id)
mp_votes (NEW — one row per MP per motion)
mp_metadata (NEW — MP name, party, entry/exit dates)
svd_vectors (NEW — per window, per entity: MP or motion)
fused_embeddings (NEW — per motion, per window: SVD + text concatenated)
Pipeline modules (new, in pipeline/):
extract_mp_votes.py — JSON blob → mp_votes rows
fetch_mp_metadata.py — OData /Kamerlid → mp_metadata rows
svd_pipeline.py — time windows → SVD → Procrustes alignment → svd_vectors
text_pipeline.py — ensure embeddings coverage, delegates to existing summarizer
fusion.py — join svd_vectors + embeddings → fused_embeddings
Analysis modules (new, in analysis/):
political_axis.py — first SVD component / anchor-party axis
trajectory.py — MP drift across aligned windows
clustering.py — UMAP on fused motion embeddings, thematic clusters
visualize.py — Plotly interactive trajectory and cluster plots
CLI entry points (new):
pipeline/run_pipeline.py — orchestrate all phases with flags
Key Components & Responsibilities
mp_votes table
- Schema:
(id, motion_id, mp_name, party, vote ENUM(voor/tegen/afwezig), date, created_at) - Populated by
extract_mp_votes.pydoing a one-time parse ofmotions.voting_resultsJSON. - Idempotent: skip motion_id if already extracted (upsert or EXISTS check).
partyfield is left NULL initially; backfilled frommp_metadataafter that table is populated.
mp_metadata table
- Schema:
(mp_name, party, entry_date, exit_date, source_id) - Fetched from OData
/Kamerlidendpoint (needs verification — see Open Questions). - Fallback: derive approximate party affiliation from
mp_votesrows (majority-party heuristic) if OData metadata is unavailable.
svd_vectors table
- Schema:
(window_id, entity_type ENUM(mp/motion), entity_id, vector JSON, model TEXT, created_at) - Stores both MP and motion SVD vectors per time window, after Procrustes alignment.
window_idis a string like2024-Q1.
fused_embeddings table
- Schema:
(motion_id, window_id, vector JSON, svd_dims INT, text_dims INT, created_at) - Separate from the existing
embeddingstable to avoid schema conflicts. - Vector is the concatenation of the SVD motion vector and the text embedding.
svd_pipeline.py
- Groups motions by time window (quarterly default).
- Builds a sparse
scipy.sparse.csr_matrix(MPs as rows, motions as columns, vote values encoded as +1/−1/0). - Calls
scipy.sparse.linalg.svds(matrix, k=dims)—kis configurable (default 50). - Applies Procrustes alignment between consecutive windows using overlapping MPs as anchors.
- Logs Procrustes disparity score per transition; flags high disparity (election transitions).
extract_mp_votes.py
- Reads all motions with
voting_resultsJSON, parses keys: if comma in key → individual MP name, else → party/fraction name. - Writes MP-level rows to
mp_votes; party-level rows are ignored here (they're already used by the existingcalculate_party_matchesflow). - Handles the three vote values:
voor(+1),tegen(−1),afwezig(0).
fusion.py
- For each motion in a window: lookup SVD motion vector from
svd_vectors; lookup text embedding fromembeddings. - Concatenates vectors (simple
list + list); stores infused_embeddings. - Skips motion if either vector is missing; logs counts.
analysis/ modules
- All read-only from DB; write only to output files (HTML/PNG plots).
political_axis.py: project all MP SVD vectors onto the first principal component; optionally define axis by anchor parties (e.g. VVD vs SP).trajectory.py: collect MP's aligned SVD vector per window → compute drift distance → plot trajectory over time.clustering.py: run UMAP onfused_embeddingsper window → label with policy_area or thematic cluster.visualize.py: Plotly interactive scatter/line plots; outputs self-contained HTML.
Data Flow
Phase 1 — Extract
motions.voting_results (JSON, existing)
→ extract_mp_votes.py
→ INSERT mp_votes rows (motion_id, mp_name, vote, date)
OData /Kamerlid
→ fetch_mp_metadata.py
→ INSERT mp_metadata rows (mp_name, party, entry_date, exit_date)
→ UPDATE mp_votes.party via JOIN (backfill)
Phase 2 — SVD
mp_votes (date-filtered per window)
→ sparse MP × motion matrix
→ scipy svds(k=50)
→ raw SVD vectors per window
Procrustes alignment:
window[t-1] aligned vectors + window[t] raw vectors
→ overlapping MPs as anchors
→ scipy.spatial.procrustes → rotation R
→ window[t] aligned vectors
→ INSERT svd_vectors rows
Phase 3 — Text embeddings (fill gaps)
motions without embedding in embeddings table
→ text_pipeline.py → ai_provider.get_embedding(body_text or description)
→ INSERT embeddings rows (existing schema)
Phase 4 — Fusion
svd_vectors (motion, window) + embeddings (motion)
→ fusion.py
→ INSERT fused_embeddings rows
Phase 5 — Analysis (on demand)
fused_embeddings + mp_metadata + svd_vectors
→ analysis modules
→ HTML plots output
Error Handling Strategy
- Extraction idempotency:
extract_mp_voteschecksSELECT COUNT(*) FROM mp_votes WHERE motion_id = ?before inserting; re-runs are safe. - Sparse windows: if a time window has fewer than
MIN_MOTIONS(default 20) orMIN_MPs(default 10), skip SVD for that window and log a warning. Do not crash. - Procrustes at election transitions: chain alignment via the last quarter of the old term and first quarter of the new term using only returning MPs. If overlap < 30%, log as HIGH_DISPARITY and store the window but flag it.
- Missing text embeddings: log motions skipped in fusion; the SVD-only path remains valid for those motions.
- OData metadata unavailable: fall back to heuristic party assignment (mp_votes majority-party per MP name). Log which MPs used fallback.
- Replace prints with structured logging: all pipeline modules use
logging.getLogger(__name__)— notprint().
Testing Strategy
-
Unit:
- Vote parser: given sample
voting_resultsJSON, assert correct MP rows extracted and party rows ignored. - Sparse matrix builder: inject 5 MPs × 10 motions → assert matrix shape and values.
- Procrustes wrapper: inject two small aligned-then-rotated matrices → assert recovered rotation close to identity.
- Fusion: inject matching SVD and text vectors → assert concatenated output length = svd_dims + text_dims.
- Vote parser: given sample
-
Integration:
- Extract → SVD → Fusion on a fixture of 50 motions (stored in
tests/fixtures/). Monkeypatch ai_provider for text embeddings. Assertfused_embeddingstable populated and vector dimensions correct.
- Extract → SVD → Fusion on a fixture of 50 motions (stored in
-
Regression:
- Run pipeline on a fixed 100-motion snapshot. Assert output dimensions and row counts stable across runs.
-
Migration tests:
- Follow existing pattern (
tests/test_migration_embeddings.py): apply new migration SQL to a temp DuckDB, assert expected tables and columns.
- Follow existing pattern (
Open Questions
- OData
/Kamerlidendpoint availability: does it expose party affiliation and tenure dates with the same API key/base URL we already use? If not, we need a scraping fallback formp_metadata. - Store rotation matrices?: MVP stores aligned vectors directly. Should we also persist the Procrustes R matrix per window transition so we can re-project new MPs added later without full recomputation?
- Output target: CLI producing HTML plots (simplest) vs. new Streamlit page vs. Jupyter notebook. Recommendation: CLI first, Streamlit page in a follow-up.
- Time window granularity: quarterly is the default. Should we validate this empirically first with an annual window (larger, more stable matrices) and switch to quarterly once the pipeline is proven?
- SVD dimensions k: default 50 dims for SVD. This needs to be validated against the actual data size (number of unique MPs × motions per window). A window with 100 MPs and 50 motions cannot have k=50 — needs to be
k < min(n_mps, n_motions). Pipeline must enforce this dynamically.