commit
fd73da3752
@ -0,0 +1,184 @@ |
||||
--- |
||||
date: 2026-03-21 |
||||
topic: "Parliamentary Embedding Pipeline (Late Fusion)" |
||||
status: validated |
||||
--- |
||||
|
||||
## Problem Statement |
||||
|
||||
We want to implement the late-fusion embedding system described in EMBEDDING_ANALYSIS.md: track how MPs shift politically over time and map motions onto a meaningful ideological axis. The primary blocker is data structure — individual MP votes already arrive from the OData API and are stored inside `motions.voting_results` as a mixed JSON blob (party names + MP names together). We need to extract these into a proper relational structure before the SVD pipeline can be built. |
||||
|
||||
**Why this is the right next step:** We already have motion text, layman explanations, text embeddings infrastructure (Qwen3 via ai_provider), and DuckDB. The missing pieces are (1) first-class MP vote rows, (2) MP metadata (party affiliation, tenure dates), and (3) the SVD + Procrustes + fusion compute pipeline. |
||||
|
||||
## Constraints |
||||
|
||||
- **DuckDB only** — no pgvector, no external vector store. In-Python compute (scipy) is correct. |
||||
- **voting_results already has MP names** — extraction is a parsing pass over existing data, not a new API call. Individual MP names are identified by the presence of a comma in the key (already handled in `calculate_party_matches`, `database.py:264`). |
||||
- **Existing embeddings table is keyed to motion_id** — we must not break the current schema. SVD and fused vectors go into new tables. |
||||
- **ai_provider.get_embedding already works** — use it as-is for text embeddings; no model changes needed for MVP. |
||||
- **ibis/DuckDB preferred** over raw SQL for analysis queries (per project preferences). |
||||
- **uv** for dependency management; add `scipy`, `umap-learn`, `plotly`, `sentence-transformers` (or use existing ai_provider for embeddings). |
||||
|
||||
## Approach |
||||
|
||||
**Late-fusion pipeline in four phases:** |
||||
|
||||
1. **Extract** — parse MP-level votes out of `voting_results` JSON into an `mp_votes` table; fetch MP metadata from OData into `mp_metadata`. |
||||
2. **Compute SVD** — per time window, build sparse MP × motion matrix → SVD → Procrustes-align windows sequentially. |
||||
3. **Text embeddings** — ensure every motion has a text embedding (existing path; just fill gaps). |
||||
4. **Fuse** — concatenate aligned SVD motion vector + text embedding → store in `fused_embeddings` table. |
||||
|
||||
Alternatives considered: |
||||
- **Pure text embeddings only**: easier but loses the behavioral (voting) signal entirely. Rejected because the whole point of the plan is the fused representation. |
||||
- **Store aligned SVD + rotation matrices separately**: more flexible for recomputing, but adds complexity. MVP will store aligned vectors directly; rotation matrices are logged for debugging but not persisted. |
||||
|
||||
## Architecture |
||||
|
||||
``` |
||||
Data layer (DB): |
||||
motions (existing) |
||||
embeddings (existing — text vectors keyed to motion_id) |
||||
mp_votes (NEW — one row per MP per motion) |
||||
mp_metadata (NEW — MP name, party, entry/exit dates) |
||||
svd_vectors (NEW — per window, per entity: MP or motion) |
||||
fused_embeddings (NEW — per motion, per window: SVD + text concatenated) |
||||
|
||||
Pipeline modules (new, in pipeline/): |
||||
extract_mp_votes.py — JSON blob → mp_votes rows |
||||
fetch_mp_metadata.py — OData /Kamerlid → mp_metadata rows |
||||
svd_pipeline.py — time windows → SVD → Procrustes alignment → svd_vectors |
||||
text_pipeline.py — ensure embeddings coverage, delegates to existing summarizer |
||||
fusion.py — join svd_vectors + embeddings → fused_embeddings |
||||
|
||||
Analysis modules (new, in analysis/): |
||||
political_axis.py — first SVD component / anchor-party axis |
||||
trajectory.py — MP drift across aligned windows |
||||
clustering.py — UMAP on fused motion embeddings, thematic clusters |
||||
visualize.py — Plotly interactive trajectory and cluster plots |
||||
|
||||
CLI entry points (new): |
||||
pipeline/run_pipeline.py — orchestrate all phases with flags |
||||
``` |
||||
|
||||
## Key Components & Responsibilities |
||||
|
||||
**mp_votes table** |
||||
- Schema: `(id, motion_id, mp_name, party, vote ENUM(voor/tegen/afwezig), date, created_at)` |
||||
- Populated by `extract_mp_votes.py` doing a one-time parse of `motions.voting_results` JSON. |
||||
- Idempotent: skip motion_id if already extracted (upsert or EXISTS check). |
||||
- `party` field is left NULL initially; backfilled from `mp_metadata` after that table is populated. |
||||
|
||||
**mp_metadata table** |
||||
- Schema: `(mp_name, party, entry_date, exit_date, source_id)` |
||||
- Fetched from OData `/Kamerlid` endpoint (needs verification — see Open Questions). |
||||
- Fallback: derive approximate party affiliation from `mp_votes` rows (majority-party heuristic) if OData metadata is unavailable. |
||||
|
||||
**svd_vectors table** |
||||
- Schema: `(window_id, entity_type ENUM(mp/motion), entity_id, vector JSON, model TEXT, created_at)` |
||||
- Stores both MP and motion SVD vectors per time window, after Procrustes alignment. |
||||
- `window_id` is a string like `2024-Q1`. |
||||
|
||||
**fused_embeddings table** |
||||
- Schema: `(motion_id, window_id, vector JSON, svd_dims INT, text_dims INT, created_at)` |
||||
- Separate from the existing `embeddings` table to avoid schema conflicts. |
||||
- Vector is the concatenation of the SVD motion vector and the text embedding. |
||||
|
||||
**svd_pipeline.py** |
||||
- Groups motions by time window (quarterly default). |
||||
- Builds a sparse `scipy.sparse.csr_matrix` (MPs as rows, motions as columns, vote values encoded as +1/−1/0). |
||||
- Calls `scipy.sparse.linalg.svds(matrix, k=dims)` — `k` is configurable (default 50). |
||||
- Applies Procrustes alignment between consecutive windows using overlapping MPs as anchors. |
||||
- Logs Procrustes disparity score per transition; flags high disparity (election transitions). |
||||
|
||||
**extract_mp_votes.py** |
||||
- Reads all motions with `voting_results` JSON, parses keys: if comma in key → individual MP name, else → party/fraction name. |
||||
- Writes MP-level rows to `mp_votes`; party-level rows are ignored here (they're already used by the existing `calculate_party_matches` flow). |
||||
- Handles the three vote values: `voor` (+1), `tegen` (−1), `afwezig` (0). |
||||
|
||||
**fusion.py** |
||||
- For each motion in a window: lookup SVD motion vector from `svd_vectors`; lookup text embedding from `embeddings`. |
||||
- Concatenates vectors (simple `list + list`); stores in `fused_embeddings`. |
||||
- Skips motion if either vector is missing; logs counts. |
||||
|
||||
**analysis/ modules** |
||||
- All read-only from DB; write only to output files (HTML/PNG plots). |
||||
- `political_axis.py`: project all MP SVD vectors onto the first principal component; optionally define axis by anchor parties (e.g. VVD vs SP). |
||||
- `trajectory.py`: collect MP's aligned SVD vector per window → compute drift distance → plot trajectory over time. |
||||
- `clustering.py`: run UMAP on `fused_embeddings` per window → label with policy_area or thematic cluster. |
||||
- `visualize.py`: Plotly interactive scatter/line plots; outputs self-contained HTML. |
||||
|
||||
## Data Flow |
||||
|
||||
``` |
||||
Phase 1 — Extract |
||||
motions.voting_results (JSON, existing) |
||||
→ extract_mp_votes.py |
||||
→ INSERT mp_votes rows (motion_id, mp_name, vote, date) |
||||
|
||||
OData /Kamerlid |
||||
→ fetch_mp_metadata.py |
||||
→ INSERT mp_metadata rows (mp_name, party, entry_date, exit_date) |
||||
→ UPDATE mp_votes.party via JOIN (backfill) |
||||
|
||||
Phase 2 — SVD |
||||
mp_votes (date-filtered per window) |
||||
→ sparse MP × motion matrix |
||||
→ scipy svds(k=50) |
||||
→ raw SVD vectors per window |
||||
|
||||
Procrustes alignment: |
||||
window[t-1] aligned vectors + window[t] raw vectors |
||||
→ overlapping MPs as anchors |
||||
→ scipy.spatial.procrustes → rotation R |
||||
→ window[t] aligned vectors |
||||
→ INSERT svd_vectors rows |
||||
|
||||
Phase 3 — Text embeddings (fill gaps) |
||||
motions without embedding in embeddings table |
||||
→ text_pipeline.py → ai_provider.get_embedding(body_text or description) |
||||
→ INSERT embeddings rows (existing schema) |
||||
|
||||
Phase 4 — Fusion |
||||
svd_vectors (motion, window) + embeddings (motion) |
||||
→ fusion.py |
||||
→ INSERT fused_embeddings rows |
||||
|
||||
Phase 5 — Analysis (on demand) |
||||
fused_embeddings + mp_metadata + svd_vectors |
||||
→ analysis modules |
||||
→ HTML plots output |
||||
``` |
||||
|
||||
## Error Handling Strategy |
||||
|
||||
- **Extraction idempotency**: `extract_mp_votes` checks `SELECT COUNT(*) FROM mp_votes WHERE motion_id = ?` before inserting; re-runs are safe. |
||||
- **Sparse windows**: if a time window has fewer than `MIN_MOTIONS` (default 20) or `MIN_MPs` (default 10), skip SVD for that window and log a warning. Do not crash. |
||||
- **Procrustes at election transitions**: chain alignment via the last quarter of the old term and first quarter of the new term using only returning MPs. If overlap < 30%, log as HIGH_DISPARITY and store the window but flag it. |
||||
- **Missing text embeddings**: log motions skipped in fusion; the SVD-only path remains valid for those motions. |
||||
- **OData metadata unavailable**: fall back to heuristic party assignment (mp_votes majority-party per MP name). Log which MPs used fallback. |
||||
- **Replace prints with structured logging**: all pipeline modules use `logging.getLogger(__name__)` — not `print()`. |
||||
|
||||
## Testing Strategy |
||||
|
||||
- **Unit**: |
||||
- Vote parser: given sample `voting_results` JSON, assert correct MP rows extracted and party rows ignored. |
||||
- Sparse matrix builder: inject 5 MPs × 10 motions → assert matrix shape and values. |
||||
- Procrustes wrapper: inject two small aligned-then-rotated matrices → assert recovered rotation close to identity. |
||||
- Fusion: inject matching SVD and text vectors → assert concatenated output length = svd_dims + text_dims. |
||||
|
||||
- **Integration**: |
||||
- Extract → SVD → Fusion on a fixture of 50 motions (stored in `tests/fixtures/`). Monkeypatch ai_provider for text embeddings. Assert `fused_embeddings` table populated and vector dimensions correct. |
||||
|
||||
- **Regression**: |
||||
- Run pipeline on a fixed 100-motion snapshot. Assert output dimensions and row counts stable across runs. |
||||
|
||||
- **Migration tests**: |
||||
- Follow existing pattern (`tests/test_migration_embeddings.py`): apply new migration SQL to a temp DuckDB, assert expected tables and columns. |
||||
|
||||
## Open Questions |
||||
|
||||
1. **OData `/Kamerlid` endpoint availability**: does it expose party affiliation and tenure dates with the same API key/base URL we already use? If not, we need a scraping fallback for `mp_metadata`. |
||||
2. **Store rotation matrices?**: MVP stores aligned vectors directly. Should we also persist the Procrustes R matrix per window transition so we can re-project new MPs added later without full recomputation? |
||||
3. **Output target**: CLI producing HTML plots (simplest) vs. new Streamlit page vs. Jupyter notebook. Recommendation: CLI first, Streamlit page in a follow-up. |
||||
4. **Time window granularity**: quarterly is the default. Should we validate this empirically first with an annual window (larger, more stable matrices) and switch to quarterly once the pipeline is proven? |
||||
5. **SVD dimensions k**: default 50 dims for SVD. This needs to be validated against the actual data size (number of unique MPs × motions per window). A window with 100 MPs and 50 motions cannot have k=50 — needs to be `k < min(n_mps, n_motions)`. Pipeline must enforce this dynamically. |
||||
Loading…
Reference in new issue