--- date: 2026-03-29 topic: "Honest PCA Axis Classification" status: validated --- # Axis Classification Design ## Problem Statement The political compass always labels its X-axis "Links–Rechts" and Y-axis "Progressief–Conservatief" regardless of what the PCA actually found. In coalition years, the first principal component captures **coalition membership**, not ideology. The dominant axis of voting variation in Rutte II (VVD+PvdA) and Rutte III/IV (VVD+CDA+D66+CU) is "are you in the governing coalition?" PvdA and PVV end up at the same position because both were in opposition — technically correct voting similarity, but the label "Links–Rechts" is a lie. The fix: after each PCA, validate what the axes actually capture by correlating party positions against a small reference dataset of known ideological scores. Assign labels honestly. ## Constraints - No changes to the PCA computation itself (`compute_2d_axes` is unchanged) - No new runtime dependencies (scipy is already optional; pandas is already present) - `party_ideologies.csv` and `coalition_membership.csv` are static data files — not derived from the DB - Backward-compatible: the compass still renders even when reference files are missing (falls back to current hardcoded labels silently) ## Approach Reference-validated PCA with dynamic labeling. For each time window, correlate the per-party PCA positions against known ideological scores. Assign a label based on which correlation is strongest. Surface the finding as a one-line caption in the UI when the axis diverges from "Links–Rechts". Rejected alternatives: - **Fixed anchor compass**: replaces honest complexity with comfortable fiction; loses behavioral information entirely - **Dual view (behavioral + ideological)**: too much UI complexity for V1; can be done later ## Architecture Overview A thin axis classification layer sits between `compute_2d_axes` (unchanged) and the compass UI. ``` compute_2d_axes() ↓ positions_by_window + axes dict ↓ classify_axes(positions_by_window, axes, db_path) ↓ axes dict enriched with: - x_label, y_label (global, most-common label across annual windows) - x_quality (dict: window_id → float, max |r|) - y_quality (dict: window_id → float, max |r|) - x_interpretation (dict: window_id → Dutch str) - y_interpretation (dict: window_id → Dutch str) ↓ compass renderer uses labels + per-year quality captions ``` ## Components ### 1. Reference data files **`data/party_ideologies.csv`** One row per party. Party names must match entity IDs in the `svd_vectors` table exactly. ``` party,left_right,progressive VVD,0.65,0.10 PvdA,-0.70,0.75 SP,-0.90,0.50 CDA,0.25,-0.45 D66,-0.10,0.85 GroenLinks,-0.70,0.90 GL,-0.70,0.90 GroenLinks-PvdA,-0.70,0.82 ChristenUnie,0.10,-0.55 SGP,0.35,-0.95 PVV,0.90,-0.50 DENK,-0.40,0.55 50Plus,-0.05,-0.10 FVD,0.90,-0.75 PvdD,-0.60,0.85 Volt,-0.20,0.80 JA21,0.70,-0.30 BBB,0.50,-0.35 NSC,0.20,-0.20 Nieuw Sociaal Contract,0.20,-0.20 BVNL,0.85,-0.55 Bij1,-0.90,0.90 ``` Scores: left_right = −1 (far left) to +1 (far right). progressive = −1 (conservative) to +1 (progressive). These are expert judgments based on party programs and voting records, not derived algorithmically. **`data/coalition_membership.csv`** One row per (window_id, party) where that party held a government seat. Annual windows only; quarterly windows inherit from their year. ``` window_id,party 2012,VVD 2012,PvdA 2013,VVD 2013,PvdA 2014,VVD 2014,PvdA 2015,VVD 2015,PvdA 2016,VVD 2016,PvdA 2017,VVD 2017,CDA 2017,D66 2017,ChristenUnie 2018,VVD 2018,CDA 2018,D66 2018,ChristenUnie 2019,VVD 2019,CDA 2019,D66 2019,ChristenUnie 2020,VVD 2020,CDA 2020,D66 2020,ChristenUnie 2021,VVD 2021,CDA 2021,D66 2021,ChristenUnie 2022,VVD 2022,D66 2022,CDA 2022,ChristenUnie 2023,VVD 2023,D66 2023,CDA 2023,ChristenUnie 2024,PVV 2024,VVD 2024,NSC 2024,BBB 2025,PVV 2025,VVD 2025,NSC 2025,BBB 2026,PVV 2026,VVD 2026,NSC 2026,BBB ``` ### 2. `analysis/axis_classifier.py` (new module) Single public function: `classify_axes(positions_by_window, axes, db_path)`. The function is pure except for reading two CSV files (cached module-level after first load). CSV paths are derived from `db_path`: `Path(db_path).parent / "party_ideologies.csv"` and `Path(db_path).parent / "coalition_membership.csv"`. Both files live in the same `data/` directory as the database. **Algorithm per window:** 1. Collect parties that appear in both `positions_by_window[window_id]` and `party_ideologies.csv`. Skip windows with fewer than 5 overlapping parties. 2. Build vectors: - `party_x`: per-party X positions from this window - `party_y`: per-party Y positions from this window - `ref_lr`: left_right scores from CSV - `ref_pc`: progressive scores from CSV - `coalition_dummy`: +1 if party is in government for this window's year, −1 otherwise (quarterly windows: strip suffix to get year, e.g., `2016-Q3` → `2016`) 3. Compute Pearson r for X against each reference dimension: - `r_lr_x = pearsonr(party_x, ref_lr)[0]` - `r_pc_x = pearsonr(party_x, ref_pc)[0]` - `r_co_x = pearsonr(party_x, coalition_dummy)[0]` 4. Assign label and interpretation using priority order (first threshold that fires wins): - `|r_lr_x| ≥ 0.65` → label = `"Links–Rechts"`, flip sign if r < 0 - `|r_co_x| ≥ 0.65` → label = `"Coalitie–Oppositie"` - `|r_pc_x| ≥ 0.65` → label = `"Progressief–Conservatief"`, flip sign if r < 0 - fallback → label = `"Stempatroon As 1"` 5. Quality score for this window's X-axis: `max(|r_lr_x|, |r_pc_x|, |r_co_x|)` 6. Repeat steps 3–5 for Y-axis using `party_y`. 7. After processing all windows, pick global X label = modal label across annual windows only (quarterly windows participate in quality tracking but not in the modal vote, to avoid over-weighting). The `current_parliament` window is excluded from modal voting entirely and from the coalition dimension (no year to look up); it still gets x_quality and x_interpretation based on the left_right and progressive correlations. **Interpretation strings (Dutch):** | label | interpretation | |---|---| | Links–Rechts | "De horizontale as weerspiegelt de klassieke links-rechts tegenstelling." | | Coalitie–Oppositie | "De horizontale as weerspiegelt stemgedrag van coalitie- versus oppositiepartijen (r={r:.2f}). Links-rechts is minder dominant dit jaar." | | Progressief–Conservatief | "De horizontale as weerspiegelt de progressief-conservatieve tegenstelling." | | Stempatroon As 1 | "De horizontale as weerspiegelt een empirisch stempatroon zonder duidelijke ideologische richting." | Y-axis interpretations follow the same template with "verticale" instead of "horizontale". **Return value:** the input `axes` dict with four new keys added: `x_label`, `y_label`, `x_quality` (dict), `y_quality` (dict), `x_interpretation` (dict), `y_interpretation` (dict). ### 3. `explorer.py` changes **`load_positions()`** — after calling `compute_2d_axes`, call `classify_axes` and store the enriched axes dict. If `classify_axes` raises for any reason, catch and log; use the original axes dict. **Compass renderer** — two changes only: 1. Replace hardcoded `"Links–Rechts"` / `"Progressief–Conservatief"` axis title strings with `axes.get("x_label", "Links–Rechts")` and `axes.get("y_label", "Progressief–Conservatief")`. 2. Add a caption below the compass for the selected year. Show when either axis quality < 0.65: > *"In 2016 weerspiegelt de horizontale as coalitie–oppositie stemgedrag (r=0.71)."* Source: `axes["x_interpretation"].get(selected_window_id, "")`. No other UI changes. The compass layout is untouched. ## Data Flow ``` load_positions(db_path, window_size) → compute_2d_axes(...) [unchanged; returns positions_by_window, axes] → classify_axes( [new] positions_by_window, axes, db_path=db_path ) reads: data/party_ideologies.csv (module-level cache) reads: data/coalition_membership.csv (module-level cache) uses: positions_by_window already in memory writes: new keys into axes dict (no mutation of positions) → return positions_by_window, axes_enriched compass render (existing function) → axes["x_label"] [was hardcoded "Links–Rechts"] → axes["y_label"] [was hardcoded "Progressief–Conservatief"] → axes["x_interpretation"][window_id] [new caption] ``` No DB writes. No new DB queries. Pure in-memory correlation over data that's already loaded. CSV reads are ~microseconds and cached after first call. ## Error Handling | Failure | Behaviour | |---|---| | `data/party_ideologies.csv` missing | Log WARNING, return `axes` unchanged (current labels preserved) | | `data/coalition_membership.csv` missing | Log WARNING, coalition dimension skipped; other correlations still computed | | Party in positions but not in CSV | Skip silently; log once at DEBUG per session | | Window has fewer than 5 overlapping parties | Skip classification for that window; use fallback label | | All correlations < 0.65 | Fallback label is always safe; no crash | | Any unexpected exception in `classify_axes` | Caller (`load_positions`) catches, logs, returns original `axes` dict | ## Testing Strategy Three new tests added to `tests/test_political_compass.py`: **`test_axis_label_left_right`** Construct synthetic per-party positions where X values correlate strongly (r > 0.8) with the left_right column of a minimal inline CSV. Assert that `classify_axes` returns `x_label == "Links–Rechts"` and `x_quality[window] > 0.65`. **`test_axis_label_coalition_dominant`** Construct synthetic positions where X values match coalition membership pattern but NOT left-right. (E.g., coalition parties [VVD, PvdA] cluster at x=+1, opposition [PVV, SP] at x=−1, which is historically coherent for 2016.) Assert `x_label == "Coalitie–Oppositie"` and that the interpretation string contains "coalitie". **`test_axis_classifier_missing_csv`** Call `classify_axes` with a db_path pointing to a nonexistent directory so CSV loading fails. Assert that the function returns the axes dict unchanged and does not raise. All three tests use monkeypatching to inject CSV content as in-memory StringIO, following the existing pattern in `tests/test_political_compass.py` of patching module-level imports. ## Deployment The CSV files (`data/party_ideologies.csv` and `data/coalition_membership.csv`) are **static reference data committed to git**. They are baked into the Docker image at build time alongside the application code. No rsync or volume mount is needed. The `.gitignore` excludes `data/*.db`, `data/*.bak`, `data/*.json` but not `data/*.csv`, so they can be tracked without change to the ignore rules. The data volume mount (`DATA_DIR:/home/app/app/data`) only contains the database file and does not overwrite the baked-in CSVs. When party compositions change (e.g., a new party enters parliament), update the CSV, commit, and redeploy. Typical frequency: once per parliament formation (~4 years). ## Open Questions None.