Add design for honest PCA axis labeling — validates each compass axis against a party ideology reference CSV and labels dynamically (Links–Rechts, Coalitie–Oppositie, or fallback) instead of hardcoding Left–Right always.main
parent
50f8a06c6d
commit
bed911b92c
@ -0,0 +1,279 @@ |
|||||||
|
--- |
||||||
|
date: 2026-03-29 |
||||||
|
topic: "Honest PCA Axis Classification" |
||||||
|
status: validated |
||||||
|
--- |
||||||
|
|
||||||
|
# Axis Classification Design |
||||||
|
|
||||||
|
## Problem Statement |
||||||
|
|
||||||
|
The political compass always labels its X-axis "Links–Rechts" and Y-axis "Progressief–Conservatief" |
||||||
|
regardless of what the PCA actually found. In coalition years, the first principal component captures |
||||||
|
**coalition membership**, not ideology. The dominant axis of voting variation in Rutte II (VVD+PvdA) |
||||||
|
and Rutte III/IV (VVD+CDA+D66+CU) is "are you in the governing coalition?" PvdA and PVV end up at the |
||||||
|
same position because both were in opposition — technically correct voting similarity, but the label |
||||||
|
"Links–Rechts" is a lie. |
||||||
|
|
||||||
|
The fix: after each PCA, validate what the axes actually capture by correlating party positions against |
||||||
|
a small reference dataset of known ideological scores. Assign labels honestly. |
||||||
|
|
||||||
|
## Constraints |
||||||
|
|
||||||
|
- No changes to the PCA computation itself (`compute_2d_axes` is unchanged) |
||||||
|
- No new runtime dependencies (scipy is already optional; pandas is already present) |
||||||
|
- `party_ideologies.csv` and `coalition_membership.csv` are static data files — not derived from the DB |
||||||
|
- Backward-compatible: the compass still renders even when reference files are missing (falls back to |
||||||
|
current hardcoded labels silently) |
||||||
|
|
||||||
|
## Approach |
||||||
|
|
||||||
|
Reference-validated PCA with dynamic labeling. For each time window, correlate the per-party PCA |
||||||
|
positions against known ideological scores. Assign a label based on which correlation is strongest. |
||||||
|
Surface the finding as a one-line caption in the UI when the axis diverges from "Links–Rechts". |
||||||
|
|
||||||
|
Rejected alternatives: |
||||||
|
- **Fixed anchor compass**: replaces honest complexity with comfortable fiction; loses behavioral |
||||||
|
information entirely |
||||||
|
- **Dual view (behavioral + ideological)**: too much UI complexity for V1; can be done later |
||||||
|
|
||||||
|
## Architecture Overview |
||||||
|
|
||||||
|
A thin axis classification layer sits between `compute_2d_axes` (unchanged) and the compass UI. |
||||||
|
|
||||||
|
``` |
||||||
|
compute_2d_axes() |
||||||
|
↓ |
||||||
|
positions_by_window + axes dict |
||||||
|
↓ |
||||||
|
classify_axes(positions_by_window, axes, db_path) |
||||||
|
↓ |
||||||
|
axes dict enriched with: |
||||||
|
- x_label, y_label (global, most-common label across annual windows) |
||||||
|
- x_quality (dict: window_id → float, max |r|) |
||||||
|
- y_quality (dict: window_id → float, max |r|) |
||||||
|
- x_interpretation (dict: window_id → Dutch str) |
||||||
|
- y_interpretation (dict: window_id → Dutch str) |
||||||
|
↓ |
||||||
|
compass renderer uses labels + per-year quality captions |
||||||
|
``` |
||||||
|
|
||||||
|
## Components |
||||||
|
|
||||||
|
### 1. Reference data files |
||||||
|
|
||||||
|
**`data/party_ideologies.csv`** |
||||||
|
|
||||||
|
One row per party. Party names must match entity IDs in the `svd_vectors` table exactly. |
||||||
|
|
||||||
|
``` |
||||||
|
party,left_right,progressive |
||||||
|
VVD,0.65,0.10 |
||||||
|
PvdA,-0.70,0.75 |
||||||
|
SP,-0.90,0.50 |
||||||
|
CDA,0.25,-0.45 |
||||||
|
D66,-0.10,0.85 |
||||||
|
GroenLinks,-0.70,0.90 |
||||||
|
GL,-0.70,0.90 |
||||||
|
GroenLinks-PvdA,-0.70,0.82 |
||||||
|
ChristenUnie,0.10,-0.55 |
||||||
|
SGP,0.35,-0.95 |
||||||
|
PVV,0.90,-0.50 |
||||||
|
DENK,-0.40,0.55 |
||||||
|
50Plus,-0.05,-0.10 |
||||||
|
FVD,0.90,-0.75 |
||||||
|
PvdD,-0.60,0.85 |
||||||
|
Volt,-0.20,0.80 |
||||||
|
JA21,0.70,-0.30 |
||||||
|
BBB,0.50,-0.35 |
||||||
|
NSC,0.20,-0.20 |
||||||
|
Nieuw Sociaal Contract,0.20,-0.20 |
||||||
|
BVNL,0.85,-0.55 |
||||||
|
Bij1,-0.90,0.90 |
||||||
|
``` |
||||||
|
|
||||||
|
Scores: left_right = −1 (far left) to +1 (far right). progressive = −1 (conservative) to +1 (progressive). |
||||||
|
These are expert judgments based on party programs and voting records, not derived algorithmically. |
||||||
|
|
||||||
|
**`data/coalition_membership.csv`** |
||||||
|
|
||||||
|
One row per (window_id, party) where that party held a government seat. Annual windows only; quarterly |
||||||
|
windows inherit from their year. |
||||||
|
|
||||||
|
``` |
||||||
|
window_id,party |
||||||
|
2012,VVD |
||||||
|
2012,PvdA |
||||||
|
2013,VVD |
||||||
|
2013,PvdA |
||||||
|
2014,VVD |
||||||
|
2014,PvdA |
||||||
|
2015,VVD |
||||||
|
2015,PvdA |
||||||
|
2016,VVD |
||||||
|
2016,PvdA |
||||||
|
2017,VVD |
||||||
|
2017,CDA |
||||||
|
2017,D66 |
||||||
|
2017,ChristenUnie |
||||||
|
2018,VVD |
||||||
|
2018,CDA |
||||||
|
2018,D66 |
||||||
|
2018,ChristenUnie |
||||||
|
2019,VVD |
||||||
|
2019,CDA |
||||||
|
2019,D66 |
||||||
|
2019,ChristenUnie |
||||||
|
2020,VVD |
||||||
|
2020,CDA |
||||||
|
2020,D66 |
||||||
|
2020,ChristenUnie |
||||||
|
2021,VVD |
||||||
|
2021,CDA |
||||||
|
2021,D66 |
||||||
|
2021,ChristenUnie |
||||||
|
2022,VVD |
||||||
|
2022,D66 |
||||||
|
2022,CDA |
||||||
|
2022,ChristenUnie |
||||||
|
2023,VVD |
||||||
|
2023,D66 |
||||||
|
2023,CDA |
||||||
|
2023,ChristenUnie |
||||||
|
2024,PVV |
||||||
|
2024,VVD |
||||||
|
2024,NSC |
||||||
|
2024,BBB |
||||||
|
2025,PVV |
||||||
|
2025,VVD |
||||||
|
2025,NSC |
||||||
|
2025,BBB |
||||||
|
2026,PVV |
||||||
|
2026,VVD |
||||||
|
2026,NSC |
||||||
|
2026,BBB |
||||||
|
``` |
||||||
|
|
||||||
|
### 2. `analysis/axis_classifier.py` (new module) |
||||||
|
|
||||||
|
Single public function: `classify_axes(positions_by_window, axes, db_path)`. |
||||||
|
|
||||||
|
The function is pure except for reading two CSV files (cached module-level after first load). |
||||||
|
|
||||||
|
**Algorithm per window:** |
||||||
|
|
||||||
|
1. Collect parties that appear in both `positions_by_window[window_id]` and `party_ideologies.csv`. |
||||||
|
Skip windows with fewer than 5 overlapping parties. |
||||||
|
2. Build vectors: |
||||||
|
- `party_x`: per-party X positions from this window |
||||||
|
- `party_y`: per-party Y positions from this window |
||||||
|
- `ref_lr`: left_right scores from CSV |
||||||
|
- `ref_pc`: progressive scores from CSV |
||||||
|
- `coalition_dummy`: +1 if party is in government for this window's year, −1 otherwise |
||||||
|
(quarterly windows: strip suffix to get year, e.g., `2016-Q3` → `2016`) |
||||||
|
3. Compute Pearson r for X against each reference dimension: |
||||||
|
- `r_lr_x = pearsonr(party_x, ref_lr)[0]` |
||||||
|
- `r_pc_x = pearsonr(party_x, ref_pc)[0]` |
||||||
|
- `r_co_x = pearsonr(party_x, coalition_dummy)[0]` |
||||||
|
4. Assign label and interpretation using priority order (first threshold that fires wins): |
||||||
|
- `|r_lr_x| ≥ 0.65` → label = `"Links–Rechts"`, flip sign if r < 0 |
||||||
|
- `|r_co_x| ≥ 0.65` → label = `"Coalitie–Oppositie"` |
||||||
|
- `|r_pc_x| ≥ 0.65` → label = `"Progressief–Conservatief"`, flip sign if r < 0 |
||||||
|
- fallback → label = `"Stempatroon As 1"` |
||||||
|
5. Quality score for this window's X-axis: `max(|r_lr_x|, |r_pc_x|, |r_co_x|)` |
||||||
|
6. Repeat steps 3–5 for Y-axis using `party_y`. |
||||||
|
7. After processing all windows, pick global X label = modal label across annual windows only |
||||||
|
(quarterly windows participate in quality tracking but not in the modal vote, to avoid |
||||||
|
over-weighting). |
||||||
|
|
||||||
|
**Interpretation strings (Dutch):** |
||||||
|
|
||||||
|
| label | interpretation | |
||||||
|
|---|---| |
||||||
|
| Links–Rechts | "De horizontale as weerspiegelt de klassieke links-rechts tegenstelling." | |
||||||
|
| Coalitie–Oppositie | "De horizontale as weerspiegelt stemgedrag van coalitie- versus oppositiepartijen (r={r:.2f}). Links-rechts is minder dominant dit jaar." | |
||||||
|
| Progressief–Conservatief | "De horizontale as weerspiegelt de progressief-conservatieve tegenstelling." | |
||||||
|
| Stempatroon As 1 | "De horizontale as weerspiegelt een empirisch stempatroon zonder duidelijke ideologische richting." | |
||||||
|
|
||||||
|
Y-axis interpretations follow the same template with "verticale" instead of "horizontale". |
||||||
|
|
||||||
|
**Return value:** the input `axes` dict with four new keys added: |
||||||
|
`x_label`, `y_label`, `x_quality` (dict), `y_quality` (dict), `x_interpretation` (dict), |
||||||
|
`y_interpretation` (dict). |
||||||
|
|
||||||
|
### 3. `explorer.py` changes |
||||||
|
|
||||||
|
**`load_positions()`** — after calling `compute_2d_axes`, call `classify_axes` and store the enriched |
||||||
|
axes dict. If `classify_axes` raises for any reason, catch and log; use the original axes dict. |
||||||
|
|
||||||
|
**Compass renderer** — two changes only: |
||||||
|
1. Replace hardcoded `"Links–Rechts"` / `"Progressief–Conservatief"` axis title strings with |
||||||
|
`axes.get("x_label", "Links–Rechts")` and `axes.get("y_label", "Progressief–Conservatief")`. |
||||||
|
2. Add a caption below the compass for the selected year. Show when either axis quality < 0.65: |
||||||
|
> *"In 2016 weerspiegelt de horizontale as coalitie–oppositie stemgedrag (r=0.71)."* |
||||||
|
|
||||||
|
Source: `axes["x_interpretation"].get(selected_window_id, "")`. |
||||||
|
|
||||||
|
No other UI changes. The compass layout is untouched. |
||||||
|
|
||||||
|
## Data Flow |
||||||
|
|
||||||
|
``` |
||||||
|
load_positions(db_path, window_size) |
||||||
|
→ compute_2d_axes(...) [unchanged; returns positions_by_window, axes] |
||||||
|
→ classify_axes( [new] |
||||||
|
positions_by_window, |
||||||
|
axes, |
||||||
|
db_path=db_path |
||||||
|
) |
||||||
|
reads: data/party_ideologies.csv (module-level cache) |
||||||
|
reads: data/coalition_membership.csv (module-level cache) |
||||||
|
uses: positions_by_window already in memory |
||||||
|
writes: new keys into axes dict (no mutation of positions) |
||||||
|
→ return positions_by_window, axes_enriched |
||||||
|
|
||||||
|
compass render (existing function) |
||||||
|
→ axes["x_label"] [was hardcoded "Links–Rechts"] |
||||||
|
→ axes["y_label"] [was hardcoded "Progressief–Conservatief"] |
||||||
|
→ axes["x_interpretation"][window_id] [new caption] |
||||||
|
``` |
||||||
|
|
||||||
|
No DB writes. No new DB queries. Pure in-memory correlation over data that's already loaded. |
||||||
|
CSV reads are ~microseconds and cached after first call. |
||||||
|
|
||||||
|
## Error Handling |
||||||
|
|
||||||
|
| Failure | Behaviour | |
||||||
|
|---|---| |
||||||
|
| `data/party_ideologies.csv` missing | Log WARNING, return `axes` unchanged (current labels preserved) | |
||||||
|
| `data/coalition_membership.csv` missing | Log WARNING, coalition dimension skipped; other correlations still computed | |
||||||
|
| Party in positions but not in CSV | Skip silently; log once at DEBUG per session | |
||||||
|
| Window has fewer than 5 overlapping parties | Skip classification for that window; use fallback label | |
||||||
|
| All correlations < 0.65 | Fallback label is always safe; no crash | |
||||||
|
| Any unexpected exception in `classify_axes` | Caller (`load_positions`) catches, logs, returns original `axes` dict | |
||||||
|
|
||||||
|
## Testing Strategy |
||||||
|
|
||||||
|
Three new tests added to `tests/test_political_compass.py`: |
||||||
|
|
||||||
|
**`test_axis_label_left_right`** |
||||||
|
Construct synthetic per-party positions where X values correlate strongly (r > 0.8) with the left_right |
||||||
|
column of a minimal inline CSV. Assert that `classify_axes` returns `x_label == "Links–Rechts"` and |
||||||
|
`x_quality[window] > 0.65`. |
||||||
|
|
||||||
|
**`test_axis_label_coalition_dominant`** |
||||||
|
Construct synthetic positions where X values match coalition membership pattern but NOT left-right. |
||||||
|
(E.g., coalition parties [VVD, PvdA] cluster at x=+1, opposition [PVV, SP] at x=−1, which is |
||||||
|
historically coherent for 2016.) Assert `x_label == "Coalitie–Oppositie"` and that the interpretation |
||||||
|
string contains "coalitie". |
||||||
|
|
||||||
|
**`test_axis_classifier_missing_csv`** |
||||||
|
Call `classify_axes` with a db_path pointing to a nonexistent directory so CSV loading fails. Assert |
||||||
|
that the function returns the axes dict unchanged and does not raise. |
||||||
|
|
||||||
|
All three tests use monkeypatching to inject CSV content as in-memory StringIO, following the existing |
||||||
|
pattern in `tests/test_political_compass.py` of patching module-level imports. |
||||||
|
|
||||||
|
## Open Questions |
||||||
|
|
||||||
|
None. |
||||||
Loading…
Reference in new issue