Add design for honest PCA axis labeling — validates each compass axis against a party ideology reference CSV and labels dynamically (Links–Rechts, Coalitie–Oppositie, or fallback) instead of hardcoding Left–Right always.main
parent
50f8a06c6d
commit
bed911b92c
@ -0,0 +1,279 @@ |
||||
--- |
||||
date: 2026-03-29 |
||||
topic: "Honest PCA Axis Classification" |
||||
status: validated |
||||
--- |
||||
|
||||
# Axis Classification Design |
||||
|
||||
## Problem Statement |
||||
|
||||
The political compass always labels its X-axis "Links–Rechts" and Y-axis "Progressief–Conservatief" |
||||
regardless of what the PCA actually found. In coalition years, the first principal component captures |
||||
**coalition membership**, not ideology. The dominant axis of voting variation in Rutte II (VVD+PvdA) |
||||
and Rutte III/IV (VVD+CDA+D66+CU) is "are you in the governing coalition?" PvdA and PVV end up at the |
||||
same position because both were in opposition — technically correct voting similarity, but the label |
||||
"Links–Rechts" is a lie. |
||||
|
||||
The fix: after each PCA, validate what the axes actually capture by correlating party positions against |
||||
a small reference dataset of known ideological scores. Assign labels honestly. |
||||
|
||||
## Constraints |
||||
|
||||
- No changes to the PCA computation itself (`compute_2d_axes` is unchanged) |
||||
- No new runtime dependencies (scipy is already optional; pandas is already present) |
||||
- `party_ideologies.csv` and `coalition_membership.csv` are static data files — not derived from the DB |
||||
- Backward-compatible: the compass still renders even when reference files are missing (falls back to |
||||
current hardcoded labels silently) |
||||
|
||||
## Approach |
||||
|
||||
Reference-validated PCA with dynamic labeling. For each time window, correlate the per-party PCA |
||||
positions against known ideological scores. Assign a label based on which correlation is strongest. |
||||
Surface the finding as a one-line caption in the UI when the axis diverges from "Links–Rechts". |
||||
|
||||
Rejected alternatives: |
||||
- **Fixed anchor compass**: replaces honest complexity with comfortable fiction; loses behavioral |
||||
information entirely |
||||
- **Dual view (behavioral + ideological)**: too much UI complexity for V1; can be done later |
||||
|
||||
## Architecture Overview |
||||
|
||||
A thin axis classification layer sits between `compute_2d_axes` (unchanged) and the compass UI. |
||||
|
||||
``` |
||||
compute_2d_axes() |
||||
↓ |
||||
positions_by_window + axes dict |
||||
↓ |
||||
classify_axes(positions_by_window, axes, db_path) |
||||
↓ |
||||
axes dict enriched with: |
||||
- x_label, y_label (global, most-common label across annual windows) |
||||
- x_quality (dict: window_id → float, max |r|) |
||||
- y_quality (dict: window_id → float, max |r|) |
||||
- x_interpretation (dict: window_id → Dutch str) |
||||
- y_interpretation (dict: window_id → Dutch str) |
||||
↓ |
||||
compass renderer uses labels + per-year quality captions |
||||
``` |
||||
|
||||
## Components |
||||
|
||||
### 1. Reference data files |
||||
|
||||
**`data/party_ideologies.csv`** |
||||
|
||||
One row per party. Party names must match entity IDs in the `svd_vectors` table exactly. |
||||
|
||||
``` |
||||
party,left_right,progressive |
||||
VVD,0.65,0.10 |
||||
PvdA,-0.70,0.75 |
||||
SP,-0.90,0.50 |
||||
CDA,0.25,-0.45 |
||||
D66,-0.10,0.85 |
||||
GroenLinks,-0.70,0.90 |
||||
GL,-0.70,0.90 |
||||
GroenLinks-PvdA,-0.70,0.82 |
||||
ChristenUnie,0.10,-0.55 |
||||
SGP,0.35,-0.95 |
||||
PVV,0.90,-0.50 |
||||
DENK,-0.40,0.55 |
||||
50Plus,-0.05,-0.10 |
||||
FVD,0.90,-0.75 |
||||
PvdD,-0.60,0.85 |
||||
Volt,-0.20,0.80 |
||||
JA21,0.70,-0.30 |
||||
BBB,0.50,-0.35 |
||||
NSC,0.20,-0.20 |
||||
Nieuw Sociaal Contract,0.20,-0.20 |
||||
BVNL,0.85,-0.55 |
||||
Bij1,-0.90,0.90 |
||||
``` |
||||
|
||||
Scores: left_right = −1 (far left) to +1 (far right). progressive = −1 (conservative) to +1 (progressive). |
||||
These are expert judgments based on party programs and voting records, not derived algorithmically. |
||||
|
||||
**`data/coalition_membership.csv`** |
||||
|
||||
One row per (window_id, party) where that party held a government seat. Annual windows only; quarterly |
||||
windows inherit from their year. |
||||
|
||||
``` |
||||
window_id,party |
||||
2012,VVD |
||||
2012,PvdA |
||||
2013,VVD |
||||
2013,PvdA |
||||
2014,VVD |
||||
2014,PvdA |
||||
2015,VVD |
||||
2015,PvdA |
||||
2016,VVD |
||||
2016,PvdA |
||||
2017,VVD |
||||
2017,CDA |
||||
2017,D66 |
||||
2017,ChristenUnie |
||||
2018,VVD |
||||
2018,CDA |
||||
2018,D66 |
||||
2018,ChristenUnie |
||||
2019,VVD |
||||
2019,CDA |
||||
2019,D66 |
||||
2019,ChristenUnie |
||||
2020,VVD |
||||
2020,CDA |
||||
2020,D66 |
||||
2020,ChristenUnie |
||||
2021,VVD |
||||
2021,CDA |
||||
2021,D66 |
||||
2021,ChristenUnie |
||||
2022,VVD |
||||
2022,D66 |
||||
2022,CDA |
||||
2022,ChristenUnie |
||||
2023,VVD |
||||
2023,D66 |
||||
2023,CDA |
||||
2023,ChristenUnie |
||||
2024,PVV |
||||
2024,VVD |
||||
2024,NSC |
||||
2024,BBB |
||||
2025,PVV |
||||
2025,VVD |
||||
2025,NSC |
||||
2025,BBB |
||||
2026,PVV |
||||
2026,VVD |
||||
2026,NSC |
||||
2026,BBB |
||||
``` |
||||
|
||||
### 2. `analysis/axis_classifier.py` (new module) |
||||
|
||||
Single public function: `classify_axes(positions_by_window, axes, db_path)`. |
||||
|
||||
The function is pure except for reading two CSV files (cached module-level after first load). |
||||
|
||||
**Algorithm per window:** |
||||
|
||||
1. Collect parties that appear in both `positions_by_window[window_id]` and `party_ideologies.csv`. |
||||
Skip windows with fewer than 5 overlapping parties. |
||||
2. Build vectors: |
||||
- `party_x`: per-party X positions from this window |
||||
- `party_y`: per-party Y positions from this window |
||||
- `ref_lr`: left_right scores from CSV |
||||
- `ref_pc`: progressive scores from CSV |
||||
- `coalition_dummy`: +1 if party is in government for this window's year, −1 otherwise |
||||
(quarterly windows: strip suffix to get year, e.g., `2016-Q3` → `2016`) |
||||
3. Compute Pearson r for X against each reference dimension: |
||||
- `r_lr_x = pearsonr(party_x, ref_lr)[0]` |
||||
- `r_pc_x = pearsonr(party_x, ref_pc)[0]` |
||||
- `r_co_x = pearsonr(party_x, coalition_dummy)[0]` |
||||
4. Assign label and interpretation using priority order (first threshold that fires wins): |
||||
- `|r_lr_x| ≥ 0.65` → label = `"Links–Rechts"`, flip sign if r < 0 |
||||
- `|r_co_x| ≥ 0.65` → label = `"Coalitie–Oppositie"` |
||||
- `|r_pc_x| ≥ 0.65` → label = `"Progressief–Conservatief"`, flip sign if r < 0 |
||||
- fallback → label = `"Stempatroon As 1"` |
||||
5. Quality score for this window's X-axis: `max(|r_lr_x|, |r_pc_x|, |r_co_x|)` |
||||
6. Repeat steps 3–5 for Y-axis using `party_y`. |
||||
7. After processing all windows, pick global X label = modal label across annual windows only |
||||
(quarterly windows participate in quality tracking but not in the modal vote, to avoid |
||||
over-weighting). |
||||
|
||||
**Interpretation strings (Dutch):** |
||||
|
||||
| label | interpretation | |
||||
|---|---| |
||||
| Links–Rechts | "De horizontale as weerspiegelt de klassieke links-rechts tegenstelling." | |
||||
| Coalitie–Oppositie | "De horizontale as weerspiegelt stemgedrag van coalitie- versus oppositiepartijen (r={r:.2f}). Links-rechts is minder dominant dit jaar." | |
||||
| Progressief–Conservatief | "De horizontale as weerspiegelt de progressief-conservatieve tegenstelling." | |
||||
| Stempatroon As 1 | "De horizontale as weerspiegelt een empirisch stempatroon zonder duidelijke ideologische richting." | |
||||
|
||||
Y-axis interpretations follow the same template with "verticale" instead of "horizontale". |
||||
|
||||
**Return value:** the input `axes` dict with four new keys added: |
||||
`x_label`, `y_label`, `x_quality` (dict), `y_quality` (dict), `x_interpretation` (dict), |
||||
`y_interpretation` (dict). |
||||
|
||||
### 3. `explorer.py` changes |
||||
|
||||
**`load_positions()`** — after calling `compute_2d_axes`, call `classify_axes` and store the enriched |
||||
axes dict. If `classify_axes` raises for any reason, catch and log; use the original axes dict. |
||||
|
||||
**Compass renderer** — two changes only: |
||||
1. Replace hardcoded `"Links–Rechts"` / `"Progressief–Conservatief"` axis title strings with |
||||
`axes.get("x_label", "Links–Rechts")` and `axes.get("y_label", "Progressief–Conservatief")`. |
||||
2. Add a caption below the compass for the selected year. Show when either axis quality < 0.65: |
||||
> *"In 2016 weerspiegelt de horizontale as coalitie–oppositie stemgedrag (r=0.71)."* |
||||
|
||||
Source: `axes["x_interpretation"].get(selected_window_id, "")`. |
||||
|
||||
No other UI changes. The compass layout is untouched. |
||||
|
||||
## Data Flow |
||||
|
||||
``` |
||||
load_positions(db_path, window_size) |
||||
→ compute_2d_axes(...) [unchanged; returns positions_by_window, axes] |
||||
→ classify_axes( [new] |
||||
positions_by_window, |
||||
axes, |
||||
db_path=db_path |
||||
) |
||||
reads: data/party_ideologies.csv (module-level cache) |
||||
reads: data/coalition_membership.csv (module-level cache) |
||||
uses: positions_by_window already in memory |
||||
writes: new keys into axes dict (no mutation of positions) |
||||
→ return positions_by_window, axes_enriched |
||||
|
||||
compass render (existing function) |
||||
→ axes["x_label"] [was hardcoded "Links–Rechts"] |
||||
→ axes["y_label"] [was hardcoded "Progressief–Conservatief"] |
||||
→ axes["x_interpretation"][window_id] [new caption] |
||||
``` |
||||
|
||||
No DB writes. No new DB queries. Pure in-memory correlation over data that's already loaded. |
||||
CSV reads are ~microseconds and cached after first call. |
||||
|
||||
## Error Handling |
||||
|
||||
| Failure | Behaviour | |
||||
|---|---| |
||||
| `data/party_ideologies.csv` missing | Log WARNING, return `axes` unchanged (current labels preserved) | |
||||
| `data/coalition_membership.csv` missing | Log WARNING, coalition dimension skipped; other correlations still computed | |
||||
| Party in positions but not in CSV | Skip silently; log once at DEBUG per session | |
||||
| Window has fewer than 5 overlapping parties | Skip classification for that window; use fallback label | |
||||
| All correlations < 0.65 | Fallback label is always safe; no crash | |
||||
| Any unexpected exception in `classify_axes` | Caller (`load_positions`) catches, logs, returns original `axes` dict | |
||||
|
||||
## Testing Strategy |
||||
|
||||
Three new tests added to `tests/test_political_compass.py`: |
||||
|
||||
**`test_axis_label_left_right`** |
||||
Construct synthetic per-party positions where X values correlate strongly (r > 0.8) with the left_right |
||||
column of a minimal inline CSV. Assert that `classify_axes` returns `x_label == "Links–Rechts"` and |
||||
`x_quality[window] > 0.65`. |
||||
|
||||
**`test_axis_label_coalition_dominant`** |
||||
Construct synthetic positions where X values match coalition membership pattern but NOT left-right. |
||||
(E.g., coalition parties [VVD, PvdA] cluster at x=+1, opposition [PVV, SP] at x=−1, which is |
||||
historically coherent for 2016.) Assert `x_label == "Coalitie–Oppositie"` and that the interpretation |
||||
string contains "coalitie". |
||||
|
||||
**`test_axis_classifier_missing_csv`** |
||||
Call `classify_axes` with a db_path pointing to a nonexistent directory so CSV loading fails. Assert |
||||
that the function returns the axes dict unchanged and does not raise. |
||||
|
||||
All three tests use monkeypatching to inject CSV content as in-memory StringIO, following the existing |
||||
pattern in `tests/test_political_compass.py` of patching module-level imports. |
||||
|
||||
## Open Questions |
||||
|
||||
None. |
||||
Loading…
Reference in new issue