11 KiB
| date | topic | status |
|---|---|---|
| 2026-03-29 | Honest PCA Axis Classification | validated |
Axis Classification Design
Problem Statement
The political compass always labels its X-axis "Links–Rechts" and Y-axis "Progressief–Conservatief" regardless of what the PCA actually found. In coalition years, the first principal component captures coalition membership, not ideology. The dominant axis of voting variation in Rutte II (VVD+PvdA) and Rutte III/IV (VVD+CDA+D66+CU) is "are you in the governing coalition?" PvdA and PVV end up at the same position because both were in opposition — technically correct voting similarity, but the label "Links–Rechts" is a lie.
The fix: after each PCA, validate what the axes actually capture by correlating party positions against a small reference dataset of known ideological scores. Assign labels honestly.
Constraints
- No changes to the PCA computation itself (
compute_2d_axesis unchanged) - No new runtime dependencies (scipy is already optional; pandas is already present)
party_ideologies.csvandcoalition_membership.csvare static data files — not derived from the DB- Backward-compatible: the compass still renders even when reference files are missing (falls back to current hardcoded labels silently)
Approach
Reference-validated PCA with dynamic labeling. For each time window, correlate the per-party PCA positions against known ideological scores. Assign a label based on which correlation is strongest. Surface the finding as a one-line caption in the UI when the axis diverges from "Links–Rechts".
Rejected alternatives:
- Fixed anchor compass: replaces honest complexity with comfortable fiction; loses behavioral information entirely
- Dual view (behavioral + ideological): too much UI complexity for V1; can be done later
Architecture Overview
A thin axis classification layer sits between compute_2d_axes (unchanged) and the compass UI.
compute_2d_axes()
↓
positions_by_window + axes dict
↓
classify_axes(positions_by_window, axes, db_path)
↓
axes dict enriched with:
- x_label, y_label (global, most-common label across annual windows)
- x_quality (dict: window_id → float, max |r|)
- y_quality (dict: window_id → float, max |r|)
- x_interpretation (dict: window_id → Dutch str)
- y_interpretation (dict: window_id → Dutch str)
↓
compass renderer uses labels + per-year quality captions
Components
1. Reference data files
data/party_ideologies.csv
One row per party. Party names must match entity IDs in the svd_vectors table exactly.
party,left_right,progressive
VVD,0.65,0.10
PvdA,-0.70,0.75
SP,-0.90,0.50
CDA,0.25,-0.45
D66,-0.10,0.85
GroenLinks,-0.70,0.90
GL,-0.70,0.90
GroenLinks-PvdA,-0.70,0.82
ChristenUnie,0.10,-0.55
SGP,0.35,-0.95
PVV,0.90,-0.50
DENK,-0.40,0.55
50Plus,-0.05,-0.10
FVD,0.90,-0.75
PvdD,-0.60,0.85
Volt,-0.20,0.80
JA21,0.70,-0.30
BBB,0.50,-0.35
NSC,0.20,-0.20
Nieuw Sociaal Contract,0.20,-0.20
BVNL,0.85,-0.55
Bij1,-0.90,0.90
Scores: left_right = −1 (far left) to +1 (far right). progressive = −1 (conservative) to +1 (progressive). These are expert judgments based on party programs and voting records, not derived algorithmically.
data/coalition_membership.csv
One row per (window_id, party) where that party held a government seat. Annual windows only; quarterly windows inherit from their year.
window_id,party
2012,VVD
2012,PvdA
2013,VVD
2013,PvdA
2014,VVD
2014,PvdA
2015,VVD
2015,PvdA
2016,VVD
2016,PvdA
2017,VVD
2017,CDA
2017,D66
2017,ChristenUnie
2018,VVD
2018,CDA
2018,D66
2018,ChristenUnie
2019,VVD
2019,CDA
2019,D66
2019,ChristenUnie
2020,VVD
2020,CDA
2020,D66
2020,ChristenUnie
2021,VVD
2021,CDA
2021,D66
2021,ChristenUnie
2022,VVD
2022,D66
2022,CDA
2022,ChristenUnie
2023,VVD
2023,D66
2023,CDA
2023,ChristenUnie
2024,PVV
2024,VVD
2024,NSC
2024,BBB
2025,PVV
2025,VVD
2025,NSC
2025,BBB
2026,PVV
2026,VVD
2026,NSC
2026,BBB
2. analysis/axis_classifier.py (new module)
Single public function: classify_axes(positions_by_window, axes, db_path).
The function is pure except for reading two CSV files (cached module-level after first load).
CSV paths are derived from db_path: Path(db_path).parent / "party_ideologies.csv" and
Path(db_path).parent / "coalition_membership.csv". Both files live in the same data/ directory
as the database.
Algorithm per window:
- Collect parties that appear in both
positions_by_window[window_id]andparty_ideologies.csv. Skip windows with fewer than 5 overlapping parties. - Build vectors:
party_x: per-party X positions from this windowparty_y: per-party Y positions from this windowref_lr: left_right scores from CSVref_pc: progressive scores from CSVcoalition_dummy: +1 if party is in government for this window's year, −1 otherwise (quarterly windows: strip suffix to get year, e.g.,2016-Q3→2016)
- Compute Pearson r for X against each reference dimension:
r_lr_x = pearsonr(party_x, ref_lr)[0]r_pc_x = pearsonr(party_x, ref_pc)[0]r_co_x = pearsonr(party_x, coalition_dummy)[0]
- Assign label and interpretation using priority order (first threshold that fires wins):
|r_lr_x| ≥ 0.65→ label ="Links–Rechts", flip sign if r < 0|r_co_x| ≥ 0.65→ label ="Coalitie–Oppositie"|r_pc_x| ≥ 0.65→ label ="Progressief–Conservatief", flip sign if r < 0- fallback → label =
"Stempatroon As 1"
- Quality score for this window's X-axis:
max(|r_lr_x|, |r_pc_x|, |r_co_x|) - Repeat steps 3–5 for Y-axis using
party_y. - After processing all windows, pick global X label = modal label across annual windows only
(quarterly windows participate in quality tracking but not in the modal vote, to avoid
over-weighting). The
current_parliamentwindow is excluded from modal voting entirely and from the coalition dimension (no year to look up); it still gets x_quality and x_interpretation based on the left_right and progressive correlations.
Interpretation strings (Dutch):
| label | interpretation |
|---|---|
| Links–Rechts | "De horizontale as weerspiegelt de klassieke links-rechts tegenstelling." |
| Coalitie–Oppositie | "De horizontale as weerspiegelt stemgedrag van coalitie- versus oppositiepartijen (r={r:.2f}). Links-rechts is minder dominant dit jaar." |
| Progressief–Conservatief | "De horizontale as weerspiegelt de progressief-conservatieve tegenstelling." |
| Stempatroon As 1 | "De horizontale as weerspiegelt een empirisch stempatroon zonder duidelijke ideologische richting." |
Y-axis interpretations follow the same template with "verticale" instead of "horizontale".
Return value: the input axes dict with four new keys added:
x_label, y_label, x_quality (dict), y_quality (dict), x_interpretation (dict),
y_interpretation (dict).
3. explorer.py changes
load_positions() — after calling compute_2d_axes, call classify_axes and store the enriched
axes dict. If classify_axes raises for any reason, catch and log; use the original axes dict.
Compass renderer — two changes only:
-
Replace hardcoded
"Links–Rechts"/"Progressief–Conservatief"axis title strings withaxes.get("x_label", "Links–Rechts")andaxes.get("y_label", "Progressief–Conservatief"). -
Add a caption below the compass for the selected year. Show when either axis quality < 0.65:
"In 2016 weerspiegelt de horizontale as coalitie–oppositie stemgedrag (r=0.71)."
Source:
axes["x_interpretation"].get(selected_window_id, "").
No other UI changes. The compass layout is untouched.
Data Flow
load_positions(db_path, window_size)
→ compute_2d_axes(...) [unchanged; returns positions_by_window, axes]
→ classify_axes( [new]
positions_by_window,
axes,
db_path=db_path
)
reads: data/party_ideologies.csv (module-level cache)
reads: data/coalition_membership.csv (module-level cache)
uses: positions_by_window already in memory
writes: new keys into axes dict (no mutation of positions)
→ return positions_by_window, axes_enriched
compass render (existing function)
→ axes["x_label"] [was hardcoded "Links–Rechts"]
→ axes["y_label"] [was hardcoded "Progressief–Conservatief"]
→ axes["x_interpretation"][window_id] [new caption]
No DB writes. No new DB queries. Pure in-memory correlation over data that's already loaded. CSV reads are ~microseconds and cached after first call.
Error Handling
| Failure | Behaviour |
|---|---|
data/party_ideologies.csv missing |
Log WARNING, return axes unchanged (current labels preserved) |
data/coalition_membership.csv missing |
Log WARNING, coalition dimension skipped; other correlations still computed |
| Party in positions but not in CSV | Skip silently; log once at DEBUG per session |
| Window has fewer than 5 overlapping parties | Skip classification for that window; use fallback label |
| All correlations < 0.65 | Fallback label is always safe; no crash |
Any unexpected exception in classify_axes |
Caller (load_positions) catches, logs, returns original axes dict |
Testing Strategy
Three new tests added to tests/test_political_compass.py:
test_axis_label_left_right
Construct synthetic per-party positions where X values correlate strongly (r > 0.8) with the left_right
column of a minimal inline CSV. Assert that classify_axes returns x_label == "Links–Rechts" and
x_quality[window] > 0.65.
test_axis_label_coalition_dominant
Construct synthetic positions where X values match coalition membership pattern but NOT left-right.
(E.g., coalition parties [VVD, PvdA] cluster at x=+1, opposition [PVV, SP] at x=−1, which is
historically coherent for 2016.) Assert x_label == "Coalitie–Oppositie" and that the interpretation
string contains "coalitie".
test_axis_classifier_missing_csv
Call classify_axes with a db_path pointing to a nonexistent directory so CSV loading fails. Assert
that the function returns the axes dict unchanged and does not raise.
All three tests use monkeypatching to inject CSV content as in-memory StringIO, following the existing
pattern in tests/test_political_compass.py of patching module-level imports.
Deployment
The CSV files (data/party_ideologies.csv and data/coalition_membership.csv) are static reference
data committed to git. They are baked into the Docker image at build time alongside the application
code. No rsync or volume mount is needed.
The .gitignore excludes data/*.db, data/*.bak, data/*.json but not data/*.csv, so they can
be tracked without change to the ignore rules. The data volume mount (DATA_DIR:/home/app/app/data)
only contains the database file and does not overwrite the baked-in CSVs.
When party compositions change (e.g., a new party enters parliament), update the CSV, commit, and redeploy. Typical frequency: once per parliament formation (~4 years).
Open Questions
None.