11 KiB

Raw Blame History

date	topic	status
2026-03-29	Honest PCA Axis Classification	validated

Axis Classification Design

Problem Statement

The political compass always labels its X-axis "Links–Rechts" and Y-axis "Progressief–Conservatief" regardless of what the PCA actually found. In coalition years, the first principal component captures coalition membership, not ideology. The dominant axis of voting variation in Rutte II (VVD+PvdA) and Rutte III/IV (VVD+CDA+D66+CU) is "are you in the governing coalition?" PvdA and PVV end up at the same position because both were in opposition — technically correct voting similarity, but the label "Links–Rechts" is a lie.

The fix: after each PCA, validate what the axes actually capture by correlating party positions against a small reference dataset of known ideological scores. Assign labels honestly.

Constraints

No changes to the PCA computation itself (compute_2d_axes is unchanged)
No new runtime dependencies (scipy is already optional; pandas is already present)
party_ideologies.csv and coalition_membership.csv are static data files — not derived from the DB
Backward-compatible: the compass still renders even when reference files are missing (falls back to current hardcoded labels silently)

Approach

Reference-validated PCA with dynamic labeling. For each time window, correlate the per-party PCA positions against known ideological scores. Assign a label based on which correlation is strongest. Surface the finding as a one-line caption in the UI when the axis diverges from "Links–Rechts".

Rejected alternatives:

Fixed anchor compass: replaces honest complexity with comfortable fiction; loses behavioral information entirely
Dual view (behavioral + ideological): too much UI complexity for V1; can be done later

Architecture Overview

A thin axis classification layer sits between compute_2d_axes (unchanged) and the compass UI.

compute_2d_axes()
       ↓
 positions_by_window  +  axes dict
       ↓
classify_axes(positions_by_window, axes, db_path)
       ↓
 axes dict enriched with:
   - x_label, y_label      (global, most-common label across annual windows)
   - x_quality             (dict: window_id → float, max |r|)
   - y_quality             (dict: window_id → float, max |r|)
   - x_interpretation      (dict: window_id → Dutch str)
   - y_interpretation      (dict: window_id → Dutch str)
       ↓
 compass renderer uses labels + per-year quality captions

Components

1. Reference data files

data/party_ideologies.csv

One row per party. Party names must match entity IDs in the svd_vectors table exactly.

party,left_right,progressive
VVD,0.65,0.10
PvdA,-0.70,0.75
SP,-0.90,0.50
CDA,0.25,-0.45
D66,-0.10,0.85
GroenLinks,-0.70,0.90
GL,-0.70,0.90
GroenLinks-PvdA,-0.70,0.82
ChristenUnie,0.10,-0.55
SGP,0.35,-0.95
PVV,0.90,-0.50
DENK,-0.40,0.55
50Plus,-0.05,-0.10
FVD,0.90,-0.75
PvdD,-0.60,0.85
Volt,-0.20,0.80
JA21,0.70,-0.30
BBB,0.50,-0.35
NSC,0.20,-0.20
Nieuw Sociaal Contract,0.20,-0.20
BVNL,0.85,-0.55
Bij1,-0.90,0.90

Scores: left_right = −1 (far left) to +1 (far right). progressive = −1 (conservative) to +1 (progressive). These are expert judgments based on party programs and voting records, not derived algorithmically.

data/coalition_membership.csv

One row per (window_id, party) where that party held a government seat. Annual windows only; quarterly windows inherit from their year.

window_id,party
2012,VVD
2012,PvdA
2013,VVD
2013,PvdA
2014,VVD
2014,PvdA
2015,VVD
2015,PvdA
2016,VVD
2016,PvdA
2017,VVD
2017,CDA
2017,D66
2017,ChristenUnie
2018,VVD
2018,CDA
2018,D66
2018,ChristenUnie
2019,VVD
2019,CDA
2019,D66
2019,ChristenUnie
2020,VVD
2020,CDA
2020,D66
2020,ChristenUnie
2021,VVD
2021,CDA
2021,D66
2021,ChristenUnie
2022,VVD
2022,D66
2022,CDA
2022,ChristenUnie
2023,VVD
2023,D66
2023,CDA
2023,ChristenUnie
2024,PVV
2024,VVD
2024,NSC
2024,BBB
2025,PVV
2025,VVD
2025,NSC
2025,BBB
2026,PVV
2026,VVD
2026,NSC
2026,BBB

2. `analysis/axis_classifier.py` (new module)

Single public function: classify_axes(positions_by_window, axes, db_path).

The function is pure except for reading two CSV files (cached module-level after first load).

CSV paths are derived from db_path: Path(db_path).parent / "party_ideologies.csv" and Path(db_path).parent / "coalition_membership.csv". Both files live in the same data/ directory as the database.

Algorithm per window:

Collect parties that appear in both positions_by_window[window_id] and party_ideologies.csv. Skip windows with fewer than 5 overlapping parties.
Build vectors:
- party_x: per-party X positions from this window
- party_y: per-party Y positions from this window
- ref_lr: left_right scores from CSV
- ref_pc: progressive scores from CSV
- coalition_dummy: +1 if party is in government for this window's year, −1 otherwise (quarterly windows: strip suffix to get year, e.g., 2016-Q3 → 2016)
Compute Pearson r for X against each reference dimension:
- r_lr_x = pearsonr(party_x, ref_lr)[0]
- r_pc_x = pearsonr(party_x, ref_pc)[0]
- r_co_x = pearsonr(party_x, coalition_dummy)[0]
Assign label and interpretation using priority order (first threshold that fires wins):
- |r_lr_x| ≥ 0.65 → label = "Links–Rechts", flip sign if r < 0
- |r_co_x| ≥ 0.65 → label = "Coalitie–Oppositie"
- |r_pc_x| ≥ 0.65 → label = "Progressief–Conservatief", flip sign if r < 0
- fallback → label = "Stempatroon As 1"
Quality score for this window's X-axis: max(|r_lr_x|, |r_pc_x|, |r_co_x|)
Repeat steps 3–5 for Y-axis using party_y.
After processing all windows, pick global X label = modal label across annual windows only (quarterly windows participate in quality tracking but not in the modal vote, to avoid over-weighting). The current_parliament window is excluded from modal voting entirely and from the coalition dimension (no year to look up); it still gets x_quality and x_interpretation based on the left_right and progressive correlations.

Interpretation strings (Dutch):

label	interpretation
Links–Rechts	"De horizontale as weerspiegelt de klassieke links-rechts tegenstelling."
Coalitie–Oppositie	"De horizontale as weerspiegelt stemgedrag van coalitie- versus oppositiepartijen (r={r:.2f}). Links-rechts is minder dominant dit jaar."
Progressief–Conservatief	"De horizontale as weerspiegelt de progressief-conservatieve tegenstelling."
Stempatroon As 1	"De horizontale as weerspiegelt een empirisch stempatroon zonder duidelijke ideologische richting."

Y-axis interpretations follow the same template with "verticale" instead of "horizontale".

Return value: the input axes dict with four new keys added: x_label, y_label, x_quality (dict), y_quality (dict), x_interpretation (dict), y_interpretation (dict).

3. `explorer.py` changes

load_positions() — after calling compute_2d_axes, call classify_axes and store the enriched axes dict. If classify_axes raises for any reason, catch and log; use the original axes dict.

Compass renderer — two changes only:

Replace hardcoded "Links–Rechts" / "Progressief–Conservatief" axis title strings with axes.get("x_label", "Links–Rechts") and axes.get("y_label", "Progressief–Conservatief").
Add a caption below the compass for the selected year. Show when either axis quality < 0.65:

"In 2016 weerspiegelt de horizontale as coalitie–oppositie stemgedrag (r=0.71)."

Source: axes["x_interpretation"].get(selected_window_id, "").

No other UI changes. The compass layout is untouched.

Data Flow

load_positions(db_path, window_size)
  → compute_2d_axes(...)          [unchanged; returns positions_by_window, axes]
  → classify_axes(               [new]
       positions_by_window,
       axes,
       db_path=db_path
    )
       reads: data/party_ideologies.csv   (module-level cache)
       reads: data/coalition_membership.csv  (module-level cache)
       uses: positions_by_window already in memory
       writes: new keys into axes dict (no mutation of positions)
  → return positions_by_window, axes_enriched

compass render (existing function)
  → axes["x_label"]                       [was hardcoded "Links–Rechts"]
  → axes["y_label"]                       [was hardcoded "Progressief–Conservatief"]
  → axes["x_interpretation"][window_id]   [new caption]

No DB writes. No new DB queries. Pure in-memory correlation over data that's already loaded. CSV reads are ~microseconds and cached after first call.

Error Handling

Failure	Behaviour
`data/party_ideologies.csv` missing	Log WARNING, return `axes` unchanged (current labels preserved)
`data/coalition_membership.csv` missing	Log WARNING, coalition dimension skipped; other correlations still computed
Party in positions but not in CSV	Skip silently; log once at DEBUG per session
Window has fewer than 5 overlapping parties	Skip classification for that window; use fallback label
All correlations < 0.65	Fallback label is always safe; no crash
Any unexpected exception in `classify_axes`	Caller (`load_positions`) catches, logs, returns original `axes` dict

Testing Strategy

Three new tests added to tests/test_political_compass.py:

test_axis_label_left_right Construct synthetic per-party positions where X values correlate strongly (r > 0.8) with the left_right column of a minimal inline CSV. Assert that classify_axes returns x_label == "Links–Rechts" and x_quality[window] > 0.65.

test_axis_label_coalition_dominant Construct synthetic positions where X values match coalition membership pattern but NOT left-right. (E.g., coalition parties [VVD, PvdA] cluster at x=+1, opposition [PVV, SP] at x=−1, which is historically coherent for 2016.) Assert x_label == "Coalitie–Oppositie" and that the interpretation string contains "coalitie".

test_axis_classifier_missing_csv Call classify_axes with a db_path pointing to a nonexistent directory so CSV loading fails. Assert that the function returns the axes dict unchanged and does not raise.

All three tests use monkeypatching to inject CSV content as in-memory StringIO, following the existing pattern in tests/test_political_compass.py of patching module-level imports.

Deployment

The CSV files (data/party_ideologies.csv and data/coalition_membership.csv) are static reference data committed to git. They are baked into the Docker image at build time alongside the application code. No rsync or volume mount is needed.

The .gitignore excludes data/*.db, data/*.bak, data/*.json but not data/*.csv, so they can be tracked without change to the ignore rules. The data volume mount (DATA_DIR:/home/app/app/data) only contains the database file and does not overwrite the baked-in CSVs.

When party compositions change (e.g., a new party enters parliament), update the CSV, commit, and redeploy. Typical frequency: once per parliament formation (~4 years).

Open Questions

None.

11 KiB Raw Blame History