7.8 KiB

Raw Blame History

date	topic	status
2026-03-29	Bootstrap confidence intervals and data enrichment	validated

Bootstrap Confidence Intervals & Data Enrichment

Problem Statement

The SVD axis charts show party centroid scores as point estimates with no indication of reliability. Volt (N=1) and D66 (N=49) look equally confident. Additionally:

2016–2018 motions lack body text, weakening embedding quality for those windows
party_svd_scores.json is a stale ad-hoc file missing NSC — should be deleted

Constraints

No re-SVD per bootstrap replicate — too expensive, only centroid uncertainty needed
Single-window bootstrap only — party scores come from current_parliament raw SVD vectors, not the Procrustes pipeline
Functional Python, using existing patterns (uv, duckdb, numpy)
Don't break existing Streamlit rendering — error bars are additive
Fixed random seed for reproducibility

Approach

Single-window centroid bootstrap. For each party, resample its N MPs with replacement 1000×, recompute centroid per replicate, take percentile CIs. Cheap (no re-SVD needed), directly answers "how reliable is this score?".

Rejected alternatives:

Multi-window Procrustes bootstrap: 1000× SVD cost, requires orientation canonicalization. Overkill.
Analytical SE (std/sqrt(N)): assumes normality, misses skewed distributions.

Components

A. Download Script Enhancement (`scripts/download_past_year.py`)

Add two CLI flags:

--skip-details (default: True, matching current hardcoded behavior) — when False, fetches body text via _get_motion_details → _fetch_body_text
--update-existing (default: False) — when True, re-processes motions already in DB to fetch missing body_text and update the record

The update-existing flow:

Query motions table for rows WHERE date BETWEEN start_date AND end_date AND (body_text IS NULL OR body_text = '')
Extract besluit_id from the URL column (format: https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit_id} — take last path segment)
For each such motion, call api._get_motion_details(besluit_id) to fetch body_text
UPDATE the motions row with the new body_text (and title/description if also missing)

Note: the motions table has no besluit_id column — it's only embedded in the URL. The update flow must parse it from the URL.

Run once after implementation: --start-date 2016-01-01 --end-date 2018-12-31 --update-existing (No need for --skip-details when using --update-existing — it always fetches details for the targeted rows.)

B. Bootstrap Computation (`analysis/political_axis.py`)

New function:

compute_party_bootstrap_cis(
    party_vectors: Dict[str, List[np.ndarray]],
    n_boot: int = 1000,
    ci: float = 95.0,
    seed: int = 42
) -> Dict[str, Dict]

Input: party_vectors is a dict mapping party name → list of individual MP vectors (each a numpy array of length 50). The caller (explorer.py) builds this from DB queries using existing mp→party mapping logic.

Returns per-party:

{
    "PVV": {
        "centroid": [50 floats],
        "ci_lower": [50 floats],
        "ci_upper": [50 floats],
        "std": [50 floats],
        "n_mps": 19
    },
    ...
}

Algorithm:

Receive pre-grouped party_vectors from caller
For each party with N >= 2:
- Create numpy Generator with fixed seed
- For each of n_boot replicates: sample N indices with replacement, compute mean vector
- Compute percentile CIs (alpha/2, 100-alpha/2) and std across replicates per dimension
For parties with N = 1: set ci_lower == ci_upper == centroid, std = 0, flag n_mps = 1

Dependencies: numpy, duckdb (read_only), json.

Import issue: _PARTY_NORMALIZE and CURRENT_PARLIAMENT_PARTIES live in explorer.py (a Streamlit app). The bootstrap function in analysis/political_axis.py can't import from there. Solution: the bootstrap function accepts party_vectors: Dict[str, List[np.ndarray]] as input — the caller (explorer.py) handles the mp→party mapping and passes grouped vectors in. This keeps the analysis module independent of Streamlit app constants and avoids duplicating the normalization logic.

Alternatively, the caller can pass the already-computed party_scores dict from load_party_axis_scores plus raw per-party MP vector lists. The simplest approach: add a helper in explorer.py that loads grouped MP vectors per party (reusing existing mapping logic) and pass that to the bootstrap function.

C. Chart Enhancement (`explorer.py`)

Modify _render_party_axis_chart to accept optional bootstrap_data: Dict[str, Dict] = None.

When bootstrap_data is provided:

For each party, compute error magnitude: (ci_upper[axis_idx] - ci_lower[axis_idx]) / 2
When flip is True, error magnitude stays the same (symmetric around the negated centroid)
Add error_x=dict(type="data", array=error_array, visible=True) to the party marker Scatter trace
Parties with N=1: render with a distinct marker (diamond shape instead of circle) as visual unreliability warning
Add N={n_mps} to hover text for all parties

The bootstrap computation should be cached alongside party scores using @st.cache_data.

D. Delete Stale JSON File

Remove thoughts/explorer/party_svd_scores.json. The app never reads this file — load_party_axis_scores always computes live from the DB. The file was generated ad-hoc during analysis and is missing NSC.

Also remove thoughts/explorer/axis_analysis_data.json — same situation, ad-hoc analysis artifact not used by the app.

Data Flow

DB (svd_vectors, mp_metadata)
  │
  ├──→ load_party_axis_scores()
  │      returns Dict[str, List[float]]  (party → 50-dim centroid)
  │
  └──→ load_party_mp_vectors()  [NEW helper in explorer.py]
         returns Dict[str, List[np.ndarray]]  (party → list of individual MP vectors)
         reuses same mp→party mapping as load_party_axis_scores
  │
  ↓
compute_party_bootstrap_cis(party_vectors, n_boot=1000, ci=95, seed=42)
  │ returns Dict[str, Dict]  (party → {centroid, ci_lower, ci_upper, std, n_mps})
  ↓
_render_party_axis_chart(party_scores, comp_sel, theme, bootstrap_data=None)
  │ indexes [comp_sel - 1] from centroid and CIs
  │ applies flip (negate score AND CI bounds)
  │ adds error_x to Plotly Scatter trace
  ↓
Streamlit renders chart with error bars

Both functions cached via @st.cache_data with same TTL.

Error Handling

N=1 parties (Volt, Lid Keijzer): Return centroid as both CI bounds, std=0. Chart renders diamond marker. Hover says "N=1, geen betrouwbaarheidsinterval".
N=2 parties (50PLUS): CIs will be wide — that's correct, let data speak.
SVD vector parsing failures: Skip MP, log warning (same as existing pattern).
Download/scraping failures: Per-chunk try/except already handles this. _fetch_body_text returns None on failure (existing behavior).
update-existing with no besluit_id: Skip motion, log. Not all motions have a besluit_id traceable to body text.

Testing Strategy

Unit Tests

test_bootstrap_fixed_seed: Synthetic data (5 parties, varying N), fixed seed. Verify:
- Output shape matches expected structure
- CI bounds bracket centroid for all parties
- N=1 party has ci_lower == ci_upper == centroid
- Same seed produces identical output
- Larger N produces narrower CIs

Integration Tests

test_bootstrap_real_db: Run against actual DB, verify:
- Returns data for all 17 current parliament parties (+NSC)
- n_mps values match known party sizes
- CI width for D66 (N=49) << CI width for SP (N=3)

Visual Validation

Run Streamlit app, verify error bars appear on SVD axis charts
Verify N=1 parties have distinct marker style
Verify hover text includes party size

Open Questions

None — design is straightforward. The only future enhancement would be multi-window bootstrap for axis stability testing, but that's a separate project.

7.8 KiB Raw Blame History