You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
motief/thoughts/shared/designs/2026-03-29-bootstrap-cis-da...

7.8 KiB

date topic status
2026-03-29 Bootstrap confidence intervals and data enrichment validated

Bootstrap Confidence Intervals & Data Enrichment

Problem Statement

The SVD axis charts show party centroid scores as point estimates with no indication of reliability. Volt (N=1) and D66 (N=49) look equally confident. Additionally:

  • 2016–2018 motions lack body text, weakening embedding quality for those windows
  • party_svd_scores.json is a stale ad-hoc file missing NSC — should be deleted

Constraints

  • No re-SVD per bootstrap replicate — too expensive, only centroid uncertainty needed
  • Single-window bootstrap only — party scores come from current_parliament raw SVD vectors, not the Procrustes pipeline
  • Functional Python, using existing patterns (uv, duckdb, numpy)
  • Don't break existing Streamlit rendering — error bars are additive
  • Fixed random seed for reproducibility

Approach

Single-window centroid bootstrap. For each party, resample its N MPs with replacement 1000×, recompute centroid per replicate, take percentile CIs. Cheap (no re-SVD needed), directly answers "how reliable is this score?".

Rejected alternatives:

  • Multi-window Procrustes bootstrap: 1000× SVD cost, requires orientation canonicalization. Overkill.
  • Analytical SE (std/sqrt(N)): assumes normality, misses skewed distributions.

Components

A. Download Script Enhancement (scripts/download_past_year.py)

Add two CLI flags:

  • --skip-details (default: True, matching current hardcoded behavior) — when False, fetches body text via _get_motion_details_fetch_body_text
  • --update-existing (default: False) — when True, re-processes motions already in DB to fetch missing body_text and update the record

The update-existing flow:

  1. Query motions table for rows WHERE date BETWEEN start_date AND end_date AND (body_text IS NULL OR body_text = '')
  2. Extract besluit_id from the URL column (format: https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit_id} — take last path segment)
  3. For each such motion, call api._get_motion_details(besluit_id) to fetch body_text
  4. UPDATE the motions row with the new body_text (and title/description if also missing)

Note: the motions table has no besluit_id column — it's only embedded in the URL. The update flow must parse it from the URL.

Run once after implementation: --start-date 2016-01-01 --end-date 2018-12-31 --update-existing (No need for --skip-details when using --update-existing — it always fetches details for the targeted rows.)

B. Bootstrap Computation (analysis/political_axis.py)

New function:

compute_party_bootstrap_cis(
    party_vectors: Dict[str, List[np.ndarray]],
    n_boot: int = 1000,
    ci: float = 95.0,
    seed: int = 42
) -> Dict[str, Dict]

Input: party_vectors is a dict mapping party name → list of individual MP vectors (each a numpy array of length 50). The caller (explorer.py) builds this from DB queries using existing mp→party mapping logic.

Returns per-party:

{
    "PVV": {
        "centroid": [50 floats],
        "ci_lower": [50 floats],
        "ci_upper": [50 floats],
        "std": [50 floats],
        "n_mps": 19
    },
    ...
}

Algorithm:

  1. Receive pre-grouped party_vectors from caller
  2. For each party with N >= 2:
    • Create numpy Generator with fixed seed
    • For each of n_boot replicates: sample N indices with replacement, compute mean vector
    • Compute percentile CIs (alpha/2, 100-alpha/2) and std across replicates per dimension
  3. For parties with N = 1: set ci_lower == ci_upper == centroid, std = 0, flag n_mps = 1

Dependencies: numpy, duckdb (read_only), json.

Import issue: _PARTY_NORMALIZE and CURRENT_PARLIAMENT_PARTIES live in explorer.py (a Streamlit app). The bootstrap function in analysis/political_axis.py can't import from there. Solution: the bootstrap function accepts party_vectors: Dict[str, List[np.ndarray]] as input — the caller (explorer.py) handles the mp→party mapping and passes grouped vectors in. This keeps the analysis module independent of Streamlit app constants and avoids duplicating the normalization logic.

Alternatively, the caller can pass the already-computed party_scores dict from load_party_axis_scores plus raw per-party MP vector lists. The simplest approach: add a helper in explorer.py that loads grouped MP vectors per party (reusing existing mapping logic) and pass that to the bootstrap function.

C. Chart Enhancement (explorer.py)

Modify _render_party_axis_chart to accept optional bootstrap_data: Dict[str, Dict] = None.

When bootstrap_data is provided:

  • For each party, compute error magnitude: (ci_upper[axis_idx] - ci_lower[axis_idx]) / 2
  • When flip is True, error magnitude stays the same (symmetric around the negated centroid)
  • Add error_x=dict(type="data", array=error_array, visible=True) to the party marker Scatter trace
  • Parties with N=1: render with a distinct marker (diamond shape instead of circle) as visual unreliability warning
  • Add N={n_mps} to hover text for all parties

The bootstrap computation should be cached alongside party scores using @st.cache_data.

D. Delete Stale JSON File

Remove thoughts/explorer/party_svd_scores.json. The app never reads this file — load_party_axis_scores always computes live from the DB. The file was generated ad-hoc during analysis and is missing NSC.

Also remove thoughts/explorer/axis_analysis_data.json — same situation, ad-hoc analysis artifact not used by the app.

Data Flow

DB (svd_vectors, mp_metadata)
  │
  ├──→ load_party_axis_scores()
  │      returns Dict[str, List[float]]  (party → 50-dim centroid)
  │
  └──→ load_party_mp_vectors()  [NEW helper in explorer.py]
         returns Dict[str, List[np.ndarray]]  (party → list of individual MP vectors)
         reuses same mp→party mapping as load_party_axis_scores
  │
  ↓
compute_party_bootstrap_cis(party_vectors, n_boot=1000, ci=95, seed=42)
  │ returns Dict[str, Dict]  (party → {centroid, ci_lower, ci_upper, std, n_mps})
  ↓
_render_party_axis_chart(party_scores, comp_sel, theme, bootstrap_data=None)
  │ indexes [comp_sel - 1] from centroid and CIs
  │ applies flip (negate score AND CI bounds)
  │ adds error_x to Plotly Scatter trace
  ↓
Streamlit renders chart with error bars

Both functions cached via @st.cache_data with same TTL.

Error Handling

  • N=1 parties (Volt, Lid Keijzer): Return centroid as both CI bounds, std=0. Chart renders diamond marker. Hover says "N=1, geen betrouwbaarheidsinterval".
  • N=2 parties (50PLUS): CIs will be wide — that's correct, let data speak.
  • SVD vector parsing failures: Skip MP, log warning (same as existing pattern).
  • Download/scraping failures: Per-chunk try/except already handles this. _fetch_body_text returns None on failure (existing behavior).
  • update-existing with no besluit_id: Skip motion, log. Not all motions have a besluit_id traceable to body text.

Testing Strategy

Unit Tests

  • test_bootstrap_fixed_seed: Synthetic data (5 parties, varying N), fixed seed. Verify:
    • Output shape matches expected structure
    • CI bounds bracket centroid for all parties
    • N=1 party has ci_lower == ci_upper == centroid
    • Same seed produces identical output
    • Larger N produces narrower CIs

Integration Tests

  • test_bootstrap_real_db: Run against actual DB, verify:
    • Returns data for all 17 current parliament parties (+NSC)
    • n_mps values match known party sizes
    • CI width for D66 (N=49) << CI width for SP (N=3)

Visual Validation

  • Run Streamlit app, verify error bars appear on SVD axis charts
  • Verify N=1 parties have distinct marker style
  • Verify hover text includes party size

Open Questions

None — design is straightforward. The only future enhancement would be multi-window bootstrap for axis stability testing, but that's a separate project.