parent
db9a61094b
commit
ff4ce0f9b2
@ -0,0 +1,168 @@ |
|||||||
|
--- |
||||||
|
date: 2026-03-29 |
||||||
|
topic: "Bootstrap confidence intervals and data enrichment" |
||||||
|
status: validated |
||||||
|
--- |
||||||
|
|
||||||
|
# Bootstrap Confidence Intervals & Data Enrichment |
||||||
|
|
||||||
|
## Problem Statement |
||||||
|
|
||||||
|
The SVD axis charts show party centroid scores as point estimates with no indication of reliability. Volt (N=1) and D66 (N=49) look equally confident. Additionally: |
||||||
|
- 2016–2018 motions lack body text, weakening embedding quality for those windows |
||||||
|
- `party_svd_scores.json` is a stale ad-hoc file missing NSC — should be deleted |
||||||
|
|
||||||
|
## Constraints |
||||||
|
|
||||||
|
- No re-SVD per bootstrap replicate — too expensive, only centroid uncertainty needed |
||||||
|
- Single-window bootstrap only — party scores come from `current_parliament` raw SVD vectors, not the Procrustes pipeline |
||||||
|
- Functional Python, using existing patterns (uv, duckdb, numpy) |
||||||
|
- Don't break existing Streamlit rendering — error bars are additive |
||||||
|
- Fixed random seed for reproducibility |
||||||
|
|
||||||
|
## Approach |
||||||
|
|
||||||
|
**Single-window centroid bootstrap.** For each party, resample its N MPs with replacement 1000×, recompute centroid per replicate, take percentile CIs. Cheap (no re-SVD needed), directly answers "how reliable is this score?". |
||||||
|
|
||||||
|
Rejected alternatives: |
||||||
|
- Multi-window Procrustes bootstrap: 1000× SVD cost, requires orientation canonicalization. Overkill. |
||||||
|
- Analytical SE (std/sqrt(N)): assumes normality, misses skewed distributions. |
||||||
|
|
||||||
|
## Components |
||||||
|
|
||||||
|
### A. Download Script Enhancement (`scripts/download_past_year.py`) |
||||||
|
|
||||||
|
Add two CLI flags: |
||||||
|
- `--skip-details` (default: `True`, matching current hardcoded behavior) — when `False`, fetches body text via `_get_motion_details` → `_fetch_body_text` |
||||||
|
- `--update-existing` (default: `False`) — when `True`, re-processes motions already in DB to fetch missing body_text and update the record |
||||||
|
|
||||||
|
The update-existing flow: |
||||||
|
1. Query motions table for rows WHERE date BETWEEN start_date AND end_date AND (body_text IS NULL OR body_text = '') |
||||||
|
2. Extract besluit_id from the URL column (format: `https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit_id}` — take last path segment) |
||||||
|
3. For each such motion, call `api._get_motion_details(besluit_id)` to fetch body_text |
||||||
|
4. UPDATE the motions row with the new body_text (and title/description if also missing) |
||||||
|
|
||||||
|
Note: the motions table has no `besluit_id` column — it's only embedded in the URL. The update flow must parse it from the URL. |
||||||
|
|
||||||
|
Run once after implementation: `--start-date 2016-01-01 --end-date 2018-12-31 --update-existing` |
||||||
|
(No need for `--skip-details` when using `--update-existing` — it always fetches details for the targeted rows.) |
||||||
|
|
||||||
|
### B. Bootstrap Computation (`analysis/political_axis.py`) |
||||||
|
|
||||||
|
New function: |
||||||
|
``` |
||||||
|
compute_party_bootstrap_cis( |
||||||
|
party_vectors: Dict[str, List[np.ndarray]], |
||||||
|
n_boot: int = 1000, |
||||||
|
ci: float = 95.0, |
||||||
|
seed: int = 42 |
||||||
|
) -> Dict[str, Dict] |
||||||
|
``` |
||||||
|
|
||||||
|
Input: `party_vectors` is a dict mapping party name → list of individual MP vectors (each a numpy array of length 50). The caller (explorer.py) builds this from DB queries using existing mp→party mapping logic. |
||||||
|
|
||||||
|
Returns per-party: |
||||||
|
``` |
||||||
|
{ |
||||||
|
"PVV": { |
||||||
|
"centroid": [50 floats], |
||||||
|
"ci_lower": [50 floats], |
||||||
|
"ci_upper": [50 floats], |
||||||
|
"std": [50 floats], |
||||||
|
"n_mps": 19 |
||||||
|
}, |
||||||
|
... |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
Algorithm: |
||||||
|
1. Receive pre-grouped `party_vectors` from caller |
||||||
|
2. For each party with N >= 2: |
||||||
|
- Create numpy Generator with fixed seed |
||||||
|
- For each of n_boot replicates: sample N indices with replacement, compute mean vector |
||||||
|
- Compute percentile CIs (alpha/2, 100-alpha/2) and std across replicates per dimension |
||||||
|
5. For parties with N = 1: set ci_lower == ci_upper == centroid, std = 0, flag n_mps = 1 |
||||||
|
|
||||||
|
Dependencies: numpy, duckdb (read_only), json. |
||||||
|
|
||||||
|
**Import issue**: `_PARTY_NORMALIZE` and `CURRENT_PARLIAMENT_PARTIES` live in `explorer.py` (a Streamlit app). The bootstrap function in `analysis/political_axis.py` can't import from there. Solution: the bootstrap function accepts `party_vectors: Dict[str, List[np.ndarray]]` as input — the caller (explorer.py) handles the mp→party mapping and passes grouped vectors in. This keeps the analysis module independent of Streamlit app constants and avoids duplicating the normalization logic. |
||||||
|
|
||||||
|
Alternatively, the caller can pass the already-computed `party_scores` dict from `load_party_axis_scores` plus raw per-party MP vector lists. The simplest approach: add a helper in explorer.py that loads grouped MP vectors per party (reusing existing mapping logic) and pass that to the bootstrap function. |
||||||
|
|
||||||
|
### C. Chart Enhancement (`explorer.py`) |
||||||
|
|
||||||
|
Modify `_render_party_axis_chart` to accept optional `bootstrap_data: Dict[str, Dict] = None`. |
||||||
|
|
||||||
|
When bootstrap_data is provided: |
||||||
|
- For each party, compute error magnitude: `(ci_upper[axis_idx] - ci_lower[axis_idx]) / 2` |
||||||
|
- When flip is True, error magnitude stays the same (symmetric around the negated centroid) |
||||||
|
- Add `error_x=dict(type="data", array=error_array, visible=True)` to the party marker Scatter trace |
||||||
|
- Parties with N=1: render with a distinct marker (diamond shape instead of circle) as visual unreliability warning |
||||||
|
- Add `N={n_mps}` to hover text for all parties |
||||||
|
|
||||||
|
The bootstrap computation should be cached alongside party scores using `@st.cache_data`. |
||||||
|
|
||||||
|
### D. Delete Stale JSON File |
||||||
|
|
||||||
|
Remove `thoughts/explorer/party_svd_scores.json`. The app never reads this file — `load_party_axis_scores` always computes live from the DB. The file was generated ad-hoc during analysis and is missing NSC. |
||||||
|
|
||||||
|
Also remove `thoughts/explorer/axis_analysis_data.json` — same situation, ad-hoc analysis artifact not used by the app. |
||||||
|
|
||||||
|
## Data Flow |
||||||
|
|
||||||
|
``` |
||||||
|
DB (svd_vectors, mp_metadata) |
||||||
|
│ |
||||||
|
├──→ load_party_axis_scores() |
||||||
|
│ returns Dict[str, List[float]] (party → 50-dim centroid) |
||||||
|
│ |
||||||
|
└──→ load_party_mp_vectors() [NEW helper in explorer.py] |
||||||
|
returns Dict[str, List[np.ndarray]] (party → list of individual MP vectors) |
||||||
|
reuses same mp→party mapping as load_party_axis_scores |
||||||
|
│ |
||||||
|
↓ |
||||||
|
compute_party_bootstrap_cis(party_vectors, n_boot=1000, ci=95, seed=42) |
||||||
|
│ returns Dict[str, Dict] (party → {centroid, ci_lower, ci_upper, std, n_mps}) |
||||||
|
↓ |
||||||
|
_render_party_axis_chart(party_scores, comp_sel, theme, bootstrap_data=None) |
||||||
|
│ indexes [comp_sel - 1] from centroid and CIs |
||||||
|
│ applies flip (negate score AND CI bounds) |
||||||
|
│ adds error_x to Plotly Scatter trace |
||||||
|
↓ |
||||||
|
Streamlit renders chart with error bars |
||||||
|
``` |
||||||
|
|
||||||
|
Both functions cached via `@st.cache_data` with same TTL. |
||||||
|
|
||||||
|
## Error Handling |
||||||
|
|
||||||
|
- **N=1 parties (Volt, Lid Keijzer)**: Return centroid as both CI bounds, std=0. Chart renders diamond marker. Hover says "N=1, geen betrouwbaarheidsinterval". |
||||||
|
- **N=2 parties (50PLUS)**: CIs will be wide — that's correct, let data speak. |
||||||
|
- **SVD vector parsing failures**: Skip MP, log warning (same as existing pattern). |
||||||
|
- **Download/scraping failures**: Per-chunk try/except already handles this. `_fetch_body_text` returns None on failure (existing behavior). |
||||||
|
- **update-existing with no besluit_id**: Skip motion, log. Not all motions have a besluit_id traceable to body text. |
||||||
|
|
||||||
|
## Testing Strategy |
||||||
|
|
||||||
|
### Unit Tests |
||||||
|
- `test_bootstrap_fixed_seed`: Synthetic data (5 parties, varying N), fixed seed. Verify: |
||||||
|
- Output shape matches expected structure |
||||||
|
- CI bounds bracket centroid for all parties |
||||||
|
- N=1 party has ci_lower == ci_upper == centroid |
||||||
|
- Same seed produces identical output |
||||||
|
- Larger N produces narrower CIs |
||||||
|
|
||||||
|
### Integration Tests |
||||||
|
- `test_bootstrap_real_db`: Run against actual DB, verify: |
||||||
|
- Returns data for all 17 current parliament parties (+NSC) |
||||||
|
- n_mps values match known party sizes |
||||||
|
- CI width for D66 (N=49) << CI width for SP (N=3) |
||||||
|
|
||||||
|
### Visual Validation |
||||||
|
- Run Streamlit app, verify error bars appear on SVD axis charts |
||||||
|
- Verify N=1 parties have distinct marker style |
||||||
|
- Verify hover text includes party size |
||||||
|
|
||||||
|
## Open Questions |
||||||
|
|
||||||
|
None — design is straightforward. The only future enhancement would be multi-window bootstrap for axis stability testing, but that's a separate project. |
||||||
Loading…
Reference in new issue