You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
113 lines
7.4 KiB
113 lines
7.4 KiB
---
|
|
date: 2026-03-24
|
|
topic: "Welk tweede kamerlid ben jij?"
|
|
status: draft
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
We need a new Streamlit tab in explorer.py titled **"Welk tweede kamerlid ben jij?"** that interactively narrows the list of 2026 MPs by asking the user a sequence of yes/no/abstain questions (motions). The goal: find the minimal set of motions (questions) that uniquely identify a single MP, or determine that no unique MP exists because two or more MPs have identical voting records.
|
|
|
|
**Why:** This is a guided identification quiz that helps users discover which MP they most resemble by iteratively comparing their answers to historic MP votes.
|
|
|
|
## Constraints
|
|
|
|
- Work inside the existing Streamlit explorer (single-file UI: **explorer.py**).
|
|
- Use existing data models/tables: **mp_votes**, **mp_metadata**, **motions** (DuckDB / MotionDatabase). No new external services.
|
|
- Keep reads read-only: do not modify the DB from the UI flow.
|
|
- YAGNI: minimal viable UX first (linear question flow, basic results table), extensible later.
|
|
|
|
## Approach (chosen)
|
|
|
|
I recommend a two-stage approach that balances simplicity and correctness:
|
|
|
|
- **Stage A (Batch-match + ranking):** Ask the user a small curated set of motions (e.g., high-controversy / high-discriminative score). Collect answers into a map motion_id -> vote and compute per-MP agreement counts using a new read-only DB helper. Show ranked candidates and whether any are unique.
|
|
- **Stage B (Minimal distinguishing set):** If multiple candidates tie (or more than one remain), compute a minimal discriminating set of additional motions by greedily selecting motions that best split the remaining candidate set and present them as follow-up questions until a unique MP or impossibility is reached.
|
|
|
|
|
|
Alternatives considered (rejected):
|
|
|
|
- Asking motions adaptively from the start using an information-gain search over the entire motion space. Rejected because it’s heavier to implement and harder to explain to users; we can implement a greedy information-gain variant later.
|
|
- Building a full decision tree offline for all MPs. Rejected for now because the dataset and party churn make maintenance cumbersome.
|
|
|
|
Effort estimate (rough):
|
|
|
|
- Backend: add one DB method to MotionDatabase (match_mps_for_votes) + helper to compute split scores — 2–4 hours.
|
|
- Frontend: add new Streamlit builder, UI state, and wiring into tabs — 2–4 hours.
|
|
- Testing & polish: 2–3 hours.
|
|
|
|
Risks & dependencies
|
|
|
|
- **Data quality:** If mp_votes.party or mp_metadata are incomplete, matching may be imperfect. We rely on existing backfill scripts to improve party fields.
|
|
- **Performance:** Joins over mp_votes can be large; we'll limit candidate motion set and use read-only DuckDB queries, with caching where appropriate.
|
|
|
|
## Architecture
|
|
|
|
High-level components (all in-process Streamlit app):
|
|
|
|
- **Explorer UI (explorer.py)** — new tab builder **build_mp_quiz_tab**. Presents questions and displays results.
|
|
- **MotionDatabase (database.py)** — new read-only method **match_mps_for_votes(user_votes, limit)** that returns per-MP agreement and overlap counts. Also a helper **choose_discriminating_motions(candidates, excluded_motion_ids, k=1)** that scores motions by how well they split candidate MPs.
|
|
- **DuckDB (data)** — existing tables: motions, mp_votes, mp_metadata.
|
|
|
|
All calls stay local — the Streamlit UI instantiates MotionDatabase(db_path) and calls the new read methods.
|
|
|
|
## Components and Responsibilities
|
|
|
|
- **build_mp_quiz_tab (explorer.py)**
|
|
- Render intro and instructions.
|
|
- Load an initial pool of candidate motions (curated by controversy or SVD components via existing load_motions_df).
|
|
- Present one question at a time, store answers in st.session_state (motion_id -> vote).
|
|
- After each answer (or on demand), call MotionDatabase.match_mps_for_votes to get ranked candidates.
|
|
- If multiple candidates remain, call the discriminating-motion helper to pick the next question.
|
|
- Show final result (unique MP or note that multiple MPs are indistinguishable).
|
|
|
|
- **MotionDatabase.match_mps_for_votes (database.py)**
|
|
- Input: user_votes dict {motion_id: vote_str}
|
|
- Output: ordered list of {mp_name, party, matched, total, agreement_pct}
|
|
- Implementation: create an in-memory relation of user_votes, join with mp_votes where mp_name LIKE '%,%' and aggregate matched / overlap counts. Order by agreement_pct, matched desc.
|
|
|
|
- **MotionDatabase.choose_discriminating_motions (database.py)**
|
|
- Input: remaining candidate mp_names, excluded_motion_ids
|
|
- Output: motion_id(s) ranked by split-score (e.g., entropy or max-min split)
|
|
- Implementation: for a small candidate set, compute how many MPs vote 'voor'/'tegen'/'onthouden' on each motion and pick motion with best split.
|
|
|
|
Files to modify (concrete)
|
|
|
|
- explorer.py
|
|
- Add function: build_mp_quiz_tab(...) near other build_*_tab functions (e.g., after build_svd_components_tab).
|
|
- Add new tab label to the tab_labels list and wire into the st.tabs and fallback radio branches. (See existing tab pattern at explorer.py around lines ~626-779.)
|
|
|
|
- database.py
|
|
- Add methods: match_mps_for_votes and choose_discriminating_motions near calculate_party_matches / mp_votes helpers.
|
|
|
|
## Data Flow
|
|
|
|
1. UI loads candidate motion list via existing load_motions_df(db_path).
|
|
2. User answers a question => stored in st.session_state['mp_quiz_votes'] mapping motion_id -> vote_token.
|
|
3. UI calls MotionDatabase.match_mps_for_votes(user_votes) (read-only DuckDB). Returns sorted candidate MPs with matched/total/agreement_pct.
|
|
4. If >1 candidate remains, UI calls MotionDatabase.choose_discriminating_motions(candidates, excluded) to pick next motion(s).
|
|
5. Repeat until one candidate remains OR no motion splits candidates (tie by identical voting histories).
|
|
|
|
## Error Handling
|
|
|
|
- Validation: normalize UI votes to the canonical tokens used in mp_votes (lowercase Dutch tokens like 'voor','tegen','onthouden','afwezig').
|
|
- Empty or missing data: if user_votes is empty or no overlaps exist, show helpful message and fall back to top-ranked MPs by similarity.
|
|
- Division-by-zero: in match computations, treat zero-overlap MPs as excluded from ranking and surface a clear message.
|
|
- Timeouts / heavy queries: restrict candidate set and use read-only DuckDB and caching (@st.cache_data) to avoid repeated heavy queries.
|
|
|
|
## Testing Strategy
|
|
|
|
- Unit tests for database methods (new tests/test_match_mps.py):
|
|
- small synthetic mp_votes fixture to assert matched/total/agreement_pct logic.
|
|
- tests for choose_discriminating_motions producing expected splits.
|
|
- Integration test for explorer tab (tests/test_explorer_quiz.py): render the builder function in a headless mode and assert UI state updates and DB calls succeed (similar to existing tests/test_explorer_import.py).
|
|
|
|
## Open Questions
|
|
|
|
1. Do we want an initial curated motion set (top-10 controversial), or start fully adaptive? I'll implement a small curated seed and make adaptive/discovery optional.
|
|
2. UX: Should we let users skip a question (abstain) and count abstain as a valid token? I assume yes and will treat abstain as a normal vote that matches mp_votes 'onthouden' or 'afwezig' values.
|
|
3. Performance limits: how many motions should we allow the user to answer (arbitrary cap e.g., 20)? I suggest 20 to keep interactions snappy.
|
|
|
|
## Next steps
|
|
|
|
I'm proceeding to create the design doc file at thoughts/shared/designs/2026-03-24-welk-tweede-kamerlid-ben-jij-design.md and commit it. Interrupt if you want changes. After that I'll spawn the planner to create a detailed implementation plan based on this design.
|
|
|