You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
86 lines
4.2 KiB
86 lines
4.2 KiB
---
|
|
date: 2026-04-13
|
|
topic: topic-derived-svd-axis-labels
|
|
---
|
|
|
|
# Topic-Derived SVD Axis Labels
|
|
|
|
## Problem Frame
|
|
|
|
The current SVD axis labels in `SVD_THEMES` (config.py) describe which parties land where, not what policy dimension the axis captures. This produces misleading labels:
|
|
|
|
- **Axis 1**: labeled "Links: PvdD, GL-PvdA" but PvdD and D66 vote the same way on the defining motions (Israel, rent, antipersonnel mines, gas extraction). D66 is known as centrist, not left. The label reflects party positions, not the actual policy divide.
|
|
- The negative pole is named after parties that *coincidentally* vote together, not parties that define the axis.
|
|
|
|
**Users** want to understand what policy dimension each axis represents. A good label should be topic-derived from the motions that define each axis.
|
|
|
|
## Requirements
|
|
|
|
### Label Derivation
|
|
|
|
- **R1** Labels are derived from the **content of the motions** that define each axis, not from party positions.
|
|
- **R2** Use **50 motions per component** (top 25 positive + top 25 negative by absolute loading) to capture the full topic breadth, not just the top 10 (which can show a misleadingly narrow slice).
|
|
- **R3** Derive the label using **TF-IDF keyword extraction** on motion titles (Dutch stopwords removed). Use the top 3-5 most distinctive keywords to form a short label.
|
|
- **R4** Also consider `policy_area` field to validate or supplement the keyword-derived label.
|
|
- **R5** Labels should be **reviewed manually** before being applied to `SVD_THEMES`. The script outputs suggestions; human validates before committing.
|
|
- **R6** For each component, the output includes:
|
|
- Suggested short label (≤60 chars)
|
|
- Top 10 representative motions (5 pos + 5 neg pole)
|
|
- Top 10 TF-IDF keywords
|
|
- Dominant `policy_area`
|
|
- Current SVD_THEMES label for reference
|
|
|
|
### Tooling
|
|
|
|
- **R7** Create a new script `scripts/derive_svd_labels.py` that generates a **review report** (markdown) with label suggestions per component.
|
|
- **R8** The report is generated by running:
|
|
```bash
|
|
uv run python3 scripts/derive_svd_labels.py --db data/motions.db --window current_parliament
|
|
```
|
|
- **R9** After review, the validated labels are written to `analysis/config.py` (updating `SVD_THEMES`).
|
|
|
|
### Output Report Format
|
|
|
|
For each component (1-10), the review report includes:
|
|
- Suggested label
|
|
- TF-IDF keyword list
|
|
- Dominant policy area
|
|
- Top 5 positive-pole motion titles
|
|
- Top 5 negative-pole motion titles
|
|
- Current label for comparison
|
|
|
|
## Success Criteria
|
|
|
|
- Each axis label reflects the actual policy topics that define that axis
|
|
- Labels are consistent and interpretable (e.g., "Buitenlandbeleid & Klimaat" not "Links vs Rechts")
|
|
- PvdD and D66 scoring on axis 1 makes sense given the derived label
|
|
- The review report makes it easy for a human to validate or correct labels
|
|
|
|
## Scope Boundaries
|
|
|
|
- **In scope**: Label derivation for axis 1-10, review workflow, updating config
|
|
- **Out of scope**: Automatically applying labels without review, changing the SVD computation, modifying the UI
|
|
- **Not changing**: The `positive_pole` / `negative_pole` fields in SVD_THEMES (those describe party coalitions, not topics — acceptable as-is)
|
|
|
|
## Key Decisions
|
|
|
|
- **TF-IDF over LLM**: TF-IDF is deterministic, fast, and sufficient for keyword extraction. No LLM dependency. Reviewer still validates output.
|
|
- **Static labels in config**: After review, labels go into `SVD_THEMES` in config.py. This keeps the current architecture (no runtime derivation).
|
|
- **Large motion sample (≥50)**: 10 motions per component is too few — axis 1's 10 motions show a mix of Israel, rent, mines, gas that looks incoherent. ≥50 gives a clearer picture of what the axis truly captures.
|
|
|
|
## Dependencies / Assumptions
|
|
|
|
- Motion titles in `motions` table are in Dutch and sufficiently descriptive
|
|
- `policy_area` field has meaningful coverage
|
|
- `svd_vectors` table contains all motion loadings for the window
|
|
|
|
## Outstanding Questions
|
|
|
|
### Resolve Before Planning
|
|
(none)
|
|
|
|
### Deferred to Planning
|
|
- **Tooling approach**: Use parallel subagents (one per axis) to analyze 50 motions each and derive labels, rather than a single sequential script. Each subagent produces a suggested label independently.
|
|
|
|
## Next Steps
|
|
→ `/ce:plan` for structured implementation planning
|
|
|