You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
347 lines
21 KiB
347 lines
21 KiB
---
|
|
title: "Motion semantic drift analysis over time"
|
|
type: feat
|
|
status: active
|
|
date: 2026-04-05
|
|
origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md
|
|
---
|
|
|
|
# Motion Semantic Drift Analysis Over Time
|
|
|
|
## Overview
|
|
|
|
Add a new analysis script that tracks how the semantic content of motions on each SVD axis evolves across annual windows (2016-2024). The script produces a markdown report with charts showing axis stability, semantic drift timelines, party voting trajectories, and cross-ideological voting patterns. This is Phase 1 (script + report); a future phase will integrate this into the Streamlit explorer.
|
|
|
|
## Problem Frame
|
|
|
|
The SVD explorer shows where parties and motions sit on axes at a point in time, but doesn't reveal how the semantic content evolves. Users can't answer: did "right-wing" motions become more extreme over time? Are the SVD axes themselves stable across windows? Do left-wing parties increasingly vote for right-wing motions? (see origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md)
|
|
|
|
## Requirements Trace
|
|
|
|
- R1. Compute cosine similarity between SVD component vectors (or motion projection patterns) across all annual windows
|
|
- R2. Generate a stability heatmap showing which axes are comparable across time
|
|
- R3. Detect axis reordering across windows
|
|
- R4. Flag unstable axes
|
|
- R5. For each stable axis, compute average fused embedding centroid of top N motions per window
|
|
- R6. Track semantic drift using cosine distance between consecutive window centroids
|
|
- R7. Identify inflection points where drift accelerated (threshold-based)
|
|
- R8. Show example motions before/after inflection points
|
|
- R9. For each party, compute voting centroid per window along each stable axis
|
|
- R10. Track party trajectories over time
|
|
- R11. Detect cross-ideological voting patterns
|
|
- R12. Show concrete examples of parties voting against ideological alignment
|
|
- R13. Script produces markdown report with embedded charts
|
|
- R14. Report includes: stability heatmap, drift timelines, party trajectories, inflection analysis
|
|
- R15. Script is parameterized: `--db`, `--windows`, `--top-n`, `--output`
|
|
|
|
## Scope Boundaries
|
|
|
|
- Annual windows only (2016-2024); quarterly windows too sparse
|
|
- Script + report only — no UI/explorer integration in this phase
|
|
- No statistical significance testing beyond basic change-point detection
|
|
- SVD component vectors (V^T matrix) not currently stored — must be added to pipeline or computed indirectly
|
|
|
|
## Context & Research
|
|
|
|
### Relevant Code and Patterns
|
|
|
|
- `scripts/generate_svd_json.py` — script structure pattern: `main(argv) -> int`, argparse, ROOT path setup, logger
|
|
- `scripts/svd_diagnostics.py` — generates markdown + JSON report from SVD analysis
|
|
- `analysis/explorer_data.py` — DuckDB data loading patterns (read_only, try/finally, vector parsing), `load_mp_vectors_by_party_for_window()` for date-aware party normalization
|
|
- `analysis/trajectory.py` — existing cross-window drift computation using `_procrustes_align_windows()`
|
|
- `pipeline/svd_pipeline.py` — SVD computation; V^T available as `Vt` variable before scaling
|
|
- `tests/test_analysis.py` — test patterns: `tmp_path` fixture, `_setup_svd_vectors()` helper, class-based tests
|
|
- `analysis/config.py` — `CANONICAL_RIGHT`/`CANONICAL_LEFT` for cross-ideological voting detection
|
|
|
|
### Key Technical Decisions
|
|
|
|
- **matplotlib for static charts** — no matplotlib usage exists in codebase; this introduces a new dependency. Alternative: Plotly static image export (already in stack). Decision: use matplotlib for markdown-embedded PNGs; simpler for static reports.
|
|
- **V^T storage via dedicated entity_type** — store raw V^T matrix as `entity_type='vt_matrix'` row in `svd_vectors`. Historical windows won't have V^T; motion-ranking correlation fallback is the primary approach for this phase.
|
|
- **Axis stability via motion projection patterns with Procrustes alignment** — since V^T may not be available for historical windows, compute axis stability indirectly. First apply Procrustes alignment (reuse `_procrustes_align_windows()` from `analysis/trajectory.py`) to motion vectors across windows, then correlate top-N motion rankings per component. This handles SVD sign ambiguity and rotation.
|
|
- **Threshold-based change-point detection** — simple drift rate threshold (no new dependencies). Detect when consecutive drift exceeds 2× median drift rate.
|
|
- **Stability threshold** — cosine similarity > 0.7 classifies axes as stable. Default parameterized via `--stability-threshold` with 0.7 as default. Distribution of similarity values reported in output for sensitivity assessment.
|
|
- **Cross-ideological voting** — use `CANONICAL_RIGHT` from `analysis.config` to identify right-wing motions (high positive loading on axis 1), then detect left-wing parties voting "voor" on those motions. Axis polarity determined per-window using canonical party scores, not global constants.
|
|
|
|
## Open Questions
|
|
|
|
### Resolved During Planning
|
|
|
|
- **Charting library**: matplotlib for static PNG embedding in markdown. Add to `pyproject.toml`.
|
|
- **Change-point detection**: Simple threshold on drift rate (2× median). No new dependencies.
|
|
- **Party-motion linkage**: Use `mp_votes` table — party voted "voor" on motion. This measures voting alignment, not sponsorship.
|
|
- **Axis stability approach**: Two-tier — (a) if V^T available, use cosine similarity; (b) fallback: Procrustes-align motion vectors, then correlate top-N motion rankings per component across windows.
|
|
- **Top N for centroids**: Default N=20, parameterized via `--top-n`. Test during execution.
|
|
|
|
### Deferred to Implementation
|
|
|
|
- Exact optimal N for top motions per axis — will test N=10, 20, 50 during execution and pick the one with clearest signal
|
|
- Cross-ideological voting threshold — provisional: party voting "voor" on motions where canonical opposite-wing parties have high absolute loadings; will calibrate against baseline
|
|
|
|
## High-Level Technical Design
|
|
|
|
> *This illustrates the intended approach and is directional guidance for review, not implementation specification.*
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ scripts/motion_drift.py │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ 1. Load Data │
|
|
│ ├── fused_embeddings (per window, per motion) │
|
|
│ ├── svd_vectors (motion projections per window) │
|
|
│ ├── mp_votes (party voting records) │
|
|
│ └── motions (text for examples) │
|
|
│ │
|
|
│ 2. Axis Stability │
|
|
│ ├── Procrustes-align motion vectors across windows │
|
|
│ ├── Option A: cosine similarity of V^T vectors (if stored) │
|
|
│ └── Option B: correlate top-N motion rankings per component │
|
|
│ └── Output: stability heatmap (window × component matrix) │
|
|
│ │
|
|
│ 3. Semantic Drift │
|
|
│ ├── For each stable axis: │
|
|
│ │ ├── Get top N motions by |loading| per window │
|
|
│ │ ├── Compute fused embedding centroid per window │
|
|
│ │ └── Cosine distance between consecutive windows │
|
|
│ └── Output: drift timeline per axis + inflection points │
|
|
│ │
|
|
│ 4. Party Voting Analysis │
|
|
│ ├── For each party (with date-aware name normalization): │
|
|
│ │ ├── Get motions party voted "voor" on per window │
|
|
│ │ └── Compute voting centroid along each stable axis │
|
|
│ ├── Cross-ideological detection (per-window axis polarity): │
|
|
│ │ ├── Left parties voting "voor" on right-wing motions │
|
|
│ │ └── Right parties voting "voor" on left-wing motions │
|
|
│ └── Output: party trajectory plots + cross-voting examples │
|
|
│ │
|
|
│ 5. Report Generation │
|
|
│ ├── Markdown with embedded matplotlib PNGs │
|
|
│ ├── Axis stability heatmap │
|
|
│ ├── Semantic drift timelines │
|
|
│ ├── Party trajectory plots │
|
|
│ └── Inflection point analysis with motion examples │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Implementation Units
|
|
|
|
- [ ] **Unit 1: Add matplotlib dependency and script scaffolding**
|
|
|
|
**Goal:** Set up the new script with proper structure and dependencies.
|
|
|
|
**Requirements:** R15
|
|
|
|
**Dependencies:** None
|
|
|
|
**Files:**
|
|
- Modify: `pyproject.toml` (add matplotlib)
|
|
- Create: `scripts/motion_drift.py`
|
|
- Test: `tests/test_motion_drift.py`
|
|
|
|
**Approach:**
|
|
- Add `matplotlib>=3.8` to `pyproject.toml` dependencies
|
|
- Create `scripts/motion_drift.py` following established script pattern: `main(argv) -> int`, argparse with `--db`, `--windows`, `--top-n`, `--output`, ROOT path setup, module logger
|
|
- Add schema validation at startup: check for required tables (`svd_vectors`, `fused_embeddings`, `mp_votes`, `motions`)
|
|
- Create minimal `tests/test_motion_drift.py` with import test, argument parsing test, and schema validation test using in-memory DuckDB fixture
|
|
|
|
**Patterns to follow:**
|
|
- `scripts/generate_svd_json.py` — script structure, argparse, entry point
|
|
- `scripts/svd_diagnostics.py` — report generation pattern
|
|
- `tests/test_analysis.py` — `tmp_path` fixture, `_setup_svd_vectors()` helper
|
|
|
|
**Test scenarios:**
|
|
- Happy path: `main(["--help"])` exits with code 0 and prints usage
|
|
- Happy path: `main(["--db", "data/motions.db", "--output", "/tmp/test"])` runs without error
|
|
- Edge case: `main(["--db", "nonexistent.db"])` handles missing database gracefully (exit code 1)
|
|
- Edge case: database with missing tables produces clear error message
|
|
|
|
**Verification:**
|
|
- `uv run python scripts/motion_drift.py --help` shows all arguments
|
|
- `uv run python -m pytest tests/test_motion_drift.py -q` passes
|
|
|
|
- [ ] **Unit 2: Axis stability analysis**
|
|
|
|
**Goal:** Compute axis stability across annual windows and generate stability heatmap.
|
|
|
|
**Requirements:** R1, R2, R3, R4
|
|
|
|
**Dependencies:** Unit 1
|
|
|
|
**Files:**
|
|
- Create: `analysis/motion_drift.py` (core analysis module)
|
|
- Modify: `scripts/motion_drift.py` (call axis stability)
|
|
- Test: `tests/test_motion_drift.py`
|
|
|
|
**Approach:**
|
|
- Create `analysis/motion_drift.py` with `compute_axis_stability(db_path, windows)` function
|
|
- Two-tier approach:
|
|
- Try loading V^T from `svd_vectors` where `entity_type='vt_matrix'` (if stored by pipeline)
|
|
- Fallback: apply Procrustes alignment to motion vectors across windows (reuse `_procrustes_align_windows()` from `analysis/trajectory.py`), then for each window get top N motions per component by absolute score and compute pairwise cosine similarity of motion ranking vectors
|
|
- Generate stability heatmap as matplotlib figure (window × component matrix, color-coded by similarity)
|
|
- Return stability report: which axes are stable (similarity > 0.7), which are reordered (high similarity to different component index), which are unstable (low similarity to any component)
|
|
|
|
**Patterns to follow:**
|
|
- `analysis/explorer_data.py` — DuckDB loading patterns, vector parsing
|
|
- `analysis/trajectory.py` — `_procrustes_align_windows()` for cross-window comparison
|
|
|
|
**Test scenarios:**
|
|
- Happy path: `compute_axis_stability` returns stability matrix for 3+ windows with synthetic data
|
|
- Happy path: stability matrix is symmetric and values are in [-1, 1]
|
|
- Happy path: Procrustes alignment corrects sign flips between windows
|
|
- Edge case: single window returns empty stability report (no comparison possible)
|
|
- Edge case: windows with no motion vectors handled gracefully (warning logged, skipped)
|
|
- Integration: run against real `data/motions.db` annual windows, verify heatmap is generated
|
|
|
|
**Verification:**
|
|
- Stability heatmap PNG generated with correct dimensions (windows × components)
|
|
- Stability report identifies at least some axes as stable (similarity > 0.7)
|
|
|
|
- [ ] **Unit 3: Semantic drift analysis**
|
|
|
|
**Goal:** Compute semantic drift timelines for stable axes and detect inflection points.
|
|
|
|
**Requirements:** R5, R6, R7, R8
|
|
|
|
**Dependencies:** Unit 2 (needs stable axis list)
|
|
|
|
**Files:**
|
|
- Modify: `analysis/motion_drift.py` (add drift functions)
|
|
- Modify: `scripts/motion_drift.py` (call drift analysis)
|
|
- Test: `tests/test_motion_drift.py`
|
|
|
|
**Approach:**
|
|
- Add `compute_semantic_drift(db_path, stable_axes, windows, top_n)` function
|
|
- For each stable axis:
|
|
- Get top N motions per window by absolute SVD loading
|
|
- Compute average fused embedding centroid per window
|
|
- Compute cosine distance between consecutive window centroids
|
|
- Detect inflection points: where drift rate exceeds 2× median drift rate
|
|
- For each inflection point, extract example motions (top 3 before/after by loading)
|
|
- Generate drift timeline plot per axis (line chart with inflection point markers)
|
|
|
|
**Patterns to follow:**
|
|
- `analysis/trajectory.py` — `compute_trajectories()` for cross-window drift computation
|
|
- `scripts/svd_diagnostics.py` — markdown report generation
|
|
|
|
**Test scenarios:**
|
|
- Happy path: `compute_semantic_drift` returns drift series for each stable axis
|
|
- Happy path: drift values are in [0, 2] (cosine distance range)
|
|
- Happy path: inflection points detected when synthetic data has abrupt change
|
|
- Edge case: axis with only 2 windows returns drift but no inflection points
|
|
- Edge case: axis with monotonic drift returns no inflection points
|
|
- Integration: run against real data, verify drift timelines are plausible
|
|
|
|
**Verification:**
|
|
- Drift timeline PNG generated per stable axis
|
|
- Inflection points (if any) are marked on timeline with motion examples in report
|
|
|
|
- [ ] **Unit 4: Party voting analysis**
|
|
|
|
**Goal:** Compute party voting centroids and detect cross-ideological voting patterns.
|
|
|
|
**Requirements:** R9, R10, R11, R12
|
|
|
|
**Dependencies:** Unit 2 (needs stable axis list)
|
|
|
|
**Files:**
|
|
- Modify: `analysis/motion_drift.py` (add party analysis functions)
|
|
- Modify: `scripts/motion_drift.py` (call party analysis)
|
|
- Test: `tests/test_motion_drift.py`
|
|
|
|
**Approach:**
|
|
- Add `compute_party_voting(db_path, stable_axes, windows)` function
|
|
- For each party:
|
|
- Query `mp_votes` for motions party voted "voor" on per window, using date-aware party name normalization (reuse `load_mp_vectors_by_party_for_window()` pattern from `analysis/explorer_data.py`)
|
|
- For each motion, get its SVD scores from `svd_vectors`
|
|
- Compute unweighted mean score along each stable axis (voting centroid)
|
|
- Track party trajectories: plot party centroid position per window along each axis
|
|
- Detect cross-ideological voting:
|
|
- For each window, independently determine axis polarity by checking where canonical right-wing parties (CANONICAL_RIGHT) score on each axis
|
|
- Identify "right-wing" motions (high positive loading on axis where PVV/FVD/JA21/SGP score high after polarity check)
|
|
- Find left-wing parties (SP, PvdA, GL, etc.) voting "voor" on right-wing motions
|
|
- Compute cross-voting rate per party per window
|
|
- Detect trends: is cross-voting increasing or decreasing over time?
|
|
- Generate party trajectory plots and cross-voting summary table
|
|
|
|
**Patterns to follow:**
|
|
- `analysis/config.py` — `CANONICAL_RIGHT`/`CANONICAL_LEFT` for party classification
|
|
- `analysis/explorer_data.py` — `mp_votes` query patterns, `load_mp_vectors_by_party_for_window()` for party normalization
|
|
|
|
**Test scenarios:**
|
|
- Happy path: `compute_party_voting` returns voting centroids for parties with sufficient data
|
|
- Happy path: cross-ideological voting detected when synthetic data has left party voting on right motions
|
|
- Happy path: party name normalization maps historical names (GL, PvdA → GroenLinks-PvdA) correctly
|
|
- Edge case: party with no "voor" votes in a window handled gracefully (centroid = NaN, skipped)
|
|
- Edge case: window with no voting data handled gracefully
|
|
- Integration: run against real data, verify party trajectories are plausible
|
|
|
|
**Verification:**
|
|
- Party trajectory PNG generated showing party movement across windows
|
|
- Cross-voting summary table in report with at least one example
|
|
|
|
- [ ] **Unit 5: Report generation**
|
|
|
|
**Goal:** Assemble all analysis outputs into a markdown report with embedded charts.
|
|
|
|
**Requirements:** R13, R14, R15
|
|
|
|
**Dependencies:** Units 2, 3, 4
|
|
|
|
**Files:**
|
|
- Modify: `scripts/motion_drift.py` (orchestrate report generation)
|
|
- Test: `tests/test_motion_drift.py`
|
|
|
|
**Approach:**
|
|
- Add `_generate_report(output_dir, stability_result, drift_result, party_result)` function
|
|
- Generate markdown with sections:
|
|
- Summary (key findings, number of stable axes, inflection points, cross-voting trends)
|
|
- Axis Stability (heatmap + interpretation)
|
|
- Semantic Drift (timeline per axis + inflection point analysis with motion examples)
|
|
- Party Voting Analysis (trajectory plots + cross-voting summary + examples)
|
|
- Methodology (brief description of approach, parameters used)
|
|
- Save all matplotlib figures as PNGs in output directory
|
|
- Embed PNGs in markdown using relative paths
|
|
|
|
**Patterns to follow:**
|
|
- `scripts/svd_diagnostics.py` — markdown report structure
|
|
- `scripts/generate_svd_json.py` — `_generate_markdown_report()` function
|
|
|
|
**Test scenarios:**
|
|
- Happy path: report generated with all sections and embedded images
|
|
- Happy path: all PNG files exist in output directory
|
|
- Edge case: no stable axes → report notes this and skips drift/party sections
|
|
- Edge case: output directory creation when it doesn't exist
|
|
|
|
**Verification:**
|
|
- `output/report.md` exists and contains all expected sections
|
|
- All referenced PNG files exist in output directory
|
|
- Report is readable in a markdown viewer
|
|
|
|
## System-Wide Impact
|
|
|
|
- **Interaction graph:** New script reads from existing DuckDB tables; no writes to production data. Pipeline change needed to store V^T matrix (optional, for future windows).
|
|
- **Unchanged invariants:** SVD computation unchanged. Explorer unchanged. Existing analysis modules unchanged.
|
|
- **New dependency:** `matplotlib` added to `pyproject.toml`. First use of matplotlib in codebase.
|
|
|
|
## Risks & Dependencies
|
|
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|
|------|-----------|--------|------------|
|
|
| matplotlib introduces new dependency burden | Low | Low | Already common library; well-maintained. Alternative: use Plotly static export if team prefers single viz stack. |
|
|
| V^T matrix not available for historical windows | High | Medium | Fallback to Procrustes-aligned motion ranking correlation (works with existing data). Store V^T going forward. |
|
|
| Sparse data in early windows (2016-2018: 124-162 motions) | Medium | Medium | Script warns about low-coverage windows; analysis focuses on 2019+ where data is richer. |
|
|
| Cross-ideological voting detection threshold too sensitive/insensitive | Medium | Low | Threshold is parameterized; will calibrate during execution against baseline drift rates. |
|
|
| Script exceeds 2-minute runtime on full dataset | Low | Low | JSON parsing of fused embeddings is the bottleneck. Will batch-load and cache if needed. |
|
|
|
|
## Documentation / Operational Notes
|
|
|
|
- New script: `scripts/motion_drift.py` — usage documented in module docstring
|
|
- New analysis module: `analysis/motion_drift.py` — functions documented with docstrings
|
|
- Report output: markdown with embedded PNGs, shareable without running the script
|
|
- Future: integrate analysis into Streamlit explorer tab (separate plan)
|
|
|
|
## Sources & References
|
|
|
|
- **Origin document:** [docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md](docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md)
|
|
- Related code: `scripts/generate_svd_json.py`, `scripts/svd_diagnostics.py`, `analysis/trajectory.py`, `analysis/explorer_data.py`
|
|
- Party sets: `analysis/config.py` (CANONICAL_RIGHT, CANONICAL_LEFT)
|
|
|