You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
motief/docs/plans/2026-04-05-004-feat-motion-...

347 lines
21 KiB

---
title: "Motion semantic drift analysis over time"
type: feat
status: active
date: 2026-04-05
origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md
---
# Motion Semantic Drift Analysis Over Time
## Overview
Add a new analysis script that tracks how the semantic content of motions on each SVD axis evolves across annual windows (2016-2024). The script produces a markdown report with charts showing axis stability, semantic drift timelines, party voting trajectories, and cross-ideological voting patterns. This is Phase 1 (script + report); a future phase will integrate this into the Streamlit explorer.
## Problem Frame
The SVD explorer shows where parties and motions sit on axes at a point in time, but doesn't reveal how the semantic content evolves. Users can't answer: did "right-wing" motions become more extreme over time? Are the SVD axes themselves stable across windows? Do left-wing parties increasingly vote for right-wing motions? (see origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md)
## Requirements Trace
- R1. Compute cosine similarity between SVD component vectors (or motion projection patterns) across all annual windows
- R2. Generate a stability heatmap showing which axes are comparable across time
- R3. Detect axis reordering across windows
- R4. Flag unstable axes
- R5. For each stable axis, compute average fused embedding centroid of top N motions per window
- R6. Track semantic drift using cosine distance between consecutive window centroids
- R7. Identify inflection points where drift accelerated (threshold-based)
- R8. Show example motions before/after inflection points
- R9. For each party, compute voting centroid per window along each stable axis
- R10. Track party trajectories over time
- R11. Detect cross-ideological voting patterns
- R12. Show concrete examples of parties voting against ideological alignment
- R13. Script produces markdown report with embedded charts
- R14. Report includes: stability heatmap, drift timelines, party trajectories, inflection analysis
- R15. Script is parameterized: `--db`, `--windows`, `--top-n`, `--output`
## Scope Boundaries
- Annual windows only (2016-2024); quarterly windows too sparse
- Script + report only — no UI/explorer integration in this phase
- No statistical significance testing beyond basic change-point detection
- SVD component vectors (V^T matrix) not currently stored — must be added to pipeline or computed indirectly
## Context & Research
### Relevant Code and Patterns
- `scripts/generate_svd_json.py` — script structure pattern: `main(argv) -> int`, argparse, ROOT path setup, logger
- `scripts/svd_diagnostics.py` — generates markdown + JSON report from SVD analysis
- `analysis/explorer_data.py` — DuckDB data loading patterns (read_only, try/finally, vector parsing), `load_mp_vectors_by_party_for_window()` for date-aware party normalization
- `analysis/trajectory.py` — existing cross-window drift computation using `_procrustes_align_windows()`
- `pipeline/svd_pipeline.py` — SVD computation; V^T available as `Vt` variable before scaling
- `tests/test_analysis.py` — test patterns: `tmp_path` fixture, `_setup_svd_vectors()` helper, class-based tests
- `analysis/config.py``CANONICAL_RIGHT`/`CANONICAL_LEFT` for cross-ideological voting detection
### Key Technical Decisions
- **matplotlib for static charts** — no matplotlib usage exists in codebase; this introduces a new dependency. Alternative: Plotly static image export (already in stack). Decision: use matplotlib for markdown-embedded PNGs; simpler for static reports.
- **V^T storage via dedicated entity_type** — store raw V^T matrix as `entity_type='vt_matrix'` row in `svd_vectors`. Historical windows won't have V^T; motion-ranking correlation fallback is the primary approach for this phase.
- **Axis stability via motion projection patterns with Procrustes alignment** — since V^T may not be available for historical windows, compute axis stability indirectly. First apply Procrustes alignment (reuse `_procrustes_align_windows()` from `analysis/trajectory.py`) to motion vectors across windows, then correlate top-N motion rankings per component. This handles SVD sign ambiguity and rotation.
- **Threshold-based change-point detection** — simple drift rate threshold (no new dependencies). Detect when consecutive drift exceeds 2× median drift rate.
- **Stability threshold** — cosine similarity > 0.7 classifies axes as stable. Default parameterized via `--stability-threshold` with 0.7 as default. Distribution of similarity values reported in output for sensitivity assessment.
- **Cross-ideological voting** — use `CANONICAL_RIGHT` from `analysis.config` to identify right-wing motions (high positive loading on axis 1), then detect left-wing parties voting "voor" on those motions. Axis polarity determined per-window using canonical party scores, not global constants.
## Open Questions
### Resolved During Planning
- **Charting library**: matplotlib for static PNG embedding in markdown. Add to `pyproject.toml`.
- **Change-point detection**: Simple threshold on drift rate (2× median). No new dependencies.
- **Party-motion linkage**: Use `mp_votes` table — party voted "voor" on motion. This measures voting alignment, not sponsorship.
- **Axis stability approach**: Two-tier — (a) if V^T available, use cosine similarity; (b) fallback: Procrustes-align motion vectors, then correlate top-N motion rankings per component across windows.
- **Top N for centroids**: Default N=20, parameterized via `--top-n`. Test during execution.
### Deferred to Implementation
- Exact optimal N for top motions per axis — will test N=10, 20, 50 during execution and pick the one with clearest signal
- Cross-ideological voting threshold — provisional: party voting "voor" on motions where canonical opposite-wing parties have high absolute loadings; will calibrate against baseline
## High-Level Technical Design
> *This illustrates the intended approach and is directional guidance for review, not implementation specification.*
```
┌─────────────────────────────────────────────────────────────────┐
│ scripts/motion_drift.py │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Load Data │
│ ├── fused_embeddings (per window, per motion) │
│ ├── svd_vectors (motion projections per window) │
│ ├── mp_votes (party voting records) │
│ └── motions (text for examples) │
│ │
│ 2. Axis Stability │
│ ├── Procrustes-align motion vectors across windows │
│ ├── Option A: cosine similarity of V^T vectors (if stored) │
│ └── Option B: correlate top-N motion rankings per component │
│ └── Output: stability heatmap (window × component matrix) │
│ │
│ 3. Semantic Drift │
│ ├── For each stable axis: │
│ │ ├── Get top N motions by |loading| per window │
│ │ ├── Compute fused embedding centroid per window │
│ │ └── Cosine distance between consecutive windows │
│ └── Output: drift timeline per axis + inflection points │
│ │
│ 4. Party Voting Analysis │
│ ├── For each party (with date-aware name normalization): │
│ │ ├── Get motions party voted "voor" on per window │
│ │ └── Compute voting centroid along each stable axis │
│ ├── Cross-ideological detection (per-window axis polarity): │
│ │ ├── Left parties voting "voor" on right-wing motions │
│ │ └── Right parties voting "voor" on left-wing motions │
│ └── Output: party trajectory plots + cross-voting examples │
│ │
│ 5. Report Generation │
│ ├── Markdown with embedded matplotlib PNGs │
│ ├── Axis stability heatmap │
│ ├── Semantic drift timelines │
│ ├── Party trajectory plots │
│ └── Inflection point analysis with motion examples │
└─────────────────────────────────────────────────────────────────┘
```
## Implementation Units
- [ ] **Unit 1: Add matplotlib dependency and script scaffolding**
**Goal:** Set up the new script with proper structure and dependencies.
**Requirements:** R15
**Dependencies:** None
**Files:**
- Modify: `pyproject.toml` (add matplotlib)
- Create: `scripts/motion_drift.py`
- Test: `tests/test_motion_drift.py`
**Approach:**
- Add `matplotlib>=3.8` to `pyproject.toml` dependencies
- Create `scripts/motion_drift.py` following established script pattern: `main(argv) -> int`, argparse with `--db`, `--windows`, `--top-n`, `--output`, ROOT path setup, module logger
- Add schema validation at startup: check for required tables (`svd_vectors`, `fused_embeddings`, `mp_votes`, `motions`)
- Create minimal `tests/test_motion_drift.py` with import test, argument parsing test, and schema validation test using in-memory DuckDB fixture
**Patterns to follow:**
- `scripts/generate_svd_json.py` — script structure, argparse, entry point
- `scripts/svd_diagnostics.py` — report generation pattern
- `tests/test_analysis.py``tmp_path` fixture, `_setup_svd_vectors()` helper
**Test scenarios:**
- Happy path: `main(["--help"])` exits with code 0 and prints usage
- Happy path: `main(["--db", "data/motions.db", "--output", "/tmp/test"])` runs without error
- Edge case: `main(["--db", "nonexistent.db"])` handles missing database gracefully (exit code 1)
- Edge case: database with missing tables produces clear error message
**Verification:**
- `uv run python scripts/motion_drift.py --help` shows all arguments
- `uv run python -m pytest tests/test_motion_drift.py -q` passes
- [ ] **Unit 2: Axis stability analysis**
**Goal:** Compute axis stability across annual windows and generate stability heatmap.
**Requirements:** R1, R2, R3, R4
**Dependencies:** Unit 1
**Files:**
- Create: `analysis/motion_drift.py` (core analysis module)
- Modify: `scripts/motion_drift.py` (call axis stability)
- Test: `tests/test_motion_drift.py`
**Approach:**
- Create `analysis/motion_drift.py` with `compute_axis_stability(db_path, windows)` function
- Two-tier approach:
- Try loading V^T from `svd_vectors` where `entity_type='vt_matrix'` (if stored by pipeline)
- Fallback: apply Procrustes alignment to motion vectors across windows (reuse `_procrustes_align_windows()` from `analysis/trajectory.py`), then for each window get top N motions per component by absolute score and compute pairwise cosine similarity of motion ranking vectors
- Generate stability heatmap as matplotlib figure (window × component matrix, color-coded by similarity)
- Return stability report: which axes are stable (similarity > 0.7), which are reordered (high similarity to different component index), which are unstable (low similarity to any component)
**Patterns to follow:**
- `analysis/explorer_data.py` — DuckDB loading patterns, vector parsing
- `analysis/trajectory.py``_procrustes_align_windows()` for cross-window comparison
**Test scenarios:**
- Happy path: `compute_axis_stability` returns stability matrix for 3+ windows with synthetic data
- Happy path: stability matrix is symmetric and values are in [-1, 1]
- Happy path: Procrustes alignment corrects sign flips between windows
- Edge case: single window returns empty stability report (no comparison possible)
- Edge case: windows with no motion vectors handled gracefully (warning logged, skipped)
- Integration: run against real `data/motions.db` annual windows, verify heatmap is generated
**Verification:**
- Stability heatmap PNG generated with correct dimensions (windows × components)
- Stability report identifies at least some axes as stable (similarity > 0.7)
- [ ] **Unit 3: Semantic drift analysis**
**Goal:** Compute semantic drift timelines for stable axes and detect inflection points.
**Requirements:** R5, R6, R7, R8
**Dependencies:** Unit 2 (needs stable axis list)
**Files:**
- Modify: `analysis/motion_drift.py` (add drift functions)
- Modify: `scripts/motion_drift.py` (call drift analysis)
- Test: `tests/test_motion_drift.py`
**Approach:**
- Add `compute_semantic_drift(db_path, stable_axes, windows, top_n)` function
- For each stable axis:
- Get top N motions per window by absolute SVD loading
- Compute average fused embedding centroid per window
- Compute cosine distance between consecutive window centroids
- Detect inflection points: where drift rate exceeds 2× median drift rate
- For each inflection point, extract example motions (top 3 before/after by loading)
- Generate drift timeline plot per axis (line chart with inflection point markers)
**Patterns to follow:**
- `analysis/trajectory.py``compute_trajectories()` for cross-window drift computation
- `scripts/svd_diagnostics.py` — markdown report generation
**Test scenarios:**
- Happy path: `compute_semantic_drift` returns drift series for each stable axis
- Happy path: drift values are in [0, 2] (cosine distance range)
- Happy path: inflection points detected when synthetic data has abrupt change
- Edge case: axis with only 2 windows returns drift but no inflection points
- Edge case: axis with monotonic drift returns no inflection points
- Integration: run against real data, verify drift timelines are plausible
**Verification:**
- Drift timeline PNG generated per stable axis
- Inflection points (if any) are marked on timeline with motion examples in report
- [ ] **Unit 4: Party voting analysis**
**Goal:** Compute party voting centroids and detect cross-ideological voting patterns.
**Requirements:** R9, R10, R11, R12
**Dependencies:** Unit 2 (needs stable axis list)
**Files:**
- Modify: `analysis/motion_drift.py` (add party analysis functions)
- Modify: `scripts/motion_drift.py` (call party analysis)
- Test: `tests/test_motion_drift.py`
**Approach:**
- Add `compute_party_voting(db_path, stable_axes, windows)` function
- For each party:
- Query `mp_votes` for motions party voted "voor" on per window, using date-aware party name normalization (reuse `load_mp_vectors_by_party_for_window()` pattern from `analysis/explorer_data.py`)
- For each motion, get its SVD scores from `svd_vectors`
- Compute unweighted mean score along each stable axis (voting centroid)
- Track party trajectories: plot party centroid position per window along each axis
- Detect cross-ideological voting:
- For each window, independently determine axis polarity by checking where canonical right-wing parties (CANONICAL_RIGHT) score on each axis
- Identify "right-wing" motions (high positive loading on axis where PVV/FVD/JA21/SGP score high after polarity check)
- Find left-wing parties (SP, PvdA, GL, etc.) voting "voor" on right-wing motions
- Compute cross-voting rate per party per window
- Detect trends: is cross-voting increasing or decreasing over time?
- Generate party trajectory plots and cross-voting summary table
**Patterns to follow:**
- `analysis/config.py``CANONICAL_RIGHT`/`CANONICAL_LEFT` for party classification
- `analysis/explorer_data.py``mp_votes` query patterns, `load_mp_vectors_by_party_for_window()` for party normalization
**Test scenarios:**
- Happy path: `compute_party_voting` returns voting centroids for parties with sufficient data
- Happy path: cross-ideological voting detected when synthetic data has left party voting on right motions
- Happy path: party name normalization maps historical names (GL, PvdA → GroenLinks-PvdA) correctly
- Edge case: party with no "voor" votes in a window handled gracefully (centroid = NaN, skipped)
- Edge case: window with no voting data handled gracefully
- Integration: run against real data, verify party trajectories are plausible
**Verification:**
- Party trajectory PNG generated showing party movement across windows
- Cross-voting summary table in report with at least one example
- [ ] **Unit 5: Report generation**
**Goal:** Assemble all analysis outputs into a markdown report with embedded charts.
**Requirements:** R13, R14, R15
**Dependencies:** Units 2, 3, 4
**Files:**
- Modify: `scripts/motion_drift.py` (orchestrate report generation)
- Test: `tests/test_motion_drift.py`
**Approach:**
- Add `_generate_report(output_dir, stability_result, drift_result, party_result)` function
- Generate markdown with sections:
- Summary (key findings, number of stable axes, inflection points, cross-voting trends)
- Axis Stability (heatmap + interpretation)
- Semantic Drift (timeline per axis + inflection point analysis with motion examples)
- Party Voting Analysis (trajectory plots + cross-voting summary + examples)
- Methodology (brief description of approach, parameters used)
- Save all matplotlib figures as PNGs in output directory
- Embed PNGs in markdown using relative paths
**Patterns to follow:**
- `scripts/svd_diagnostics.py` — markdown report structure
- `scripts/generate_svd_json.py``_generate_markdown_report()` function
**Test scenarios:**
- Happy path: report generated with all sections and embedded images
- Happy path: all PNG files exist in output directory
- Edge case: no stable axes → report notes this and skips drift/party sections
- Edge case: output directory creation when it doesn't exist
**Verification:**
- `output/report.md` exists and contains all expected sections
- All referenced PNG files exist in output directory
- Report is readable in a markdown viewer
## System-Wide Impact
- **Interaction graph:** New script reads from existing DuckDB tables; no writes to production data. Pipeline change needed to store V^T matrix (optional, for future windows).
- **Unchanged invariants:** SVD computation unchanged. Explorer unchanged. Existing analysis modules unchanged.
- **New dependency:** `matplotlib` added to `pyproject.toml`. First use of matplotlib in codebase.
## Risks & Dependencies
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| matplotlib introduces new dependency burden | Low | Low | Already common library; well-maintained. Alternative: use Plotly static export if team prefers single viz stack. |
| V^T matrix not available for historical windows | High | Medium | Fallback to Procrustes-aligned motion ranking correlation (works with existing data). Store V^T going forward. |
| Sparse data in early windows (2016-2018: 124-162 motions) | Medium | Medium | Script warns about low-coverage windows; analysis focuses on 2019+ where data is richer. |
| Cross-ideological voting detection threshold too sensitive/insensitive | Medium | Low | Threshold is parameterized; will calibrate during execution against baseline drift rates. |
| Script exceeds 2-minute runtime on full dataset | Low | Low | JSON parsing of fused embeddings is the bottleneck. Will batch-load and cache if needed. |
## Documentation / Operational Notes
- New script: `scripts/motion_drift.py` — usage documented in module docstring
- New analysis module: `analysis/motion_drift.py` — functions documented with docstrings
- Report output: markdown with embedded PNGs, shareable without running the script
- Future: integrate analysis into Streamlit explorer tab (separate plan)
## Sources & References
- **Origin document:** [docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md](docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md)
- Related code: `scripts/generate_svd_json.py`, `scripts/svd_diagnostics.py`, `analysis/trajectory.py`, `analysis/explorer_data.py`
- Party sets: `analysis/config.py` (CANONICAL_RIGHT, CANONICAL_LEFT)