motief/docs/plans/2026-04-05-004-feat-motion-...

---
title: "Motion semantic drift analysis over time"
type: feat
status: active
date: 2026-04-05
origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md
---

# Motion Semantic Drift Analysis Over Time

## Overview

Add a new analysis script that tracks how the semantic content of motions on each SVD axis evolves across annual windows (2016-2024). The script produces a markdown report with charts showing axis stability, semantic drift timelines, party voting trajectories, and cross-ideological voting patterns. This is Phase 1 (script + report); a future phase will integrate this into the Streamlit explorer.

## Problem Frame

The SVD explorer shows where parties and motions sit on axes at a point in time, but doesn't reveal how the semantic content evolves. Users can't answer: did "right-wing" motions become more extreme over time? Are the SVD axes themselves stable across windows? Do left-wing parties increasingly vote for right-wing motions? (see origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md)

## Requirements Trace

- R1. Compute cosine similarity between SVD component vectors (or motion projection patterns) across all annual windows
- R2. Generate a stability heatmap showing which axes are comparable across time
- R3. Detect axis reordering across windows
- R4. Flag unstable axes
- R5. For each stable axis, compute average fused embedding centroid of top N motions per window
- R6. Track semantic drift using cosine distance between consecutive window centroids
- R7. Identify inflection points where drift accelerated (threshold-based)
- R8. Show example motions before/after inflection points
- R9. For each party, compute voting centroid per window along each stable axis
- R10. Track party trajectories over time
- R11. Detect cross-ideological voting patterns
- R12. Show concrete examples of parties voting against ideological alignment
- R13. Script produces markdown report with embedded charts
- R14. Report includes: stability heatmap, drift timelines, party trajectories, inflection analysis
- R15. Script is parameterized: `--db`, `--windows`, `--top-n`, `--output`

## Scope Boundaries

- Annual windows only (2016-2024); quarterly windows too sparse
- Script + report only — no UI/explorer integration in this phase
- No statistical significance testing beyond basic change-point detection
- SVD component vectors (V^T matrix) not currently stored — must be added to pipeline or computed indirectly

## Context & Research

### Relevant Code and Patterns

- `scripts/generate_svd_json.py` — script structure pattern: `main(argv) -> int`, argparse, ROOT path setup, logger
- `scripts/svd_diagnostics.py` — generates markdown + JSON report from SVD analysis
- `analysis/explorer_data.py` — DuckDB data loading patterns (read_only, try/finally, vector parsing), `load_mp_vectors_by_party_for_window()` for date-aware party normalization
- `analysis/trajectory.py` — existing cross-window drift computation using `_procrustes_align_windows()`
- `pipeline/svd_pipeline.py` — SVD computation; V^T available as `Vt` variable before scaling
- `tests/test_analysis.py` — test patterns: `tmp_path` fixture, `_setup_svd_vectors()` helper, class-based tests
- `analysis/config.py` — `CANONICAL_RIGHT`/`CANONICAL_LEFT` for cross-ideological voting detection

### Key Technical Decisions

- **matplotlib for static charts** — no matplotlib usage exists in codebase; this introduces a new dependency. Alternative: Plotly static image export (already in stack). Decision: use matplotlib for markdown-embedded PNGs; simpler for static reports.
- **V^T storage via dedicated entity_type** — store raw V^T matrix as `entity_type='vt_matrix'` row in `svd_vectors`. Historical windows won't have V^T; motion-ranking correlation fallback is the primary approach for this phase.
- **Axis stability via motion projection patterns with Procrustes alignment** — since V^T may not be available for historical windows, compute axis stability indirectly. First apply Procrustes alignment (reuse `_procrustes_align_windows()` from `analysis/trajectory.py`) to motion vectors across windows, then correlate top-N motion rankings per component. This handles SVD sign ambiguity and rotation.
- **Threshold-based change-point detection** — simple drift rate threshold (no new dependencies). Detect when consecutive drift exceeds 2× median drift rate.
- **Stability threshold** — cosine similarity > 0.7 classifies axes as stable. Default parameterized via `--stability-threshold` with 0.7 as default. Distribution of similarity values reported in output for sensitivity assessment.
- **Cross-ideological voting** — use `CANONICAL_RIGHT` from `analysis.config` to identify right-wing motions (high positive loading on axis 1), then detect left-wing parties voting "voor" on those motions. Axis polarity determined per-window using canonical party scores, not global constants.

## Open Questions

### Resolved During Planning

- **Charting library**: matplotlib for static PNG embedding in markdown. Add to `pyproject.toml`.
- **Change-point detection**: Simple threshold on drift rate (2× median). No new dependencies.
- **Party-motion linkage**: Use `mp_votes` table — party voted "voor" on motion. This measures voting alignment, not sponsorship.
- **Axis stability approach**: Two-tier — (a) if V^T available, use cosine similarity; (b) fallback: Procrustes-align motion vectors, then correlate top-N motion rankings per component across windows.
- **Top N for centroids**: Default N=20, parameterized via `--top-n`. Test during execution.

### Deferred to Implementation

- Exact optimal N for top motions per axis — will test N=10, 20, 50 during execution and pick the one with clearest signal
- Cross-ideological voting threshold — provisional: party voting "voor" on motions where canonical opposite-wing parties have high absolute loadings; will calibrate against baseline

## High-Level Technical Design

> *This illustrates the intended approach and is directional guidance for review, not implementation specification.*

```
┌─────────────────────────────────────────────────────────────────┐
│                    scripts/motion_drift.py                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. Load Data                                                    │
│     ├── fused_embeddings (per window, per motion)                │
│     ├── svd_vectors (motion projections per window)              │
│     ├── mp_votes (party voting records)                          │
│     └── motions (text for examples)                              │
│                                                                  │
│  2. Axis Stability                                               │
│     ├── Procrustes-align motion vectors across windows           │
│     ├── Option A: cosine similarity of V^T vectors (if stored)   │
│     └── Option B: correlate top-N motion rankings per component  │
│     └── Output: stability heatmap (window × component matrix)    │
│                                                                  │
│  3. Semantic Drift                                               │
│     ├── For each stable axis:                                     │
│     │   ├── Get top N motions by |loading| per window            │
│     │   ├── Compute fused embedding centroid per window          │
│     │   └── Cosine distance between consecutive windows          │
│     └── Output: drift timeline per axis + inflection points      │
│                                                                  │
│  4. Party Voting Analysis                                        │
│     ├── For each party (with date-aware name normalization):     │
│     │   ├── Get motions party voted "voor" on per window         │
│     │   └── Compute voting centroid along each stable axis       │
│     ├── Cross-ideological detection (per-window axis polarity):  │
│     │   ├── Left parties voting "voor" on right-wing motions     │
│     │   └── Right parties voting "voor" on left-wing motions     │
│     └── Output: party trajectory plots + cross-voting examples   │
│                                                                  │
│  5. Report Generation                                            │
│     ├── Markdown with embedded matplotlib PNGs                   │
│     ├── Axis stability heatmap                                   │
│     ├── Semantic drift timelines                                 │
│     ├── Party trajectory plots                                   │
│     └── Inflection point analysis with motion examples           │
└─────────────────────────────────────────────────────────────────┘
```

## Implementation Units

- [ ] **Unit 1: Add matplotlib dependency and script scaffolding**

**Goal:** Set up the new script with proper structure and dependencies.

**Requirements:** R15

**Dependencies:** None

**Files:**
- Modify: `pyproject.toml` (add matplotlib)
- Create: `scripts/motion_drift.py`
- Test: `tests/test_motion_drift.py`

**Approach:**
- Add `matplotlib>=3.8` to `pyproject.toml` dependencies
- Create `scripts/motion_drift.py` following established script pattern: `main(argv) -> int`, argparse with `--db`, `--windows`, `--top-n`, `--output`, ROOT path setup, module logger
- Add schema validation at startup: check for required tables (`svd_vectors`, `fused_embeddings`, `mp_votes`, `motions`)
- Create minimal `tests/test_motion_drift.py` with import test, argument parsing test, and schema validation test using in-memory DuckDB fixture

**Patterns to follow:**
- `scripts/generate_svd_json.py` — script structure, argparse, entry point
- `scripts/svd_diagnostics.py` — report generation pattern
- `tests/test_analysis.py` — `tmp_path` fixture, `_setup_svd_vectors()` helper

**Test scenarios:**
- Happy path: `main(["--help"])` exits with code 0 and prints usage
- Happy path: `main(["--db", "data/motions.db", "--output", "/tmp/test"])` runs without error
- Edge case: `main(["--db", "nonexistent.db"])` handles missing database gracefully (exit code 1)
- Edge case: database with missing tables produces clear error message

**Verification:**
- `uv run python scripts/motion_drift.py --help` shows all arguments
- `uv run python -m pytest tests/test_motion_drift.py -q` passes

- [ ] **Unit 2: Axis stability analysis**

**Goal:** Compute axis stability across annual windows and generate stability heatmap.

**Requirements:** R1, R2, R3, R4

**Dependencies:** Unit 1

**Files:**
- Create: `analysis/motion_drift.py` (core analysis module)
- Modify: `scripts/motion_drift.py` (call axis stability)
- Test: `tests/test_motion_drift.py`

**Approach:**
- Create `analysis/motion_drift.py` with `compute_axis_stability(db_path, windows)` function
- Two-tier approach:
  - Try loading V^T from `svd_vectors` where `entity_type='vt_matrix'` (if stored by pipeline)
  - Fallback: apply Procrustes alignment to motion vectors across windows (reuse `_procrustes_align_windows()` from `analysis/trajectory.py`), then for each window get top N motions per component by absolute score and compute pairwise cosine similarity of motion ranking vectors
- Generate stability heatmap as matplotlib figure (window × component matrix, color-coded by similarity)
- Return stability report: which axes are stable (similarity > 0.7), which are reordered (high similarity to different component index), which are unstable (low similarity to any component)

**Patterns to follow:**
- `analysis/explorer_data.py` — DuckDB loading patterns, vector parsing
- `analysis/trajectory.py` — `_procrustes_align_windows()` for cross-window comparison

**Test scenarios:**
- Happy path: `compute_axis_stability` returns stability matrix for 3+ windows with synthetic data
- Happy path: stability matrix is symmetric and values are in [-1, 1]
- Happy path: Procrustes alignment corrects sign flips between windows
- Edge case: single window returns empty stability report (no comparison possible)
- Edge case: windows with no motion vectors handled gracefully (warning logged, skipped)
- Integration: run against real `data/motions.db` annual windows, verify heatmap is generated

**Verification:**
- Stability heatmap PNG generated with correct dimensions (windows × components)
- Stability report identifies at least some axes as stable (similarity > 0.7)

- [ ] **Unit 3: Semantic drift analysis**

**Goal:** Compute semantic drift timelines for stable axes and detect inflection points.

**Requirements:** R5, R6, R7, R8

**Dependencies:** Unit 2 (needs stable axis list)

**Files:**
- Modify: `analysis/motion_drift.py` (add drift functions)
- Modify: `scripts/motion_drift.py` (call drift analysis)
- Test: `tests/test_motion_drift.py`

**Approach:**
- Add `compute_semantic_drift(db_path, stable_axes, windows, top_n)` function
- For each stable axis:
  - Get top N motions per window by absolute SVD loading
  - Compute average fused embedding centroid per window
  - Compute cosine distance between consecutive window centroids
  - Detect inflection points: where drift rate exceeds 2× median drift rate
- For each inflection point, extract example motions (top 3 before/after by loading)
- Generate drift timeline plot per axis (line chart with inflection point markers)

**Patterns to follow:**
- `analysis/trajectory.py` — `compute_trajectories()` for cross-window drift computation
- `scripts/svd_diagnostics.py` — markdown report generation

**Test scenarios:**
- Happy path: `compute_semantic_drift` returns drift series for each stable axis
- Happy path: drift values are in [0, 2] (cosine distance range)
- Happy path: inflection points detected when synthetic data has abrupt change
- Edge case: axis with only 2 windows returns drift but no inflection points
- Edge case: axis with monotonic drift returns no inflection points
- Integration: run against real data, verify drift timelines are plausible

**Verification:**
- Drift timeline PNG generated per stable axis
- Inflection points (if any) are marked on timeline with motion examples in report

- [ ] **Unit 4: Party voting analysis**

**Goal:** Compute party voting centroids and detect cross-ideological voting patterns.

**Requirements:** R9, R10, R11, R12

**Dependencies:** Unit 2 (needs stable axis list)

**Files:**
- Modify: `analysis/motion_drift.py` (add party analysis functions)
- Modify: `scripts/motion_drift.py` (call party analysis)
- Test: `tests/test_motion_drift.py`

**Approach:**
- Add `compute_party_voting(db_path, stable_axes, windows)` function
- For each party:
  - Query `mp_votes` for motions party voted "voor" on per window, using date-aware party name normalization (reuse `load_mp_vectors_by_party_for_window()` pattern from `analysis/explorer_data.py`)
  - For each motion, get its SVD scores from `svd_vectors`
  - Compute unweighted mean score along each stable axis (voting centroid)
- Track party trajectories: plot party centroid position per window along each axis
- Detect cross-ideological voting:
  - For each window, independently determine axis polarity by checking where canonical right-wing parties (CANONICAL_RIGHT) score on each axis
  - Identify "right-wing" motions (high positive loading on axis where PVV/FVD/JA21/SGP score high after polarity check)
  - Find left-wing parties (SP, PvdA, GL, etc.) voting "voor" on right-wing motions
  - Compute cross-voting rate per party per window
  - Detect trends: is cross-voting increasing or decreasing over time?
- Generate party trajectory plots and cross-voting summary table

**Patterns to follow:**
- `analysis/config.py` — `CANONICAL_RIGHT`/`CANONICAL_LEFT` for party classification
- `analysis/explorer_data.py` — `mp_votes` query patterns, `load_mp_vectors_by_party_for_window()` for party normalization

**Test scenarios:**
- Happy path: `compute_party_voting` returns voting centroids for parties with sufficient data
- Happy path: cross-ideological voting detected when synthetic data has left party voting on right motions
- Happy path: party name normalization maps historical names (GL, PvdA → GroenLinks-PvdA) correctly
- Edge case: party with no "voor" votes in a window handled gracefully (centroid = NaN, skipped)
- Edge case: window with no voting data handled gracefully
- Integration: run against real data, verify party trajectories are plausible

**Verification:**
- Party trajectory PNG generated showing party movement across windows
- Cross-voting summary table in report with at least one example

- [ ] **Unit 5: Report generation**

**Goal:** Assemble all analysis outputs into a markdown report with embedded charts.

**Requirements:** R13, R14, R15

**Dependencies:** Units 2, 3, 4

**Files:**
- Modify: `scripts/motion_drift.py` (orchestrate report generation)
- Test: `tests/test_motion_drift.py`

**Approach:**
- Add `_generate_report(output_dir, stability_result, drift_result, party_result)` function
- Generate markdown with sections:
  - Summary (key findings, number of stable axes, inflection points, cross-voting trends)
  - Axis Stability (heatmap + interpretation)
  - Semantic Drift (timeline per axis + inflection point analysis with motion examples)
  - Party Voting Analysis (trajectory plots + cross-voting summary + examples)
  - Methodology (brief description of approach, parameters used)
- Save all matplotlib figures as PNGs in output directory
- Embed PNGs in markdown using relative paths

**Patterns to follow:**
- `scripts/svd_diagnostics.py` — markdown report structure
- `scripts/generate_svd_json.py` — `_generate_markdown_report()` function

**Test scenarios:**
- Happy path: report generated with all sections and embedded images
- Happy path: all PNG files exist in output directory
- Edge case: no stable axes → report notes this and skips drift/party sections
- Edge case: output directory creation when it doesn't exist

**Verification:**
- `output/report.md` exists and contains all expected sections
- All referenced PNG files exist in output directory
- Report is readable in a markdown viewer

## System-Wide Impact

- **Interaction graph:** New script reads from existing DuckDB tables; no writes to production data. Pipeline change needed to store V^T matrix (optional, for future windows).
- **Unchanged invariants:** SVD computation unchanged. Explorer unchanged. Existing analysis modules unchanged.
- **New dependency:** `matplotlib` added to `pyproject.toml`. First use of matplotlib in codebase.

## Risks & Dependencies

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| matplotlib introduces new dependency burden | Low | Low | Already common library; well-maintained. Alternative: use Plotly static export if team prefers single viz stack. |
| V^T matrix not available for historical windows | High | Medium | Fallback to Procrustes-aligned motion ranking correlation (works with existing data). Store V^T going forward. |
| Sparse data in early windows (2016-2018: 124-162 motions) | Medium | Medium | Script warns about low-coverage windows; analysis focuses on 2019+ where data is richer. |
| Cross-ideological voting detection threshold too sensitive/insensitive | Medium | Low | Threshold is parameterized; will calibrate during execution against baseline drift rates. |
| Script exceeds 2-minute runtime on full dataset | Low | Low | JSON parsing of fused embeddings is the bottleneck. Will batch-load and cache if needed. |

## Documentation / Operational Notes

- New script: `scripts/motion_drift.py` — usage documented in module docstring
- New analysis module: `analysis/motion_drift.py` — functions documented with docstrings
- Report output: markdown with embedded PNGs, shareable without running the script
- Future: integrate analysis into Streamlit explorer tab (separate plan)

## Sources & References

- **Origin document:** [docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md](docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md)
- Related code: `scripts/generate_svd_json.py`, `scripts/svd_diagnostics.py`, `analysis/trajectory.py`, `analysis/explorer_data.py`
- Party sets: `analysis/config.py` (CANONICAL_RIGHT, CANONICAL_LEFT)