--- title: "Motion semantic drift analysis over time" type: feat status: active date: 2026-04-05 origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md --- # Motion Semantic Drift Analysis Over Time ## Overview Add a new analysis script that tracks how the semantic content of motions on each SVD axis evolves across annual windows (2016-2024). The script produces a markdown report with charts showing axis stability, semantic drift timelines, party voting trajectories, and cross-ideological voting patterns. This is Phase 1 (script + report); a future phase will integrate this into the Streamlit explorer. ## Problem Frame The SVD explorer shows where parties and motions sit on axes at a point in time, but doesn't reveal how the semantic content evolves. Users can't answer: did "right-wing" motions become more extreme over time? Are the SVD axes themselves stable across windows? Do left-wing parties increasingly vote for right-wing motions? (see origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md) ## Requirements Trace - R1. Compute cosine similarity between SVD component vectors (or motion projection patterns) across all annual windows - R2. Generate a stability heatmap showing which axes are comparable across time - R3. Detect axis reordering across windows - R4. Flag unstable axes - R5. For each stable axis, compute average fused embedding centroid of top N motions per window - R6. Track semantic drift using cosine distance between consecutive window centroids - R7. Identify inflection points where drift accelerated (threshold-based) - R8. Show example motions before/after inflection points - R9. For each party, compute voting centroid per window along each stable axis - R10. Track party trajectories over time - R11. Detect cross-ideological voting patterns - R12. Show concrete examples of parties voting against ideological alignment - R13. Script produces markdown report with embedded charts - R14. Report includes: stability heatmap, drift timelines, party trajectories, inflection analysis - R15. Script is parameterized: `--db`, `--windows`, `--top-n`, `--output` ## Scope Boundaries - Annual windows only (2016-2024); quarterly windows too sparse - Script + report only — no UI/explorer integration in this phase - No statistical significance testing beyond basic change-point detection - SVD component vectors (V^T matrix) not currently stored — must be added to pipeline or computed indirectly ## Context & Research ### Relevant Code and Patterns - `scripts/generate_svd_json.py` — script structure pattern: `main(argv) -> int`, argparse, ROOT path setup, logger - `scripts/svd_diagnostics.py` — generates markdown + JSON report from SVD analysis - `analysis/explorer_data.py` — DuckDB data loading patterns (read_only, try/finally, vector parsing), `load_mp_vectors_by_party_for_window()` for date-aware party normalization - `analysis/trajectory.py` — existing cross-window drift computation using `_procrustes_align_windows()` - `pipeline/svd_pipeline.py` — SVD computation; V^T available as `Vt` variable before scaling - `tests/test_analysis.py` — test patterns: `tmp_path` fixture, `_setup_svd_vectors()` helper, class-based tests - `analysis/config.py` — `CANONICAL_RIGHT`/`CANONICAL_LEFT` for cross-ideological voting detection ### Key Technical Decisions - **matplotlib for static charts** — no matplotlib usage exists in codebase; this introduces a new dependency. Alternative: Plotly static image export (already in stack). Decision: use matplotlib for markdown-embedded PNGs; simpler for static reports. - **V^T storage via dedicated entity_type** — store raw V^T matrix as `entity_type='vt_matrix'` row in `svd_vectors`. Historical windows won't have V^T; motion-ranking correlation fallback is the primary approach for this phase. - **Axis stability via motion projection patterns with Procrustes alignment** — since V^T may not be available for historical windows, compute axis stability indirectly. First apply Procrustes alignment (reuse `_procrustes_align_windows()` from `analysis/trajectory.py`) to motion vectors across windows, then correlate top-N motion rankings per component. This handles SVD sign ambiguity and rotation. - **Threshold-based change-point detection** — simple drift rate threshold (no new dependencies). Detect when consecutive drift exceeds 2× median drift rate. - **Stability threshold** — cosine similarity > 0.7 classifies axes as stable. Default parameterized via `--stability-threshold` with 0.7 as default. Distribution of similarity values reported in output for sensitivity assessment. - **Cross-ideological voting** — use `CANONICAL_RIGHT` from `analysis.config` to identify right-wing motions (high positive loading on axis 1), then detect left-wing parties voting "voor" on those motions. Axis polarity determined per-window using canonical party scores, not global constants. ## Open Questions ### Resolved During Planning - **Charting library**: matplotlib for static PNG embedding in markdown. Add to `pyproject.toml`. - **Change-point detection**: Simple threshold on drift rate (2× median). No new dependencies. - **Party-motion linkage**: Use `mp_votes` table — party voted "voor" on motion. This measures voting alignment, not sponsorship. - **Axis stability approach**: Two-tier — (a) if V^T available, use cosine similarity; (b) fallback: Procrustes-align motion vectors, then correlate top-N motion rankings per component across windows. - **Top N for centroids**: Default N=20, parameterized via `--top-n`. Test during execution. ### Deferred to Implementation - Exact optimal N for top motions per axis — will test N=10, 20, 50 during execution and pick the one with clearest signal - Cross-ideological voting threshold — provisional: party voting "voor" on motions where canonical opposite-wing parties have high absolute loadings; will calibrate against baseline ## High-Level Technical Design > *This illustrates the intended approach and is directional guidance for review, not implementation specification.* ``` ┌─────────────────────────────────────────────────────────────────┐ │ scripts/motion_drift.py │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. Load Data │ │ ├── fused_embeddings (per window, per motion) │ │ ├── svd_vectors (motion projections per window) │ │ ├── mp_votes (party voting records) │ │ └── motions (text for examples) │ │ │ │ 2. Axis Stability │ │ ├── Procrustes-align motion vectors across windows │ │ ├── Option A: cosine similarity of V^T vectors (if stored) │ │ └── Option B: correlate top-N motion rankings per component │ │ └── Output: stability heatmap (window × component matrix) │ │ │ │ 3. Semantic Drift │ │ ├── For each stable axis: │ │ │ ├── Get top N motions by |loading| per window │ │ │ ├── Compute fused embedding centroid per window │ │ │ └── Cosine distance between consecutive windows │ │ └── Output: drift timeline per axis + inflection points │ │ │ │ 4. Party Voting Analysis │ │ ├── For each party (with date-aware name normalization): │ │ │ ├── Get motions party voted "voor" on per window │ │ │ └── Compute voting centroid along each stable axis │ │ ├── Cross-ideological detection (per-window axis polarity): │ │ │ ├── Left parties voting "voor" on right-wing motions │ │ │ └── Right parties voting "voor" on left-wing motions │ │ └── Output: party trajectory plots + cross-voting examples │ │ │ │ 5. Report Generation │ │ ├── Markdown with embedded matplotlib PNGs │ │ ├── Axis stability heatmap │ │ ├── Semantic drift timelines │ │ ├── Party trajectory plots │ │ └── Inflection point analysis with motion examples │ └─────────────────────────────────────────────────────────────────┘ ``` ## Implementation Units - [ ] **Unit 1: Add matplotlib dependency and script scaffolding** **Goal:** Set up the new script with proper structure and dependencies. **Requirements:** R15 **Dependencies:** None **Files:** - Modify: `pyproject.toml` (add matplotlib) - Create: `scripts/motion_drift.py` - Test: `tests/test_motion_drift.py` **Approach:** - Add `matplotlib>=3.8` to `pyproject.toml` dependencies - Create `scripts/motion_drift.py` following established script pattern: `main(argv) -> int`, argparse with `--db`, `--windows`, `--top-n`, `--output`, ROOT path setup, module logger - Add schema validation at startup: check for required tables (`svd_vectors`, `fused_embeddings`, `mp_votes`, `motions`) - Create minimal `tests/test_motion_drift.py` with import test, argument parsing test, and schema validation test using in-memory DuckDB fixture **Patterns to follow:** - `scripts/generate_svd_json.py` — script structure, argparse, entry point - `scripts/svd_diagnostics.py` — report generation pattern - `tests/test_analysis.py` — `tmp_path` fixture, `_setup_svd_vectors()` helper **Test scenarios:** - Happy path: `main(["--help"])` exits with code 0 and prints usage - Happy path: `main(["--db", "data/motions.db", "--output", "/tmp/test"])` runs without error - Edge case: `main(["--db", "nonexistent.db"])` handles missing database gracefully (exit code 1) - Edge case: database with missing tables produces clear error message **Verification:** - `uv run python scripts/motion_drift.py --help` shows all arguments - `uv run python -m pytest tests/test_motion_drift.py -q` passes - [ ] **Unit 2: Axis stability analysis** **Goal:** Compute axis stability across annual windows and generate stability heatmap. **Requirements:** R1, R2, R3, R4 **Dependencies:** Unit 1 **Files:** - Create: `analysis/motion_drift.py` (core analysis module) - Modify: `scripts/motion_drift.py` (call axis stability) - Test: `tests/test_motion_drift.py` **Approach:** - Create `analysis/motion_drift.py` with `compute_axis_stability(db_path, windows)` function - Two-tier approach: - Try loading V^T from `svd_vectors` where `entity_type='vt_matrix'` (if stored by pipeline) - Fallback: apply Procrustes alignment to motion vectors across windows (reuse `_procrustes_align_windows()` from `analysis/trajectory.py`), then for each window get top N motions per component by absolute score and compute pairwise cosine similarity of motion ranking vectors - Generate stability heatmap as matplotlib figure (window × component matrix, color-coded by similarity) - Return stability report: which axes are stable (similarity > 0.7), which are reordered (high similarity to different component index), which are unstable (low similarity to any component) **Patterns to follow:** - `analysis/explorer_data.py` — DuckDB loading patterns, vector parsing - `analysis/trajectory.py` — `_procrustes_align_windows()` for cross-window comparison **Test scenarios:** - Happy path: `compute_axis_stability` returns stability matrix for 3+ windows with synthetic data - Happy path: stability matrix is symmetric and values are in [-1, 1] - Happy path: Procrustes alignment corrects sign flips between windows - Edge case: single window returns empty stability report (no comparison possible) - Edge case: windows with no motion vectors handled gracefully (warning logged, skipped) - Integration: run against real `data/motions.db` annual windows, verify heatmap is generated **Verification:** - Stability heatmap PNG generated with correct dimensions (windows × components) - Stability report identifies at least some axes as stable (similarity > 0.7) - [ ] **Unit 3: Semantic drift analysis** **Goal:** Compute semantic drift timelines for stable axes and detect inflection points. **Requirements:** R5, R6, R7, R8 **Dependencies:** Unit 2 (needs stable axis list) **Files:** - Modify: `analysis/motion_drift.py` (add drift functions) - Modify: `scripts/motion_drift.py` (call drift analysis) - Test: `tests/test_motion_drift.py` **Approach:** - Add `compute_semantic_drift(db_path, stable_axes, windows, top_n)` function - For each stable axis: - Get top N motions per window by absolute SVD loading - Compute average fused embedding centroid per window - Compute cosine distance between consecutive window centroids - Detect inflection points: where drift rate exceeds 2× median drift rate - For each inflection point, extract example motions (top 3 before/after by loading) - Generate drift timeline plot per axis (line chart with inflection point markers) **Patterns to follow:** - `analysis/trajectory.py` — `compute_trajectories()` for cross-window drift computation - `scripts/svd_diagnostics.py` — markdown report generation **Test scenarios:** - Happy path: `compute_semantic_drift` returns drift series for each stable axis - Happy path: drift values are in [0, 2] (cosine distance range) - Happy path: inflection points detected when synthetic data has abrupt change - Edge case: axis with only 2 windows returns drift but no inflection points - Edge case: axis with monotonic drift returns no inflection points - Integration: run against real data, verify drift timelines are plausible **Verification:** - Drift timeline PNG generated per stable axis - Inflection points (if any) are marked on timeline with motion examples in report - [ ] **Unit 4: Party voting analysis** **Goal:** Compute party voting centroids and detect cross-ideological voting patterns. **Requirements:** R9, R10, R11, R12 **Dependencies:** Unit 2 (needs stable axis list) **Files:** - Modify: `analysis/motion_drift.py` (add party analysis functions) - Modify: `scripts/motion_drift.py` (call party analysis) - Test: `tests/test_motion_drift.py` **Approach:** - Add `compute_party_voting(db_path, stable_axes, windows)` function - For each party: - Query `mp_votes` for motions party voted "voor" on per window, using date-aware party name normalization (reuse `load_mp_vectors_by_party_for_window()` pattern from `analysis/explorer_data.py`) - For each motion, get its SVD scores from `svd_vectors` - Compute unweighted mean score along each stable axis (voting centroid) - Track party trajectories: plot party centroid position per window along each axis - Detect cross-ideological voting: - For each window, independently determine axis polarity by checking where canonical right-wing parties (CANONICAL_RIGHT) score on each axis - Identify "right-wing" motions (high positive loading on axis where PVV/FVD/JA21/SGP score high after polarity check) - Find left-wing parties (SP, PvdA, GL, etc.) voting "voor" on right-wing motions - Compute cross-voting rate per party per window - Detect trends: is cross-voting increasing or decreasing over time? - Generate party trajectory plots and cross-voting summary table **Patterns to follow:** - `analysis/config.py` — `CANONICAL_RIGHT`/`CANONICAL_LEFT` for party classification - `analysis/explorer_data.py` — `mp_votes` query patterns, `load_mp_vectors_by_party_for_window()` for party normalization **Test scenarios:** - Happy path: `compute_party_voting` returns voting centroids for parties with sufficient data - Happy path: cross-ideological voting detected when synthetic data has left party voting on right motions - Happy path: party name normalization maps historical names (GL, PvdA → GroenLinks-PvdA) correctly - Edge case: party with no "voor" votes in a window handled gracefully (centroid = NaN, skipped) - Edge case: window with no voting data handled gracefully - Integration: run against real data, verify party trajectories are plausible **Verification:** - Party trajectory PNG generated showing party movement across windows - Cross-voting summary table in report with at least one example - [ ] **Unit 5: Report generation** **Goal:** Assemble all analysis outputs into a markdown report with embedded charts. **Requirements:** R13, R14, R15 **Dependencies:** Units 2, 3, 4 **Files:** - Modify: `scripts/motion_drift.py` (orchestrate report generation) - Test: `tests/test_motion_drift.py` **Approach:** - Add `_generate_report(output_dir, stability_result, drift_result, party_result)` function - Generate markdown with sections: - Summary (key findings, number of stable axes, inflection points, cross-voting trends) - Axis Stability (heatmap + interpretation) - Semantic Drift (timeline per axis + inflection point analysis with motion examples) - Party Voting Analysis (trajectory plots + cross-voting summary + examples) - Methodology (brief description of approach, parameters used) - Save all matplotlib figures as PNGs in output directory - Embed PNGs in markdown using relative paths **Patterns to follow:** - `scripts/svd_diagnostics.py` — markdown report structure - `scripts/generate_svd_json.py` — `_generate_markdown_report()` function **Test scenarios:** - Happy path: report generated with all sections and embedded images - Happy path: all PNG files exist in output directory - Edge case: no stable axes → report notes this and skips drift/party sections - Edge case: output directory creation when it doesn't exist **Verification:** - `output/report.md` exists and contains all expected sections - All referenced PNG files exist in output directory - Report is readable in a markdown viewer ## System-Wide Impact - **Interaction graph:** New script reads from existing DuckDB tables; no writes to production data. Pipeline change needed to store V^T matrix (optional, for future windows). - **Unchanged invariants:** SVD computation unchanged. Explorer unchanged. Existing analysis modules unchanged. - **New dependency:** `matplotlib` added to `pyproject.toml`. First use of matplotlib in codebase. ## Risks & Dependencies | Risk | Likelihood | Impact | Mitigation | |------|-----------|--------|------------| | matplotlib introduces new dependency burden | Low | Low | Already common library; well-maintained. Alternative: use Plotly static export if team prefers single viz stack. | | V^T matrix not available for historical windows | High | Medium | Fallback to Procrustes-aligned motion ranking correlation (works with existing data). Store V^T going forward. | | Sparse data in early windows (2016-2018: 124-162 motions) | Medium | Medium | Script warns about low-coverage windows; analysis focuses on 2019+ where data is richer. | | Cross-ideological voting detection threshold too sensitive/insensitive | Medium | Low | Threshold is parameterized; will calibrate during execution against baseline drift rates. | | Script exceeds 2-minute runtime on full dataset | Low | Low | JSON parsing of fused embeddings is the bottleneck. Will batch-load and cache if needed. | ## Documentation / Operational Notes - New script: `scripts/motion_drift.py` — usage documented in module docstring - New analysis module: `analysis/motion_drift.py` — functions documented with docstrings - Report output: markdown with embedded PNGs, shareable without running the script - Future: integrate analysis into Streamlit explorer tab (separate plan) ## Sources & References - **Origin document:** [docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md](docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md) - Related code: `scripts/generate_svd_json.py`, `scripts/svd_diagnostics.py`, `analysis/trajectory.py`, `analysis/explorer_data.py` - Party sets: `analysis/config.py` (CANONICAL_RIGHT, CANONICAL_LEFT)