You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
motief/docs/plans/2026-04-05-004-feat-motion-...

21 KiB

title type status date origin
Motion semantic drift analysis over time feat active 2026-04-05 docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md

Motion Semantic Drift Analysis Over Time

Overview

Add a new analysis script that tracks how the semantic content of motions on each SVD axis evolves across annual windows (2016-2024). The script produces a markdown report with charts showing axis stability, semantic drift timelines, party voting trajectories, and cross-ideological voting patterns. This is Phase 1 (script + report); a future phase will integrate this into the Streamlit explorer.

Problem Frame

The SVD explorer shows where parties and motions sit on axes at a point in time, but doesn't reveal how the semantic content evolves. Users can't answer: did "right-wing" motions become more extreme over time? Are the SVD axes themselves stable across windows? Do left-wing parties increasingly vote for right-wing motions? (see origin: docs/brainstorms/2026-04-05-motion-semantic-drift-over-time-requirements.md)

Requirements Trace

  • R1. Compute cosine similarity between SVD component vectors (or motion projection patterns) across all annual windows
  • R2. Generate a stability heatmap showing which axes are comparable across time
  • R3. Detect axis reordering across windows
  • R4. Flag unstable axes
  • R5. For each stable axis, compute average fused embedding centroid of top N motions per window
  • R6. Track semantic drift using cosine distance between consecutive window centroids
  • R7. Identify inflection points where drift accelerated (threshold-based)
  • R8. Show example motions before/after inflection points
  • R9. For each party, compute voting centroid per window along each stable axis
  • R10. Track party trajectories over time
  • R11. Detect cross-ideological voting patterns
  • R12. Show concrete examples of parties voting against ideological alignment
  • R13. Script produces markdown report with embedded charts
  • R14. Report includes: stability heatmap, drift timelines, party trajectories, inflection analysis
  • R15. Script is parameterized: --db, --windows, --top-n, --output

Scope Boundaries

  • Annual windows only (2016-2024); quarterly windows too sparse
  • Script + report only — no UI/explorer integration in this phase
  • No statistical significance testing beyond basic change-point detection
  • SVD component vectors (V^T matrix) not currently stored — must be added to pipeline or computed indirectly

Context & Research

Relevant Code and Patterns

  • scripts/generate_svd_json.py — script structure pattern: main(argv) -> int, argparse, ROOT path setup, logger
  • scripts/svd_diagnostics.py — generates markdown + JSON report from SVD analysis
  • analysis/explorer_data.py — DuckDB data loading patterns (read_only, try/finally, vector parsing), load_mp_vectors_by_party_for_window() for date-aware party normalization
  • analysis/trajectory.py — existing cross-window drift computation using _procrustes_align_windows()
  • pipeline/svd_pipeline.py — SVD computation; V^T available as Vt variable before scaling
  • tests/test_analysis.py — test patterns: tmp_path fixture, _setup_svd_vectors() helper, class-based tests
  • analysis/config.pyCANONICAL_RIGHT/CANONICAL_LEFT for cross-ideological voting detection

Key Technical Decisions

  • matplotlib for static charts — no matplotlib usage exists in codebase; this introduces a new dependency. Alternative: Plotly static image export (already in stack). Decision: use matplotlib for markdown-embedded PNGs; simpler for static reports.
  • V^T storage via dedicated entity_type — store raw V^T matrix as entity_type='vt_matrix' row in svd_vectors. Historical windows won't have V^T; motion-ranking correlation fallback is the primary approach for this phase.
  • Axis stability via motion projection patterns with Procrustes alignment — since V^T may not be available for historical windows, compute axis stability indirectly. First apply Procrustes alignment (reuse _procrustes_align_windows() from analysis/trajectory.py) to motion vectors across windows, then correlate top-N motion rankings per component. This handles SVD sign ambiguity and rotation.
  • Threshold-based change-point detection — simple drift rate threshold (no new dependencies). Detect when consecutive drift exceeds 2× median drift rate.
  • Stability threshold — cosine similarity > 0.7 classifies axes as stable. Default parameterized via --stability-threshold with 0.7 as default. Distribution of similarity values reported in output for sensitivity assessment.
  • Cross-ideological voting — use CANONICAL_RIGHT from analysis.config to identify right-wing motions (high positive loading on axis 1), then detect left-wing parties voting "voor" on those motions. Axis polarity determined per-window using canonical party scores, not global constants.

Open Questions

Resolved During Planning

  • Charting library: matplotlib for static PNG embedding in markdown. Add to pyproject.toml.
  • Change-point detection: Simple threshold on drift rate (2× median). No new dependencies.
  • Party-motion linkage: Use mp_votes table — party voted "voor" on motion. This measures voting alignment, not sponsorship.
  • Axis stability approach: Two-tier — (a) if V^T available, use cosine similarity; (b) fallback: Procrustes-align motion vectors, then correlate top-N motion rankings per component across windows.
  • Top N for centroids: Default N=20, parameterized via --top-n. Test during execution.

Deferred to Implementation

  • Exact optimal N for top motions per axis — will test N=10, 20, 50 during execution and pick the one with clearest signal
  • Cross-ideological voting threshold — provisional: party voting "voor" on motions where canonical opposite-wing parties have high absolute loadings; will calibrate against baseline

High-Level Technical Design

This illustrates the intended approach and is directional guidance for review, not implementation specification.

┌─────────────────────────────────────────────────────────────────┐
│                    scripts/motion_drift.py                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. Load Data                                                    │
│     ├── fused_embeddings (per window, per motion)                │
│     ├── svd_vectors (motion projections per window)              │
│     ├── mp_votes (party voting records)                          │
│     └── motions (text for examples)                              │
│                                                                  │
│  2. Axis Stability                                               │
│     ├── Procrustes-align motion vectors across windows           │
│     ├── Option A: cosine similarity of V^T vectors (if stored)   │
│     └── Option B: correlate top-N motion rankings per component  │
│     └── Output: stability heatmap (window × component matrix)    │
│                                                                  │
│  3. Semantic Drift                                               │
│     ├── For each stable axis:                                     │
│     │   ├── Get top N motions by |loading| per window            │
│     │   ├── Compute fused embedding centroid per window          │
│     │   └── Cosine distance between consecutive windows          │
│     └── Output: drift timeline per axis + inflection points      │
│                                                                  │
│  4. Party Voting Analysis                                        │
│     ├── For each party (with date-aware name normalization):     │
│     │   ├── Get motions party voted "voor" on per window         │
│     │   └── Compute voting centroid along each stable axis       │
│     ├── Cross-ideological detection (per-window axis polarity):  │
│     │   ├── Left parties voting "voor" on right-wing motions     │
│     │   └── Right parties voting "voor" on left-wing motions     │
│     └── Output: party trajectory plots + cross-voting examples   │
│                                                                  │
│  5. Report Generation                                            │
│     ├── Markdown with embedded matplotlib PNGs                   │
│     ├── Axis stability heatmap                                   │
│     ├── Semantic drift timelines                                 │
│     ├── Party trajectory plots                                   │
│     └── Inflection point analysis with motion examples           │
└─────────────────────────────────────────────────────────────────┘

Implementation Units

  • Unit 1: Add matplotlib dependency and script scaffolding

Goal: Set up the new script with proper structure and dependencies.

Requirements: R15

Dependencies: None

Files:

  • Modify: pyproject.toml (add matplotlib)
  • Create: scripts/motion_drift.py
  • Test: tests/test_motion_drift.py

Approach:

  • Add matplotlib>=3.8 to pyproject.toml dependencies
  • Create scripts/motion_drift.py following established script pattern: main(argv) -> int, argparse with --db, --windows, --top-n, --output, ROOT path setup, module logger
  • Add schema validation at startup: check for required tables (svd_vectors, fused_embeddings, mp_votes, motions)
  • Create minimal tests/test_motion_drift.py with import test, argument parsing test, and schema validation test using in-memory DuckDB fixture

Patterns to follow:

  • scripts/generate_svd_json.py — script structure, argparse, entry point
  • scripts/svd_diagnostics.py — report generation pattern
  • tests/test_analysis.pytmp_path fixture, _setup_svd_vectors() helper

Test scenarios:

  • Happy path: main(["--help"]) exits with code 0 and prints usage
  • Happy path: main(["--db", "data/motions.db", "--output", "/tmp/test"]) runs without error
  • Edge case: main(["--db", "nonexistent.db"]) handles missing database gracefully (exit code 1)
  • Edge case: database with missing tables produces clear error message

Verification:

  • uv run python scripts/motion_drift.py --help shows all arguments

  • uv run python -m pytest tests/test_motion_drift.py -q passes

  • Unit 2: Axis stability analysis

Goal: Compute axis stability across annual windows and generate stability heatmap.

Requirements: R1, R2, R3, R4

Dependencies: Unit 1

Files:

  • Create: analysis/motion_drift.py (core analysis module)
  • Modify: scripts/motion_drift.py (call axis stability)
  • Test: tests/test_motion_drift.py

Approach:

  • Create analysis/motion_drift.py with compute_axis_stability(db_path, windows) function
  • Two-tier approach:
    • Try loading V^T from svd_vectors where entity_type='vt_matrix' (if stored by pipeline)
    • Fallback: apply Procrustes alignment to motion vectors across windows (reuse _procrustes_align_windows() from analysis/trajectory.py), then for each window get top N motions per component by absolute score and compute pairwise cosine similarity of motion ranking vectors
  • Generate stability heatmap as matplotlib figure (window × component matrix, color-coded by similarity)
  • Return stability report: which axes are stable (similarity > 0.7), which are reordered (high similarity to different component index), which are unstable (low similarity to any component)

Patterns to follow:

  • analysis/explorer_data.py — DuckDB loading patterns, vector parsing
  • analysis/trajectory.py_procrustes_align_windows() for cross-window comparison

Test scenarios:

  • Happy path: compute_axis_stability returns stability matrix for 3+ windows with synthetic data
  • Happy path: stability matrix is symmetric and values are in [-1, 1]
  • Happy path: Procrustes alignment corrects sign flips between windows
  • Edge case: single window returns empty stability report (no comparison possible)
  • Edge case: windows with no motion vectors handled gracefully (warning logged, skipped)
  • Integration: run against real data/motions.db annual windows, verify heatmap is generated

Verification:

  • Stability heatmap PNG generated with correct dimensions (windows × components)

  • Stability report identifies at least some axes as stable (similarity > 0.7)

  • Unit 3: Semantic drift analysis

Goal: Compute semantic drift timelines for stable axes and detect inflection points.

Requirements: R5, R6, R7, R8

Dependencies: Unit 2 (needs stable axis list)

Files:

  • Modify: analysis/motion_drift.py (add drift functions)
  • Modify: scripts/motion_drift.py (call drift analysis)
  • Test: tests/test_motion_drift.py

Approach:

  • Add compute_semantic_drift(db_path, stable_axes, windows, top_n) function
  • For each stable axis:
    • Get top N motions per window by absolute SVD loading
    • Compute average fused embedding centroid per window
    • Compute cosine distance between consecutive window centroids
    • Detect inflection points: where drift rate exceeds 2× median drift rate
  • For each inflection point, extract example motions (top 3 before/after by loading)
  • Generate drift timeline plot per axis (line chart with inflection point markers)

Patterns to follow:

  • analysis/trajectory.pycompute_trajectories() for cross-window drift computation
  • scripts/svd_diagnostics.py — markdown report generation

Test scenarios:

  • Happy path: compute_semantic_drift returns drift series for each stable axis
  • Happy path: drift values are in [0, 2] (cosine distance range)
  • Happy path: inflection points detected when synthetic data has abrupt change
  • Edge case: axis with only 2 windows returns drift but no inflection points
  • Edge case: axis with monotonic drift returns no inflection points
  • Integration: run against real data, verify drift timelines are plausible

Verification:

  • Drift timeline PNG generated per stable axis

  • Inflection points (if any) are marked on timeline with motion examples in report

  • Unit 4: Party voting analysis

Goal: Compute party voting centroids and detect cross-ideological voting patterns.

Requirements: R9, R10, R11, R12

Dependencies: Unit 2 (needs stable axis list)

Files:

  • Modify: analysis/motion_drift.py (add party analysis functions)
  • Modify: scripts/motion_drift.py (call party analysis)
  • Test: tests/test_motion_drift.py

Approach:

  • Add compute_party_voting(db_path, stable_axes, windows) function
  • For each party:
    • Query mp_votes for motions party voted "voor" on per window, using date-aware party name normalization (reuse load_mp_vectors_by_party_for_window() pattern from analysis/explorer_data.py)
    • For each motion, get its SVD scores from svd_vectors
    • Compute unweighted mean score along each stable axis (voting centroid)
  • Track party trajectories: plot party centroid position per window along each axis
  • Detect cross-ideological voting:
    • For each window, independently determine axis polarity by checking where canonical right-wing parties (CANONICAL_RIGHT) score on each axis
    • Identify "right-wing" motions (high positive loading on axis where PVV/FVD/JA21/SGP score high after polarity check)
    • Find left-wing parties (SP, PvdA, GL, etc.) voting "voor" on right-wing motions
    • Compute cross-voting rate per party per window
    • Detect trends: is cross-voting increasing or decreasing over time?
  • Generate party trajectory plots and cross-voting summary table

Patterns to follow:

  • analysis/config.pyCANONICAL_RIGHT/CANONICAL_LEFT for party classification
  • analysis/explorer_data.pymp_votes query patterns, load_mp_vectors_by_party_for_window() for party normalization

Test scenarios:

  • Happy path: compute_party_voting returns voting centroids for parties with sufficient data
  • Happy path: cross-ideological voting detected when synthetic data has left party voting on right motions
  • Happy path: party name normalization maps historical names (GL, PvdA → GroenLinks-PvdA) correctly
  • Edge case: party with no "voor" votes in a window handled gracefully (centroid = NaN, skipped)
  • Edge case: window with no voting data handled gracefully
  • Integration: run against real data, verify party trajectories are plausible

Verification:

  • Party trajectory PNG generated showing party movement across windows

  • Cross-voting summary table in report with at least one example

  • Unit 5: Report generation

Goal: Assemble all analysis outputs into a markdown report with embedded charts.

Requirements: R13, R14, R15

Dependencies: Units 2, 3, 4

Files:

  • Modify: scripts/motion_drift.py (orchestrate report generation)
  • Test: tests/test_motion_drift.py

Approach:

  • Add _generate_report(output_dir, stability_result, drift_result, party_result) function
  • Generate markdown with sections:
    • Summary (key findings, number of stable axes, inflection points, cross-voting trends)
    • Axis Stability (heatmap + interpretation)
    • Semantic Drift (timeline per axis + inflection point analysis with motion examples)
    • Party Voting Analysis (trajectory plots + cross-voting summary + examples)
    • Methodology (brief description of approach, parameters used)
  • Save all matplotlib figures as PNGs in output directory
  • Embed PNGs in markdown using relative paths

Patterns to follow:

  • scripts/svd_diagnostics.py — markdown report structure
  • scripts/generate_svd_json.py_generate_markdown_report() function

Test scenarios:

  • Happy path: report generated with all sections and embedded images
  • Happy path: all PNG files exist in output directory
  • Edge case: no stable axes → report notes this and skips drift/party sections
  • Edge case: output directory creation when it doesn't exist

Verification:

  • output/report.md exists and contains all expected sections
  • All referenced PNG files exist in output directory
  • Report is readable in a markdown viewer

System-Wide Impact

  • Interaction graph: New script reads from existing DuckDB tables; no writes to production data. Pipeline change needed to store V^T matrix (optional, for future windows).
  • Unchanged invariants: SVD computation unchanged. Explorer unchanged. Existing analysis modules unchanged.
  • New dependency: matplotlib added to pyproject.toml. First use of matplotlib in codebase.

Risks & Dependencies

Risk Likelihood Impact Mitigation
matplotlib introduces new dependency burden Low Low Already common library; well-maintained. Alternative: use Plotly static export if team prefers single viz stack.
V^T matrix not available for historical windows High Medium Fallback to Procrustes-aligned motion ranking correlation (works with existing data). Store V^T going forward.
Sparse data in early windows (2016-2018: 124-162 motions) Medium Medium Script warns about low-coverage windows; analysis focuses on 2019+ where data is richer.
Cross-ideological voting detection threshold too sensitive/insensitive Medium Low Threshold is parameterized; will calibrate during execution against baseline drift rates.
Script exceeds 2-minute runtime on full dataset Low Low JSON parsing of fused embeddings is the bottleneck. Will batch-load and cache if needed.

Documentation / Operational Notes

  • New script: scripts/motion_drift.py — usage documented in module docstring
  • New analysis module: analysis/motion_drift.py — functions documented with docstrings
  • Report output: markdown with embedded PNGs, shareable without running the script
  • Future: integrate analysis into Streamlit explorer tab (separate plan)

Sources & References