motief/thoughts/shared/plans/2026-03-22-motion-explorer-...

 # Motion Explorer Implementation Plan

 **Goal:** Regenerate analyses (compass + similarity cache), add an interactive Streamlit explorer (explorer.py) exposing political compass, party trajectories, motion search and browser, and update the blog post with real counts and vector-dimension facts.

 **Design doc:** thoughts/shared/designs/2026-03-22-motion-explorer-design.md

 ---

 ## Summary / Architecture

 We'll perform three high-level workstreams in dependency order:
 1. Analysis rerun: after the running pipeline releases the DB lock, run the minimal pipeline steps to (re)compute fused vectors and then recompute the similarity cache for all quarterly windows 2019-Q1 → 2024-Q4. Also run the static compass generator for verification.
 2. explorer.py: single-file Streamlit app placed at project root. It will use the existing analysis.* modules for heavy computations (cached via @st.cache_data) and duckdb read-only connections for all DB reads. Figures are produced with plotly and rendered inline in Streamlit.
 3. Blog post update: update thoughts/blog-post-political-compass.md with real DB numbers, updated similarity cache counts and correct fused vector dimensions.

 Key implementation decisions (gap-filling):
 - Explorer is a single import-safe module: top-level definitions only, no expensive work on import. Running the UI triggers computations.
 - Use @st.cache_data for expensive functions: load_positions (compute_2d_axes), load_party_map, load_motions_df.
 - All DuckDB access in explorer.py will use duckdb.connect(database=..., read_only=True).
 - For similarity lookups we'll query similarity_cache directly via read-only DuckDB rather than calling MotionDatabase (which opens non-read-only connections), to respect the "DB may be running" constraint.
 - The UI will filter out motions with title exactly "Verworpen." by default; a sidebar toggle allows showing them.
 - Tests: explorer is a UI script so no behavioural TDD possible. We'll add a minimal import/sanity test ensuring the module is import-safe and key functions exist. Blog-post updates are manual but the plan includes a small helper script to compute exact counts to paste into the markdown.

 ---

 ## Dependency Graph

 ```
 Batch 1 (parallel): 1.1 [analysis-rerun - single operator task] (depends: none)
 Batch 2 (parallel): 2.1, 2.2 [explorer implementation + test] (depends: 1.1 for verification, but code can be implemented earlier)
 Batch 3 (serial): 3.1 [blog post update] (depends: 1.1)
 ```

 NOTE: The actual critical dependency is that the DB lock must be released before running the analysis rerun (Batch 1). The explorer code (Batch 2) can be implemented while the pipeline is running — it will only attempt DB reads at runtime and uses read-only connections.

 ---

 ## Batch 1: Analysis rerun (operator tasks — no repo files changed)

 These are operational steps to run after the pipeline finishes and the DB lock is released. Run from the repository root.

 Task 1.1: Regenerate compass outputs and fused vectors
 **What:** Run generate_compass.py and run the pipeline to (re)fuse vectors for quarterly windows covering 2019-Q1 → 2024-Q4. We will not re-run expensive fetch/extract/SVD/text steps if they are already up-to-date; only fusion (phase 5) must run so fused_embeddings exists for all windows.
 **Commands (run after pipeline finishes and DB unlocked):**

 - Verify DB file exists:
   .venv/bin/python -c "import os,sys; p='data/motions.db'; print('exists' if os.path.exists(p) else 'MISSING'); sys.exit(0)"

 - Run static compass for quick visual check (produces HTML output):
   .venv/bin/python scripts/generate_compass.py --db data/motions.db --out outputs --method pca --pca-residual

 - Run the pipeline orchestrator so Phase 5 (fusion) runs for quarterly windows 2019-01-01 → 2025-01-01.
   We explicitly skip metadata/extract/svd/text since those may already be present; this minimizes rework and avoids mixing read/write connections in the current process.

   .venv/bin/python -m pipeline.run_pipeline \
       --db-path data/motions.db \
       --start-date 2019-01-01 --end-date 2025-01-01 \
       --window-size quarterly \
       --skip-metadata --skip-extract --skip-svd --skip-text

 **Notes:** run_pipeline.py includes a --skip-fusion flag; we MUST NOT pass --skip-fusion here because we want fusion to execute. The script supports exactly the flags shown.

 **Verify:**
 - After run_pipeline completes, verify fused_embeddings rows exist for expected windows:
   .venv/bin/python - <<'PY'
   import duckdb
   conn = duckdb.connect(database='data/motions.db', read_only=True)
   print(conn.execute("SELECT window_id, COUNT(*) FROM fused_embeddings GROUP BY window_id ORDER BY window_id DESC").fetchall())
   conn.close()
   PY

 Task 1.2: Recompute similarity cache for all quarterly windows 2019-Q1 → 2024-Q4
 **What:** Compute top-20 similarities per motion per window for the fused vectors and insert rows into similarity_cache. We will run similarity.compute.compute_similarities per window. The repository's similarity/compute.py exposes compute_similarities(vector_type='fused', window_id=..., top_k=20).

 **Command (one-liner loop):**
 .venv/bin/python - <<'PY'
 from similarity.compute import compute_similarities
 windows = []
 years = range(2019, 2025)  # 2019..2024
 for y in years:
     for q in (1,2,3,4):
         windows.append(f"{y}-Q{q}")
 total = 0
 for wid in windows:
     inserted = compute_similarities(vector_type='fused', window_id=wid, top_k=20, db_path='data/motions.db')
     print(f"window={wid} inserted={inserted}")
     total += inserted
 print('DONE total_inserted=', total)
 PY

 **Notes & decisions:**
 - The compute_similarities function already clears existing rows for (vector_type, window_id) before inserting new ones, so this is safe to re-run.
 - If compute_similarities raises memory pressure for large windows, run on subsets (split windows further) — but try the simple loop first.

 **Verify:**
 - Basic counts per window:
   .venv/bin/python - <<'PY'
   import duckdb
   conn = duckdb.connect(database='data/motions.db', read_only=True)
   print(conn.execute("SELECT window_id, COUNT(*) FROM similarity_cache WHERE vector_type = 'fused' GROUP BY window_id ORDER BY window_id").fetchall())
   print('total', conn.execute("SELECT COUNT(*) FROM similarity_cache WHERE vector_type = 'fused'").fetchone())
   conn.close()
   PY

 - Spot-check top neighbors for a known motion id (replace 123 with a real id observed from motions table):
   .venv/bin/python - <<'PY'
   import duckdb
   conn = duckdb.connect(database='data/motions.db', read_only=True)
   print(conn.execute("SELECT id FROM motions ORDER BY id LIMIT 1").fetchall())
   src = conn.execute("SELECT id FROM motions ORDER BY id LIMIT 1").fetchone()[0]
   print('example source id=', src)
   print(conn.execute("SELECT target_motion_id, score FROM similarity_cache WHERE source_motion_id = ? AND vector_type = 'fused' ORDER BY score DESC LIMIT 10", (src,)).fetchall())
   conn.close()
   PY

 ---

 ## Batch 2: Explorer implementation (code + test) — parallel implementers

 All tasks in this batch are independent and can be worked on in parallel. The single file to add is explorer.py at the project root. A small unit test ensures import-safety.

 Decision: explorer.py will be placed at project root (same level as app.py) as requested by design. It will avoid performing DB work at import time so tests and other scripts can import it safely.

 ### Task 2.1: explorer.py
 **File:** explorer.py
 **Test:** tests/test_explorer_import.py
 **Depends:** none (safe to implement while pipeline runs)

 Implementation (copy-paste-ready). This is a minimal, well-documented, and import-safe Streamlit app that follows the design requirements. It uses @st.cache_data on heavy functions, opens DuckDB with read_only=True for all reads, and uses existing analysis modules for computing 2D axes.

 ```python
 # explorer.py
 """Streamlit motion explorer.

 Import-safe: heavy computations are behind functions guarded by @st.cache_data
 and only run when the user opens the app (streamlit run explorer.py).
 """

 from __future__ import annotations

 import logging
 from typing import Dict, List, Optional, Tuple

 import duckdb
 import pandas as pd
 import plotly.express as px
 import streamlit as st

 # keep a module-level logger
 logger = logging.getLogger(__name__)


 # ---------- Cached data loaders ----------


 @st.cache_data
 def load_positions(db_path: str = "data/motions.db", window_size: str = "annual") -> Tuple[Dict[str, Dict[str, Tuple[float, float]]], Optional[Dict]]:
     """Load positions_by_window and axis_def using existing analysis.political_axis.compute_2d_axes.

     This delegates heavy computation to the analysis module and caches the result in Streamlit.
     The function intentionally accepts db_path so callers (tests) can pass a different path.
     """
     try:
         from analysis.political_axis import compute_2d_axes
     except Exception as e:
         logger.exception("analysis.political_axis not available: %s", e)
         return {}, None

     # compute_2d_axes may be expensive; we let the analysis module handle internals
     positions_by_window, axis_def = compute_2d_axes(
         db_path, method="pca", pca_residual=True, normalize_vectors=True
     )
     return positions_by_window, axis_def


 @st.cache_data
 def load_party_map(db_path: str = "data/motions.db") -> Dict[str, str]:
     """Return mp_name -> party mapping.

     Uses the helper in analysis.visualize which already knows heuristics.
     """
     try:
         from analysis.visualize import _load_party_map

         return _load_party_map(db_path)
     except Exception:
         logger.exception("Failed to load party map")
         return {}


 @st.cache_data
 def load_motions_df(db_path: str = "data/motions.db") -> pd.DataFrame:
     """Load motions table into a cached pandas DataFrame (read-only connection).

     Columns returned: id, title, description, date, policy_area, voting_results, layman_explanation, winning_margin, controversy_score
     """
     conn = None
     try:
         conn = duckdb.connect(database=db_path, read_only=True)
         df = conn.execute(
             "SELECT id, title, description, date, policy_area, voting_results, layman_explanation, winning_margin, controversy_score FROM motions"
         ).fetchdf()
         return df
     finally:
         if conn is not None:
             try:
                 conn.close()
             except Exception:
                 pass


 def query_similar_from_cache(db_path: str, source_motion_id: int, vector_type: str = "fused", window_id: Optional[str] = None, top_k: int = 10) -> List[Dict]:
     """Query similarity_cache table using a read-only connection.

     Returns list of dicts with keys target_motion_id, score, id.
     """
     conn = None
     try:
         conn = duckdb.connect(database=db_path, read_only=True)
         params = [source_motion_id, vector_type]
         query = "SELECT target_motion_id, score, id, window_id FROM similarity_cache WHERE source_motion_id = ? AND vector_type = ?"
         if window_id is not None:
             query += " AND window_id = ?"
             params.append(window_id)
         query += " ORDER BY score DESC LIMIT ?"
         params.append(top_k)
         rows = conn.execute(query, params).fetchall()
         cols = [c[0] for c in conn.description]
         return [dict(zip(cols, r)) for r in rows]
     finally:
         if conn is not None:
             try:
                 conn.close()
             except Exception:
                 pass


 # ---------- UI builders ----------


 def build_compass_tab(db_path: str, window_size: str, show_rejected: bool):
     positions_by_window, axis_def = load_positions(db_path, window_size)
     party_map = load_party_map(db_path)

     if not positions_by_window:
         st.error("No position data available. Run the pipeline or check data/motions.db")
         return

     windows = sorted(positions_by_window.keys())
     # default: latest window
     default_index = max(0, len(windows) - 1)
     idx = st.slider("Window", 0, len(windows) - 1, default_index)
     window_id = windows[idx]

     pos = positions_by_window.get(window_id, {})
     names = list(pos.keys())
     xs = [p[0] for p in pos.values()]
     ys = [p[1] for p in pos.values()]
     parties = [party_map.get(n, "Unknown") for n in names]

     fig = px.scatter(x=xs, y=ys, color=parties, hover_name=names, title=f"Political Compass ({window_id})")
     st.plotly_chart(fig, use_container_width=True)


 def build_trajectories_tab(db_path: str, window_size: str):
     positions_by_window, _ = load_positions(db_path, window_size)
     if not positions_by_window:
         st.error("No trajectories available")
         return

     window_ids = sorted(positions_by_window.keys())
     # Build per-party centroids per window
     import numpy as _np

     party_map = load_party_map(db_path)
     # user control
     show_mps = st.checkbox("Show MPs (individual trajectories)", value=False)
     selected_parties = st.multiselect("Parties (select to restrict)", options=sorted(set(party_map.values())), default=None)

     fig = None
     if show_mps:
         # plot a small subset by default to avoid clutter
         mp_limit = 200
         traces = []
         # build mp_coords
         mp_coords = {}
         for wid in window_ids:
             for mp, coord in positions_by_window.get(wid, {}).items():
                 mp_coords.setdefault(mp, []).append((wid, coord))

         # optionally filter by party map
         mps = [m for m in mp_coords.keys() if (not selected_parties) or (party_map.get(m) in selected_parties)]
         mps = sorted(mps)[:mp_limit]

         fig = px.line()
         for mp in mps:
             items = sorted(mp_coords[mp], key=lambda it: window_ids.index(it[0]))
             xs = [c[1][0] for c in items]
             ys = [c[1][1] for c in items]
             fig.add_scatter(x=xs, y=ys, mode='lines+markers', name=mp)
     else:
         # party centroids
         party_centroids = {}
         for wid in window_ids:
             coords_by_party = {}
             for mp, coord in positions_by_window.get(wid, {}).items():
                 party = party_map.get(mp)
                 if party is None:
                     continue


                 coords_by_party.setdefault(party, []).append(coord)
             for party, coords in coords_by_party.items():
                 xs = [c[0] for c in coords]
                 ys = [c[1] for c in coords]
                 centroid = (_np.mean(xs), _np.mean(ys))
                 party_centroids.setdefault(party, {'windows': [], 'coords': []})
                 party_centroids[party]['windows'].append(wid)
                 party_centroids[party]['coords'].append(centroid)

         fig = px.line()
         for party, data in party_centroids.items():
             if selected_parties and party not in selected_parties:
                 continue

             xs = [c[0] for c in data['coords']]
             ys = [c[1] for c in data['coords']]
             fig.add_scatter(x=xs, y=ys, mode='lines+markers', name=party)

     if fig is not None:
         st.plotly_chart(fig, use_container_width=True)


 def build_search_tab(db_path: str, show_rejected: bool):
     df = load_motions_df(db_path)
     if df is None or df.empty:
         st.info("No motions table available")
         return

     # filters
     years = sorted(pd.to_datetime(df['date']).dt.year.dropna().unique().tolist())
     if years:
         start_year, end_year = min(years), max(years)
     else:
         start_year, end_year = 2019, 2024

     year_range = st.slider("Year range", int(start_year), int(end_year), (int(start_year), int(end_year)))
     policy_areas = sorted(df['policy_area'].dropna().unique().tolist())
     policy_filter = st.multiselect("Policy areas", options=policy_areas, default=None)
     query = st.text_input("Search text (title / layman_explanation)")

     # in-memory filter
     working = df.copy()
     # filter rejected default
     if not show_rejected:
         working = working[working['title'].str.strip() != 'Verworpen.']

     working['y'] = pd.to_datetime(working['date']).dt.year
     working = working[(working['y'] >= year_range[0]) & (working['y'] <= year_range[1])]
     if policy_filter:
         working = working[working['policy_area'].isin(policy_filter)]
     if query:
         q = query.lower()
         mask = working['title'].fillna('').str.lower().str.contains(q) | working['layman_explanation'].fillna('').str.lower().str.contains(q)
         working = working[mask]

     st.write(f"{len(working)} results")
     for _, row in working.sort_values(by='controversy_score', ascending=False).head(50).iterrows():
         with st.expander(f"{row['title']} — {row['date']}"):
             st.write(row.get('layman_explanation') or row.get('description') or '')
             st.write('Policy area:', row.get('policy_area'))
             st.write('Controversy score:', row.get('controversy_score'))
             # similar
             similar = query_similar_from_cache(db_path, int(row['id']), vector_type='fused', top_k=10)
             if similar:
                 st.write('Vergelijkbare moties:')
                 for s in similar:
                     st.write(f"- id={s['target_motion_id']} score={s['score']:.3f} window={s.get('window_id')}")
             else:
                 st.info('Nog geen vergelijkbare moties beschikbaar')


 def build_browser_tab(db_path: str, show_rejected: bool):
     df = load_motions_df(db_path)
     if df is None or df.empty:
         st.info("No motions table available")
         return

     if not show_rejected:
         df = df[df['title'].str.strip() != 'Verworpen.']

     df_display = df[['id', 'title', 'date', 'policy_area', 'controversy_score', 'winning_margin']].copy()
     df_display = df_display.sort_values(by=['date'], ascending=False)

     sel = st.experimental_data_editor(df_display, num_rows='dynamic')
     # store selected id via session_state: user clicks a row and then presses a button
     st.write('Select a row and click "Show details"')
     sel_row_idx = st.number_input('Select row index (0-based)', min_value=0, max_value=max(0, len(df_display)-1), value=0)
     if st.button('Show details'):
         row = df_display.iloc[int(sel_row_idx)]
         st.subheader(row['title'])
         st.write(df.loc[df['id'] == row['id']].iloc[0].get('description') or '')
         similar = query_similar_from_cache(db_path, int(row['id']), vector_type='fused', top_k=10)
         if similar:
             st.write('Top similar:')
             for s in similar:
                 st.write(f"- id={s['target_motion_id']} score={s['score']:.3f} window={s.get('window_id')}")
         else:
             st.info('Nog geen vergelijkbare moties beschikbaar')


 def run_app():
     st.set_page_config(layout='wide', page_title='Parlement Explorer')

     st.sidebar.title('Explorer settings')
     db_path = st.sidebar.text_input('DuckDB path', value='data/motions.db')
     window_granularity = st.sidebar.selectbox('Window granularity', ['annual', 'quarterly'], index=0)
     show_rejected = st.sidebar.checkbox('Toon verworpen', value=False)

     tabs = st.tabs(['Politiek Kompas', 'Partij Trajectories', 'Motie Zoeken', 'Motie Browser'])
     with tabs[0]:
         build_compass_tab(db_path, window_granularity, show_rejected)
     with tabs[1]:
         build_trajectories_tab(db_path, window_granularity)
     with tabs[2]:
         build_search_tab(db_path, show_rejected)
     with tabs[3]:
         build_browser_tab(db_path, show_rejected)


 if __name__ == '__main__':
     run_app()
 ```

 **Verify (local/dev):**
 - Run the app once the DB is available: streamlit run explorer.py
 - Verify that Tab 1 loads and you can slide windows, plot renders inline
 - Verify Tab 3 search returns results and shows similar motions
 - Verify all long-running operations are cached (first call slow, subsequent fast)

 ### Task 2.2: Test for explorer import-safety
 **File:** tests/test_explorer_import.py
 **Depends:** none

 Minimal pytest to ensure the module can be imported without triggering heavy work and that run_app and key functions exist.

 ```python
 # tests/test_explorer_import.py
 import importlib


 def test_explorer_importable():
     mod = importlib.import_module('explorer')
     assert hasattr(mod, 'run_app')
     assert callable(mod.run_app)
     # key helpers
     assert hasattr(mod, 'load_positions')
     assert hasattr(mod, 'load_motions_df')
 ```

 **Verify:**
 - Run tests (no DB required for import test):
   .venv/bin/python -m pytest tests/test_explorer_import.py -q

 ---

 ## Batch 3: Blog post update (manual / single-file edit)

 The blog post at thoughts/blog-post-political-compass.md contains placeholder numbers for motion counts, similarity cache totals and fused vector dimension claim. After analysis rerun completes, update the markdown with exact numbers.

 ### Task 3.1: Update blog post with real numbers
 **File to modify:** thoughts/blog-post-political-compass.md
 **Depends:** 1.1, 1.2 (analysis rerun and similarity cache recompute must finish first)

 Steps to compute authoritative numbers (run after Batch 1 completes):
 1. Motion counts per year (SQL):
    .venv/bin/python - <<'PY'
    import duckdb
    conn = duckdb.connect(database='data/motions.db', read_only=True)
    rows = conn.execute("SELECT EXTRACT(year FROM date) AS y, COUNT(*) FROM motions GROUP BY y ORDER BY y").fetchall()
    print(rows)
    conn.close()
    PY

 2. Similarity cache total count (fused vectors):
    .venv/bin/python - <<'PY'
    import duckdb
    conn = duckdb.connect(database='data/motions.db', read_only=True)
    total = conn.execute("SELECT COUNT(*) FROM similarity_cache WHERE vector_type = 'fused'").fetchone()[0]
    print('similarity_cache_fused_total=', total)
    conn.close()
    PY

 3. Verify fused vector dimensions claim (inspect fused_embeddings.vector JSON lengths) — the fused field is stored as JSON array; compute distinct lengths:
    .venv/bin/python - <<'PY'
    import duckdb, json
    conn = duckdb.connect(database='data/motions.db', read_only=True)
    lens = conn.execute("SELECT DISTINCT CARDINALITY(vector) FROM fused_embeddings ORDER BY 1 DESC").fetchall()
    print('distinct_fused_lengths=', lens)
    conn.close()
    PY

 Replace the placeholder table and counts in thoughts/blog-post-political-compass.md with the outputs above. Also correct the fused dimensions claim (line that currently reads "fused = [svd_dims (10)] + [text_dims (2560)] = 2570") by pasting the real dimensions found.

 Verification: After editing, spell-check and run a quick search to ensure the old placeholder numbers are gone:
 grep -n "212,206\|2570\|~450 (newly backfilled)" -n thoughts/blog-post-political-compass.md || echo "No placeholders remain"

 Commit message suggestions (to use when committing these changes):
 - feat(explorer): add initial Streamlit explorer (explorer.py) + import test
 - chore(analysis): recompute fused embeddings + similarity cache for 2019-Q1..2024-Q4 (instructions)
 - docs(blog): update political compass blog post with real counts and vector dims

 ---

 ## Rollout / verification checklist (final acceptance)
 - [ ] Analysis rerun finished without errors; fused_embeddings rows present for 2019-Q1..2024-Q4
 - [ ] similarity_cache contains top-k neighbors for each window (spot-check 3 windows)
 - [ ] explorer.py runs: streamlit run explorer.py renders tabs and figures inline
 - [ ] explorer uses read-only DuckDB connections (manual code review + spot-check)
 - [ ] thoughts/blog-post-political-compass.md updated with real numbers and vector dims
 - [ ] All tests still pass: .venv/bin/python -m pytest -q

 ---

 ## Appendix: reasoning & decisions
 - Design requires read-only DB access: MotionDatabase methods often open connections without read_only flag. To guarantee read-only behaviour while the pipeline runs, explorer.py queries DuckDB directly with read_only=True for all SELECTs. This avoids accidentally holding write locks.
 - The design required using existing analysis.* modules. compute_2d_axes is used as-is and wrapped by @st.cache_data; we rely on it to perform heavy PCA/SVD logic.
 - The similarity recompute step uses similarity.compute.compute_similarities per-window. The design referenced recompute_all_windows which did not exist in the repo; we use a small loop (shown above) to call compute_similarities per window.

 *** End Plan