docs(plan): subagent-based two-dimensional extremity rescoring plan

4 weeks ago · b6612d834a
parent bf37f84a8b
commit b6612d834a
1 changed files with 346 additions and 0 deletions
--- a/docs/plans/2026-05-08-004-feat-subagent-extremity-rescoring-plan.md
+++ b/docs/plans/2026-05-08-004-feat-subagent-extremity-rescoring-plan.md
@ -0,0 +1,346 @@
+---
+title: Subagent-based two-dimensional extremity rescoring and mechanism analysis
+type: feat
+status: active
+date: 2026-05-08
+origin: docs/plans/2026-05-08-002-feat-overton-window-shift-plan.md
+---
+
+# Subagent-Based Two-Dimensional Extremity Rescoring and Mechanism Analysis
+
+## Summary
+
+The current Overton analysis has a known weakness: the LLM extremity score conflates stylistic radicalism (inflammatory language) with material policy impact (rights restricted, groups affected). The manual audit of 20 motions suggested 75% agreement — enough to trust the broad findings but not the fine-grained extremity-stratified analysis. This plan replaces the OpenRouter-based scoring pipeline with project-local subagents (deepseek v4 flash) that score motions via native reasoning. The subagent skill is a durable project asset usable for future LLM analyses. Additionally, we analyze which specific types of right-wing motions gained centrist support post-2024 (mechanism analysis), and compare content extremity shifts over time with the new dual-dimension scores.
+
+---
+
+## Problem Frame
+
+The current `extremity_scorer.py` calls OpenRouter (mistral-small) with a single-dimension prompt asking "how radical is this?" on a 1-5 scale. This conflates two dimensions:
+
+- **Stylistic extremity**: How inflammatory/harsh is the language?
+- **Material impact**: How much would this policy actually restrict rights, affect groups, or reshape institutions?
+
+The current scores cannot separate "rude but harmless" from "measured but devastating." The findings report flags this as the primary measurement concern (LLM audit at 75% agreement, systematic overrating of anti-institutional language).
+
+Additionally, the Overton analysis tells us *that* centrist support rose post-2024 but not *which kinds* of right-wing motions drove this shift. Mechanism analysis fills this gap.
+
+---
+
+## Requirements
+
+- R1. Write a project-local skill at `.opencode/skills/score-extremity/` that defines a two-dimensional scoring prompt, JSON output schema, and subagent-spawning workflow. The skill is a durable asset for future LLM analyses.
+- R2. Score a stratified sample of 100 right-wing motions (25 per extremity bucket, 1-2 / 2-3 / 3-4 / 4-5) for both stylistic extremity and material impact. Compute correlation between the two dimensions.
+- R3. If r > 0.7, confirm the single-dimensional scores are directionally usable. If r < 0.7, flag that separate dimensions matter and extend the sample.
+- R4. Mechanism analysis: classify the 2,986 right-wing motions by policy mechanism (what specific institutional change the motion proposes) and compute which mechanisms gained the most centrist support post-2024.
+- R5. Build the scoring infrastructure test-first. Each scoring subagent and the orchestration layer have unit tests mocking the subagent dispatch.
+- R6. Update the findings report with dual-dimension correlation, mechanism analysis results, and refreshed content extremity narrative.
+
+---
+
+## Scope Boundaries
+
+- In scope: Writing the skill, stratified 100-motion sample, mechanism classification, test infrastructure, report update.
+- Out of scope: Re-scoring all 2,986 motions (deferred until r is measured). Interactive dashboard. Streamlit UI changes.
+- The skill lives at `.opencode/skills/score-extremity/SKILL.md` — one file, no Python dependencies.
+
+---
+
+## Context & Research
+
+### Relevant Code and Patterns
+
+- `analysis/right_wing/extremity_scorer.py` — current single-dimension scoring (prompt template, JSON schema, batch orchestration)
+- `analysis/right_wing/direction3_migration_antidemocratic.py` — analysis script pattern (DuckDB queries, matplotlib charts, markdown output)
+- `reports/overton_window/findings_report.md` — current report with Section 8 next steps
+- `tests/right_wing/` — empty directory, target for new test files
+
+### Institutional Learnings
+
+- `docs/solutions/best-practices/overton-window-shift-methodology-2026-05-24.md` — Step 7 describes 2D rescoring and manual audit
+- `docs/solutions/insights/llm-motion-classification-prompt-design.md` — prior work on orthogonal prompt dimensions
+
+### Key Technical Decisions
+
+- **Subagents, not OpenRouter API calls.** deepseek v4 flash subagents score motions natively via reasoning. No API keys, no rate limits, no cost. The orchestrating script spawns subagents via the `task` tool and collects structured JSON.
+- **Skill as prompt artifact, not code.** The `.opencode/skills/score-extremity/SKILL.md` defines the scoring prompt, JSON schema, and subagent-spawning instructions in natural language. The orchestrating Python script reads the skill, formats prompts, and spawns subagents.
+- **Batch size: 10 motions per subagent.** Each subagent scores 10 motions for both dimensions. 100 motions = 10 subagents. Parallel dispatch via one `task` call per batch.
+- **Stratified sample across all 4 extremity buckets.** 25 per bucket from the existing LLM scores. This tests whether the two dimensions diverge more in high-extremity buckets (where inflammatory language may dominate).
+- **Mechanism taxonomy derived from the data.** The subagent derives mechanism categories from the motion text (e.g., "detention/removal", "benefit restriction", "institutional bypass", "symbolic/declarative", "rights limitation", "procedural hurdle"). No pre-defined taxonomy.
+- **Storage: two DB tables.** `extremity_scores_2d` (motion_id, stylistic_score, material_score, subagent_explanation) and `motion_mechanisms` (motion_id, mechanism_category, centrist_support_delta). Existing tables unchanged.
+
+### Open Questions
+
+#### Resolved During Planning
+
+- **Q: How to test subagent spawning?** Mock the subagent dispatch layer. The skill produces a JSON contract; test that the orchestrator correctly parses and stores results.
+- **Q: Which 100 motions to sample?** Stratified random from the 2,986 classified right-wing motions, 25 per extremity bucket, seeded for reproducibility.
+
+#### Deferred to Implementation
+
+- **Q: Exact mechanism taxonomy** — derived by the subagent from the data, not pre-specified.
+- **Q: Whether to extend the sample beyond 100** — depends on the r value between dimensions.
+
+---
+
+## Output Structure
+
+```
+.opencode/skills/score-extremity/
+    SKILL.md                  # Scoring prompt, JSON schema, subagent workflow
+
+analysis/right_wing/
+    extremity_rescore_2d.py    # Orchestrator: reads skill, spawns 10 subagents, collects results
+    mechanism_analysis.py      # Mechanism classification + centrist support breakdown
+
+tests/right_wing/
+    test_extremity_rescore_2d.py  # Unit tests for orchestrator
+    test_mechanism_analysis.py    # Unit tests for mechanism pipeline
+```
+
+---
+
+## High-Level Technical Design
+
+> *This illustrates the intended approach and is directional guidance for review, not implementation specification.*
+
+```
+┌──────────────────────┐
+│ .opencode/skills/     │
+│  score-extremity/     │  ← Read by orchestrator
+│  SKILL.md             │
+│  - Prompt template    │
+│  - JSON schema        │
+│  - Subagent workflow  │
+└──────────┬───────────┘
+           │ read
+┌──────────▼───────────┐
+│ extremity_rescore_2d  │
+│ .py (orchestrator)    │
+│                       │
+│ 1. Query 100 motions  │
+│ 2. Format 10 batches  │
+│ 3. Spawn 10 subagents │──→ subagent scores 10 motions
+│ 4. Collect JSON       │    returns {motion_id: {stylistic_score, material_score}}
+│ 5. Validate + store   │
+└──────────────────────┘
+```
+
+---
+
+## Implementation Units
+
+### U1. Write the Scoring Skill
+
+**Goal:** Create a project-local skill that an orchestrator can read to configure subagent-based two-dimensional extremity scoring.
+
+**Requirements:** R1
+
+**Dependencies:** None
+
+**Files:**
+- Create: `.opencode/skills/score-extremity/SKILL.md`
+
+**Approach:**
+- YAML frontmatter: `name: score-extremity`, `description: "Two-dimensional extremity scoring for Dutch parliamentary motions. Use when scoring policy radicalism along stylistic vs material impact dimensions."`
+- Body: the two-dimensional scoring prompt (in Dutch, matching the existing PROMPT_TEMPLATE style). Define two scores: `stijl_extremiteit` (1–5, inflammatory language) and `materiele_impact` (1–5, substantive rights/policy effect).
+- Body: the JSON output schema matching the prompt.
+- Body: instructions for how the orchestrator should spawn subagents (batch size 10, parallel dispatch, collect results, validate JSON).
+
+**Patterns to follow:**
+- Existing PROMPT_TEMPLATE in `analysis/right_wing/extremity_scorer.py` for prompt structure
+- `~/.config/opencode/skills/ce-work/SKILL.md` for YAML frontmatter conventions
+
+**Test scenarios:**
+- Edge case: skill file has valid YAML frontmatter with required `name` and `description` fields.
+- Edge case: skill body contains the expected sections (prompt template, JSON schema, usage instructions).
+- Happy path: orchestrator can read the skill file and extract prompt + schema.
+
+**Verification:**
+- `opencode` detects the skill at startup (listed in available_skills).
+- The skill contains a clear two-dimensional scoring prompt in Dutch.
+
+---
+
+### U2. Build Orchestrator + Subagent Scoring Pipeline (TDD)
+
+**Goal:** Build the orchestrating script that reads the skill, queries 100 motions, spawns subagents, collects and validates results, and stores two-dimensional scores in the database.
+
+**Requirements:** R2, R5
+
+**Dependencies:** U1
+
+**Files:**
+- Create: `analysis/right_wing/extremity_rescore_2d.py`
+- Create: `tests/right_wing/test_extremity_rescore_2d.py`
+
+**Approach:**
+1. **Test-first:** Write tests for the orchestrator before implementation:
+   - Test that `load_skill()` returns prompt and schema from SKILL.md
+   - Test that `format_batches(motions, batch_size=10)` splits correctly
+   - Test that `validate_subagent_result(result, schema)` catches malformed JSON
+   - Test that `store_scores(db_path, results)` writes to `extremity_scores_2d` table
+   - Mock the subagent dispatch to return synthetic JSON
+2. **Implementation:** 
+   - `load_skill()` — reads `.opencode/skills/score-extremity/SKILL.md`, parses YAML frontmatter, returns body
+   - `sample_motions(db_path, n_per_bucket=25, seed=42)` — stratified query from `right_wing_motions` JOIN `extremity_scores`
+   - `format_batches()` — groups motions into batches of 10, builds prompts with motion text + layman explanation
+   - `spawn_and_collect()` — orchestrator reads the skill, manually formats context for each subagent batch, spawns via `task` tool with return JSON contract
+   - `validate_and_store()` — validates each result against the schema, writes to DB
+3. **Database:** `CREATE TABLE IF NOT EXISTS extremity_scores_2d (motion_id INTEGER PRIMARY KEY, stylistic_score INTEGER, material_score INTEGER, stylistic_rationale TEXT, material_rationale TEXT)`
+
+**Execution note:** Implement test-first. Write failing tests, then implementation.
+
+**Patterns to follow:**
+- `analysis/right_wing/extremity_scorer.py` — existing DB write patterns
+- `tests/agent_tools/test_database_tools.py` — temp DB fixture patterns
+
+**Test scenarios:**
+- Happy path: load_skill returns non-empty prompt and schema.
+- Happy path: format_batches with 100 motions produces 10 batches of 10.
+- Happy path: validate_and_store with valid JSON inserts 10 rows into extremity_scores_2d.
+- Edge case: missing SKILL.md raises clear error.
+- Edge case: fewer than 100 motions in a bucket samples what's available.
+- Edge case: subagent returns missing field in JSON — validator rejects.
+- Edge case: subagent returns score outside 1–5 range — validator rejects.
+
+**Verification:**
+- All tests pass before any subagent is spawned.
+- `extremity_scores_2d` table exists with correct schema.
+- Orchestrator can be configured with a `--dry-run` flag that validates the pipeline without spawning subagents.
+
+---
+
+### U3. Execute the 100-Motion Rescoring
+
+**Goal:** Run the orchestrator to score 100 motions, compute the correlation between stylistic and material extremity, and report the results.
+
+**Requirements:** R2, R3
+
+**Dependencies:** U2
+
+**Files:**
+- Modify: `analysis/right_wing/extremity_rescore_2d.py` (any fixes from live run)
+- Output: `reports/overton_window/extremity_2d_correlation.md`
+
+**Approach:**
+1. Run the orchestrator with actual subagent dispatch (no `--dry-run`)
+2. Spawn 10 subagents in parallel, each scoring 10 motions
+3. Collect all results, validate against schema
+4. Compute Pearson r between stylistic_score and material_score
+5. Write a short correlation report with:
+   - Overall r and per-bucket r
+   - Scatter plot of stylistic vs material scores
+   - Conclusion: "dimensions are separable" if r < 0.7, "single score sufficient" if r > 0.7
+   - Recommendation for next steps (extend sample, re-score all, or proceed)
+
+**Technical design:**
+The orchestrator calls the `task` tool with the skill's prompt and each batch's motion data. Each subagent returns:
+```json
+{
+  "motions": [
+    {"motion_id": 123, "stijl_extremiteit": 3, "materiele_impact": 4, "rationale": "..."}
+  ]
+}
+```
+
+**Verification:**
+- 100 motions have both stylistic_score and material_score.
+- Correlation report written with clear r value.
+- All scores are integers 1–5.
+
+---
+
+### U4. Mechanism Analysis
+
+**Goal:** Classify right-wing motions by policy mechanism and compute which mechanisms gained the most centrist support post-2024.
+
+**Requirements:** R4, R5
+
+**Dependencies:** None (reads existing DB tables)
+
+**Files:**
+- Create: `analysis/right_wing/mechanism_analysis.py`
+- Create: `tests/right_wing/test_mechanism_analysis.py`
+- Output: `reports/overton_window/mechanism_analysis.md`
+
+**Approach:**
+1. **Subagent-based classification:** Spawn subagents to classify motions by mechanism. Each subagent receives 25 motions and returns JSON mapping `motion_id -> mechanism_category`. The subagent derives categories from the data (not a pre-defined taxonomy).
+2. **Test-first:** Write tests for the orchestration layer (query, batch formatting, table creation, result validation).
+3. **Compute centrist support per mechanism:** Using `centrist_support_strict` from `right_wing_motions`, compute pre/post-2024 centrist support and delta per mechanism category.
+4. **Report:** Table of mechanism categories ranked by centrist support delta, with N per category. Top-5 mechanisms visualization.
+
+**Execution note:** Implement test-first. Mock subagent dispatch in tests.
+
+**Patterns to follow:**
+- `analysis/right_wing/direction3_migration_antidemocratic.py` — category breakdown patterns
+- `analysis/right_wing/overton_breakpoint_analysis.py` — pre/post comparison patterns
+
+**Test scenarios:**
+- Happy path: mechanism analysis script runs on real DB and produces a markdown report.
+- Happy path: table has mechanism categories with N, pre-CS, post-CS, delta columns.
+- Edge case: subagent returns unknown mechanism category — orchestrator normalizes or flags.
+- Edge case: mechanism category with <5 motions flagged as unreliable.
+
+**Verification:**
+- `reports/overton_window/mechanism_analysis.md` exists with mechanism breakdown.
+- Report includes centrist support delta per mechanism.
+- Top mechanism is identified with supporting evidence from motion titles.
+
+---
+
+### U5. Update Findings Report
+
+**Goal:** Integrate dual-dimension correlation and mechanism analysis into the Overton findings report.
+
+**Requirements:** R6
+
+**Dependencies:** U3, U4
+
+**Files:**
+- Modify: `reports/overton_window/findings_report.md`
+
+**Approach:**
+1. Add a new Section 3b (or update Section 3 Content Extremity) with:
+   - Two-dimensional scoring results and correlation
+   - Whether the single-dimensional scores are confirmed or need revision
+   - Updated content extremity narrative with caveats refined by dual-dimension insight
+2. Add a new Section 7 (Mechanism Analysis) with:
+   - Which mechanisms drove the centrist support surge
+   - Migration vs non-migration mechanism differences
+3. Update Section 8 (Next Steps) to reflect completed 2D rescoring and mechanism work
+
+**Verification:**
+- Report is internally consistent.
+- New sections reference the right figures and tables.
+- Next steps don't list work that's already done.
+
+---
+
+## System-Wide Impact
+
+- **New DB tables:** `extremity_scores_2d`, `motion_mechanisms` — additive, no existing data modified.
+- **New skill:** `.opencode/skills/score-extremity/SKILL.md` — no code impact, only prompt artifact.
+- **No UI changes, no agent_tools changes, no pipeline changes.**
+- **Tests:** New tests in `tests/right_wing/` do not affect existing test suite.
+
+---
+
+## Risks & Dependencies
+
+| Risk | Mitigation |
+|---|---|
+| Subagent capacity limits (too many parallel dispatches) | Batch size 10 = 10 parallel subagents. Well within limits for 100 motions. If extending to 2,986, use hybrid approach (larger batches or fallback to API). |
+| Subagent returns malformed JSON | Validator layer rejects and retries individual batches (max 2 retries). |
+| Two dimensions correlate highly (r > 0.9) | Confirms the single-dimensional scores are directionally valid. Write this finding up as a confirmatory result — still valuable. |
+| Mechanism taxonomy is too coarse to discriminate | Subagent derives from data, not pre-defined taxonomy. Iterative refinement in the subagent prompt if first pass is noisy. |
+
+---
+
+## Sources & References
+
+- Origin plan: `docs/plans/2026-05-08-002-feat-overton-window-shift-plan.md`
+- Findings report: `reports/overton_window/findings_report.md`
+- Methodology doc: `docs/solutions/best-practices/overton-window-shift-methodology-2026-05-24.md`
+- Existing scorer: `analysis/right_wing/extremity_scorer.py`
+- Skill format reference: `~/.config/opencode/skills/ce-work/SKILL.md`