motief/docs/plans/2026-05-08-004-feat-subagen...

---
title: Subagent-based two-dimensional extremity rescoring and mechanism analysis
type: feat
status: active
date: 2026-05-08
origin: docs/plans/2026-05-08-002-feat-overton-window-shift-plan.md
---

# Subagent-Based Two-Dimensional Extremity Rescoring and Mechanism Analysis

## Summary

The current Overton analysis has a known weakness: the LLM extremity score conflates stylistic radicalism (inflammatory language) with material policy impact (rights restricted, groups affected). The manual audit of 20 motions suggested 75% agreement — enough to trust the broad findings but not the fine-grained extremity-stratified analysis. This plan replaces the OpenRouter-based scoring pipeline with project-local subagents (deepseek v4 flash) that score motions via native reasoning. The subagent skill is a durable project asset usable for future LLM analyses. Additionally, we analyze which specific types of right-wing motions gained centrist support post-2024 (mechanism analysis), and compare content extremity shifts over time with the new dual-dimension scores.

---

## Problem Frame

The current `extremity_scorer.py` calls OpenRouter (mistral-small) with a single-dimension prompt asking "how radical is this?" on a 1-5 scale. This conflates two dimensions:

- **Stylistic extremity**: How inflammatory/harsh is the language?
- **Material impact**: How much would this policy actually restrict rights, affect groups, or reshape institutions?

The current scores cannot separate "rude but harmless" from "measured but devastating." The findings report flags this as the primary measurement concern (LLM audit at 75% agreement, systematic overrating of anti-institutional language).

Additionally, the Overton analysis tells us *that* centrist support rose post-2024 but not *which kinds* of right-wing motions drove this shift. Mechanism analysis fills this gap.

---

## Requirements

- R1. Write a project-local skill at `.opencode/skills/score-extremity/` that defines a two-dimensional scoring prompt, JSON output schema, and subagent-spawning workflow. The skill is a durable asset for future LLM analyses.
- R2. Score a stratified sample of 100 right-wing motions (25 per extremity bucket, 1-2 / 2-3 / 3-4 / 4-5) for both stylistic extremity and material impact. Compute correlation between the two dimensions.
- R3. If r > 0.7, confirm the single-dimensional scores are directionally usable. If r < 0.7, flag that separate dimensions matter and extend the sample.
- R4. Mechanism analysis: classify the 2,986 right-wing motions by policy mechanism (what specific institutional change the motion proposes) and compute which mechanisms gained the most centrist support post-2024.
- R5. Build the scoring infrastructure test-first. Each scoring subagent and the orchestration layer have unit tests mocking the subagent dispatch.
- R6. Update the findings report with dual-dimension correlation, mechanism analysis results, and refreshed content extremity narrative.

---

## Scope Boundaries

- In scope: Writing the skill, stratified 100-motion sample, mechanism classification, test infrastructure, report update.
- Out of scope: Re-scoring all 2,986 motions (deferred until r is measured). Interactive dashboard. Streamlit UI changes.
- The skill lives at `.opencode/skills/score-extremity/SKILL.md` — one file, no Python dependencies.

---

## Context & Research

### Relevant Code and Patterns

- `analysis/right_wing/extremity_scorer.py` — current single-dimension scoring (prompt template, JSON schema, batch orchestration)
- `analysis/right_wing/direction3_migration_antidemocratic.py` — analysis script pattern (DuckDB queries, matplotlib charts, markdown output)
- `reports/overton_window/findings_report.md` — current report with Section 8 next steps
- `tests/right_wing/` — empty directory, target for new test files

### Institutional Learnings

- `docs/solutions/best-practices/overton-window-shift-methodology-2026-05-24.md` — Step 7 describes 2D rescoring and manual audit
- `docs/solutions/insights/llm-motion-classification-prompt-design.md` — prior work on orthogonal prompt dimensions

### Key Technical Decisions

- **Subagents, not OpenRouter API calls.** deepseek v4 flash subagents score motions natively via reasoning. No API keys, no rate limits, no cost. The orchestrating script spawns subagents via the `task` tool and collects structured JSON.
- **Skill as prompt artifact, not code.** The `.opencode/skills/score-extremity/SKILL.md` defines the scoring prompt, JSON schema, and subagent-spawning instructions in natural language. The orchestrating Python script reads the skill, formats prompts, and spawns subagents.
- **Batch size: 10 motions per subagent.** Each subagent scores 10 motions for both dimensions. 100 motions = 10 subagents. Parallel dispatch via one `task` call per batch.
- **Stratified sample across all 4 extremity buckets.** 25 per bucket from the existing LLM scores. This tests whether the two dimensions diverge more in high-extremity buckets (where inflammatory language may dominate).
- **Mechanism taxonomy derived from the data.** The subagent derives mechanism categories from the motion text (e.g., "detention/removal", "benefit restriction", "institutional bypass", "symbolic/declarative", "rights limitation", "procedural hurdle"). No pre-defined taxonomy.
- **Storage: two DB tables.** `extremity_scores_2d` (motion_id, stylistic_score, material_score, subagent_explanation) and `motion_mechanisms` (motion_id, mechanism_category, centrist_support_delta). Existing tables unchanged.

### Open Questions

#### Resolved During Planning

- **Q: How to test subagent spawning?** Mock the subagent dispatch layer. The skill produces a JSON contract; test that the orchestrator correctly parses and stores results.
- **Q: Which 100 motions to sample?** Stratified random from the 2,986 classified right-wing motions, 25 per extremity bucket, seeded for reproducibility.

#### Deferred to Implementation

- **Q: Exact mechanism taxonomy** — derived by the subagent from the data, not pre-specified.
- **Q: Whether to extend the sample beyond 100** — depends on the r value between dimensions.

---

## Output Structure

```
.opencode/skills/score-extremity/
    SKILL.md                  # Scoring prompt, JSON schema, subagent workflow

analysis/right_wing/
    extremity_rescore_2d.py    # Orchestrator: reads skill, spawns 10 subagents, collects results
    mechanism_analysis.py      # Mechanism classification + centrist support breakdown

tests/right_wing/
    test_extremity_rescore_2d.py  # Unit tests for orchestrator
    test_mechanism_analysis.py    # Unit tests for mechanism pipeline
```

---

## High-Level Technical Design

> *This illustrates the intended approach and is directional guidance for review, not implementation specification.*

```
┌──────────────────────┐
│ .opencode/skills/     │
│  score-extremity/     │  ← Read by orchestrator
│  SKILL.md             │
│  - Prompt template    │
│  - JSON schema        │
│  - Subagent workflow  │
└──────────┬───────────┘
           │ read
┌──────────▼───────────┐
│ extremity_rescore_2d  │
│ .py (orchestrator)    │
│                       │
│ 1. Query 100 motions  │
│ 2. Format 10 batches  │
│ 3. Spawn 10 subagents │──→ subagent scores 10 motions
│ 4. Collect JSON       │    returns {motion_id: {stylistic_score, material_score}}
│ 5. Validate + store   │
└──────────────────────┘
```

---

## Implementation Units

### U1. Write the Scoring Skill

**Goal:** Create a project-local skill that an orchestrator can read to configure subagent-based two-dimensional extremity scoring.

**Requirements:** R1

**Dependencies:** None

**Files:**
- Create: `.opencode/skills/score-extremity/SKILL.md`

**Approach:**
- YAML frontmatter: `name: score-extremity`, `description: "Two-dimensional extremity scoring for Dutch parliamentary motions. Use when scoring policy radicalism along stylistic vs material impact dimensions."`
- Body: the two-dimensional scoring prompt (in Dutch, matching the existing PROMPT_TEMPLATE style). Define two scores: `stijl_extremiteit` (1–5, inflammatory language) and `materiele_impact` (1–5, substantive rights/policy effect).
- Body: the JSON output schema matching the prompt.
- Body: instructions for how the orchestrator should spawn subagents (batch size 10, parallel dispatch, collect results, validate JSON).

**Patterns to follow:**
- Existing PROMPT_TEMPLATE in `analysis/right_wing/extremity_scorer.py` for prompt structure
- `~/.config/opencode/skills/ce-work/SKILL.md` for YAML frontmatter conventions

**Test scenarios:**
- Edge case: skill file has valid YAML frontmatter with required `name` and `description` fields.
- Edge case: skill body contains the expected sections (prompt template, JSON schema, usage instructions).
- Happy path: orchestrator can read the skill file and extract prompt + schema.

**Verification:**
- `opencode` detects the skill at startup (listed in available_skills).
- The skill contains a clear two-dimensional scoring prompt in Dutch.

---

### U2. Build Orchestrator + Subagent Scoring Pipeline (TDD)

**Goal:** Build the orchestrating script that reads the skill, queries 100 motions, spawns subagents, collects and validates results, and stores two-dimensional scores in the database.

**Requirements:** R2, R5

**Dependencies:** U1

**Files:**
- Create: `analysis/right_wing/extremity_rescore_2d.py`
- Create: `tests/right_wing/test_extremity_rescore_2d.py`

**Approach:**
1. **Test-first:** Write tests for the orchestrator before implementation:
   - Test that `load_skill()` returns prompt and schema from SKILL.md
   - Test that `format_batches(motions, batch_size=10)` splits correctly
   - Test that `validate_subagent_result(result, schema)` catches malformed JSON
   - Test that `store_scores(db_path, results)` writes to `extremity_scores_2d` table
   - Mock the subagent dispatch to return synthetic JSON
2. **Implementation:**
   - `load_skill()` — reads `.opencode/skills/score-extremity/SKILL.md`, parses YAML frontmatter, returns body
   - `sample_motions(db_path, n_per_bucket=25, seed=42)` — stratified query from `right_wing_motions` JOIN `extremity_scores`
   - `format_batches()` — groups motions into batches of 10, builds prompts with motion text + layman explanation
   - `spawn_and_collect()` — orchestrator reads the skill, manually formats context for each subagent batch, spawns via `task` tool with return JSON contract
   - `validate_and_store()` — validates each result against the schema, writes to DB
3. **Database:** `CREATE TABLE IF NOT EXISTS extremity_scores_2d (motion_id INTEGER PRIMARY KEY, stylistic_score INTEGER, material_score INTEGER, stylistic_rationale TEXT, material_rationale TEXT)`

**Execution note:** Implement test-first. Write failing tests, then implementation.

**Patterns to follow:**
- `analysis/right_wing/extremity_scorer.py` — existing DB write patterns
- `tests/agent_tools/test_database_tools.py` — temp DB fixture patterns

**Test scenarios:**
- Happy path: load_skill returns non-empty prompt and schema.
- Happy path: format_batches with 100 motions produces 10 batches of 10.
- Happy path: validate_and_store with valid JSON inserts 10 rows into extremity_scores_2d.
- Edge case: missing SKILL.md raises clear error.
- Edge case: fewer than 100 motions in a bucket samples what's available.
- Edge case: subagent returns missing field in JSON — validator rejects.
- Edge case: subagent returns score outside 1–5 range — validator rejects.

**Verification:**
- All tests pass before any subagent is spawned.
- `extremity_scores_2d` table exists with correct schema.
- Orchestrator can be configured with a `--dry-run` flag that validates the pipeline without spawning subagents.

---

### U3. Execute the 100-Motion Rescoring

**Goal:** Run the orchestrator to score 100 motions, compute the correlation between stylistic and material extremity, and report the results.

**Requirements:** R2, R3

**Dependencies:** U2

**Files:**
- Modify: `analysis/right_wing/extremity_rescore_2d.py` (any fixes from live run)
- Output: `reports/overton_window/extremity_2d_correlation.md`

**Approach:**
1. Run the orchestrator with actual subagent dispatch (no `--dry-run`)
2. Spawn 10 subagents in parallel, each scoring 10 motions
3. Collect all results, validate against schema
4. Compute Pearson r between stylistic_score and material_score
5. Write a short correlation report with:
   - Overall r and per-bucket r
   - Scatter plot of stylistic vs material scores
   - Conclusion: "dimensions are separable" if r < 0.7, "single score sufficient" if r > 0.7
   - Recommendation for next steps (extend sample, re-score all, or proceed)

**Technical design:**
The orchestrator calls the `task` tool with the skill's prompt and each batch's motion data. Each subagent returns:
```json
{
  "motions": [
    {"motion_id": 123, "stijl_extremiteit": 3, "materiele_impact": 4, "rationale": "..."}
  ]
}
```

**Verification:**
- 100 motions have both stylistic_score and material_score.
- Correlation report written with clear r value.
- All scores are integers 1–5.

---

### U4. Mechanism Analysis

**Goal:** Classify right-wing motions by policy mechanism and compute which mechanisms gained the most centrist support post-2024.

**Requirements:** R4, R5

**Dependencies:** None (reads existing DB tables)

**Files:**
- Create: `analysis/right_wing/mechanism_analysis.py`
- Create: `tests/right_wing/test_mechanism_analysis.py`
- Output: `reports/overton_window/mechanism_analysis.md`

**Approach:**
1. **Subagent-based classification:** Spawn subagents to classify motions by mechanism. Each subagent receives 25 motions and returns JSON mapping `motion_id -> mechanism_category`. The subagent derives categories from the data (not a pre-defined taxonomy).
2. **Test-first:** Write tests for the orchestration layer (query, batch formatting, table creation, result validation).
3. **Compute centrist support per mechanism:** Using `centrist_support_strict` from `right_wing_motions`, compute pre/post-2024 centrist support and delta per mechanism category.
4. **Report:** Table of mechanism categories ranked by centrist support delta, with N per category. Top-5 mechanisms visualization.

**Execution note:** Implement test-first. Mock subagent dispatch in tests.

**Patterns to follow:**
- `analysis/right_wing/direction3_migration_antidemocratic.py` — category breakdown patterns
- `analysis/right_wing/overton_breakpoint_analysis.py` — pre/post comparison patterns

**Test scenarios:**
- Happy path: mechanism analysis script runs on real DB and produces a markdown report.
- Happy path: table has mechanism categories with N, pre-CS, post-CS, delta columns.
- Edge case: subagent returns unknown mechanism category — orchestrator normalizes or flags.
- Edge case: mechanism category with <5 motions flagged as unreliable.

**Verification:**
- `reports/overton_window/mechanism_analysis.md` exists with mechanism breakdown.
- Report includes centrist support delta per mechanism.
- Top mechanism is identified with supporting evidence from motion titles.

---

### U5. Update Findings Report

**Goal:** Integrate dual-dimension correlation and mechanism analysis into the Overton findings report.

**Requirements:** R6

**Dependencies:** U3, U4

**Files:**
- Modify: `reports/overton_window/findings_report.md`

**Approach:**
1. Add a new Section 3b (or update Section 3 Content Extremity) with:
   - Two-dimensional scoring results and correlation
   - Whether the single-dimensional scores are confirmed or need revision
   - Updated content extremity narrative with caveats refined by dual-dimension insight
2. Add a new Section 7 (Mechanism Analysis) with:
   - Which mechanisms drove the centrist support surge
   - Migration vs non-migration mechanism differences
3. Update Section 8 (Next Steps) to reflect completed 2D rescoring and mechanism work

**Verification:**
- Report is internally consistent.
- New sections reference the right figures and tables.
- Next steps don't list work that's already done.

---

## System-Wide Impact

- **New DB tables:** `extremity_scores_2d`, `motion_mechanisms` — additive, no existing data modified.
- **New skill:** `.opencode/skills/score-extremity/SKILL.md` — no code impact, only prompt artifact.
- **No UI changes, no agent_tools changes, no pipeline changes.**
- **Tests:** New tests in `tests/right_wing/` do not affect existing test suite.

---

## Risks & Dependencies

| Risk | Mitigation |
|---|---|
| Subagent capacity limits (too many parallel dispatches) | Batch size 10 = 10 parallel subagents. Well within limits for 100 motions. If extending to 2,986, use hybrid approach (larger batches or fallback to API). |
| Subagent returns malformed JSON | Validator layer rejects and retries individual batches (max 2 retries). |
| Two dimensions correlate highly (r > 0.9) | Confirms the single-dimensional scores are directionally valid. Write this finding up as a confirmatory result — still valuable. |
| Mechanism taxonomy is too coarse to discriminate | Subagent derives from data, not pre-defined taxonomy. Iterative refinement in the subagent prompt if first pass is noisy. |

---

## Sources & References

- Origin plan: `docs/plans/2026-05-08-002-feat-overton-window-shift-plan.md`
- Findings report: `reports/overton_window/findings_report.md`
- Methodology doc: `docs/solutions/best-practices/overton-window-shift-methodology-2026-05-24.md`
- Existing scorer: `analysis/right_wing/extremity_scorer.py`
- Skill format reference: `~/.config/opencode/skills/ce-work/SKILL.md`