From b6612d834ae552b77de4d31498772630f3e65f56 Mon Sep 17 00:00:00 2001 From: Sven Geboers Date: Sun, 24 May 2026 23:13:54 +0200 Subject: [PATCH] docs(plan): subagent-based two-dimensional extremity rescoring plan --- ...-feat-subagent-extremity-rescoring-plan.md | 346 ++++++++++++++++++ 1 file changed, 346 insertions(+) create mode 100644 docs/plans/2026-05-08-004-feat-subagent-extremity-rescoring-plan.md diff --git a/docs/plans/2026-05-08-004-feat-subagent-extremity-rescoring-plan.md b/docs/plans/2026-05-08-004-feat-subagent-extremity-rescoring-plan.md new file mode 100644 index 0000000..336478c --- /dev/null +++ b/docs/plans/2026-05-08-004-feat-subagent-extremity-rescoring-plan.md @@ -0,0 +1,346 @@ +--- +title: Subagent-based two-dimensional extremity rescoring and mechanism analysis +type: feat +status: active +date: 2026-05-08 +origin: docs/plans/2026-05-08-002-feat-overton-window-shift-plan.md +--- + +# Subagent-Based Two-Dimensional Extremity Rescoring and Mechanism Analysis + +## Summary + +The current Overton analysis has a known weakness: the LLM extremity score conflates stylistic radicalism (inflammatory language) with material policy impact (rights restricted, groups affected). The manual audit of 20 motions suggested 75% agreement — enough to trust the broad findings but not the fine-grained extremity-stratified analysis. This plan replaces the OpenRouter-based scoring pipeline with project-local subagents (deepseek v4 flash) that score motions via native reasoning. The subagent skill is a durable project asset usable for future LLM analyses. Additionally, we analyze which specific types of right-wing motions gained centrist support post-2024 (mechanism analysis), and compare content extremity shifts over time with the new dual-dimension scores. + +--- + +## Problem Frame + +The current `extremity_scorer.py` calls OpenRouter (mistral-small) with a single-dimension prompt asking "how radical is this?" on a 1-5 scale. This conflates two dimensions: + +- **Stylistic extremity**: How inflammatory/harsh is the language? +- **Material impact**: How much would this policy actually restrict rights, affect groups, or reshape institutions? + +The current scores cannot separate "rude but harmless" from "measured but devastating." The findings report flags this as the primary measurement concern (LLM audit at 75% agreement, systematic overrating of anti-institutional language). + +Additionally, the Overton analysis tells us *that* centrist support rose post-2024 but not *which kinds* of right-wing motions drove this shift. Mechanism analysis fills this gap. + +--- + +## Requirements + +- R1. Write a project-local skill at `.opencode/skills/score-extremity/` that defines a two-dimensional scoring prompt, JSON output schema, and subagent-spawning workflow. The skill is a durable asset for future LLM analyses. +- R2. Score a stratified sample of 100 right-wing motions (25 per extremity bucket, 1-2 / 2-3 / 3-4 / 4-5) for both stylistic extremity and material impact. Compute correlation between the two dimensions. +- R3. If r > 0.7, confirm the single-dimensional scores are directionally usable. If r < 0.7, flag that separate dimensions matter and extend the sample. +- R4. Mechanism analysis: classify the 2,986 right-wing motions by policy mechanism (what specific institutional change the motion proposes) and compute which mechanisms gained the most centrist support post-2024. +- R5. Build the scoring infrastructure test-first. Each scoring subagent and the orchestration layer have unit tests mocking the subagent dispatch. +- R6. Update the findings report with dual-dimension correlation, mechanism analysis results, and refreshed content extremity narrative. + +--- + +## Scope Boundaries + +- In scope: Writing the skill, stratified 100-motion sample, mechanism classification, test infrastructure, report update. +- Out of scope: Re-scoring all 2,986 motions (deferred until r is measured). Interactive dashboard. Streamlit UI changes. +- The skill lives at `.opencode/skills/score-extremity/SKILL.md` — one file, no Python dependencies. + +--- + +## Context & Research + +### Relevant Code and Patterns + +- `analysis/right_wing/extremity_scorer.py` — current single-dimension scoring (prompt template, JSON schema, batch orchestration) +- `analysis/right_wing/direction3_migration_antidemocratic.py` — analysis script pattern (DuckDB queries, matplotlib charts, markdown output) +- `reports/overton_window/findings_report.md` — current report with Section 8 next steps +- `tests/right_wing/` — empty directory, target for new test files + +### Institutional Learnings + +- `docs/solutions/best-practices/overton-window-shift-methodology-2026-05-24.md` — Step 7 describes 2D rescoring and manual audit +- `docs/solutions/insights/llm-motion-classification-prompt-design.md` — prior work on orthogonal prompt dimensions + +### Key Technical Decisions + +- **Subagents, not OpenRouter API calls.** deepseek v4 flash subagents score motions natively via reasoning. No API keys, no rate limits, no cost. The orchestrating script spawns subagents via the `task` tool and collects structured JSON. +- **Skill as prompt artifact, not code.** The `.opencode/skills/score-extremity/SKILL.md` defines the scoring prompt, JSON schema, and subagent-spawning instructions in natural language. The orchestrating Python script reads the skill, formats prompts, and spawns subagents. +- **Batch size: 10 motions per subagent.** Each subagent scores 10 motions for both dimensions. 100 motions = 10 subagents. Parallel dispatch via one `task` call per batch. +- **Stratified sample across all 4 extremity buckets.** 25 per bucket from the existing LLM scores. This tests whether the two dimensions diverge more in high-extremity buckets (where inflammatory language may dominate). +- **Mechanism taxonomy derived from the data.** The subagent derives mechanism categories from the motion text (e.g., "detention/removal", "benefit restriction", "institutional bypass", "symbolic/declarative", "rights limitation", "procedural hurdle"). No pre-defined taxonomy. +- **Storage: two DB tables.** `extremity_scores_2d` (motion_id, stylistic_score, material_score, subagent_explanation) and `motion_mechanisms` (motion_id, mechanism_category, centrist_support_delta). Existing tables unchanged. + +### Open Questions + +#### Resolved During Planning + +- **Q: How to test subagent spawning?** Mock the subagent dispatch layer. The skill produces a JSON contract; test that the orchestrator correctly parses and stores results. +- **Q: Which 100 motions to sample?** Stratified random from the 2,986 classified right-wing motions, 25 per extremity bucket, seeded for reproducibility. + +#### Deferred to Implementation + +- **Q: Exact mechanism taxonomy** — derived by the subagent from the data, not pre-specified. +- **Q: Whether to extend the sample beyond 100** — depends on the r value between dimensions. + +--- + +## Output Structure + +``` +.opencode/skills/score-extremity/ + SKILL.md # Scoring prompt, JSON schema, subagent workflow + +analysis/right_wing/ + extremity_rescore_2d.py # Orchestrator: reads skill, spawns 10 subagents, collects results + mechanism_analysis.py # Mechanism classification + centrist support breakdown + +tests/right_wing/ + test_extremity_rescore_2d.py # Unit tests for orchestrator + test_mechanism_analysis.py # Unit tests for mechanism pipeline +``` + +--- + +## High-Level Technical Design + +> *This illustrates the intended approach and is directional guidance for review, not implementation specification.* + +``` +┌──────────────────────┐ +│ .opencode/skills/ │ +│ score-extremity/ │ ← Read by orchestrator +│ SKILL.md │ +│ - Prompt template │ +│ - JSON schema │ +│ - Subagent workflow │ +└──────────┬───────────┘ + │ read +┌──────────▼───────────┐ +│ extremity_rescore_2d │ +│ .py (orchestrator) │ +│ │ +│ 1. Query 100 motions │ +│ 2. Format 10 batches │ +│ 3. Spawn 10 subagents │──→ subagent scores 10 motions +│ 4. Collect JSON │ returns {motion_id: {stylistic_score, material_score}} +│ 5. Validate + store │ +└──────────────────────┘ +``` + +--- + +## Implementation Units + +### U1. Write the Scoring Skill + +**Goal:** Create a project-local skill that an orchestrator can read to configure subagent-based two-dimensional extremity scoring. + +**Requirements:** R1 + +**Dependencies:** None + +**Files:** +- Create: `.opencode/skills/score-extremity/SKILL.md` + +**Approach:** +- YAML frontmatter: `name: score-extremity`, `description: "Two-dimensional extremity scoring for Dutch parliamentary motions. Use when scoring policy radicalism along stylistic vs material impact dimensions."` +- Body: the two-dimensional scoring prompt (in Dutch, matching the existing PROMPT_TEMPLATE style). Define two scores: `stijl_extremiteit` (1–5, inflammatory language) and `materiele_impact` (1–5, substantive rights/policy effect). +- Body: the JSON output schema matching the prompt. +- Body: instructions for how the orchestrator should spawn subagents (batch size 10, parallel dispatch, collect results, validate JSON). + +**Patterns to follow:** +- Existing PROMPT_TEMPLATE in `analysis/right_wing/extremity_scorer.py` for prompt structure +- `~/.config/opencode/skills/ce-work/SKILL.md` for YAML frontmatter conventions + +**Test scenarios:** +- Edge case: skill file has valid YAML frontmatter with required `name` and `description` fields. +- Edge case: skill body contains the expected sections (prompt template, JSON schema, usage instructions). +- Happy path: orchestrator can read the skill file and extract prompt + schema. + +**Verification:** +- `opencode` detects the skill at startup (listed in available_skills). +- The skill contains a clear two-dimensional scoring prompt in Dutch. + +--- + +### U2. Build Orchestrator + Subagent Scoring Pipeline (TDD) + +**Goal:** Build the orchestrating script that reads the skill, queries 100 motions, spawns subagents, collects and validates results, and stores two-dimensional scores in the database. + +**Requirements:** R2, R5 + +**Dependencies:** U1 + +**Files:** +- Create: `analysis/right_wing/extremity_rescore_2d.py` +- Create: `tests/right_wing/test_extremity_rescore_2d.py` + +**Approach:** +1. **Test-first:** Write tests for the orchestrator before implementation: + - Test that `load_skill()` returns prompt and schema from SKILL.md + - Test that `format_batches(motions, batch_size=10)` splits correctly + - Test that `validate_subagent_result(result, schema)` catches malformed JSON + - Test that `store_scores(db_path, results)` writes to `extremity_scores_2d` table + - Mock the subagent dispatch to return synthetic JSON +2. **Implementation:** + - `load_skill()` — reads `.opencode/skills/score-extremity/SKILL.md`, parses YAML frontmatter, returns body + - `sample_motions(db_path, n_per_bucket=25, seed=42)` — stratified query from `right_wing_motions` JOIN `extremity_scores` + - `format_batches()` — groups motions into batches of 10, builds prompts with motion text + layman explanation + - `spawn_and_collect()` — orchestrator reads the skill, manually formats context for each subagent batch, spawns via `task` tool with return JSON contract + - `validate_and_store()` — validates each result against the schema, writes to DB +3. **Database:** `CREATE TABLE IF NOT EXISTS extremity_scores_2d (motion_id INTEGER PRIMARY KEY, stylistic_score INTEGER, material_score INTEGER, stylistic_rationale TEXT, material_rationale TEXT)` + +**Execution note:** Implement test-first. Write failing tests, then implementation. + +**Patterns to follow:** +- `analysis/right_wing/extremity_scorer.py` — existing DB write patterns +- `tests/agent_tools/test_database_tools.py` — temp DB fixture patterns + +**Test scenarios:** +- Happy path: load_skill returns non-empty prompt and schema. +- Happy path: format_batches with 100 motions produces 10 batches of 10. +- Happy path: validate_and_store with valid JSON inserts 10 rows into extremity_scores_2d. +- Edge case: missing SKILL.md raises clear error. +- Edge case: fewer than 100 motions in a bucket samples what's available. +- Edge case: subagent returns missing field in JSON — validator rejects. +- Edge case: subagent returns score outside 1–5 range — validator rejects. + +**Verification:** +- All tests pass before any subagent is spawned. +- `extremity_scores_2d` table exists with correct schema. +- Orchestrator can be configured with a `--dry-run` flag that validates the pipeline without spawning subagents. + +--- + +### U3. Execute the 100-Motion Rescoring + +**Goal:** Run the orchestrator to score 100 motions, compute the correlation between stylistic and material extremity, and report the results. + +**Requirements:** R2, R3 + +**Dependencies:** U2 + +**Files:** +- Modify: `analysis/right_wing/extremity_rescore_2d.py` (any fixes from live run) +- Output: `reports/overton_window/extremity_2d_correlation.md` + +**Approach:** +1. Run the orchestrator with actual subagent dispatch (no `--dry-run`) +2. Spawn 10 subagents in parallel, each scoring 10 motions +3. Collect all results, validate against schema +4. Compute Pearson r between stylistic_score and material_score +5. Write a short correlation report with: + - Overall r and per-bucket r + - Scatter plot of stylistic vs material scores + - Conclusion: "dimensions are separable" if r < 0.7, "single score sufficient" if r > 0.7 + - Recommendation for next steps (extend sample, re-score all, or proceed) + +**Technical design:** +The orchestrator calls the `task` tool with the skill's prompt and each batch's motion data. Each subagent returns: +```json +{ + "motions": [ + {"motion_id": 123, "stijl_extremiteit": 3, "materiele_impact": 4, "rationale": "..."} + ] +} +``` + +**Verification:** +- 100 motions have both stylistic_score and material_score. +- Correlation report written with clear r value. +- All scores are integers 1–5. + +--- + +### U4. Mechanism Analysis + +**Goal:** Classify right-wing motions by policy mechanism and compute which mechanisms gained the most centrist support post-2024. + +**Requirements:** R4, R5 + +**Dependencies:** None (reads existing DB tables) + +**Files:** +- Create: `analysis/right_wing/mechanism_analysis.py` +- Create: `tests/right_wing/test_mechanism_analysis.py` +- Output: `reports/overton_window/mechanism_analysis.md` + +**Approach:** +1. **Subagent-based classification:** Spawn subagents to classify motions by mechanism. Each subagent receives 25 motions and returns JSON mapping `motion_id -> mechanism_category`. The subagent derives categories from the data (not a pre-defined taxonomy). +2. **Test-first:** Write tests for the orchestration layer (query, batch formatting, table creation, result validation). +3. **Compute centrist support per mechanism:** Using `centrist_support_strict` from `right_wing_motions`, compute pre/post-2024 centrist support and delta per mechanism category. +4. **Report:** Table of mechanism categories ranked by centrist support delta, with N per category. Top-5 mechanisms visualization. + +**Execution note:** Implement test-first. Mock subagent dispatch in tests. + +**Patterns to follow:** +- `analysis/right_wing/direction3_migration_antidemocratic.py` — category breakdown patterns +- `analysis/right_wing/overton_breakpoint_analysis.py` — pre/post comparison patterns + +**Test scenarios:** +- Happy path: mechanism analysis script runs on real DB and produces a markdown report. +- Happy path: table has mechanism categories with N, pre-CS, post-CS, delta columns. +- Edge case: subagent returns unknown mechanism category — orchestrator normalizes or flags. +- Edge case: mechanism category with <5 motions flagged as unreliable. + +**Verification:** +- `reports/overton_window/mechanism_analysis.md` exists with mechanism breakdown. +- Report includes centrist support delta per mechanism. +- Top mechanism is identified with supporting evidence from motion titles. + +--- + +### U5. Update Findings Report + +**Goal:** Integrate dual-dimension correlation and mechanism analysis into the Overton findings report. + +**Requirements:** R6 + +**Dependencies:** U3, U4 + +**Files:** +- Modify: `reports/overton_window/findings_report.md` + +**Approach:** +1. Add a new Section 3b (or update Section 3 Content Extremity) with: + - Two-dimensional scoring results and correlation + - Whether the single-dimensional scores are confirmed or need revision + - Updated content extremity narrative with caveats refined by dual-dimension insight +2. Add a new Section 7 (Mechanism Analysis) with: + - Which mechanisms drove the centrist support surge + - Migration vs non-migration mechanism differences +3. Update Section 8 (Next Steps) to reflect completed 2D rescoring and mechanism work + +**Verification:** +- Report is internally consistent. +- New sections reference the right figures and tables. +- Next steps don't list work that's already done. + +--- + +## System-Wide Impact + +- **New DB tables:** `extremity_scores_2d`, `motion_mechanisms` — additive, no existing data modified. +- **New skill:** `.opencode/skills/score-extremity/SKILL.md` — no code impact, only prompt artifact. +- **No UI changes, no agent_tools changes, no pipeline changes.** +- **Tests:** New tests in `tests/right_wing/` do not affect existing test suite. + +--- + +## Risks & Dependencies + +| Risk | Mitigation | +|---|---| +| Subagent capacity limits (too many parallel dispatches) | Batch size 10 = 10 parallel subagents. Well within limits for 100 motions. If extending to 2,986, use hybrid approach (larger batches or fallback to API). | +| Subagent returns malformed JSON | Validator layer rejects and retries individual batches (max 2 retries). | +| Two dimensions correlate highly (r > 0.9) | Confirms the single-dimensional scores are directionally valid. Write this finding up as a confirmatory result — still valuable. | +| Mechanism taxonomy is too coarse to discriminate | Subagent derives from data, not pre-defined taxonomy. Iterative refinement in the subagent prompt if first pass is noisy. | + +--- + +## Sources & References + +- Origin plan: `docs/plans/2026-05-08-002-feat-overton-window-shift-plan.md` +- Findings report: `reports/overton_window/findings_report.md` +- Methodology doc: `docs/solutions/best-practices/overton-window-shift-methodology-2026-05-24.md` +- Existing scorer: `analysis/right_wing/extremity_scorer.py` +- Skill format reference: `~/.config/opencode/skills/ce-work/SKILL.md`