18 KiB

Raw Blame History

title	type	status	date	origin
Subagent-based two-dimensional extremity rescoring and mechanism analysis	feat	active	2026-05-08	docs/plans/2026-05-08-002-feat-overton-window-shift-plan.md

Subagent-Based Two-Dimensional Extremity Rescoring and Mechanism Analysis

Summary

The current Overton analysis has a known weakness: the LLM extremity score conflates stylistic radicalism (inflammatory language) with material policy impact (rights restricted, groups affected). The manual audit of 20 motions suggested 75% agreement — enough to trust the broad findings but not the fine-grained extremity-stratified analysis. This plan replaces the OpenRouter-based scoring pipeline with project-local subagents (deepseek v4 flash) that score motions via native reasoning. The subagent skill is a durable project asset usable for future LLM analyses. Additionally, we analyze which specific types of right-wing motions gained centrist support post-2024 (mechanism analysis), and compare content extremity shifts over time with the new dual-dimension scores.

Problem Frame

The current extremity_scorer.py calls OpenRouter (mistral-small) with a single-dimension prompt asking "how radical is this?" on a 1-5 scale. This conflates two dimensions:

Stylistic extremity: How inflammatory/harsh is the language?
Material impact: How much would this policy actually restrict rights, affect groups, or reshape institutions?

The current scores cannot separate "rude but harmless" from "measured but devastating." The findings report flags this as the primary measurement concern (LLM audit at 75% agreement, systematic overrating of anti-institutional language).

Additionally, the Overton analysis tells us that centrist support rose post-2024 but not which kinds of right-wing motions drove this shift. Mechanism analysis fills this gap.

Requirements

R1. Write a project-local skill at .opencode/skills/score-extremity/ that defines a two-dimensional scoring prompt, JSON output schema, and subagent-spawning workflow. The skill is a durable asset for future LLM analyses.
R2. Score a stratified sample of 100 right-wing motions (25 per extremity bucket, 1-2 / 2-3 / 3-4 / 4-5) for both stylistic extremity and material impact. Compute correlation between the two dimensions.
R3. If r > 0.7, confirm the single-dimensional scores are directionally usable. If r < 0.7, flag that separate dimensions matter and extend the sample.
R4. Mechanism analysis: classify the 2,986 right-wing motions by policy mechanism (what specific institutional change the motion proposes) and compute which mechanisms gained the most centrist support post-2024.
R5. Build the scoring infrastructure test-first. Each scoring subagent and the orchestration layer have unit tests mocking the subagent dispatch.
R6. Update the findings report with dual-dimension correlation, mechanism analysis results, and refreshed content extremity narrative.

Scope Boundaries

In scope: Writing the skill, stratified 100-motion sample, mechanism classification, test infrastructure, report update.
Out of scope: Re-scoring all 2,986 motions (deferred until r is measured). Interactive dashboard. Streamlit UI changes.
The skill lives at .opencode/skills/score-extremity/SKILL.md — one file, no Python dependencies.

Context & Research

Relevant Code and Patterns

analysis/right_wing/extremity_scorer.py — current single-dimension scoring (prompt template, JSON schema, batch orchestration)
analysis/right_wing/direction3_migration_antidemocratic.py — analysis script pattern (DuckDB queries, matplotlib charts, markdown output)
reports/overton_window/findings_report.md — current report with Section 8 next steps
tests/right_wing/ — empty directory, target for new test files

Institutional Learnings

docs/solutions/best-practices/overton-window-shift-methodology-2026-05-24.md — Step 7 describes 2D rescoring and manual audit
docs/solutions/insights/llm-motion-classification-prompt-design.md — prior work on orthogonal prompt dimensions

Key Technical Decisions

Subagents, not OpenRouter API calls. deepseek v4 flash subagents score motions natively via reasoning. No API keys, no rate limits, no cost. The orchestrating script spawns subagents via the task tool and collects structured JSON.
Skill as prompt artifact, not code. The .opencode/skills/score-extremity/SKILL.md defines the scoring prompt, JSON schema, and subagent-spawning instructions in natural language. The orchestrating Python script reads the skill, formats prompts, and spawns subagents.
Batch size: 10 motions per subagent. Each subagent scores 10 motions for both dimensions. 100 motions = 10 subagents. Parallel dispatch via one task call per batch.
Stratified sample across all 4 extremity buckets. 25 per bucket from the existing LLM scores. This tests whether the two dimensions diverge more in high-extremity buckets (where inflammatory language may dominate).
Mechanism taxonomy derived from the data. The subagent derives mechanism categories from the motion text (e.g., "detention/removal", "benefit restriction", "institutional bypass", "symbolic/declarative", "rights limitation", "procedural hurdle"). No pre-defined taxonomy.
Storage: two DB tables. extremity_scores_2d (motion_id, stylistic_score, material_score, subagent_explanation) and motion_mechanisms (motion_id, mechanism_category, centrist_support_delta). Existing tables unchanged.

Open Questions

Resolved During Planning

Q: How to test subagent spawning? Mock the subagent dispatch layer. The skill produces a JSON contract; test that the orchestrator correctly parses and stores results.
Q: Which 100 motions to sample? Stratified random from the 2,986 classified right-wing motions, 25 per extremity bucket, seeded for reproducibility.

Deferred to Implementation

Q: Exact mechanism taxonomy — derived by the subagent from the data, not pre-specified.
Q: Whether to extend the sample beyond 100 — depends on the r value between dimensions.

Output Structure

.opencode/skills/score-extremity/
    SKILL.md                  # Scoring prompt, JSON schema, subagent workflow

analysis/right_wing/
    extremity_rescore_2d.py    # Orchestrator: reads skill, spawns 10 subagents, collects results
    mechanism_analysis.py      # Mechanism classification + centrist support breakdown

tests/right_wing/
    test_extremity_rescore_2d.py  # Unit tests for orchestrator
    test_mechanism_analysis.py    # Unit tests for mechanism pipeline

High-Level Technical Design

This illustrates the intended approach and is directional guidance for review, not implementation specification.

┌──────────────────────┐
│ .opencode/skills/     │
│  score-extremity/     │  ← Read by orchestrator
│  SKILL.md             │
│  - Prompt template    │
│  - JSON schema        │
│  - Subagent workflow  │
└──────────┬───────────┘
           │ read
┌──────────▼───────────┐
│ extremity_rescore_2d  │
│ .py (orchestrator)    │
│                       │
│ 1. Query 100 motions  │
│ 2. Format 10 batches  │
│ 3. Spawn 10 subagents │──→ subagent scores 10 motions
│ 4. Collect JSON       │    returns {motion_id: {stylistic_score, material_score}}
│ 5. Validate + store   │
└──────────────────────┘

Implementation Units

U1. Write the Scoring Skill

Goal: Create a project-local skill that an orchestrator can read to configure subagent-based two-dimensional extremity scoring.

Requirements: R1

Dependencies: None

Files:

Create: .opencode/skills/score-extremity/SKILL.md

Approach:

YAML frontmatter: name: score-extremity, description: "Two-dimensional extremity scoring for Dutch parliamentary motions. Use when scoring policy radicalism along stylistic vs material impact dimensions."
Body: the two-dimensional scoring prompt (in Dutch, matching the existing PROMPT_TEMPLATE style). Define two scores: stijl_extremiteit (1–5, inflammatory language) and materiele_impact (1–5, substantive rights/policy effect).
Body: the JSON output schema matching the prompt.
Body: instructions for how the orchestrator should spawn subagents (batch size 10, parallel dispatch, collect results, validate JSON).

Patterns to follow:

Existing PROMPT_TEMPLATE in analysis/right_wing/extremity_scorer.py for prompt structure
~/.config/opencode/skills/ce-work/SKILL.md for YAML frontmatter conventions

Test scenarios:

Edge case: skill file has valid YAML frontmatter with required name and description fields.
Edge case: skill body contains the expected sections (prompt template, JSON schema, usage instructions).
Happy path: orchestrator can read the skill file and extract prompt + schema.

Verification:

opencode detects the skill at startup (listed in available_skills).
The skill contains a clear two-dimensional scoring prompt in Dutch.

U2. Build Orchestrator + Subagent Scoring Pipeline (TDD)

Goal: Build the orchestrating script that reads the skill, queries 100 motions, spawns subagents, collects and validates results, and stores two-dimensional scores in the database.

Requirements: R2, R5

Dependencies: U1

Files:

Create: analysis/right_wing/extremity_rescore_2d.py
Create: tests/right_wing/test_extremity_rescore_2d.py

Approach:

Test-first: Write tests for the orchestrator before implementation:
- Test that load_skill() returns prompt and schema from SKILL.md
- Test that format_batches(motions, batch_size=10) splits correctly
- Test that validate_subagent_result(result, schema) catches malformed JSON
- Test that store_scores(db_path, results) writes to extremity_scores_2d table
- Mock the subagent dispatch to return synthetic JSON
Implementation:
- load_skill() — reads .opencode/skills/score-extremity/SKILL.md, parses YAML frontmatter, returns body
- sample_motions(db_path, n_per_bucket=25, seed=42) — stratified query from right_wing_motions JOIN extremity_scores
- format_batches() — groups motions into batches of 10, builds prompts with motion text + layman explanation
- spawn_and_collect() — orchestrator reads the skill, manually formats context for each subagent batch, spawns via task tool with return JSON contract
- validate_and_store() — validates each result against the schema, writes to DB
Database: CREATE TABLE IF NOT EXISTS extremity_scores_2d (motion_id INTEGER PRIMARY KEY, stylistic_score INTEGER, material_score INTEGER, stylistic_rationale TEXT, material_rationale TEXT)

Execution note: Implement test-first. Write failing tests, then implementation.

Patterns to follow:

analysis/right_wing/extremity_scorer.py — existing DB write patterns
tests/agent_tools/test_database_tools.py — temp DB fixture patterns

Test scenarios:

Happy path: load_skill returns non-empty prompt and schema.
Happy path: format_batches with 100 motions produces 10 batches of 10.
Happy path: validate_and_store with valid JSON inserts 10 rows into extremity_scores_2d.
Edge case: missing SKILL.md raises clear error.
Edge case: fewer than 100 motions in a bucket samples what's available.
Edge case: subagent returns missing field in JSON — validator rejects.
Edge case: subagent returns score outside 1–5 range — validator rejects.

Verification:

All tests pass before any subagent is spawned.
extremity_scores_2d table exists with correct schema.
Orchestrator can be configured with a --dry-run flag that validates the pipeline without spawning subagents.

U3. Execute the 100-Motion Rescoring

Goal: Run the orchestrator to score 100 motions, compute the correlation between stylistic and material extremity, and report the results.

Requirements: R2, R3

Dependencies: U2

Files:

Modify: analysis/right_wing/extremity_rescore_2d.py (any fixes from live run)
Output: reports/overton_window/extremity_2d_correlation.md

Approach:

Run the orchestrator with actual subagent dispatch (no --dry-run)
Spawn 10 subagents in parallel, each scoring 10 motions
Collect all results, validate against schema
Compute Pearson r between stylistic_score and material_score
Write a short correlation report with:
- Overall r and per-bucket r
- Scatter plot of stylistic vs material scores
- Conclusion: "dimensions are separable" if r < 0.7, "single score sufficient" if r > 0.7
- Recommendation for next steps (extend sample, re-score all, or proceed)

Technical design: The orchestrator calls the task tool with the skill's prompt and each batch's motion data. Each subagent returns:

{
  "motions": [
    {"motion_id": 123, "stijl_extremiteit": 3, "materiele_impact": 4, "rationale": "..."}
  ]
}

Verification:

100 motions have both stylistic_score and material_score.
Correlation report written with clear r value.
All scores are integers 1–5.

U4. Mechanism Analysis

Goal: Classify right-wing motions by policy mechanism and compute which mechanisms gained the most centrist support post-2024.

Requirements: R4, R5

Dependencies: None (reads existing DB tables)

Files:

Create: analysis/right_wing/mechanism_analysis.py
Create: tests/right_wing/test_mechanism_analysis.py
Output: reports/overton_window/mechanism_analysis.md

Approach:

Subagent-based classification: Spawn subagents to classify motions by mechanism. Each subagent receives 25 motions and returns JSON mapping motion_id -> mechanism_category. The subagent derives categories from the data (not a pre-defined taxonomy).
Test-first: Write tests for the orchestration layer (query, batch formatting, table creation, result validation).
Compute centrist support per mechanism: Using centrist_support_strict from right_wing_motions, compute pre/post-2024 centrist support and delta per mechanism category.
Report: Table of mechanism categories ranked by centrist support delta, with N per category. Top-5 mechanisms visualization.

Execution note: Implement test-first. Mock subagent dispatch in tests.

Patterns to follow:

analysis/right_wing/direction3_migration_antidemocratic.py — category breakdown patterns
analysis/right_wing/overton_breakpoint_analysis.py — pre/post comparison patterns

Test scenarios:

Happy path: mechanism analysis script runs on real DB and produces a markdown report.
Happy path: table has mechanism categories with N, pre-CS, post-CS, delta columns.
Edge case: subagent returns unknown mechanism category — orchestrator normalizes or flags.
Edge case: mechanism category with <5 motions flagged as unreliable.

Verification:

reports/overton_window/mechanism_analysis.md exists with mechanism breakdown.
Report includes centrist support delta per mechanism.
Top mechanism is identified with supporting evidence from motion titles.

U5. Update Findings Report

Goal: Integrate dual-dimension correlation and mechanism analysis into the Overton findings report.

Requirements: R6

Dependencies: U3, U4

Files:

Modify: reports/overton_window/findings_report.md

Approach:

Add a new Section 3b (or update Section 3 Content Extremity) with:
- Two-dimensional scoring results and correlation
- Whether the single-dimensional scores are confirmed or need revision
- Updated content extremity narrative with caveats refined by dual-dimension insight
Add a new Section 7 (Mechanism Analysis) with:
- Which mechanisms drove the centrist support surge
- Migration vs non-migration mechanism differences
Update Section 8 (Next Steps) to reflect completed 2D rescoring and mechanism work

Verification:

Report is internally consistent.
New sections reference the right figures and tables.
Next steps don't list work that's already done.

System-Wide Impact

New DB tables: extremity_scores_2d, motion_mechanisms — additive, no existing data modified.
New skill: .opencode/skills/score-extremity/SKILL.md — no code impact, only prompt artifact.
No UI changes, no agent_tools changes, no pipeline changes.
Tests: New tests in tests/right_wing/ do not affect existing test suite.

Risks & Dependencies

Risk	Mitigation
Subagent capacity limits (too many parallel dispatches)	Batch size 10 = 10 parallel subagents. Well within limits for 100 motions. If extending to 2,986, use hybrid approach (larger batches or fallback to API).
Subagent returns malformed JSON	Validator layer rejects and retries individual batches (max 2 retries).
Two dimensions correlate highly (r > 0.9)	Confirms the single-dimensional scores are directionally valid. Write this finding up as a confirmatory result — still valuable.
Mechanism taxonomy is too coarse to discriminate	Subagent derives from data, not pre-defined taxonomy. Iterative refinement in the subagent prompt if first pass is noisy.

Sources & References

Origin plan: docs/plans/2026-05-08-002-feat-overton-window-shift-plan.md
Findings report: reports/overton_window/findings_report.md
Methodology doc: docs/solutions/best-practices/overton-window-shift-methodology-2026-05-24.md
Existing scorer: analysis/right_wing/extremity_scorer.py
Skill format reference: ~/.config/opencode/skills/ce-work/SKILL.md

18 KiB Raw Blame History

Subagent-Based Two-Dimensional Extremity Rescoring and Mechanism Analysis

Summary

Problem Frame

Requirements

Scope Boundaries

Context & Research

Relevant Code and Patterns

Institutional Learnings

Key Technical Decisions

Open Questions

Resolved During Planning

Deferred to Implementation

Output Structure

High-Level Technical Design

Implementation Units

U1. Write the Scoring Skill

U2. Build Orchestrator + Subagent Scoring Pipeline (TDD)

U3. Execute the 100-Motion Rescoring

U4. Mechanism Analysis

U5. Update Findings Report

System-Wide Impact

Risks & Dependencies

Sources & References

18 KiB

Raw Blame History