You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
motief/docs/solutions/best-practices/large-scale-subagent-2d-ext...

110 lines
4.8 KiB

---
title: Large-scale subagent-based 2D extremity scoring
date: 2026-06-05
category: best-practices
module: analysis/right_wing
problem_type: best_practice
component: development_workflow
severity: medium
applies_when:
- "scaling LLM scoring from hundreds to tens of thousands of items"
- "using subagent dispatch as a replacement for API-based batch scoring"
- "parallel batch processing with stateful incremental storage"
tags:
- extremity-scoring
- subagent-dispatch
- parallelism
- duckdb
- llm-workflow
---
# Large-scale subagent-based 2D extremity scoring
## Context
After scoring 117 right-wing motions with 2D extremity (stijl-extremiteit + materiele impact) using deepseek v4 flash subagents, we needed to scale to all 29,570 motions in the database. The existing OpenRouter-based batch pipeline (`chat_completion_json_parallel`) would be too expensive and slow at this scale. Subagent dispatch via the `task` tool was the alternative.
## Guidance
### 1. Batch file generation
Generate fixed-size batch files (20 motions each) containing filled prompt templates with all motion context upfront. This avoids repeated DB queries per subagent:
```python
for i, chunk in enumerate(chunks):
batch_content = ""
for motion in chunk:
batch_content += f"MOTION_ID: {motion['id']}\n{prompt_template.format(...)}\n\n"
write(f"/tmp/all_batch_{i:04d}.txt", batch_content)
```
Always write exact motion IDs in each batch file so results can be matched back without ambiguity.
### 2. Politically neutral prompt
When scoring motions across the full political spectrum (not just right-wing), adjust the material impact scale to be politically symmetric:
- Scale point 5 should describe "fundamentele herstructurering van rechten, instituties of economische systemen" — not only right-wing actions like "inperking van rechten"
- Include examples from both left and right: high-impact left motions (nationalization, wealth taxes, climate mandates) and right motions (asylum cessation, EU exit) should both reach the top of the scale
The SKILL.md file is read at runtime via `load_skill()`, so prompt changes take effect immediately without code changes.
### 3. Subagent dispatch pattern
Dispatch subagents in parallel waves of 5-8, each handling 5 batch files (100 motions):
```
For each wave of 5-8 subagents (in parallel):
For each subagent (handling 5 batch files):
task(score-extremity skill, "Score these motions: {batch_content}")
Wait for all to complete
Collect results from /tmp/all_result_*.json
Validate and store to DB incrementally
```
Key: store results to DB after each wave, not after all waves. /tmp files can be cleaned up by the system, and subagent timeouts can lose data.
### 4. Anti-scripting guard
Subagents sometimes write Python scripts to batch-score motions instead of scoring directly in their reasoning. Add explicit instructions:
```
IMPORTANT: Do NOT write Python scripts to score these motions. Score them
directly in your reasoning, returning the JSON array. Do not use code
to automate this — your reasoning and judgment IS the scoring mechanism.
```
### 5. Incremental storage
Use `INSERT OR REPLACE` for idempotent writes:
```sql
INSERT OR REPLACE INTO extremity_scores_all
(motion_id, stijl_extremiteit, stijl_toelichting, materiele_impact, materiele_toelichting)
VALUES (?, ?, ?, ?, ?)
```
This allows re-running waves without duplicate errors and makes the pipeline resumable.
### 6. Handling placeholder motions
Many motions in the database have only an outcome label ("Aangenomen." / "Verworpen.") with no text or layman explanation. These should be scored (1, 1) and the scoring subagent should detect and report this. Do not try to infer scores from metadata like controversy scores — this defeats the purpose of LLM-based scoring.
## Why This Matters
- **Cost**: Subagent-based scoring via deepseek v4 flash is ~$2-3 for 30K motions vs. $50-100+ via OpenRouter API at comparable scale
- **Resumability**: Wave-by-wave DB storage means a timeout or crash loses at most one wave (~400-500 motions)
- **Prompt agility**: SKILL.md changes propagate immediately to the next wave — no pipeline restart needed
- **Independence**: Style and material impact dimensions maintain moderate correlation (r ≈ 0.43) even at scale, confirming they capture separable signals
## Examples
**Failed approach**: single monolithic subagent scoring all 30K motions. Times out, loses all progress.
**Working approach**: 1,184 batch files, ~80 waves of 5-8 subagents each, DB stored after each wave. 3-day pipeline, resumable, $3 total cost.
## Related
- `.opencode/skills/score-extremity/SKILL.md` — the scoring prompt and subagent workflow
- `analysis/right_wing/extremity_score_all.py` — batch generation and orchestrator
- `docs/solutions/best-practices/overton-extended-analysis-methodology-2026-05-26.md` — 2D scoring in Overton context