5.2 KiB
| title | date | category | module | problem_type | component | severity | applies_when | tags |
|---|---|---|---|---|---|---|---|---|
| Motion category classification via parallel subagent pipeline | 2026-06-15 | best-practices | analysis/right_wing | best_practice | development_workflow | medium | [classifying thousands of items into policy categories using LLMs sequential LLM batch pipelines time out or run too slowly a classification taxonomy can be derived from a sample rather than predefined items are independently classifiable with no cross-item state] | [motion-classification subagent-dispatch parallelism duckdb category-taxonomy] |
Motion category classification via parallel subagent pipeline
Context
The right_wing_motions table in data/motions.db had a category column that was 100% NULL across 3,030 classified motions — blocking downstream Overton analysis that splits centrist support by policy domain. The existing derive_categories.py script used OpenRouter's chat_completion_json_parallel to classify motions in sequential batches, but consistently timed out after 10 minutes without classifying anything at scale. A different approach was needed.
Guidance
1. Derive taxonomy from a sample first
Have a sub-agent read a random sample (e.g., 60 motions) and infer natural categories from the data. This produces categories grounded in the actual motion content rather than a preconceived list:
- The sample ensures categories reflect real distribution (migration-heavy, stikstof-driven, etc.)
- The sub-agent returns a concise taxonomy with descriptions for each category
- Include a catch-all "overig" category for edge cases
For this project the taxonomy yielded 10 categories: asiel/vreemdelingen, landbouw/natuur, veiligheid/justitie, zorg/gezondheid, economie, energie/klimaat, buitenland/europa, onderwijs/wetenschap, verkeer/infrastructuur, overig.
2. Chunk data into independent batches
Dump motions from the DB to JSON, then split into small chunks (~38 motions each) that fit comfortably within a single sub-agent's context window. Each chunk is a standalone JSON file containing motion_id, title, and body_text.
3. Dispatch parallel classification sub-agents
Spawn one sub-agent per chunk simultaneously (up to 80 in this case). Each receives:
- The chunk of motions to classify
- The taxonomy with category descriptions
- A strict JSON output format:
[{"motion_id": ..., "category": ..., "category_explanation": ...}] - An instruction to read both title and body_text before deciding on a category
All 80 agents run in parallel, finishing in minutes rather than hours.
4. Merge results and update the database
Collect all result files. Validate each for correct structure (some may use non-standard key names). Then update the DB:
UPDATE right_wing_motions
SET category = ?, category_explanation = ?
WHERE motion_id = ?
Verify by counting non-NULL rows.
5. Integrate into downstream analysis
Once the category column is populated, update analysis scripts and charts to use it. For the Overton QMD report this meant:
- A Plotly dropdown filter on the main centrist support chart to toggle between categories
- A category delta bar chart showing pre/post centrist support change per domain
- Quarterly domain trajectory charts for the 5 largest categories
Why This Matters
- Speed: 80 parallel agents classified 3,030 motions in minutes vs. a sequential script that never finished at all
- Simplicity: No timeout handling, retry logic, or batch management needed — each agent is a fire-and-forget independent unit
- Quality: Classification is grounded in reasoning (reading title + full text), not keyword matching or vector similarity
- Discoverability: The derived taxonomy (10 categories) emerges naturally from the data rather than being imposed upfront
When to Apply
- You have thousands of items needing per-item LLM processing
- Each item is independently classifiable
- The task fits in a sub-agent's context window when batched at ~30-50 items
- Parallel dispatch infrastructure is available (e.g., the
tasktool)
Examples
The pipeline was applied to 3,030 Dutch right-wing motions. The taxonomy was derived from a 60-motion sample by a single sub-agent, then 80 parallel sub-agents classified ~38 motions each. Final distribution was: landbouw/natuur 487, economie 470, asiel/vreemdelingen 423, buitenland/europa 386, veiligheid/justitie 359, zorg/gezondheid 348, energie/klimaat 174, overig 159, verkeer/infrastructuur 138, onderwijs/wetenschap 86.
Two chunks needed minor fixes (used category_label / predicted_category instead of category). A quick validation script caught these before the DB update.
Related
docs/solutions/best-practices/large-scale-subagent-2d-extremity-scoring-2026-06-05.md— parallel subagent pattern for numeric extremity scoring (same infrastructure, different task)analysis/right_wing/derive_categories.py— the original sequential script that timed outdocs/solutions/best-practices/domain-decomposition-overton-analysis.md— why category-split analysis matters for Overton interpretationdocs/solutions/best-practices/overton-narrative-architecture-2026-06-06.md— QMD report structure that consumed the categories