--- title: "Motion category classification via parallel subagent pipeline" date: 2026-06-15 category: best-practices module: analysis/right_wing problem_type: best_practice component: development_workflow severity: medium applies_when: - "classifying thousands of items into policy categories using LLMs" - "sequential LLM batch pipelines time out or run too slowly" - "a classification taxonomy can be derived from a sample rather than predefined" - "items are independently classifiable with no cross-item state" tags: - motion-classification - subagent-dispatch - parallelism - duckdb - category-taxonomy --- # Motion category classification via parallel subagent pipeline ## Context The `right_wing_motions` table in `data/motions.db` had a `category` column that was 100% NULL across 3,030 classified motions — blocking downstream Overton analysis that splits centrist support by policy domain. The existing `derive_categories.py` script used OpenRouter's `chat_completion_json_parallel` to classify motions in sequential batches, but consistently timed out after 10 minutes without classifying anything at scale. A different approach was needed. ## Guidance ### 1. Derive taxonomy from a sample first Have a sub-agent read a random sample (e.g., 60 motions) and infer natural categories from the data. This produces categories grounded in the actual motion content rather than a preconceived list: - The sample ensures categories reflect real distribution (migration-heavy, stikstof-driven, etc.) - The sub-agent returns a concise taxonomy with descriptions for each category - Include a catch-all "overig" category for edge cases For this project the taxonomy yielded 10 categories: asiel/vreemdelingen, landbouw/natuur, veiligheid/justitie, zorg/gezondheid, economie, energie/klimaat, buitenland/europa, onderwijs/wetenschap, verkeer/infrastructuur, overig. ### 2. Chunk data into independent batches Dump motions from the DB to JSON, then split into small chunks (~38 motions each) that fit comfortably within a single sub-agent's context window. Each chunk is a standalone JSON file containing motion_id, title, and body_text. ### 3. Dispatch parallel classification sub-agents Spawn one sub-agent per chunk simultaneously (up to 80 in this case). Each receives: - The chunk of motions to classify - The taxonomy with category descriptions - A strict JSON output format: `[{"motion_id": ..., "category": ..., "category_explanation": ...}]` - An instruction to read both title and body_text before deciding on a category All 80 agents run in parallel, finishing in minutes rather than hours. ### 4. Merge results and update the database Collect all result files. Validate each for correct structure (some may use non-standard key names). Then update the DB: ```sql UPDATE right_wing_motions SET category = ?, category_explanation = ? WHERE motion_id = ? ``` Verify by counting non-NULL rows. ### 5. Integrate into downstream analysis Once the category column is populated, update analysis scripts and charts to use it. For the Overton QMD report this meant: - A Plotly dropdown filter on the main centrist support chart to toggle between categories - A category delta bar chart showing pre/post centrist support change per domain - Quarterly domain trajectory charts for the 5 largest categories ## Why This Matters - **Speed**: 80 parallel agents classified 3,030 motions in minutes vs. a sequential script that never finished at all - **Simplicity**: No timeout handling, retry logic, or batch management needed — each agent is a fire-and-forget independent unit - **Quality**: Classification is grounded in reasoning (reading title + full text), not keyword matching or vector similarity - **Discoverability**: The derived taxonomy (10 categories) emerges naturally from the data rather than being imposed upfront ## When to Apply - You have thousands of items needing per-item LLM processing - Each item is independently classifiable - The task fits in a sub-agent's context window when batched at ~30-50 items - Parallel dispatch infrastructure is available (e.g., the `task` tool) ## Examples The pipeline was applied to 3,030 Dutch right-wing motions. The taxonomy was derived from a 60-motion sample by a single sub-agent, then 80 parallel sub-agents classified ~38 motions each. Final distribution was: landbouw/natuur 487, economie 470, asiel/vreemdelingen 423, buitenland/europa 386, veiligheid/justitie 359, zorg/gezondheid 348, energie/klimaat 174, overig 159, verkeer/infrastructuur 138, onderwijs/wetenschap 86. Two chunks needed minor fixes (used `category_label` / `predicted_category` instead of `category`). A quick validation script caught these before the DB update. ## Related - `docs/solutions/best-practices/large-scale-subagent-2d-extremity-scoring-2026-06-05.md` — parallel subagent pattern for numeric extremity scoring (same infrastructure, different task) - `analysis/right_wing/derive_categories.py` — the original sequential script that timed out - `docs/solutions/best-practices/domain-decomposition-overton-analysis.md` — why category-split analysis matters for Overton interpretation - `docs/solutions/best-practices/overton-narrative-architecture-2026-06-06.md` — QMD report structure that consumed the categories