7.5 KiB

Raw Blame History

title	date	category	module	problem_type	component	severity	applies_when	tags
Parallel API batching for AI motion summarization with adaptive throughput	2026-05-05	workflow-issues	summarizer	workflow_issue	service_object	medium	[Backfilling large numbers of AI-generated summaries via an API Rate limits or slow throughput bottlenecking batch processing Need to process 10,000+ items with an LLM API]	[parallelization batching openrouter mistral throughput cost-optimization]

Parallel API Batching for AI Motion Summarization

Context

Generating layman-friendly explanations for 29,000+ parliamentary motions via an LLM API. Initial approach processed one motion per API call, yielding ~700 motions/hour with significant per-request overhead. At this rate, the full backfill would take ~40 hours and cost ~$15-20. The budget tracking was also inaccurate — estimated $5.00 cap but actual API spend was only ~$1.78 when the cap was hit.

Guidance

1. Application-level batching (not native API batching)

OpenAI's /chat/completions endpoint does not support multiple independent conversations in one request. Instead, pack 10-20 motions into a single prompt and request structured JSON output:

# Build one prompt with N motions
prompt = f"""Je krijgt {len(motions)} moties.
Schrijf voor ELKE motie 2-3 zinnen uitleg.
Geef antwoord als JSON: {{"motion_id": "uitleg", ...}}

{motions_block}
"""

# Request JSON mode
payload = {
    "model": model,
    "messages": messages,
    "response_format": {"type": "json_object"}
}

This eliminates 90%+ of per-request HTTP overhead and context-window overhead.

2. Parallel API requests with ThreadPoolExecutor

When the API supports concurrent requests, use ThreadPoolExecutor to saturate the connection:

from concurrent.futures import ThreadPoolExecutor

def chat_completion_json_parallel(
    message_batches, model=None, max_workers=3
):
    def _fetch_one(messages):
        return chat_completion_json(messages, model=model)

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(_fetch_one, batch) for batch in message_batches]
        return [f.result() for f in futures]

With 3 parallel workers × 15 motions per batch = 45 motions per chunk. This yielded 2.0x speedup (0.5 → 1.0 motions/sec). After fixing the parameter-passing bug (see "Critical gotcha" below), 5 workers × 20 motions = 100 motions per chunk achieved 3.4x speedup (1.1 → 3.7 motions/sec).

3. Model selection for language quality

Tested three models on the same problematic motions:

Model	Quality	Speed	Issues
`google/gemma-4-26b-a4b-it`	Good	Slow (~25s/batch)	Occasional English words
`mistralai/mistral-small-2603`	Excellent	Medium	None observed
`mistralai/mistral-small-3.2-24b-instruct`	Good	Medium	One blank output

Recommendation: mistralai/mistral-small-2603 for Dutch-language tasks. Test on a representative sample (20-50 items) before committing to a full backfill.

4. Post-processing pipeline

Always add a post-processing step to catch model failures:

def _post_process_summary(self, text: str) -> str:
    # 1. Remove lines that are mostly non-Latin (Chinese, Arabic, etc.)
    # 2. Replace known English words with Dutch equivalents
    # 3. Fix common typos (e.g., "lageinkomen" → "laag inkomen")
    # 4. Reject if >10 common English words remain
    # 5. Remove metadata fragments like "(45-102)"
    # 6. Normalize whitespace and punctuation

This caught: Arabic script hallucinations, English words like "filthy", Dutch typos like "formuliernoten", and metadata leaking from titles.

5. Adaptive backfill script

Create a backfill script that monitors throughput and adjusts parameters dynamically:

# Start conservative
api_batch_size = 15
parallel_batches = 3
chunk_size = api_batch_size * parallel_batches  # 45

# After each chunk, measure time
if chunk_time < avg_chunk_time * 0.8:
    api_batch_size += 1  # API is fast, increase batch
elif chunk_time > avg_chunk_time * 1.5:
    api_batch_size -= 1  # API is slow, decrease batch

    # On repeated failures, back off
    if failures_in_row >= 2:
        api_batch_size -= 2
        delay_between_chunks += 1.0

Critical gotcha: pass parallelism config all the way through

When adding parallel_batches to the orchestration layer, make sure it actually reaches the API call. A common bug is adding the parameter to the outer script but leaving hardcoded defaults in the inner method:

# summarizer.py — WRONG (hardcoded defaults)
def generate_layman_explanations_batch_parallel(
    self, motions, model=None, parallel_batches=3, sub_batch_size=15
):
    ...

def update_motion_summaries(
    self, ..., api_batch_size=15, parallel_batches=3  # still hardcoded!
):
    ...
    summaries = self.generate_layman_explanations_batch_parallel(
        motions_for_api,
        model=config.QWEN_MODEL,
        parallel_batches=3,        # <-- BUG: ignores parameter
        sub_batch_size=15,         # <-- BUG: ignores parameter
    )

Fix: pass the parameters through and compute sub_batch_size dynamically:

# summarizer.py — CORRECT
def update_motion_summaries(
    self, ..., api_batch_size=15, parallel_batches=3
):
    ...
    summaries = self.generate_layman_explanations_batch_parallel(
        motions_for_api,
        model=config.QWEN_MODEL,
        parallel_batches=parallel_batches,
        sub_batch_size=api_batch_size // parallel_batches,
    )

Impact of this bug: With parallel_batches=3 hardcoded, increasing backfill.py to 5 workers had zero effect. After fixing: speed jumped from 1.1 → 3.7 motions/sec (3.4x).

6. Accurate cost tracking

Update cost estimates when switching models. The old estimate assumed Qwen at $0.60/M tokens, but Mistral Small is much cheaper:

# Mistral Small 2603 pricing (OpenRouter)
TOKEN_PRICE_PER_MILLION = 0.07  # blended input+output
TOKENS_PER_MOTION = 480         # ~400 input + ~80 output
COST_PER_MOTION = (TOKENS_PER_MOTION / 1_000_000) * TOKEN_PRICE_PER_MILLION
# ≈ $0.000034 per motion

Why This Matters

Speed: From ~700/hour (single) to ~13,320/hour (parallel batch, 5 workers) — 19x faster
Cost: From ~$15-20 estimated to ~$1-2 actual for 30,000 motions
Quality: Model testing on edge cases prevents garbage-in-garbage-out at scale
Reliability: Post-processing catches ~5% of outputs that would degrade UX

When to Apply

Processing 1,000+ items through any LLM API
API charges per-request (not per-token) or has high latency per request
Output quality is critical and model hallucinations are unacceptable
Running overnight/background backfills where throughput matters more than latency

Examples

Before (single-motion API calls)

for motion in motions:
    summary = ai.chat_completion(build_prompt(motion))
    # 50 motions = 50 API calls, ~250s total

After (parallel batching)

# Split into 3 batches of 15
batches = [motions[i:i+15] for i in range(0, 50, 15)]
message_batches = [build_batch_prompt(batch) for batch in batches]
results = chat_completion_json_parallel(message_batches, max_workers=3)
# 50 motions = 3 API calls, ~25s total

ai_provider.py — chat_completion_json_parallel() implementation
summarizer.py — MotionSummarizer with batch and parallel methods
backfill.py — Adaptive backfill script with dynamic parameter tuning

7.5 KiB Raw Blame History