You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
212 lines
7.5 KiB
212 lines
7.5 KiB
---
|
|
title: "Parallel API batching for AI motion summarization with adaptive throughput"
|
|
date: "2026-05-05"
|
|
category: workflow-issues
|
|
module: summarizer
|
|
problem_type: workflow_issue
|
|
component: service_object
|
|
severity: medium
|
|
applies_when:
|
|
- "Backfilling large numbers of AI-generated summaries via an API"
|
|
- "Rate limits or slow throughput bottlenecking batch processing"
|
|
- "Need to process 10,000+ items with an LLM API"
|
|
tags:
|
|
- parallelization
|
|
- batching
|
|
- openrouter
|
|
- mistral
|
|
- throughput
|
|
- cost-optimization
|
|
---
|
|
|
|
# Parallel API Batching for AI Motion Summarization
|
|
|
|
## Context
|
|
|
|
Generating layman-friendly explanations for 29,000+ parliamentary motions via an LLM API. Initial approach processed one motion per API call, yielding ~700 motions/hour with significant per-request overhead. At this rate, the full backfill would take ~40 hours and cost ~$15-20. The budget tracking was also inaccurate — estimated $5.00 cap but actual API spend was only ~$1.78 when the cap was hit.
|
|
|
|
## Guidance
|
|
|
|
### 1. Application-level batching (not native API batching)
|
|
|
|
OpenAI's `/chat/completions` endpoint does not support multiple independent conversations in one request. Instead, pack 10-20 motions into a single prompt and request structured JSON output:
|
|
|
|
```python
|
|
# Build one prompt with N motions
|
|
prompt = f"""Je krijgt {len(motions)} moties.
|
|
Schrijf voor ELKE motie 2-3 zinnen uitleg.
|
|
Geef antwoord als JSON: {{"motion_id": "uitleg", ...}}
|
|
|
|
{motions_block}
|
|
"""
|
|
|
|
# Request JSON mode
|
|
payload = {
|
|
"model": model,
|
|
"messages": messages,
|
|
"response_format": {"type": "json_object"}
|
|
}
|
|
```
|
|
|
|
This eliminates 90%+ of per-request HTTP overhead and context-window overhead.
|
|
|
|
### 2. Parallel API requests with ThreadPoolExecutor
|
|
|
|
When the API supports concurrent requests, use `ThreadPoolExecutor` to saturate the connection:
|
|
|
|
```python
|
|
from concurrent.futures import ThreadPoolExecutor
|
|
|
|
def chat_completion_json_parallel(
|
|
message_batches, model=None, max_workers=3
|
|
):
|
|
def _fetch_one(messages):
|
|
return chat_completion_json(messages, model=model)
|
|
|
|
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
|
futures = [executor.submit(_fetch_one, batch) for batch in message_batches]
|
|
return [f.result() for f in futures]
|
|
```
|
|
|
|
With 3 parallel workers × 15 motions per batch = 45 motions per chunk. This yielded **2.0x speedup** (0.5 → 1.0 motions/sec). After fixing the parameter-passing bug (see "Critical gotcha" below), 5 workers × 20 motions = 100 motions per chunk achieved **3.4x speedup** (1.1 → 3.7 motions/sec).
|
|
|
|
### 3. Model selection for language quality
|
|
|
|
Tested three models on the same problematic motions:
|
|
|
|
| Model | Quality | Speed | Issues |
|
|
|-------|---------|-------|--------|
|
|
| `google/gemma-4-26b-a4b-it` | Good | Slow (~25s/batch) | Occasional English words |
|
|
| `mistralai/mistral-small-2603` | **Excellent** | Medium | None observed |
|
|
| `mistralai/mistral-small-3.2-24b-instruct` | Good | Medium | One blank output |
|
|
|
|
**Recommendation**: `mistralai/mistral-small-2603` for Dutch-language tasks. Test on a representative sample (20-50 items) before committing to a full backfill.
|
|
|
|
### 4. Post-processing pipeline
|
|
|
|
Always add a post-processing step to catch model failures:
|
|
|
|
```python
|
|
def _post_process_summary(self, text: str) -> str:
|
|
# 1. Remove lines that are mostly non-Latin (Chinese, Arabic, etc.)
|
|
# 2. Replace known English words with Dutch equivalents
|
|
# 3. Fix common typos (e.g., "lageinkomen" → "laag inkomen")
|
|
# 4. Reject if >10 common English words remain
|
|
# 5. Remove metadata fragments like "(45-102)"
|
|
# 6. Normalize whitespace and punctuation
|
|
```
|
|
|
|
This caught: Arabic script hallucinations, English words like "filthy", Dutch typos like "formuliernoten", and metadata leaking from titles.
|
|
|
|
### 5. Adaptive backfill script
|
|
|
|
Create a backfill script that monitors throughput and adjusts parameters dynamically:
|
|
|
|
```python
|
|
# Start conservative
|
|
api_batch_size = 15
|
|
parallel_batches = 3
|
|
chunk_size = api_batch_size * parallel_batches # 45
|
|
|
|
# After each chunk, measure time
|
|
if chunk_time < avg_chunk_time * 0.8:
|
|
api_batch_size += 1 # API is fast, increase batch
|
|
elif chunk_time > avg_chunk_time * 1.5:
|
|
api_batch_size -= 1 # API is slow, decrease batch
|
|
|
|
# On repeated failures, back off
|
|
if failures_in_row >= 2:
|
|
api_batch_size -= 2
|
|
delay_between_chunks += 1.0
|
|
```
|
|
|
|
### Critical gotcha: pass parallelism config all the way through
|
|
|
|
When adding `parallel_batches` to the orchestration layer, make sure it actually reaches the API call. A common bug is adding the parameter to the outer script but leaving hardcoded defaults in the inner method:
|
|
|
|
```python
|
|
# summarizer.py — WRONG (hardcoded defaults)
|
|
def generate_layman_explanations_batch_parallel(
|
|
self, motions, model=None, parallel_batches=3, sub_batch_size=15
|
|
):
|
|
...
|
|
|
|
def update_motion_summaries(
|
|
self, ..., api_batch_size=15, parallel_batches=3 # still hardcoded!
|
|
):
|
|
...
|
|
summaries = self.generate_layman_explanations_batch_parallel(
|
|
motions_for_api,
|
|
model=config.QWEN_MODEL,
|
|
parallel_batches=3, # <-- BUG: ignores parameter
|
|
sub_batch_size=15, # <-- BUG: ignores parameter
|
|
)
|
|
```
|
|
|
|
**Fix**: pass the parameters through and compute `sub_batch_size` dynamically:
|
|
|
|
```python
|
|
# summarizer.py — CORRECT
|
|
def update_motion_summaries(
|
|
self, ..., api_batch_size=15, parallel_batches=3
|
|
):
|
|
...
|
|
summaries = self.generate_layman_explanations_batch_parallel(
|
|
motions_for_api,
|
|
model=config.QWEN_MODEL,
|
|
parallel_batches=parallel_batches,
|
|
sub_batch_size=api_batch_size // parallel_batches,
|
|
)
|
|
```
|
|
|
|
**Impact of this bug**: With `parallel_batches=3` hardcoded, increasing backfill.py to 5 workers had zero effect. After fixing: speed jumped from **1.1 → 3.7 motions/sec** (3.4x).
|
|
|
|
### 6. Accurate cost tracking
|
|
|
|
Update cost estimates when switching models. The old estimate assumed Qwen at $0.60/M tokens, but Mistral Small is much cheaper:
|
|
|
|
```python
|
|
# Mistral Small 2603 pricing (OpenRouter)
|
|
TOKEN_PRICE_PER_MILLION = 0.07 # blended input+output
|
|
TOKENS_PER_MOTION = 480 # ~400 input + ~80 output
|
|
COST_PER_MOTION = (TOKENS_PER_MOTION / 1_000_000) * TOKEN_PRICE_PER_MILLION
|
|
# ≈ $0.000034 per motion
|
|
```
|
|
|
|
## Why This Matters
|
|
|
|
- **Speed**: From ~700/hour (single) to ~13,320/hour (parallel batch, 5 workers) — **19x faster**
|
|
- **Cost**: From ~$15-20 estimated to ~$1-2 actual for 30,000 motions
|
|
- **Quality**: Model testing on edge cases prevents garbage-in-garbage-out at scale
|
|
- **Reliability**: Post-processing catches ~5% of outputs that would degrade UX
|
|
|
|
## When to Apply
|
|
|
|
- Processing 1,000+ items through any LLM API
|
|
- API charges per-request (not per-token) or has high latency per request
|
|
- Output quality is critical and model hallucinations are unacceptable
|
|
- Running overnight/background backfills where throughput matters more than latency
|
|
|
|
## Examples
|
|
|
|
### Before (single-motion API calls)
|
|
```python
|
|
for motion in motions:
|
|
summary = ai.chat_completion(build_prompt(motion))
|
|
# 50 motions = 50 API calls, ~250s total
|
|
```
|
|
|
|
### After (parallel batching)
|
|
```python
|
|
# Split into 3 batches of 15
|
|
batches = [motions[i:i+15] for i in range(0, 50, 15)]
|
|
message_batches = [build_batch_prompt(batch) for batch in batches]
|
|
results = chat_completion_json_parallel(message_batches, max_workers=3)
|
|
# 50 motions = 3 API calls, ~25s total
|
|
```
|
|
|
|
## Related
|
|
|
|
- `ai_provider.py` — `chat_completion_json_parallel()` implementation
|
|
- `summarizer.py` — `MotionSummarizer` with batch and parallel methods
|
|
- `backfill.py` — Adaptive backfill script with dynamic parameter tuning
|
|
|