docs(solutions): document best practice for deriving blog numbers from pipeline outputs

4 months ago · be4375b303
parent 3a240fd907
commit be4375b303
1 changed files with 92 additions and 0 deletions
--- a/docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md
+++ b/docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md
@ -0,0 +1,92 @@
+---
+title: Always Derive Blog Numbers from Pipeline Outputs, Not Memory
+date: 2026-04-16
+category: docs/solutions/best-practices
+module: documentation
+problem_type: best_practice
+component: documentation
+severity: medium
+applies_when:
+  - Writing or updating a data-driven blog post
+  - Adding EVR percentages, vote counts, or any quantitative claims
+  - Referencing pipeline components (embeddings, fusion, similarity) in public-facing docs
+tags: [blog, pipeline, evr, svd, canonical-outputs, data-driven-docs]
+---
+
+# Always Derive Blog Numbers from Pipeline Outputs, Not Memory
+
+## Context
+
+The political compass blog post was written with hardcoded numbers (EVR ~32%/~21%, 38 windows) that drifted from the pipeline's actual outputs as the data and methodology evolved. A maintenance session was required to bring every figure back in sync, generate supporting visuals, and strip references to pipeline components not yet deployed to production.
+
+## Guidance
+
+**Pull every quantitative claim directly from the canonical pipeline functions:**
+
+| Claim | Canonical source |
+|-------|-----------------|
+| EVR percentages | `analysis.political_axis.compute_svd_spectrum(window_ids=[...])` |
+| Vote/motion counts | `SELECT COUNT(*) FROM motions / mp_votes` via `data/motions.db` |
+| Window count | `analysis.political_axis` — count of aligned windows |
+| Party agreement | `analysis.explorer_data` or direct SQL on `mp_votes` |
+
+**Never reference pipeline components that are not in production.** If `fused_embeddings` rows exist in the DB but the fusion pipeline is not yet in active use, do not describe it as part of the current workflow in blog copy.
+
+**Generate supporting visuals programmatically** (matplotlib → `docs/research/`) and embed them by relative path in the blog HTML. This makes regeneration trivial when numbers change.
+
+## Why This Matters
+
+Hardcoded numbers in blog copy inevitably drift from reality as:
+- More parliamentary windows are added (38 → 41 → …)
+- SVD methodology changes (e.g., Procrustes alignment, window selection)
+- Pipeline components are added or removed from production
+
+When numbers drift, the post loses credibility and requires an expensive archaeology pass to fix. Generating them from the pipeline makes each update a single script run.
+
+## When to Apply
+
+- Before publishing or updating any post that cites quantitative pipeline outputs
+- When the pipeline has changed (new windows, new methodology) and existing posts reference old numbers
+- When removing or adding a pipeline stage — audit all docs for references to that stage
+
+## Examples
+
+**Before (hardcoded, stale):**
+```html
+<p>PC1 explains ~32% of the variance and PC2 explains ~21% — together ~52%.</p>
+```
+
+**After (derived from pipeline, accurate):**
+```python
+# scripts/generate_blog_assets.py
+from analysis.political_axis import compute_svd_spectrum
+
+evr = compute_svd_spectrum(window_ids=["current_parliament"])
+# evr[0] = 0.290, evr[1] = 0.1146 → PC1~29%, PC2~11.5%, total~41%
+```
+```html
+<p>PC1 explains ~29% of the variance and PC2 explains ~11.5% — together ~41%.</p>
+```
+
+**Multi-window EVR (Procrustes-aligned across all 41 windows):**
+```python
+evr_multi = compute_svd_spectrum()  # no window_ids → all windows
+# evr_multi[0] = 0.1463, evr_multi[1] = 0.1310
+```
+
+**Party agreement for a specific window:**
+```python
+import duckdb
+con = duckdb.connect("data/motions.db")
+# Agreement between two parties in a quarter
+sql = """
+  SELECT AVG(CASE WHEN a.vote = b.vote THEN 1.0 ELSE 0.0 END)
+  FROM mp_votes a JOIN mp_votes b USING (motion_id)
+  WHERE a.party = 'GroenLinks' AND b.party = 'PvdA'
+    AND a.motion_id IN (SELECT id FROM motions WHERE window_id = '2023-Q3')
+"""
+```
+
+## Related
+
+- `docs/solutions/best-practices/svd-labels-voting-patterns-not-semantics.md` — companion guidance on keeping SVD axis *labels* aligned with voting data rather than semantic assumptions