docs(solutions): document best practice for deriving blog numbers from pipeline outputs

4 months ago · be4375b303
parent 3a240fd907
commit be4375b303
1 changed files with 92 additions and 0 deletions
--- a/docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md
+++ b/docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md
@ -0,0 +1,92 @@
 ---
 title: Always Derive Blog Numbers from Pipeline Outputs, Not Memory
 date: 2026-04-16
 category: docs/solutions/best-practices
 module: documentation
 problem_type: best_practice
 component: documentation
 severity: medium
 applies_when:
  - Writing or updating a data-driven blog post
  - Adding EVR percentages, vote counts, or any quantitative claims
  - Referencing pipeline components (embeddings, fusion, similarity) in public-facing docs
 tags: [blog, pipeline, evr, svd, canonical-outputs, data-driven-docs]
 ---
 # Always Derive Blog Numbers from Pipeline Outputs, Not Memory
 ## Context
 The political compass blog post was written with hardcoded numbers (EVR ~32%/~21%, 38 windows) that drifted from the pipeline's actual outputs as the data and methodology evolved. A maintenance session was required to bring every figure back in sync, generate supporting visuals, and strip references to pipeline components not yet deployed to production.
 ## Guidance
 **Pull every quantitative claim directly from the canonical pipeline functions:**
 | Claim | Canonical source |
 |-------|-----------------|
 | EVR percentages | `analysis.political_axis.compute_svd_spectrum(window_ids=[...])` |
 | Vote/motion counts | `SELECT COUNT(*) FROM motions / mp_votes` via `data/motions.db` |
 | Window count | `analysis.political_axis` — count of aligned windows |
 | Party agreement | `analysis.explorer_data` or direct SQL on `mp_votes` |
 **Never reference pipeline components that are not in production.** If `fused_embeddings` rows exist in the DB but the fusion pipeline is not yet in active use, do not describe it as part of the current workflow in blog copy.
 **Generate supporting visuals programmatically** (matplotlib → `docs/research/`) and embed them by relative path in the blog HTML. This makes regeneration trivial when numbers change.
 ## Why This Matters
 Hardcoded numbers in blog copy inevitably drift from reality as:
 - More parliamentary windows are added (38 → 41 → …)
 - SVD methodology changes (e.g., Procrustes alignment, window selection)
 - Pipeline components are added or removed from production
 When numbers drift, the post loses credibility and requires an expensive archaeology pass to fix. Generating them from the pipeline makes each update a single script run.
 ## When to Apply
 - Before publishing or updating any post that cites quantitative pipeline outputs
 - When the pipeline has changed (new windows, new methodology) and existing posts reference old numbers
 - When removing or adding a pipeline stage — audit all docs for references to that stage
 ## Examples
 **Before (hardcoded, stale):**
 ```html
 <p>PC1 explains ~32% of the variance and PC2 explains ~21% — together ~52%.</p>
 ```
 **After (derived from pipeline, accurate):**
 ```python
 # scripts/generate_blog_assets.py
 from analysis.political_axis import compute_svd_spectrum
 evr = compute_svd_spectrum(window_ids=["current_parliament"])
 # evr[0] = 0.290, evr[1] = 0.1146 → PC1~29%, PC2~11.5%, total~41%
 ```
 ```html
 <p>PC1 explains ~29% of the variance and PC2 explains ~11.5% — together ~41%.</p>
 ```
 **Multi-window EVR (Procrustes-aligned across all 41 windows):**
 ```python
 evr_multi = compute_svd_spectrum()  # no window_ids → all windows
 # evr_multi[0] = 0.1463, evr_multi[1] = 0.1310
 ```
 **Party agreement for a specific window:**
 ```python
 import duckdb
 con = duckdb.connect("data/motions.db")
 # Agreement between two parties in a quarter
 sql = """
  SELECT AVG(CASE WHEN a.vote = b.vote THEN 1.0 ELSE 0.0 END)
  FROM mp_votes a JOIN mp_votes b USING (motion_id)
  WHERE a.party = 'GroenLinks' AND b.party = 'PvdA'
    AND a.motion_id IN (SELECT id FROM motions WHERE window_id = '2023-Q3')
 """
 ```
 ## Related
 - `docs/solutions/best-practices/svd-labels-voting-patterns-not-semantics.md` — companion guidance on keeping SVD axis *labels* aligned with voting data rather than semantic assumptions