From be4375b30357424a0aa0db3afa242fd74c9a0032 Mon Sep 17 00:00:00 2001 From: Sven Geboers Date: Thu, 16 Apr 2026 18:52:53 +0200 Subject: [PATCH] docs(solutions): document best practice for deriving blog numbers from pipeline outputs --- ...umbers-from-pipeline-outputs-2026-04-16.md | 92 +++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md diff --git a/docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md b/docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md new file mode 100644 index 0000000..ad01a7d --- /dev/null +++ b/docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md @@ -0,0 +1,92 @@ +--- +title: Always Derive Blog Numbers from Pipeline Outputs, Not Memory +date: 2026-04-16 +category: docs/solutions/best-practices +module: documentation +problem_type: best_practice +component: documentation +severity: medium +applies_when: + - Writing or updating a data-driven blog post + - Adding EVR percentages, vote counts, or any quantitative claims + - Referencing pipeline components (embeddings, fusion, similarity) in public-facing docs +tags: [blog, pipeline, evr, svd, canonical-outputs, data-driven-docs] +--- + +# Always Derive Blog Numbers from Pipeline Outputs, Not Memory + +## Context + +The political compass blog post was written with hardcoded numbers (EVR ~32%/~21%, 38 windows) that drifted from the pipeline's actual outputs as the data and methodology evolved. A maintenance session was required to bring every figure back in sync, generate supporting visuals, and strip references to pipeline components not yet deployed to production. + +## Guidance + +**Pull every quantitative claim directly from the canonical pipeline functions:** + +| Claim | Canonical source | +|-------|-----------------| +| EVR percentages | `analysis.political_axis.compute_svd_spectrum(window_ids=[...])` | +| Vote/motion counts | `SELECT COUNT(*) FROM motions / mp_votes` via `data/motions.db` | +| Window count | `analysis.political_axis` — count of aligned windows | +| Party agreement | `analysis.explorer_data` or direct SQL on `mp_votes` | + +**Never reference pipeline components that are not in production.** If `fused_embeddings` rows exist in the DB but the fusion pipeline is not yet in active use, do not describe it as part of the current workflow in blog copy. + +**Generate supporting visuals programmatically** (matplotlib → `docs/research/`) and embed them by relative path in the blog HTML. This makes regeneration trivial when numbers change. + +## Why This Matters + +Hardcoded numbers in blog copy inevitably drift from reality as: +- More parliamentary windows are added (38 → 41 → …) +- SVD methodology changes (e.g., Procrustes alignment, window selection) +- Pipeline components are added or removed from production + +When numbers drift, the post loses credibility and requires an expensive archaeology pass to fix. Generating them from the pipeline makes each update a single script run. + +## When to Apply + +- Before publishing or updating any post that cites quantitative pipeline outputs +- When the pipeline has changed (new windows, new methodology) and existing posts reference old numbers +- When removing or adding a pipeline stage — audit all docs for references to that stage + +## Examples + +**Before (hardcoded, stale):** +```html +

PC1 explains ~32% of the variance and PC2 explains ~21% — together ~52%.

+``` + +**After (derived from pipeline, accurate):** +```python +# scripts/generate_blog_assets.py +from analysis.political_axis import compute_svd_spectrum + +evr = compute_svd_spectrum(window_ids=["current_parliament"]) +# evr[0] = 0.290, evr[1] = 0.1146 → PC1~29%, PC2~11.5%, total~41% +``` +```html +

PC1 explains ~29% of the variance and PC2 explains ~11.5% — together ~41%.

+``` + +**Multi-window EVR (Procrustes-aligned across all 41 windows):** +```python +evr_multi = compute_svd_spectrum() # no window_ids → all windows +# evr_multi[0] = 0.1463, evr_multi[1] = 0.1310 +``` + +**Party agreement for a specific window:** +```python +import duckdb +con = duckdb.connect("data/motions.db") +# Agreement between two parties in a quarter +sql = """ + SELECT AVG(CASE WHEN a.vote = b.vote THEN 1.0 ELSE 0.0 END) + FROM mp_votes a JOIN mp_votes b USING (motion_id) + WHERE a.party = 'GroenLinks' AND b.party = 'PvdA' + AND a.motion_id IN (SELECT id FROM motions WHERE window_id = '2023-Q3') +""" +``` + +## Related + +- `docs/solutions/best-practices/svd-labels-voting-patterns-not-semantics.md` — companion guidance on keeping SVD axis *labels* aligned with voting data rather than semantic assumptions