--- title: "Fusion pipeline: vector dimension inconsistency causes padding" date: 2026-03-23 module: pipeline problem_type: best_practice component: fusion-pipeline severity: low tags: - fusion - embeddings - vector-dimensions - pipeline - data-quality --- # Fusion Pipeline: Vector Dimension Inconsistency Causes Padding ## Context During a fusion + similarity pipeline run (2026-03-23), several windows had inconsistent vector dimensions. The pipeline padded vectors to a common dimension to allow fusion and similarity processing, logging warnings per affected window. ## Pipeline Run Summary | Metric | Value | |--------|-------| | Start | 2026-03-23T15:30:00Z | | End | 2026-03-23T16:47:04Z | | Duration | 1h 17m 4s | | Embeddings processed | 28,172 | | Fused embeddings | 40,524 | | Similarity rows | 405,216 | ## Per-Window Warnings | Window | Inserted | Warnings | Issue | |--------|----------|----------|-------| | win-002 | 2,048 | 1 | Padded vectors due to dim mismatch | | win-003 | 4,096 | 2 | Padded vectors due to dim mismatch | | win-005 | 15,344 | 3 | Padded vectors due to dim mismatch | **Note:** win-001 and win-004 had no warnings (consistent dimensions). ## Why This Happens Vector dimensions can become inconsistent across windows when: 1. **Embedding model changes** between window processing runs 2. **Text truncation** produces different effective lengths 3. **Pipeline restarts** after partial failures create mixed batches 4. **Different window sizes** (annual vs quarterly) aggregate different numbers of motions ## Impact - **Fused embeddings are padded**, not truncated — data is preserved but with zero-padding - **Similarity scores** may be slightly affected for padded dimensions - **No data loss**, but quality degradation in affected windows ## Prevention 1. **Validate dimensions before fusion** ```python # Before calling fusion, assert all vectors have the same dimension dims = {len(v) for v in window_vectors} assert len(dims) == 1, f"Dimension mismatch: {dims}" ``` 2. **Re-embed with consistent model/settings** if dimensions differ - Don't mix embeddings from different model versions - Re-run the full embedding pipeline if the model changes 3. **Window-level dimension checks** in the pipeline: ```python # In pipeline/fusion.py or equivalent for window_id, vectors in window_vectors.items(): dim = len(vectors[0]) if not all(len(v) == dim for v in vectors): raise ValueError(f"Window {window_id}: inconsistent vector dimensions") ``` 4. **QA sampling after fusion** - Perform sample similarity lookups across N=20-50 items - Validate fused vectors against source embeddings - Check for anomalies in similarity scores for affected windows ## When to Apply - Before running the fusion pipeline - After re-running the embedding pipeline with new model/settings - When adding new windows to an existing fused embedding set - During QA of similarity cache results ## Related - `docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md` — Canonical pipeline output sources - `pipeline/fusion.py` — Fusion pipeline implementation - `data/motions.db` — `fused_embeddings` and `similarity_cache` tables