motief/docs/solutions/best-practices/fusion-vector-dimension-con...

---
title: "Fusion pipeline: vector dimension inconsistency causes padding"
date: 2026-03-23
module: pipeline
problem_type: best_practice
component: fusion-pipeline
severity: low
tags:
  - fusion
  - embeddings
  - vector-dimensions
  - pipeline
  - data-quality
---

# Fusion Pipeline: Vector Dimension Inconsistency Causes Padding

## Context

During a fusion + similarity pipeline run (2026-03-23), several windows had inconsistent vector dimensions. The pipeline padded vectors to a common dimension to allow fusion and similarity processing, logging warnings per affected window.

## Pipeline Run Summary

| Metric | Value |
|--------|-------|
| Start | 2026-03-23T15:30:00Z |
| End | 2026-03-23T16:47:04Z |
| Duration | 1h 17m 4s |
| Embeddings processed | 28,172 |
| Fused embeddings | 40,524 |
| Similarity rows | 405,216 |

## Per-Window Warnings

| Window | Inserted | Warnings | Issue |
|--------|----------|----------|-------|
| win-002 | 2,048 | 1 | Padded vectors due to dim mismatch |
| win-003 | 4,096 | 2 | Padded vectors due to dim mismatch |
| win-005 | 15,344 | 3 | Padded vectors due to dim mismatch |

**Note:** win-001 and win-004 had no warnings (consistent dimensions).

## Why This Happens

Vector dimensions can become inconsistent across windows when:
1. **Embedding model changes** between window processing runs
2. **Text truncation** produces different effective lengths
3. **Pipeline restarts** after partial failures create mixed batches
4. **Different window sizes** (annual vs quarterly) aggregate different numbers of motions

## Impact

- **Fused embeddings are padded**, not truncated — data is preserved but with zero-padding
- **Similarity scores** may be slightly affected for padded dimensions
- **No data loss**, but quality degradation in affected windows

## Prevention

1. **Validate dimensions before fusion**
   ```python
   # Before calling fusion, assert all vectors have the same dimension
   dims = {len(v) for v in window_vectors}
   assert len(dims) == 1, f"Dimension mismatch: {dims}"
   ```

2. **Re-embed with consistent model/settings** if dimensions differ
   - Don't mix embeddings from different model versions
   - Re-run the full embedding pipeline if the model changes

3. **Window-level dimension checks** in the pipeline:
   ```python
   # In pipeline/fusion.py or equivalent
   for window_id, vectors in window_vectors.items():
       dim = len(vectors[0])
       if not all(len(v) == dim for v in vectors):
           raise ValueError(f"Window {window_id}: inconsistent vector dimensions")
   ```

4. **QA sampling after fusion**
   - Perform sample similarity lookups across N=20-50 items
   - Validate fused vectors against source embeddings
   - Check for anomalies in similarity scores for affected windows

## When to Apply

- Before running the fusion pipeline
- After re-running the embedding pipeline with new model/settings
- When adding new windows to an existing fused embedding set
- During QA of similarity cache results

## Related

- `docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md` — Canonical pipeline output sources
- `pipeline/fusion.py` — Fusion pipeline implementation
- `data/motions.db` — `fused_embeddings` and `similarity_cache` tables