You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
motief/docs/solutions/best-practices/fusion-vector-dimension-con...

95 lines
3.2 KiB

---
title: "Fusion pipeline: vector dimension inconsistency causes padding"
date: 2026-03-23
module: pipeline
problem_type: best_practice
component: fusion-pipeline
severity: low
tags:
- fusion
- embeddings
- vector-dimensions
- pipeline
- data-quality
---
# Fusion Pipeline: Vector Dimension Inconsistency Causes Padding
## Context
During a fusion + similarity pipeline run (2026-03-23), several windows had inconsistent vector dimensions. The pipeline padded vectors to a common dimension to allow fusion and similarity processing, logging warnings per affected window.
## Pipeline Run Summary
| Metric | Value |
|--------|-------|
| Start | 2026-03-23T15:30:00Z |
| End | 2026-03-23T16:47:04Z |
| Duration | 1h 17m 4s |
| Embeddings processed | 28,172 |
| Fused embeddings | 40,524 |
| Similarity rows | 405,216 |
## Per-Window Warnings
| Window | Inserted | Warnings | Issue |
|--------|----------|----------|-------|
| win-002 | 2,048 | 1 | Padded vectors due to dim mismatch |
| win-003 | 4,096 | 2 | Padded vectors due to dim mismatch |
| win-005 | 15,344 | 3 | Padded vectors due to dim mismatch |
**Note:** win-001 and win-004 had no warnings (consistent dimensions).
## Why This Happens
Vector dimensions can become inconsistent across windows when:
1. **Embedding model changes** between window processing runs
2. **Text truncation** produces different effective lengths
3. **Pipeline restarts** after partial failures create mixed batches
4. **Different window sizes** (annual vs quarterly) aggregate different numbers of motions
## Impact
- **Fused embeddings are padded**, not truncated — data is preserved but with zero-padding
- **Similarity scores** may be slightly affected for padded dimensions
- **No data loss**, but quality degradation in affected windows
## Prevention
1. **Validate dimensions before fusion**
```python
# Before calling fusion, assert all vectors have the same dimension
dims = {len(v) for v in window_vectors}
assert len(dims) == 1, f"Dimension mismatch: {dims}"
```
2. **Re-embed with consistent model/settings** if dimensions differ
- Don't mix embeddings from different model versions
- Re-run the full embedding pipeline if the model changes
3. **Window-level dimension checks** in the pipeline:
```python
# In pipeline/fusion.py or equivalent
for window_id, vectors in window_vectors.items():
dim = len(vectors[0])
if not all(len(v) == dim for v in vectors):
raise ValueError(f"Window {window_id}: inconsistent vector dimensions")
```
4. **QA sampling after fusion**
- Perform sample similarity lookups across N=20-50 items
- Validate fused vectors against source embeddings
- Check for anomalies in similarity scores for affected windows
## When to Apply
- Before running the fusion pipeline
- After re-running the embedding pipeline with new model/settings
- When adding new windows to an existing fused embedding set
- During QA of similarity cache results
## Related
- `docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md` — Canonical pipeline output sources
- `pipeline/fusion.py` — Fusion pipeline implementation
- `data/motions.db``fused_embeddings` and `similarity_cache` tables