You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
95 lines
3.2 KiB
95 lines
3.2 KiB
---
|
|
title: "Fusion pipeline: vector dimension inconsistency causes padding"
|
|
date: 2026-03-23
|
|
module: pipeline
|
|
problem_type: best_practice
|
|
component: fusion-pipeline
|
|
severity: low
|
|
tags:
|
|
- fusion
|
|
- embeddings
|
|
- vector-dimensions
|
|
- pipeline
|
|
- data-quality
|
|
---
|
|
|
|
# Fusion Pipeline: Vector Dimension Inconsistency Causes Padding
|
|
|
|
## Context
|
|
|
|
During a fusion + similarity pipeline run (2026-03-23), several windows had inconsistent vector dimensions. The pipeline padded vectors to a common dimension to allow fusion and similarity processing, logging warnings per affected window.
|
|
|
|
## Pipeline Run Summary
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Start | 2026-03-23T15:30:00Z |
|
|
| End | 2026-03-23T16:47:04Z |
|
|
| Duration | 1h 17m 4s |
|
|
| Embeddings processed | 28,172 |
|
|
| Fused embeddings | 40,524 |
|
|
| Similarity rows | 405,216 |
|
|
|
|
## Per-Window Warnings
|
|
|
|
| Window | Inserted | Warnings | Issue |
|
|
|--------|----------|----------|-------|
|
|
| win-002 | 2,048 | 1 | Padded vectors due to dim mismatch |
|
|
| win-003 | 4,096 | 2 | Padded vectors due to dim mismatch |
|
|
| win-005 | 15,344 | 3 | Padded vectors due to dim mismatch |
|
|
|
|
**Note:** win-001 and win-004 had no warnings (consistent dimensions).
|
|
|
|
## Why This Happens
|
|
|
|
Vector dimensions can become inconsistent across windows when:
|
|
1. **Embedding model changes** between window processing runs
|
|
2. **Text truncation** produces different effective lengths
|
|
3. **Pipeline restarts** after partial failures create mixed batches
|
|
4. **Different window sizes** (annual vs quarterly) aggregate different numbers of motions
|
|
|
|
## Impact
|
|
|
|
- **Fused embeddings are padded**, not truncated — data is preserved but with zero-padding
|
|
- **Similarity scores** may be slightly affected for padded dimensions
|
|
- **No data loss**, but quality degradation in affected windows
|
|
|
|
## Prevention
|
|
|
|
1. **Validate dimensions before fusion**
|
|
```python
|
|
# Before calling fusion, assert all vectors have the same dimension
|
|
dims = {len(v) for v in window_vectors}
|
|
assert len(dims) == 1, f"Dimension mismatch: {dims}"
|
|
```
|
|
|
|
2. **Re-embed with consistent model/settings** if dimensions differ
|
|
- Don't mix embeddings from different model versions
|
|
- Re-run the full embedding pipeline if the model changes
|
|
|
|
3. **Window-level dimension checks** in the pipeline:
|
|
```python
|
|
# In pipeline/fusion.py or equivalent
|
|
for window_id, vectors in window_vectors.items():
|
|
dim = len(vectors[0])
|
|
if not all(len(v) == dim for v in vectors):
|
|
raise ValueError(f"Window {window_id}: inconsistent vector dimensions")
|
|
```
|
|
|
|
4. **QA sampling after fusion**
|
|
- Perform sample similarity lookups across N=20-50 items
|
|
- Validate fused vectors against source embeddings
|
|
- Check for anomalies in similarity scores for affected windows
|
|
|
|
## When to Apply
|
|
|
|
- Before running the fusion pipeline
|
|
- After re-running the embedding pipeline with new model/settings
|
|
- When adding new windows to an existing fused embedding set
|
|
- During QA of similarity cache results
|
|
|
|
## Related
|
|
|
|
- `docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md` — Canonical pipeline output sources
|
|
- `pipeline/fusion.py` — Fusion pipeline implementation
|
|
- `data/motions.db` — `fused_embeddings` and `similarity_cache` tables
|
|
|