You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3.2 KiB
3.2 KiB
| title | date | module | problem_type | component | severity | tags |
|---|---|---|---|---|---|---|
| Fusion pipeline: vector dimension inconsistency causes padding | 2026-03-23 | pipeline | best_practice | fusion-pipeline | low | [fusion embeddings vector-dimensions pipeline data-quality] |
Fusion Pipeline: Vector Dimension Inconsistency Causes Padding
Context
During a fusion + similarity pipeline run (2026-03-23), several windows had inconsistent vector dimensions. The pipeline padded vectors to a common dimension to allow fusion and similarity processing, logging warnings per affected window.
Pipeline Run Summary
| Metric | Value |
|---|---|
| Start | 2026-03-23T15:30:00Z |
| End | 2026-03-23T16:47:04Z |
| Duration | 1h 17m 4s |
| Embeddings processed | 28,172 |
| Fused embeddings | 40,524 |
| Similarity rows | 405,216 |
Per-Window Warnings
| Window | Inserted | Warnings | Issue |
|---|---|---|---|
| win-002 | 2,048 | 1 | Padded vectors due to dim mismatch |
| win-003 | 4,096 | 2 | Padded vectors due to dim mismatch |
| win-005 | 15,344 | 3 | Padded vectors due to dim mismatch |
Note: win-001 and win-004 had no warnings (consistent dimensions).
Why This Happens
Vector dimensions can become inconsistent across windows when:
- Embedding model changes between window processing runs
- Text truncation produces different effective lengths
- Pipeline restarts after partial failures create mixed batches
- Different window sizes (annual vs quarterly) aggregate different numbers of motions
Impact
- Fused embeddings are padded, not truncated — data is preserved but with zero-padding
- Similarity scores may be slightly affected for padded dimensions
- No data loss, but quality degradation in affected windows
Prevention
-
Validate dimensions before fusion
# Before calling fusion, assert all vectors have the same dimension dims = {len(v) for v in window_vectors} assert len(dims) == 1, f"Dimension mismatch: {dims}" -
Re-embed with consistent model/settings if dimensions differ
- Don't mix embeddings from different model versions
- Re-run the full embedding pipeline if the model changes
-
Window-level dimension checks in the pipeline:
# In pipeline/fusion.py or equivalent for window_id, vectors in window_vectors.items(): dim = len(vectors[0]) if not all(len(v) == dim for v in vectors): raise ValueError(f"Window {window_id}: inconsistent vector dimensions") -
QA sampling after fusion
- Perform sample similarity lookups across N=20-50 items
- Validate fused vectors against source embeddings
- Check for anomalies in similarity scores for affected windows
When to Apply
- Before running the fusion pipeline
- After re-running the embedding pipeline with new model/settings
- When adding new windows to an existing fused embedding set
- During QA of similarity cache results
Related
docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md— Canonical pipeline output sourcespipeline/fusion.py— Fusion pipeline implementationdata/motions.db—fused_embeddingsandsimilarity_cachetables