3.2 KiB

Raw Blame History

title	date	module	problem_type	component	severity	tags
Fusion pipeline: vector dimension inconsistency causes padding	2026-03-23	pipeline	best_practice	fusion-pipeline	low	[fusion embeddings vector-dimensions pipeline data-quality]

Fusion Pipeline: Vector Dimension Inconsistency Causes Padding

Context

During a fusion + similarity pipeline run (2026-03-23), several windows had inconsistent vector dimensions. The pipeline padded vectors to a common dimension to allow fusion and similarity processing, logging warnings per affected window.

Pipeline Run Summary

Metric	Value
Start	2026-03-23T15:30:00Z
End	2026-03-23T16:47:04Z
Duration	1h 17m 4s
Embeddings processed	28,172
Fused embeddings	40,524
Similarity rows	405,216

Per-Window Warnings

Window	Inserted	Warnings	Issue
win-002	2,048	1	Padded vectors due to dim mismatch
win-003	4,096	2	Padded vectors due to dim mismatch
win-005	15,344	3	Padded vectors due to dim mismatch

Note: win-001 and win-004 had no warnings (consistent dimensions).

Why This Happens

Vector dimensions can become inconsistent across windows when:

Embedding model changes between window processing runs
Text truncation produces different effective lengths
Pipeline restarts after partial failures create mixed batches
Different window sizes (annual vs quarterly) aggregate different numbers of motions

Impact

Fused embeddings are padded, not truncated — data is preserved but with zero-padding
Similarity scores may be slightly affected for padded dimensions
No data loss, but quality degradation in affected windows

Prevention

Validate dimensions before fusion

# Before calling fusion, assert all vectors have the same dimension
dims = {len(v) for v in window_vectors}
assert len(dims) == 1, f"Dimension mismatch: {dims}"

Re-embed with consistent model/settings if dimensions differ
- Don't mix embeddings from different model versions
- Re-run the full embedding pipeline if the model changes

Window-level dimension checks in the pipeline:

# In pipeline/fusion.py or equivalent
for window_id, vectors in window_vectors.items():
    dim = len(vectors[0])
    if not all(len(v) == dim for v in vectors):
        raise ValueError(f"Window {window_id}: inconsistent vector dimensions")

QA sampling after fusion
- Perform sample similarity lookups across N=20-50 items
- Validate fused vectors against source embeddings
- Check for anomalies in similarity scores for affected windows

When to Apply

Before running the fusion pipeline
After re-running the embedding pipeline with new model/settings
When adding new windows to an existing fused embedding set
During QA of similarity cache results

docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md — Canonical pipeline output sources
pipeline/fusion.py — Fusion pipeline implementation
data/motions.db — fused_embeddings and similarity_cache tables

3.2 KiB Raw Blame History