You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
motief/docs/solutions/best-practices/fusion-vector-dimension-con...

3.2 KiB

title date module problem_type component severity tags
Fusion pipeline: vector dimension inconsistency causes padding 2026-03-23 pipeline best_practice fusion-pipeline low [fusion embeddings vector-dimensions pipeline data-quality]

Fusion Pipeline: Vector Dimension Inconsistency Causes Padding

Context

During a fusion + similarity pipeline run (2026-03-23), several windows had inconsistent vector dimensions. The pipeline padded vectors to a common dimension to allow fusion and similarity processing, logging warnings per affected window.

Pipeline Run Summary

Metric Value
Start 2026-03-23T15:30:00Z
End 2026-03-23T16:47:04Z
Duration 1h 17m 4s
Embeddings processed 28,172
Fused embeddings 40,524
Similarity rows 405,216

Per-Window Warnings

Window Inserted Warnings Issue
win-002 2,048 1 Padded vectors due to dim mismatch
win-003 4,096 2 Padded vectors due to dim mismatch
win-005 15,344 3 Padded vectors due to dim mismatch

Note: win-001 and win-004 had no warnings (consistent dimensions).

Why This Happens

Vector dimensions can become inconsistent across windows when:

  1. Embedding model changes between window processing runs
  2. Text truncation produces different effective lengths
  3. Pipeline restarts after partial failures create mixed batches
  4. Different window sizes (annual vs quarterly) aggregate different numbers of motions

Impact

  • Fused embeddings are padded, not truncated — data is preserved but with zero-padding
  • Similarity scores may be slightly affected for padded dimensions
  • No data loss, but quality degradation in affected windows

Prevention

  1. Validate dimensions before fusion

    # Before calling fusion, assert all vectors have the same dimension
    dims = {len(v) for v in window_vectors}
    assert len(dims) == 1, f"Dimension mismatch: {dims}"
    
  2. Re-embed with consistent model/settings if dimensions differ

    • Don't mix embeddings from different model versions
    • Re-run the full embedding pipeline if the model changes
  3. Window-level dimension checks in the pipeline:

    # In pipeline/fusion.py or equivalent
    for window_id, vectors in window_vectors.items():
        dim = len(vectors[0])
        if not all(len(v) == dim for v in vectors):
            raise ValueError(f"Window {window_id}: inconsistent vector dimensions")
    
  4. QA sampling after fusion

    • Perform sample similarity lookups across N=20-50 items
    • Validate fused vectors against source embeddings
    • Check for anomalies in similarity scores for affected windows

When to Apply

  • Before running the fusion pipeline
  • After re-running the embedding pipeline with new model/settings
  • When adding new windows to an existing fused embedding set
  • During QA of similarity cache results
  • docs/solutions/best-practices/blog-numbers-from-pipeline-outputs-2026-04-16.md — Canonical pipeline output sources
  • pipeline/fusion.py — Fusion pipeline implementation
  • data/motions.dbfused_embeddings and similarity_cache tables