You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/.mindmodel/patterns/embeddings-similarity.md

2.0 KiB

title category
Embeddings Similarity Pipeline patterns

Embeddings Similarity Pipeline

Rules

  • Keep embedding calls batched where possible; fallback to per-item attempts on persistent batch failure.
  • Store raw embeddings, SVD vectors, and fused_embeddings separately; fused_embeddings are typically concatenation [svd + text].
  • Compute similarity as normalized cosine on padded vectors; record top-k neighbors in similarity_cache.
  • Use read_only DuckDB connections in compute workers to allow parallel runs.

Examples

pipeline/ai_provider_wrapper.py - Batched embed + fallback

for start in range(0, len(texts), batch_size):
    chunk = texts[start : start + batch_size]
    resp = _post_with_retries("/embeddings", json={"model": model, "input": chunk})
...
for j in range(i, end):
    t = texts[j]
    single, single_exc = _attempt_batch([t], j)
    if single:
        results[j] = single[0]

pipeline/fusion.py - Concatenation and storage

try:
    svd_vec = json.loads(svd_json)
except Exception:
    _logger.exception("Invalid SVD vector JSON for entity %s", entity_id)
    skipped_missing_svd += 1
    continue
...
fused = list(svd_vec) + list(text_vec)
res = db.store_fused_embedding(
    int(entity_id),
    window_id,
    fused,
    svd_dims=len(svd_vec),
    text_dims=len(text_vec),
)

similarity/compute.py - Normalized cosine similarity

# Normalize rows
norms = np.linalg.norm(matrix, axis=1, keepdims=True)
norms[norms == 0] = 1.0
normalized = matrix / norms
sim = normalized @ normalized.T
...
# pick top-k neighbors and write to similarity_cache

Anti-Patterns

Bad: Assuming consistent vector length

Problem: Assuming consistent vector length without checks leads to shape errors.

Remediation: Detect inconsistent lengths, pad with zeros, and log a warning (as seen in compute.py).

Bad: Inline heavy computation in UI

Problem: Recomputing heavy pipelines inline in UI requests.

Remediation: Schedule heavy work in scripts/subprocesses and read precomputed results in UI.