You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/thoughts/shared/designs/2026-03-22-embedding-simila...

7.5 KiB

date topic status
2026-03-22 Embedding-Based Motion Similarity Cache validated

Problem Statement

We have text embeddings and fused (SVD + text) embeddings stored for motions, but no usable similarity search. The current database.search_similar() is a full Python scan — it SELECTs all embeddings, parses JSON one by one, and computes cosine similarity with zip in pure Python. This is O(N) per query with no vectorized math, no indexing, and no caching. The similarity cache migration (2026-03-22-add-similarity-cache.sql) is a commented-out placeholder with no executable SQL.

Additionally, several infrastructure gaps block a working similarity system:

  • The embeddings table is not created by _init_database() (only exists via migration file)
  • The fusion pipeline has an N+1 query pattern (per SVD row queries embeddings separately)
  • ai_provider._post_with_retries does not retry on 429 (rate limit) responses

Constraints

  • DuckDB only — no pgvector, no external vector store
  • Vectors stored as JSON text columns (existing format, not changing)
  • DuckDB connections are short-lived (open/close per method)
  • Do not modify app.py or scheduler.py
  • Tests must be offline (monkeypatch network calls)
  • Functional style, Python, uv
  • Logging via getLogger, no print()

Approach

Precomputed similarity cache — batch-compute top-K nearest neighbors per motion and store results in a cache table. The UI reads the cache with a simple indexed lookup.

Rationale: the motion corpus changes slowly (new motions trickle in from parliament). Computing nearest neighbors at query time is wasteful. One offline O(N^2) pass via numpy matrix multiplication gives us O(1) lookups forever until the next recompute.

Alternatives rejected:

  • DuckDB vss extension (HNSW): experimental, requires vector format migration away from JSON text, overkill for ~thousands of motions
  • Real-time numpy search: better than pure-Python zip, but still O(N) per query; caching eliminates repeated work
  • FAISS/Annoy ANN index: designed for millions of vectors, unnecessary complexity at our scale

Architecture

New files:
  similarity/
    __init__.py
    compute.py      -- batch pairwise cosine, extract top-K, write cache
    lookup.py       -- read cached results for a motion

Modified files:
  database.py       -- add similarity_cache + embeddings to _init_database,
                       add store/read/clear helpers, deprecate old search_similar
  migrations/2026-03-22-add-similarity-cache.sql  -- uncomment and finalize
  ai_provider.py    -- add 429 to retry branch
  pipeline/fusion.py -- fix N+1 with bulk JOIN

Components

similarity_cache table

similarity_cache (
    id              INTEGER DEFAULT nextval('similarity_cache_id_seq'),
    source_motion_id INTEGER NOT NULL,
    target_motion_id INTEGER NOT NULL,
    score           REAL NOT NULL,
    vector_type     TEXT NOT NULL,      -- 'text', 'fused', 'svd'
    window_id       TEXT,               -- NULL for text-only, set for fused/SVD
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)

Composite index on (source_motion_id, vector_type, window_id) for fast lookups.

similarity/compute.py

  • Load all vectors of a given type into a numpy matrix in one query (parse JSON, stack into ndarray)
  • Normalize rows to unit length
  • Compute full cosine similarity matrix via normalized @ normalized.T
  • Extract top-K per row (excluding self-similarity)
  • Bulk-insert results into similarity_cache
  • Idempotent: clear_similarity_cache(vector_type, window_id) then insert within same connection scope

Public function: compute_similarities(vector_type='fused', window_id=None, top_k=10, db_path=None)

similarity/lookup.py

  • get_similar_motions(motion_id, vector_type='fused', window_id=None, top_k=10, db_path=None) — SELECT from cache ordered by score DESC
  • Returns list of dicts: {motion_id, score}
  • Optionally join motion metadata (title, layman_explanation) for richer results
  • Graceful degradation: empty cache returns empty list

database.py changes

  1. Add embeddings table creation to _init_database() — matches migration schema
  2. Add similarity_cache table + sequence creation to _init_database()
  3. New helpers:
    • store_similarity_batch(rows: list[dict]) — bulk INSERT
    • get_cached_similarities(source_motion_id, vector_type, window_id=None, top_k=10) — read
    • clear_similarity_cache(vector_type, window_id=None) — DELETE for idempotent recompute
  4. Deprecate search_similar() — mark with a log warning pointing to similarity.lookup

ai_provider.py fix

  • Add HTTP 429 to the retry branch in _post_with_retries
  • If Retry-After header is present, use it as the backoff delay; otherwise fall back to existing exponential backoff
  • This is a single-line condition change plus header parsing

pipeline/fusion.py fix

  • Replace the per-row SELECT from embeddings with a single bulk query: JOIN svd_vectors with latest embeddings per motion_id in one SQL statement
  • Loop over joined results and concatenate in Python
  • Eliminates N+1 query pattern

Data Flow

  1. Existing pipeline runs: extract MP votes → SVD → text embeddings → fusion
  2. After fusion completes, similarity/compute.py loads all fused vectors for the window into a numpy matrix
  3. Computes pairwise cosine similarity matrix, extracts top-K per motion
  4. Bulk-inserts results into similarity_cache (clearing previous cache for that batch first)
  5. Separately, text-only similarity can be computed across all motions (no window dependency)
  6. UI calls similarity/lookup.py for a direct indexed read — instant response

Error Handling

  • Missing vectors: motions without embeddings are excluded from the similarity matrix; not an error
  • Empty matrix: if no vectors exist for a vector_type/window, log warning and skip (don't write empty cache)
  • DB write failures: wrap cache writes in try/except, log error, don't crash the pipeline; similarity is non-critical
  • Stale cache: cache is fully replaced on each recompute (delete + insert in same connection scope); if recompute fails partway, old cache remains valid
  • Dimension mismatch: vectors with inconsistent dimensions are padded or excluded with a warning (following existing clustering.py pattern)

Testing Strategy

  • Unit: compute.py — create known vectors with predictable cosine similarities (e.g., identical vectors → score 1.0, orthogonal → 0.0), verify matrix math produces correct top-K ordering
  • Unit: lookup.py — seed cache table in temp DB, verify queries return correct ordered results, verify empty cache returns empty list
  • Unit: database helpers — test store_similarity_batch / get_cached_similarities / clear_similarity_cache round-trip
  • Unit: ai_provider 429 retry — monkeypatch requests.post to return 429, verify retry with backoff
  • Unit: fusion bulk join — verify N+1 elimination produces same results as original
  • Migration test — apply updated similarity_cache migration on temp DuckDB, verify schema matches expected columns
  • Integration test — insert fake embeddings → run compute → verify cache populated → lookup returns expected results
  • All tests offline: in-memory DuckDB, monkeypatched network calls

Open Questions

None blocking. Future enhancements (not in scope):

  • MP-to-MP similarity from SVD vectors (explorer UI is motion-focused for now)
  • Real-time similarity for newly ingested motions before next batch recompute