motief

7.5 KiB

Raw Blame History

date	topic	status
2026-03-22	Embedding-Based Motion Similarity Cache	validated

Problem Statement

We have text embeddings and fused (SVD + text) embeddings stored for motions, but no usable similarity search. The current database.search_similar() is a full Python scan — it SELECTs all embeddings, parses JSON one by one, and computes cosine similarity with zip in pure Python. This is O(N) per query with no vectorized math, no indexing, and no caching. The similarity cache migration (2026-03-22-add-similarity-cache.sql) is a commented-out placeholder with no executable SQL.

Additionally, several infrastructure gaps block a working similarity system:

The embeddings table is not created by _init_database() (only exists via migration file)
The fusion pipeline has an N+1 query pattern (per SVD row queries embeddings separately)
ai_provider._post_with_retries does not retry on 429 (rate limit) responses

Constraints

DuckDB only — no pgvector, no external vector store
Vectors stored as JSON text columns (existing format, not changing)
DuckDB connections are short-lived (open/close per method)
Do not modify app.py or scheduler.py
Tests must be offline (monkeypatch network calls)
Functional style, Python, uv
Logging via getLogger, no print()

Approach

Precomputed similarity cache — batch-compute top-K nearest neighbors per motion and store results in a cache table. The UI reads the cache with a simple indexed lookup.

Rationale: the motion corpus changes slowly (new motions trickle in from parliament). Computing nearest neighbors at query time is wasteful. One offline O(N^2) pass via numpy matrix multiplication gives us O(1) lookups forever until the next recompute.

Alternatives rejected:

DuckDB vss extension (HNSW): experimental, requires vector format migration away from JSON text, overkill for ~thousands of motions
Real-time numpy search: better than pure-Python zip, but still O(N) per query; caching eliminates repeated work
FAISS/Annoy ANN index: designed for millions of vectors, unnecessary complexity at our scale

Architecture

New files:
  similarity/
    __init__.py
    compute.py      -- batch pairwise cosine, extract top-K, write cache
    lookup.py       -- read cached results for a motion

Modified files:
  database.py       -- add similarity_cache + embeddings to _init_database,
                       add store/read/clear helpers, deprecate old search_similar
  migrations/2026-03-22-add-similarity-cache.sql  -- uncomment and finalize
  ai_provider.py    -- add 429 to retry branch
  pipeline/fusion.py -- fix N+1 with bulk JOIN

Components

similarity_cache table

similarity_cache (
    id              INTEGER DEFAULT nextval('similarity_cache_id_seq'),
    source_motion_id INTEGER NOT NULL,
    target_motion_id INTEGER NOT NULL,
    score           REAL NOT NULL,
    vector_type     TEXT NOT NULL,      -- 'text', 'fused', 'svd'
    window_id       TEXT,               -- NULL for text-only, set for fused/SVD
    created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)

Composite index on (source_motion_id, vector_type, window_id) for fast lookups.

similarity/compute.py

Load all vectors of a given type into a numpy matrix in one query (parse JSON, stack into ndarray)
Normalize rows to unit length
Compute full cosine similarity matrix via normalized @ normalized.T
Extract top-K per row (excluding self-similarity)
Bulk-insert results into similarity_cache
Idempotent: clear_similarity_cache(vector_type, window_id) then insert within same connection scope

Public function: compute_similarities(vector_type='fused', window_id=None, top_k=10, db_path=None)

similarity/lookup.py

get_similar_motions(motion_id, vector_type='fused', window_id=None, top_k=10, db_path=None) — SELECT from cache ordered by score DESC
Returns list of dicts: {motion_id, score}
Optionally join motion metadata (title, layman_explanation) for richer results
Graceful degradation: empty cache returns empty list

database.py changes

Add embeddings table creation to _init_database() — matches migration schema
Add similarity_cache table + sequence creation to _init_database()
New helpers:
- store_similarity_batch(rows: list[dict]) — bulk INSERT
- get_cached_similarities(source_motion_id, vector_type, window_id=None, top_k=10) — read
- clear_similarity_cache(vector_type, window_id=None) — DELETE for idempotent recompute
Deprecate search_similar() — mark with a log warning pointing to similarity.lookup

ai_provider.py fix

Add HTTP 429 to the retry branch in _post_with_retries
If Retry-After header is present, use it as the backoff delay; otherwise fall back to existing exponential backoff
This is a single-line condition change plus header parsing

pipeline/fusion.py fix

Replace the per-row SELECT from embeddings with a single bulk query: JOIN svd_vectors with latest embeddings per motion_id in one SQL statement
Loop over joined results and concatenate in Python
Eliminates N+1 query pattern

Data Flow

Existing pipeline runs: extract MP votes → SVD → text embeddings → fusion
After fusion completes, similarity/compute.py loads all fused vectors for the window into a numpy matrix
Computes pairwise cosine similarity matrix, extracts top-K per motion
Bulk-inserts results into similarity_cache (clearing previous cache for that batch first)
Separately, text-only similarity can be computed across all motions (no window dependency)
UI calls similarity/lookup.py for a direct indexed read — instant response

Error Handling

Missing vectors: motions without embeddings are excluded from the similarity matrix; not an error
Empty matrix: if no vectors exist for a vector_type/window, log warning and skip (don't write empty cache)
DB write failures: wrap cache writes in try/except, log error, don't crash the pipeline; similarity is non-critical
Stale cache: cache is fully replaced on each recompute (delete + insert in same connection scope); if recompute fails partway, old cache remains valid
Dimension mismatch: vectors with inconsistent dimensions are padded or excluded with a warning (following existing clustering.py pattern)

Testing Strategy

Unit: compute.py — create known vectors with predictable cosine similarities (e.g., identical vectors → score 1.0, orthogonal → 0.0), verify matrix math produces correct top-K ordering
Unit: lookup.py — seed cache table in temp DB, verify queries return correct ordered results, verify empty cache returns empty list
Unit: database helpers — test store_similarity_batch / get_cached_similarities / clear_similarity_cache round-trip
Unit: ai_provider 429 retry — monkeypatch requests.post to return 429, verify retry with backoff
Unit: fusion bulk join — verify N+1 elimination produces same results as original
Migration test — apply updated similarity_cache migration on temp DuckDB, verify schema matches expected columns
Integration test — insert fake embeddings → run compute → verify cache populated → lookup returns expected results
All tests offline: in-memory DuckDB, monkeypatched network calls

Open Questions

None blocking. Future enhancements (not in scope):

MP-to-MP similarity from SVD vectors (explorer UI is motion-focused for now)
Real-time similarity for newly ingested motions before next batch recompute

7.5 KiB Raw Blame History