7.5 KiB
| date | topic | status |
|---|---|---|
| 2026-03-22 | Embedding-Based Motion Similarity Cache | validated |
Problem Statement
We have text embeddings and fused (SVD + text) embeddings stored for motions, but no usable similarity search. The current database.search_similar() is a full Python scan — it SELECTs all embeddings, parses JSON one by one, and computes cosine similarity with zip in pure Python. This is O(N) per query with no vectorized math, no indexing, and no caching. The similarity cache migration (2026-03-22-add-similarity-cache.sql) is a commented-out placeholder with no executable SQL.
Additionally, several infrastructure gaps block a working similarity system:
- The
embeddingstable is not created by_init_database()(only exists via migration file) - The fusion pipeline has an N+1 query pattern (per SVD row queries embeddings separately)
ai_provider._post_with_retriesdoes not retry on 429 (rate limit) responses
Constraints
- DuckDB only — no pgvector, no external vector store
- Vectors stored as JSON text columns (existing format, not changing)
- DuckDB connections are short-lived (open/close per method)
- Do not modify
app.pyorscheduler.py - Tests must be offline (monkeypatch network calls)
- Functional style, Python, uv
- Logging via
getLogger, noprint()
Approach
Precomputed similarity cache — batch-compute top-K nearest neighbors per motion and store results in a cache table. The UI reads the cache with a simple indexed lookup.
Rationale: the motion corpus changes slowly (new motions trickle in from parliament). Computing nearest neighbors at query time is wasteful. One offline O(N^2) pass via numpy matrix multiplication gives us O(1) lookups forever until the next recompute.
Alternatives rejected:
- DuckDB vss extension (HNSW): experimental, requires vector format migration away from JSON text, overkill for ~thousands of motions
- Real-time numpy search: better than pure-Python zip, but still O(N) per query; caching eliminates repeated work
- FAISS/Annoy ANN index: designed for millions of vectors, unnecessary complexity at our scale
Architecture
New files:
similarity/
__init__.py
compute.py -- batch pairwise cosine, extract top-K, write cache
lookup.py -- read cached results for a motion
Modified files:
database.py -- add similarity_cache + embeddings to _init_database,
add store/read/clear helpers, deprecate old search_similar
migrations/2026-03-22-add-similarity-cache.sql -- uncomment and finalize
ai_provider.py -- add 429 to retry branch
pipeline/fusion.py -- fix N+1 with bulk JOIN
Components
similarity_cache table
similarity_cache (
id INTEGER DEFAULT nextval('similarity_cache_id_seq'),
source_motion_id INTEGER NOT NULL,
target_motion_id INTEGER NOT NULL,
score REAL NOT NULL,
vector_type TEXT NOT NULL, -- 'text', 'fused', 'svd'
window_id TEXT, -- NULL for text-only, set for fused/SVD
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
Composite index on (source_motion_id, vector_type, window_id) for fast lookups.
similarity/compute.py
- Load all vectors of a given type into a numpy matrix in one query (parse JSON, stack into ndarray)
- Normalize rows to unit length
- Compute full cosine similarity matrix via
normalized @ normalized.T - Extract top-K per row (excluding self-similarity)
- Bulk-insert results into
similarity_cache - Idempotent:
clear_similarity_cache(vector_type, window_id)then insert within same connection scope
Public function: compute_similarities(vector_type='fused', window_id=None, top_k=10, db_path=None)
similarity/lookup.py
get_similar_motions(motion_id, vector_type='fused', window_id=None, top_k=10, db_path=None)— SELECT from cache ordered by score DESC- Returns list of dicts:
{motion_id, score} - Optionally join motion metadata (title, layman_explanation) for richer results
- Graceful degradation: empty cache returns empty list
database.py changes
- Add
embeddingstable creation to_init_database()— matches migration schema - Add
similarity_cachetable + sequence creation to_init_database() - New helpers:
store_similarity_batch(rows: list[dict])— bulk INSERTget_cached_similarities(source_motion_id, vector_type, window_id=None, top_k=10)— readclear_similarity_cache(vector_type, window_id=None)— DELETE for idempotent recompute
- Deprecate
search_similar()— mark with a log warning pointing tosimilarity.lookup
ai_provider.py fix
- Add HTTP 429 to the retry branch in
_post_with_retries - If
Retry-Afterheader is present, use it as the backoff delay; otherwise fall back to existing exponential backoff - This is a single-line condition change plus header parsing
pipeline/fusion.py fix
- Replace the per-row SELECT from
embeddingswith a single bulk query: JOINsvd_vectorswith latestembeddingsper motion_id in one SQL statement - Loop over joined results and concatenate in Python
- Eliminates N+1 query pattern
Data Flow
- Existing pipeline runs: extract MP votes → SVD → text embeddings → fusion
- After fusion completes,
similarity/compute.pyloads all fused vectors for the window into a numpy matrix - Computes pairwise cosine similarity matrix, extracts top-K per motion
- Bulk-inserts results into
similarity_cache(clearing previous cache for that batch first) - Separately, text-only similarity can be computed across all motions (no window dependency)
- UI calls
similarity/lookup.pyfor a direct indexed read — instant response
Error Handling
- Missing vectors: motions without embeddings are excluded from the similarity matrix; not an error
- Empty matrix: if no vectors exist for a vector_type/window, log warning and skip (don't write empty cache)
- DB write failures: wrap cache writes in try/except, log error, don't crash the pipeline; similarity is non-critical
- Stale cache: cache is fully replaced on each recompute (delete + insert in same connection scope); if recompute fails partway, old cache remains valid
- Dimension mismatch: vectors with inconsistent dimensions are padded or excluded with a warning (following existing clustering.py pattern)
Testing Strategy
- Unit: compute.py — create known vectors with predictable cosine similarities (e.g., identical vectors → score 1.0, orthogonal → 0.0), verify matrix math produces correct top-K ordering
- Unit: lookup.py — seed cache table in temp DB, verify queries return correct ordered results, verify empty cache returns empty list
- Unit: database helpers — test store_similarity_batch / get_cached_similarities / clear_similarity_cache round-trip
- Unit: ai_provider 429 retry — monkeypatch requests.post to return 429, verify retry with backoff
- Unit: fusion bulk join — verify N+1 elimination produces same results as original
- Migration test — apply updated similarity_cache migration on temp DuckDB, verify schema matches expected columns
- Integration test — insert fake embeddings → run compute → verify cache populated → lookup returns expected results
- All tests offline: in-memory DuckDB, monkeypatched network calls
Open Questions
None blocking. Future enhancements (not in scope):
- MP-to-MP similarity from SVD vectors (explorer UI is motion-focused for now)
- Real-time similarity for newly ingested motions before next batch recompute