Precomputed top-K similarity cache replacing the naive Python-scan search_similar(). Also covers fixes for: embeddings table missing from _init_database, fusion N+1 query, and ai_provider 429 retry.main
parent
bf68e48460
commit
a248807e03
@ -0,0 +1,145 @@ |
|||||||
|
--- |
||||||
|
date: 2026-03-22 |
||||||
|
topic: "Embedding-Based Motion Similarity Cache" |
||||||
|
status: validated |
||||||
|
--- |
||||||
|
|
||||||
|
## Problem Statement |
||||||
|
|
||||||
|
We have text embeddings and fused (SVD + text) embeddings stored for motions, but no usable similarity search. The current `database.search_similar()` is a full Python scan — it SELECTs all embeddings, parses JSON one by one, and computes cosine similarity with `zip` in pure Python. This is O(N) per query with no vectorized math, no indexing, and no caching. The similarity cache migration (`2026-03-22-add-similarity-cache.sql`) is a commented-out placeholder with no executable SQL. |
||||||
|
|
||||||
|
Additionally, several infrastructure gaps block a working similarity system: |
||||||
|
- The `embeddings` table is not created by `_init_database()` (only exists via migration file) |
||||||
|
- The fusion pipeline has an N+1 query pattern (per SVD row queries embeddings separately) |
||||||
|
- `ai_provider._post_with_retries` does not retry on 429 (rate limit) responses |
||||||
|
|
||||||
|
## Constraints |
||||||
|
|
||||||
|
- DuckDB only — no pgvector, no external vector store |
||||||
|
- Vectors stored as JSON text columns (existing format, not changing) |
||||||
|
- DuckDB connections are short-lived (open/close per method) |
||||||
|
- Do not modify `app.py` or `scheduler.py` |
||||||
|
- Tests must be offline (monkeypatch network calls) |
||||||
|
- Functional style, Python, uv |
||||||
|
- Logging via `getLogger`, no `print()` |
||||||
|
|
||||||
|
## Approach |
||||||
|
|
||||||
|
**Precomputed similarity cache** — batch-compute top-K nearest neighbors per motion and store results in a cache table. The UI reads the cache with a simple indexed lookup. |
||||||
|
|
||||||
|
Rationale: the motion corpus changes slowly (new motions trickle in from parliament). Computing nearest neighbors at query time is wasteful. One offline O(N^2) pass via numpy matrix multiplication gives us O(1) lookups forever until the next recompute. |
||||||
|
|
||||||
|
Alternatives rejected: |
||||||
|
- **DuckDB vss extension (HNSW)**: experimental, requires vector format migration away from JSON text, overkill for ~thousands of motions |
||||||
|
- **Real-time numpy search**: better than pure-Python zip, but still O(N) per query; caching eliminates repeated work |
||||||
|
- **FAISS/Annoy ANN index**: designed for millions of vectors, unnecessary complexity at our scale |
||||||
|
|
||||||
|
## Architecture |
||||||
|
|
||||||
|
``` |
||||||
|
New files: |
||||||
|
similarity/ |
||||||
|
__init__.py |
||||||
|
compute.py -- batch pairwise cosine, extract top-K, write cache |
||||||
|
lookup.py -- read cached results for a motion |
||||||
|
|
||||||
|
Modified files: |
||||||
|
database.py -- add similarity_cache + embeddings to _init_database, |
||||||
|
add store/read/clear helpers, deprecate old search_similar |
||||||
|
migrations/2026-03-22-add-similarity-cache.sql -- uncomment and finalize |
||||||
|
ai_provider.py -- add 429 to retry branch |
||||||
|
pipeline/fusion.py -- fix N+1 with bulk JOIN |
||||||
|
``` |
||||||
|
|
||||||
|
## Components |
||||||
|
|
||||||
|
### similarity_cache table |
||||||
|
|
||||||
|
``` |
||||||
|
similarity_cache ( |
||||||
|
id INTEGER DEFAULT nextval('similarity_cache_id_seq'), |
||||||
|
source_motion_id INTEGER NOT NULL, |
||||||
|
target_motion_id INTEGER NOT NULL, |
||||||
|
score REAL NOT NULL, |
||||||
|
vector_type TEXT NOT NULL, -- 'text', 'fused', 'svd' |
||||||
|
window_id TEXT, -- NULL for text-only, set for fused/SVD |
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP |
||||||
|
) |
||||||
|
``` |
||||||
|
|
||||||
|
Composite index on `(source_motion_id, vector_type, window_id)` for fast lookups. |
||||||
|
|
||||||
|
### similarity/compute.py |
||||||
|
|
||||||
|
- Load all vectors of a given type into a numpy matrix in one query (parse JSON, stack into ndarray) |
||||||
|
- Normalize rows to unit length |
||||||
|
- Compute full cosine similarity matrix via `normalized @ normalized.T` |
||||||
|
- Extract top-K per row (excluding self-similarity) |
||||||
|
- Bulk-insert results into `similarity_cache` |
||||||
|
- Idempotent: `clear_similarity_cache(vector_type, window_id)` then insert within same connection scope |
||||||
|
|
||||||
|
Public function: `compute_similarities(vector_type='fused', window_id=None, top_k=10, db_path=None)` |
||||||
|
|
||||||
|
### similarity/lookup.py |
||||||
|
|
||||||
|
- `get_similar_motions(motion_id, vector_type='fused', window_id=None, top_k=10, db_path=None)` — SELECT from cache ordered by score DESC |
||||||
|
- Returns list of dicts: `{motion_id, score}` |
||||||
|
- Optionally join motion metadata (title, layman_explanation) for richer results |
||||||
|
- Graceful degradation: empty cache returns empty list |
||||||
|
|
||||||
|
### database.py changes |
||||||
|
|
||||||
|
1. Add `embeddings` table creation to `_init_database()` — matches migration schema |
||||||
|
2. Add `similarity_cache` table + sequence creation to `_init_database()` |
||||||
|
3. New helpers: |
||||||
|
- `store_similarity_batch(rows: list[dict])` — bulk INSERT |
||||||
|
- `get_cached_similarities(source_motion_id, vector_type, window_id=None, top_k=10)` — read |
||||||
|
- `clear_similarity_cache(vector_type, window_id=None)` — DELETE for idempotent recompute |
||||||
|
4. Deprecate `search_similar()` — mark with a log warning pointing to `similarity.lookup` |
||||||
|
|
||||||
|
### ai_provider.py fix |
||||||
|
|
||||||
|
- Add HTTP 429 to the retry branch in `_post_with_retries` |
||||||
|
- If `Retry-After` header is present, use it as the backoff delay; otherwise fall back to existing exponential backoff |
||||||
|
- This is a single-line condition change plus header parsing |
||||||
|
|
||||||
|
### pipeline/fusion.py fix |
||||||
|
|
||||||
|
- Replace the per-row SELECT from `embeddings` with a single bulk query: |
||||||
|
JOIN `svd_vectors` with latest `embeddings` per motion_id in one SQL statement |
||||||
|
- Loop over joined results and concatenate in Python |
||||||
|
- Eliminates N+1 query pattern |
||||||
|
|
||||||
|
## Data Flow |
||||||
|
|
||||||
|
1. Existing pipeline runs: extract MP votes → SVD → text embeddings → fusion |
||||||
|
2. After fusion completes, `similarity/compute.py` loads all fused vectors for the window into a numpy matrix |
||||||
|
3. Computes pairwise cosine similarity matrix, extracts top-K per motion |
||||||
|
4. Bulk-inserts results into `similarity_cache` (clearing previous cache for that batch first) |
||||||
|
5. Separately, text-only similarity can be computed across all motions (no window dependency) |
||||||
|
6. UI calls `similarity/lookup.py` for a direct indexed read — instant response |
||||||
|
|
||||||
|
## Error Handling |
||||||
|
|
||||||
|
- **Missing vectors**: motions without embeddings are excluded from the similarity matrix; not an error |
||||||
|
- **Empty matrix**: if no vectors exist for a vector_type/window, log warning and skip (don't write empty cache) |
||||||
|
- **DB write failures**: wrap cache writes in try/except, log error, don't crash the pipeline; similarity is non-critical |
||||||
|
- **Stale cache**: cache is fully replaced on each recompute (delete + insert in same connection scope); if recompute fails partway, old cache remains valid |
||||||
|
- **Dimension mismatch**: vectors with inconsistent dimensions are padded or excluded with a warning (following existing clustering.py pattern) |
||||||
|
|
||||||
|
## Testing Strategy |
||||||
|
|
||||||
|
- **Unit: compute.py** — create known vectors with predictable cosine similarities (e.g., identical vectors → score 1.0, orthogonal → 0.0), verify matrix math produces correct top-K ordering |
||||||
|
- **Unit: lookup.py** — seed cache table in temp DB, verify queries return correct ordered results, verify empty cache returns empty list |
||||||
|
- **Unit: database helpers** — test store_similarity_batch / get_cached_similarities / clear_similarity_cache round-trip |
||||||
|
- **Unit: ai_provider 429 retry** — monkeypatch requests.post to return 429, verify retry with backoff |
||||||
|
- **Unit: fusion bulk join** — verify N+1 elimination produces same results as original |
||||||
|
- **Migration test** — apply updated similarity_cache migration on temp DuckDB, verify schema matches expected columns |
||||||
|
- **Integration test** — insert fake embeddings → run compute → verify cache populated → lookup returns expected results |
||||||
|
- **All tests offline**: in-memory DuckDB, monkeypatched network calls |
||||||
|
|
||||||
|
## Open Questions |
||||||
|
|
||||||
|
None blocking. Future enhancements (not in scope): |
||||||
|
- MP-to-MP similarity from SVD vectors (explorer UI is motion-focused for now) |
||||||
|
- Real-time similarity for newly ingested motions before next batch recompute |
||||||
Loading…
Reference in new issue