20 KiB
motion content enrichment — implementation plan
Goal: Implement the prioritized incremental hardening from the design (2026-03-23) so the SyncFeed → embedding → fusion → similarity pipeline is more robust, observable, and testable. Break the work into small, independent micro-tasks (one file + its test per task) so many implementers can work in parallel.
Design doc: thoughts/shared/designs/2026-03-23-motion-content-enrichment-next-steps-design.md
Architecture summary (what I'll implement)
- Add a small audit API on MotionDatabase so code can record per-item failures in a stable place (or fall back to a ledger file if DuckDB is not present).
- Add a dedicated ai_provider retry/fallback wrapper that:
- retries failed batches (exponential backoff),
- on persistent failure retries missing items with smaller batch sizes,
- returns aligned embedding results (None for failed items),
- records persistent failures to audit_events (using MotionDatabase.append_audit_event).
- Wire text embedding pipeline to use the wrapper and return failed ids (so rerun script can retry them).
- Add a
--max-body-workersCLI option to scripts/sync_motion_content.py, reduce default to 10 and add per-request retries. - Add
--retry-missingto scripts/rerun_embeddings.py: rerun missing failed items with smaller batches. - Add a DB-side safety filter in similarity.compute to avoid inserting trivial 1.0 matches for very-short identical titles.
- Add a small QA script scripts/qa_similarity.py that samples windows/motions and writes a short JSON ledger for manual review.
- Add focused unit tests for the new behaviours (ai retry wrapper, DB audit append, sync body fetch retries, rerun retry mode, similarity filter, QA script).
Decisions / gap filling (why these concrete choices)
- Audit recording: implement MotionDatabase.append_audit_event that writes to audit_events table if present, else appends to thoughts/ledgers/audit_events.json. Rationale: migration SQL is a commented placeholder; making DB write optional keeps tests and CI safe; writing to ledgers is actionable and durable for triage.
- ai retry backoff params: default retries=3, initial_backoff=0.5s, jitter ±10%, fallback smaller_batch_size = max(1, batch_size // 2). Rationale: conservative defaults that map to design and are implementable/testable.
- fetch_body_text retries: 3 attempts per ext_id with small exponential backoff (0.5s). Use requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10) to limit pool size and avoid pool warnings. Default max workers lowered to 10.
- Interface changes: ensure_text_embeddings will return an extended result with failed_ids as a 5th element: (stored, skipped_existing, skipped_no_text, errors, failed_ids). I will update rerun_embeddings and its tests accordingly. Rationale: rerun needs failed ids; propagating as return value is simplest and testable.
- All new code uses logging.getLogger(name) (no print in library modules) to obey constraints.
- Tests will use monkeypatching/mocks to avoid network/DB dependencies.
Dependency graph (high level)
Batch 1 (foundation, parallel): tasks 1.1–1.4 (no interdeps except where noted).
Batch 2 (core, parallel): tasks 2.1–2.3 (depend on Batch 1).
Batch 3 (safety & QA, parallel): task 3.1 (depends on Batch 2 and Batch 1).
Batch 1 (parallel): 1.1, 1.2, 1.3, 1.4
Batch 2 (parallel): 2.1, 2.2, 2.3 [depends on Batch 1]
Batch 3 (parallel): 3.1 [depends on Batch 2]
Batch 1: Foundation (parallel - 4 implementers)
All tasks in this batch have NO (external) dependencies except where noted.
Task 1.1: MotionDatabase.append_audit_event
Owner: implementer (author)
Estimate: 2 hours
Depends: none
Description: Add an append_audit_event(...) helper to database.MotionDatabase. This method will attempt to INSERT a row into an audit_events table (if the table exists). If DuckDB is not available or the table does not exist, append the event to a JSON file under thoughts/ledgers/audit_events.json. This provides a stable place to record per-item failures without forcing a migration to run during tests/CI.
File: database.py (modify: add method)
Test: tests/test_database_audit.py (new)
Implementation notes (decisions):
- Signature: append_audit_event(actor_id: str | None, action: str, target_type: str | None = None, target_id: str | None = None, metadata: dict | None = None) -> bool
- Behavior:
- If duckdb is None: write (append) to thoughts/ledgers/audit_events.json as list of event objects (create file/dir as needed).
- If duckdb present: run "INSERT INTO audit_events (... )" wrapped in try/except; if table missing or INSERT fails, fall back to writing to the ledger file.
- Do not raise; log at appropriate levels and return True if recorded somewhere, False otherwise.
- Use uuid.uuid4() for id and UTC timestamp for created_at.
- Use logging.getLogger(name) for messages.
Test (complete list):
- tests/test_database_audit.py
- Case A (duckdb=None emulation): monkeypatch database.duckdb = None, ensure Ledger file created and content contains the event.
- Case B (duckdb present but table insertion raises): monkeypatch duckdb.connect to a MagicMock that raises on execute -> verify fallback to ledger file.
- Verify method returns True when written to ledger, and that JSON is valid.
Verify:
- pytest -q tests/test_database_audit.py
Commit message suggestion:
- feat(db): add append_audit_event helper to MotionDatabase (ledger fallback)
Task 1.2: ai provider retry/fallback wrapper
Owner: implementer
Estimate: 3 hours
Depends: 1.1 (uses MotionDatabase.append_audit_event)
Description: Add a small module that wraps ai_provider.get_embeddings_batch to provide robust retries and fallback to smaller batch sizes. The wrapper returns a list of embeddings aligned with inputs; for items that permanently fail we return None in-place and record an audit event via MotionDatabase.append_audit_event.
File: pipeline/ai_provider_wrapper.py (new)
Test: tests/test_ai_provider_wrapper.py (new)
Implementation details:
- Provide function get_embeddings_with_retry(texts: list[str], motion_ids: list[int] | None = None, model: str | None = None, batch_size: int = 50, retries: int = 3) -> list[Optional[list[float]]]
- Approach:
- Iterate inputs in chunks of batch_size.
- For each chunk:
- Try ai_provider.get_embeddings_batch(chunk, model=model, batch_size=batch_size) up to
retrieswith exponential backoff (initial_backoff=0.5s, jitter). - If a chunk continuously fails, split the chunk into subchunks (smaller_batch_size = max(1, batch_size // 2)) and retry the subchunks with the same logic.
- If an individual text still fails, mark the corresponding index result as None and record an audit event via MotionDatabase.append_audit_event with action='embedding_failed' and metadata including model, exception message, and attempts.
- Try ai_provider.get_embeddings_batch(chunk, model=model, batch_size=batch_size) up to
- Return a results list of the same length as inputs (embedding lists or None).
- Use MotionDatabase(db_path=...) only if a db_path is provided in env/config or via optional parameter — by default use database.db (existing module-level db instance) to call append_audit_event.
- Keep function pure enough to be unit-tested by monkeypatching ai_provider.get_embeddings_batch and MotionDatabase.append_audit_event.
Test cases:
- test successful batch returns embeddings aligned to inputs
- test simulated transient failure where first attempt fails and second succeeds (observed retry)
- test persistent chunk failure triggers fallback to smaller chunks and eventual audit appended for failing items (verify append_audit_event called with expected metadata)
- tests use monkeypatch to stub ai_provider.get_embeddings_batch behavior and MotionDatabase.append_audit_event
Verify:
- pytest -q tests/test_ai_provider_wrapper.py
Commit:
- feat(pipeline): ai provider retry/fallback wrapper
Task 1.3: QA script — scripts/qa_similarity.py
Owner: implementer
Estimate: 2 hours
Depends: none
Description: Add a small script that samples N motions across windows, runs similarity lookup for each sampled motion, asserts simple heuristics (e.g., top-5 are not all score==1.0 except identical IDs), and writes a JSON summary into thoughts/ledgers/qa_similarity_{timestamp}.json. This script is meant to be run manually/CI for a quick QA check.
File: scripts/qa_similarity.py (new)
Test: tests/test_qa_similarity.py (new)
Implementation notes:
- CLI: --db-path, --sample-size (default 50), --top-k (default 5)
- Implementation uses MotionDatabase to select a small set of motions and similarity.get_cached_similarities (or MotionDatabase.get_cached_similarities) to evaluate neighbors.
- The script returns a dict summary which is also written to a uniquely named JSON under thoughts/ledgers/.
- For tests, monkeypatch MotionDatabase to return deterministic samples and similarities; verify the script produces the expected JSON summary and returns reasonable pass/fail flags.
Verify:
- pytest -q tests/test_qa_similarity.py
- Run manually: python scripts/qa_similarity.py --db-path data/motions.db --sample-size 10
Commit:
- feat(scripts): add QA similarity sampling script and ledger writer
Task 1.4: sync_motion_content — reduce concurrency, add per-ext_id retry, add CLI flag
Owner: implementer
Estimate: 3 hours
Depends: 1.1 (write failures to audit via MotionDatabase.append_audit_event)
Description: Harden the body text fetcher:
- Add CLI flag
--max-body-workers(default reduce to 10). - Use requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10) when creating the requests.Session in sync_motion_content.
- Implement per-ext_id retry in _fetch_body_text: try up to 3 times with exponential backoff on network errors/5xx/429.
- When a body_text fetch permanently fails, call MotionDatabase.append_audit_event(action='body_fetch_failed', target_type='document', target_id=ext_id, metadata=...) so failures are recorded.
File: scripts/sync_motion_content.py (modify)
Test: tests/test_sync_motion_content.py (new)
Implementation details:
- Add parser.add_argument("--max-body-workers", type=int, default=10, help=...) in CLI
- When creating session: mount HTTPAdapter with pool_maxsize equal to max_body_workers (requests.adapters.HTTPAdapter(pool_maxsize=...)). Also set session.adapters["https://"] = adapter.
- Modify _fetch_body_text(ext_id, session) to attempt up to 3 tries and return None on exhaustion; log appropriately; call db.append_audit_event when permanently failing (db from database.db).
- Update fetch_body_texts to pass max_workers param through as already implemented, but default constant MAX_BODY_WORKERS should be set to 10 at top of file.
Test plan:
- Test that _fetch_body_text retries: monkeypatch session.get to fail first (raise requests.ConnectionError) and succeed second; verify returned text is successful and that only as many attempts occurred as expected.
- Test permanent failure case: monkeypatch session.get to always raise and verify MotionDatabase.append_audit_event was called (monkeypatch database.db.append_audit_event).
- Test fetch_body_texts respects max_workers param by running small set and monkeypatching ThreadPoolExecutor to observe max_workers argument (or call with small size and assert function returns mapped results).
Verify:
- pytest -q tests/test_sync_motion_content.py
- Manual run: python scripts/sync_motion_content.py --db-path data/motions.db --max-body-workers 10
Commit:
- feat(sync): add per-ext_id retries and --max-body-workers flag (defaults to 10), record failures to audit
Batch 2: Core modules (parallel - 3 implementers)
These tasks depend on Batch 1 (ai wrapper and audit method must be present).
Task 2.1: text_pipeline — use ai wrapper & return failed_ids
Owner: implementer
Estimate: 3 hours
Depends: 1.2 (ai_provider_wrapper) and 1.1 (audit)
Description: Modify pipeline/text_pipeline.py to call the new ai_provider_wrapper.get_embeddings_with_retry instead of ai_provider.get_embeddings_batch. Extend ensure_text_embeddings to collect indexes/ids of motions which failed to get embeddings and return them as a fifth element: (stored, skipped_existing, skipped_no_text, errors, failed_ids). Keep logging behavior similar but include a log line reporting failed_ids for the run.
File: pipeline/text_pipeline.py (modify)
Test: tests/test_text_pipeline_retry.py (new)
Implementation details:
- Replace the ai_provider.get_embeddings_batch(batch_texts, ...) call with wrapper.get_embeddings_with_retry(batch_texts, batch_ids, model=model, batch_size=batch_size, retries=3).
- The wrapper returns list aligned with batch_texts containing either embedding list or None. For each None, increment errors and append motion_id to failed_ids.
- At the end of ensure_text_embeddings, return stored, skipped_existing, skipped_no_text, errors, failed_ids.
- Also ensure docstring updated.
- Keep existing counting and logging; existing callers will be updated in Task 2.2.
Test plan:
- Unit test that ensure_text_embeddings:
- when wrapper returns embeddings for all batch items, stored increments as expected.
- when wrapper returns None for some items, those motion_ids included in failed_ids and errors counts reflect them.
- Use monkeypatch to stub pipeline.ai_provider_wrapper.get_embeddings_with_retry and database.db.store_embedding.
Verify:
- pytest -q tests/test_text_pipeline_retry.py
Commit:
- feat(pipeline): use ai_provider wrapper for robust embeddings and return failed ids
Task 2.2: rerun_embeddings — add --retry-missing mode and wire re-run
Owner: implementer
Estimate: 2.5 hours
Depends: 2.1 (ensure_text_embeddings new return)
Description: Add a CLI flag --retry-missing to scripts/rerun_embeddings.py. When set, after the main ensure_text_embeddings call, if the returned failed_ids list is non-empty, attempt to re-run embedding for just those failed motion ids using smaller batch_size (e.g., half) via a new helper in text_pipeline (call ensure_text_embeddings with an argument to limit to a provided list OR use a new function text_pipeline.embed_given_ids(...)). To keep changes minimal, call ensure_text_embeddings with a temporary limit and the wrapper can accept a motion_ids argument. The script should record audit events for items that still fail after retry.
File: scripts/rerun_embeddings.py (modify)
Test: tests/test_rerun_embeddings.py (modify — existing test)
Implementation notes:
- Add parser.add_argument("--retry-missing", action="store_true", help=...).
- After first ensure_text_embeddings, expect a 5-tuple. If retry_missing and failed_ids exist, call a second short pass: call text_pipeline.get_embeddings_for_ids(db_path=db_path, ids=failed_ids, model=model, batch_size=max(1, batch_size // 2)). Option: reuse ensure_text_embeddings by adding optional parameter to accept a list of motion ids (we added returning failed_ids earlier; modify text_pipeline to accept motion_id list). Implementation choice: add new helper function in text_pipeline called ensure_text_embeddings_for_ids, and use it here.
- Update tests/test_rerun_embeddings.py to monkeypatch the new text_pipeline helper and simulate that first call returns failed_ids and second call resolves them; assert rerun called accordingly and summary contains expected fields.
Test changes:
- Update tests/test_rerun_embeddings.py to reflect that text_pipeline.ensure_text_embeddings returns five values and to simulate --retry-missing behavior.
- Keep the existing expectations in the test (we will extend them to include failed_ids handling).
Verify:
- pytest -q tests/test_rerun_embeddings.py
- Manual run: python scripts/rerun_embeddings.py --db-path data/motions.db --retry-missing
Commit:
- feat(scripts): add --retry-missing to rerun_embeddings and retry failed items with smaller batches
Task 2.3: similarity.compute — DB-side safety filter to avoid trivial 1.0 matches
Owner: implementer
Estimate: 3 hours
Depends: none (reads existing DB)
Description: Add a small DB-side filter before inserting similarity rows that filters out suspicious 1.0 matches between different motions when the titles are extremely short (heuristic: identical titles with length < 12 characters). Add diagnostic logging for filtered pairs.
File: similarity/compute.py (modify)
Test: tests/test_similarity_compute_filter.py (new)
Implementation details:
- After building rows_to_insert (list of dicts with source/target ids & score), perform:
- If score == 1.0 (or very near 1.0 with tolerance e.g., > 0.999999), fetch titles for the set of involved ids (single query: SELECT id, title FROM motions WHERE id IN (...)).
- For each candidate row with perfect/near-perfect score, if motion titles are equal and len(title.strip()) < 12, skip insertion and log debug/info that pair was filtered due to trivial short identical title.
- The threshold 12 chosen conservatively (document in commit).
- Keep inserted count and return behavior unchanged.
- Make sure DB connections are opened/closed per method.
Test plan:
- Construct a minimal in-memory or duckdb-mocked scenario where two different motion ids have identical short title and their vectors produce 1.0 similarity. Monkeypatch duckdb.connect to return rows such that compute_similarities will produce rows_to_insert including a 1.0. Verify store_similarity_batch is not called for that row (monkeypatch MotionDatabase.store_similarity_batch or spy on db.store_similarity_batch calls).
Verify:
- pytest -q tests/test_similarity_compute_filter.py
Commit:
- fix(similarity): filter trivial 1.0 matches for very-short identical titles
Batch 3: Observability / Integration (parallel - 1-2 implementers)
These are small finishing tasks (audit/ledgers, small extras).
Task 3.1: Tests & CI adjustments, docs, ledger examples
Owner: reviewer (PR reviewer)
Estimate: 2 hours
Depends: all tasks above (1.1–2.3)
Description: After the code is in, run full test suite, fix any flaky tests, add short README note in thoughts/ledgers/ about how to run QA script and how audit_events fallback works. Add a small example ledger created by QA script if helpful.
Files: (changes/additions)
thoughts/shared/plans/2026-03-23-motion-content-enrichment-plan.md(this plan — created)thoughts/ledgers/README_motion_enrichment.md(new, optional)- No dedicated unit test for this task; it's a reviewer/integration task.
Verification:
- Run full tests: pytest
- Run QA script locally: python scripts/qa_similarity.py --db-path data/motions.db --sample-size 10
- Inspect thoughts/ledgers/qa_similarity_*.json and audit_events ledger file.
Commit:
- docs(ledgers): document QA and audit fallback behavior
Test / Verification summary (per-task commands)
- Task 1.1
- pytest -q tests/test_database_audit.py
- Task 1.2
- pytest -q tests/test_ai_provider_wrapper.py
- Task 1.3
- pytest -q tests/test_qa_similarity.py
- python scripts/qa_similarity.py --db-path data/motions.db --sample-size 10
- Task 1.4
- pytest -q tests/test_sync_motion_content.py
- python scripts/sync_motion_content.py --db-path data/motions.db --max-body-workers 10 --skip-body-text (dry run)
- Task 2.1
- pytest -q tests/test_text_pipeline_retry.py
- Task 2.2
- pytest -q tests/test_rerun_embeddings.py
- python scripts/rerun_embeddings.py --db-path data/motions.db --retry-missing
- Task 2.3
- pytest -q tests/test_similarity_compute_filter.py
Full suite verification:
- pytest -q
If you want I can now:
- generate the apply_patch to create the files and tests described (one patch containing all files), or
- create the plan file only (this document was requested) — I have it ready at: thoughts/shared/plans/2026-03-23-motion-content-enrichment-plan.md
Which would you like next?