6.8 KiB

Raw Blame History

date	topic	status
2026-03-23	Test refactor: replace monkeypatching with real implementations	validated

Test Refactor — Remove Monkeypatching, Use Real Implementations

Problem Statement

The new test suite (added during the motion content enrichment hardening sprint) relies on:

sys.modules injection of a fake duckdb module
monkeypatch.setattr to replace pipeline functions wholesale
Exception-swallowing smoke tests that verify nothing meaningful

These tests give false confidence. They patch away the very code paths we want to exercise. The goal is to replace them with tests that run real production code paths using in-memory DuckDB and injected fake callables.

Constraints

Do NOT modify app.py or scheduler.py
Keep the module-level db = MotionDatabase() singleton in database.py — do not remove it (production code depends on it)
Production-code changes must be minimal and backwards-compatible (all existing callers continue to work with default parameters)
Tests run with .venv/bin/python -m pytest -q
No print() in library modules — use logging.getLogger(__name__)

Approach: Dependency Injection + In-Memory DuckDB

Thread an optional db: MotionDatabase = None parameter through the three pipeline functions that currently hard-import the singleton. Pass MotionDatabase(":memory:") in tests. Replace fake ai_provider calls with a real FakeEmbedder callable injected as a parameter.

Rejected alternatives:

Repository/Protocol pattern — major refactor of database.py and all callers; YAGNI for current test coverage needs
tmp_path real DB files — slower, requires full schema migration in fixtures, still needs singleton patching

Architecture

Production Changes (minimal, backwards-compatible)

pipeline/ai_provider_wrapper.py

Add db: MotionDatabase = None param to get_embeddings_with_retry
Add embedder=None param (callable replacing get_embeddings_batch)
If None, fall back to imported singleton / real provider

pipeline/text_pipeline.py

ensure_text_embeddings: already accepts db_path; also accept db: MotionDatabase = None as override (takes precedence over db_path)
ensure_text_embeddings_for_ids: same db override param
_select_text: accept db: MotionDatabase (already does); when db.db_path == ":memory:" the function returns [] (no motions seeded at that level) — this is fine for retry-path tests

similarity/compute.py

compute_similarities: already constructs MotionDatabase(db_path) locally; add db: MotionDatabase = None override param

Test Infrastructure (`tests/conftest.py`)

New fixtures added alongside existing ones:

mem_db fixture (function-scoped)

Creates MotionDatabase(":memory:") — triggers _init_database() which creates all tables/sequences in-memory
Seeds a minimal set of motions rows for tests that need them
Returned to test; closed after test

fake_embedder fixture

Returns a FakeEmbedder instance — a real callable class with signature (texts, model=None, batch_size=50) -> list[list[float]]
Returns deterministic embeddings: [[0.1] * 1536] per text
Accepts fail_ids: list[int] kwarg to simulate per-item permanent failures (raises RuntimeError for those items)

Components

`FakeEmbedder` (tests/conftest.py)

Real class, not a Mock
__call__(texts, model=None, batch_size=50) — returns list of float vectors
fail_ids set at construction time: any call where the batch index is in fail_ids raises RuntimeError
Tracks call_count for assertions

`mem_db` fixture

MotionDatabase(":memory:") — schema initialized by __init__
Optionally seeds motions via db.insert_motion(...) for tests that need real data
No filesystem side effects

Data Flow

Retry test (replaces `test_rerun_embeddings_retry.py`)

mem_db fixture creates in-memory DB
FakeEmbedder(fail_ids=[101, 102]) created
Call ensure_text_embeddings(db=mem_db, embedder=fake_embedder)
Wrapper internally retries failed items
Assert fake_embedder.call_count > 1 (retries happened)
Assert failed_ids returned contains [101, 102]

Audit event test (replaces `test_database_audit.py`)

mem_db fixture — in-memory DB with audit_events table
Call mem_db.append_audit_event(None, "test_action", ...)
Query audit_events table directly: assert row exists
No filesystem writes

Similarity filter test (replaces `test_similarity_compute_filter.py`)

mem_db seeded with 2 motions with identical short titles ("Aangenomen.")
Embeddings seeded for both motions (identical vectors → score 1.0)
Call compute_similarities(vector_type="fused", window_id=None, db=mem_db)
Assert store_similarity_batch stored 0 pairs (filtered out)

Embedding failure + audit event test (new, `test_ai_provider_wrapper.py`)

mem_db fixture
FakeEmbedder(fail_ids=[5]) — motion 5 permanently fails
Call get_embeddings_with_retry(texts=[...], motion_ids=[5], db=mem_db, embedder=fake_embedder)
Assert return contains failure sentinel for motion 5
Query mem_db audit_events table: assert row with action="embedding_failed", target_id=5

Error Handling Strategy

_select_text with in-memory DB returns [] (no rows) — callers treat empty list as "no motions to embed", which is correct
FakeEmbedder.fail_ids raises RuntimeError to exercise the retry/backoff path without sleeping — the wrapper's retry loop catches Exception, so this exercises the real retry path
Audit event writes go to the in-memory audit_events table; the JSON ledger fallback path is tested separately if needed

Testing Strategy

All tests use .venv/bin/python -m pytest -q. No sys.modules tricks. No monkeypatch.setattr on production module attributes.

Test file	What it exercises
`test_database_audit.py`	`append_audit_event` → in-memory DB row
`test_ai_provider_wrapper.py`	Empty input; transient retry; permanent fail + audit event
`test_rerun_embeddings_retry.py`	Real `ensure_text_embeddings` → retry path → `ensure_text_embeddings_for_ids` called
`test_similarity_compute_filter.py`	Real `compute_similarities` → filter removes perfect-score identical-title pairs

Open Questions

Does MotionDatabase(":memory:").__init__ successfully create the audit_events table? (Needs verification — if _init_database conditionally skips it, the fixture needs to run it explicitly.)
Does _select_text need the DB connection passed directly rather than reconstructed from db.db_path? For :memory: DBs, a new duckdb.connect(":memory:") opens an empty DB, not the seeded one. Resolution: ensure_text_embeddings should pass the seeded connection rather than the path, OR accept that _select_text returns [] for in-memory DBs (sufficient for retry-path tests).

6.8 KiB Raw Blame History