parent
fd73da3752
commit
0bbda408fb
@ -0,0 +1,146 @@ |
||||
# Parliamentary Embedding Pipeline (Late Fusion) Implementation Plan |
||||
|
||||
Goal: Implement an MVP late-fusion pipeline that (1) extracts MP-level votes from the existing motions.voting_results JSON, (2) builds aligned SVD representations per time-window, (3) ensures text embeddings coverage, and (4) fuses SVD motion vectors with text embeddings into a fused_embeddings table — all using DuckDB and in-Python compute. |
||||
|
||||
Design reference: thoughts/shared/designs/2026-03-21-parliamentary-embedding-pipeline-design.md |
||||
|
||||
--- |
||||
|
||||
## Dependency Graph |
||||
|
||||
``` |
||||
Batch 1 (parallel): 1.1, 1.2, 1.3, 1.4 [foundation - no deps] |
||||
Batch 2 (parallel): 2.1, 2.2-spike, 2.3, 2.4, 2.5 [core - depends on batch 1] |
||||
Batch 3 (parallel): 3.1, 3.2 [integration & CI - depends on batch 2] |
||||
``` |
||||
|
||||
--- |
||||
|
||||
## Batch 1: Foundation (parallel) |
||||
|
||||
### Task 1.1: Add scientific dependencies |
||||
- File to modify: pyproject.toml |
||||
- Test: tests/test_pyproject_deps.py |
||||
- Hours: 1.0 | Priority: high | Depends: none |
||||
- Acceptance: scipy>=1.11, umap-learn>=0.5, plotly>=5.0 present in pyproject.toml |
||||
|
||||
### Task 1.2: Add migration file placeholders |
||||
- Files to create: migrations/2026_03_21__create_mp_votes.sql, migrations/2026_03_21__create_mp_metadata.sql, migrations/2026_03_21__create_svd_vectors.sql, migrations/2026_03_21__create_fused_embeddings.sql |
||||
- Test: tests/test_migration_pipeline_tables.py (follows existing pattern in tests/test_migration_embeddings.py) |
||||
- Hours: 1.5 | Priority: high | Depends: none |
||||
- Acceptance: Migration files exist; test applies them to temp DuckDB and asserts expected tables/columns |
||||
|
||||
### Task 1.3: Extend database.py with new tables + helpers |
||||
- File to modify: database.py (_init_database + new helpers) |
||||
- Test: tests/test_database_schema_and_helpers.py |
||||
- Hours: 3.5 | Priority: highest | Depends: 1.2 |
||||
- Helpers to add: mp_votes_exists_for_motion, insert_mp_vote, upsert_mp_metadata, store_svd_vector, store_fused_embedding |
||||
- Acceptance: Tables created, helpers tested against temp DuckDB via round-trip insert/select; logging not prints |
||||
|
||||
### Task 1.4: Add test fixtures |
||||
- File to create: tests/fixtures/sample_voting_results.json (5–10 motions with mixed party + MP keys) |
||||
- Hours: 0.5 | Priority: medium | Depends: none |
||||
|
||||
--- |
||||
|
||||
## Batch 2: Core Pipeline (parallel, depends on Batch 1) |
||||
|
||||
### Task 2.1: pipeline/extract_mp_votes.py |
||||
- Extract MP rows from voting_results JSON; comma-in-key = MP name, else = party (skip party rows) |
||||
- Test: tests/test_extract_mp_votes.py |
||||
- Hours: 4.0 | Priority: highest | Depends: 1.3, 1.4 |
||||
- Acceptance: Idempotent; correct MP rows inserted; party keys ignored; re-run produces no duplicates |
||||
|
||||
### Task 2.2 (spike): pipeline/spike_fetch_kamerlid.py |
||||
- Validate whether OData /Kamerlid endpoint provides party affiliation + tenure dates |
||||
- Test: tests/test_spike_fetch_kamerlid.py (monkeypatched network) |
||||
- Hours: 2.0 | Priority: highest (spike) | Depends: 1.3 |
||||
- Acceptance: Spike result documented; test asserts field extraction from mocked response; full fetch (2.2b) scheduled or fallback heuristic decided |
||||
|
||||
### Task 2.3: pipeline/text_pipeline.py |
||||
- Ensure every motion has a text embedding; delegates to existing ai_provider.get_embedding |
||||
- Text priority: body_text > layman_explanation > description |
||||
- Test: tests/test_text_pipeline.py (monkeypatch ai_provider) |
||||
- Hours: 3.0 | Priority: high | Depends: 1.3, 1.1 |
||||
|
||||
### Task 2.4: pipeline/svd_pipeline.py |
||||
- Per-window: build sparse MP×Motion csr_matrix → scipy svds → Procrustes alignment → store svd_vectors |
||||
- CRITICAL: enforce k < min(n_mps, n_motions); reduce k dynamically if needed; test this path |
||||
- Procrustes: log disparity score; flag HIGH_DISPARITY if overlap < 30% |
||||
- Test: tests/test_svd_pipeline.py (synthetic 5×6 matrix, k reduction test, alignment test) |
||||
- Hours: 6.0 | Priority: highest | Depends: 1.3 |
||||
|
||||
### Task 2.5: pipeline/fusion.py |
||||
- For each motion in window: fetch SVD motion vector + text embedding → concatenate → store fused_embeddings |
||||
- Skip and log if either vector missing |
||||
- Test: tests/test_fusion.py (verify vector length = svd_dims + text_dims) |
||||
- Hours: 3.0 | Priority: high | Depends: 2.3, 2.4 |
||||
|
||||
--- |
||||
|
||||
## Batch 3: Integration & CI (depends on Batch 2) |
||||
|
||||
### Task 3.1: tests/integration/test_pipeline_end_to_end.py |
||||
- Apply migrations → seed motions → monkeypatch ai_provider → run extract → SVD → text → fuse |
||||
- Assert fused_embeddings rows and vector dimensions |
||||
- Hours: 4.0 | Priority: highest | Depends: 2.1, 2.3, 2.4, 2.5 |
||||
- Use numpy.random.seed(0); dataset ≤50 motions for CI speed |
||||
|
||||
### Task 3.2: tests/conftest.py (fixtures + test helpers) |
||||
- Fixtures: temp_duckdb_path, apply_migrations, monkeypatch_ai_provider, mock_odata_client |
||||
- Add tests/README.md section on monkeypatching strategy |
||||
- Hours: 2.0 | Priority: high | Depends: 1.3 |
||||
|
||||
--- |
||||
|
||||
## Migration filenames |
||||
- migrations/2026_03_21__create_mp_votes.sql — columns: id, motion_id, mp_name, party, vote, date, created_at |
||||
- migrations/2026_03_21__create_mp_metadata.sql — columns: mp_name (PK), party, entry_date, exit_date, source_id |
||||
- migrations/2026_03_21__create_svd_vectors.sql — columns: window_id, entity_type, entity_id, vector, model, created_at |
||||
- migrations/2026_03_21__create_fused_embeddings.sql — columns: motion_id, window_id, vector, svd_dims, text_dims, created_at |
||||
|
||||
--- |
||||
|
||||
## CI / Test instructions |
||||
|
||||
- Run all tests: pytest -q |
||||
- Run unit tests only: pytest -q tests/ --ignore=tests/integration |
||||
- Run integration test: pytest -q tests/integration/test_pipeline_end_to_end.py |
||||
- Monkeypatch ai_provider.get_embedding with a function returning [0.01]*16 for fast tests |
||||
- Monkeypatch OData/API calls via requests-mock or monkeypatch.setattr on TweedeKamerAPIClient methods |
||||
- Temp DuckDB: use pytest tmp_path fixture; apply migration SQL files at test setup |
||||
- Determinism: numpy.random.seed(0) in all tests calling scipy/numpy |
||||
|
||||
--- |
||||
|
||||
## 3-Sprint Schedule (2-week sprints) |
||||
|
||||
Sprint 1 (Weeks 1–2): Tasks 1.1, 1.2, 1.3, 1.4, 2.2-spike |
||||
- Deliverables: DB schema extended, migrations present, spike result documented |
||||
|
||||
Sprint 2 (Weeks 3–4): Tasks 2.1, 2.3, 2.4, 2.5 (+ 2.2b if spike succeeded) |
||||
- Deliverables: All pipeline modules implemented with passing unit tests |
||||
|
||||
Sprint 3 (Weeks 5–6): Tasks 3.1, 3.2 |
||||
- Deliverables: Integration test passing end-to-end; CI docs written |
||||
|
||||
--- |
||||
|
||||
## Key assumptions |
||||
|
||||
1. Vectors stored as JSON (consistent with existing embeddings table) |
||||
2. Use existing ai_provider.get_embedding for text embeddings — no new model calls |
||||
3. SVD k enforced dynamically (k < min(n_mps, n_motions)); tests cover this path |
||||
4. Procrustes rotation matrices NOT persisted in MVP (aligned vectors stored directly) |
||||
5. mp_metadata: try OData spike first; fallback to majority-party heuristic if unavailable |
||||
6. Default quarterly time windows, but parameterized for Annual validation in Sprint 2 |
||||
7. All new helpers go into existing database.py MotionDatabase class (not a new module) |
||||
8. Analysis/visualization (UMAP, Plotly plots) is a follow-up sprint, NOT included here |
||||
|
||||
## Open questions |
||||
|
||||
1. Does OData /Kamerlid expose party affiliation + tenure dates? (Sprint 1 spike answers this) |
||||
2. Should Procrustes rotation matrices be persisted? (MVP: no; revisit after) |
||||
3. Time-window granularity: annual first for stability validation, then quarterly? |
||||
4. Production k value for SVD: default 50 but must be validated against real data sizes |
||||
5. Who runs migrations in production, and how? (Out of scope for MVP) |
||||
Loading…
Reference in new issue