update plan: replace spike with confirmed FractieZetelPersoon fetch task

main
Sven Geboers 1 month ago
parent 0bbda408fb
commit c498c3467e
  1. 31
      thoughts/shared/plans/2026-03-21-parliamentary-embedding-pipeline-plan.md

@ -10,7 +10,7 @@ Design reference: thoughts/shared/designs/2026-03-21-parliamentary-embedding-pip
``` ```
Batch 1 (parallel): 1.1, 1.2, 1.3, 1.4 [foundation - no deps] Batch 1 (parallel): 1.1, 1.2, 1.3, 1.4 [foundation - no deps]
Batch 2 (parallel): 2.1, 2.2-spike, 2.3, 2.4, 2.5 [core - depends on batch 1] Batch 2 (parallel): 2.1, 2.2, 2.3, 2.4, 2.5 [core - depends on batch 1]
Batch 3 (parallel): 3.1, 3.2 [integration & CI - depends on batch 2] Batch 3 (parallel): 3.1, 3.2 [integration & CI - depends on batch 2]
``` ```
@ -34,7 +34,7 @@ Batch 3 (parallel): 3.1, 3.2 [integration & CI - depends on batch 2]
- File to modify: database.py (_init_database + new helpers) - File to modify: database.py (_init_database + new helpers)
- Test: tests/test_database_schema_and_helpers.py - Test: tests/test_database_schema_and_helpers.py
- Hours: 3.5 | Priority: highest | Depends: 1.2 - Hours: 3.5 | Priority: highest | Depends: 1.2
- Helpers to add: mp_votes_exists_for_motion, insert_mp_vote, upsert_mp_metadata, store_svd_vector, store_fused_embedding - Helpers to add: mp_votes_exists_for_motion, insert_mp_vote, upsert_mp_metadata(mp_name, party, van, tot_en_met, persoon_id), store_svd_vector, store_fused_embedding
- Acceptance: Tables created, helpers tested against temp DuckDB via round-trip insert/select; logging not prints - Acceptance: Tables created, helpers tested against temp DuckDB via round-trip insert/select; logging not prints
### Task 1.4: Add test fixtures ### Task 1.4: Add test fixtures
@ -51,11 +51,16 @@ Batch 3 (parallel): 3.1, 3.2 [integration & CI - depends on batch 2]
- Hours: 4.0 | Priority: highest | Depends: 1.3, 1.4 - Hours: 4.0 | Priority: highest | Depends: 1.3, 1.4
- Acceptance: Idempotent; correct MP rows inserted; party keys ignored; re-run produces no duplicates - Acceptance: Idempotent; correct MP rows inserted; party keys ignored; re-run produces no duplicates
### Task 2.2 (spike): pipeline/spike_fetch_kamerlid.py ### Task 2.2: pipeline/fetch_mp_metadata.py
- Validate whether OData /Kamerlid endpoint provides party affiliation + tenure dates - Fetch MP party membership and tenure from OData using confirmed endpoints (spike resolved: Persoon + FractieZetelPersoon are available)
- Test: tests/test_spike_fetch_kamerlid.py (monkeypatched network) - OData query: `/FractieZetelPersoon?$filter=Verwijderd eq false&$expand=Persoon($select=Id,Achternaam,Initialen,Tussenvoegsel),FractieZetel($expand=Fractie($select=NaamNL))`
- Hours: 2.0 | Priority: highest (spike) | Depends: 1.3 - Key fields: FractieZetelPersoon.Van (entry_date), FractieZetelPersoon.TotEnMet (exit_date, null=active), Persoon.Achternaam, Persoon.Initialen, Persoon.Tussenvoegsel, Fractie.NaamNL (party name)
- Acceptance: Spike result documented; test asserts field extraction from mocked response; full fetch (2.2b) scheduled or fallback heuristic decided - Name normalization: reconstruct ActorNaam format from Persoon fields: `"{Tussenvoegsel} {Achternaam}, {Initialen}".strip()` (must match keys in voting_results JSON, e.g. "Yesilgöz-Zegerius, D.")
- Persoon.Id stored as source_id (GUID) for deduplication
- Stores via MotionDatabase.upsert_mp_metadata; idempotent on re-run
- Test: tests/test_fetch_mp_metadata.py — monkeypatch requests.get with canned FractieZetelPersoon+Persoon response; assert name normalization and DB rows
- Hours: 3.5 | Priority: highest | Depends: 1.3
- Acceptance: mp_metadata rows correct; name normalization tested for tussenvoegsel variants; TotEnMet=null handled correctly; re-run idempotent
### Task 2.3: pipeline/text_pipeline.py ### Task 2.3: pipeline/text_pipeline.py
- Ensure every motion has a text embedding; delegates to existing ai_provider.get_embedding - Ensure every motion has a text embedding; delegates to existing ai_provider.get_embedding
@ -95,7 +100,7 @@ Batch 3 (parallel): 3.1, 3.2 [integration & CI - depends on batch 2]
## Migration filenames ## Migration filenames
- migrations/2026_03_21__create_mp_votes.sql — columns: id, motion_id, mp_name, party, vote, date, created_at - migrations/2026_03_21__create_mp_votes.sql — columns: id, motion_id, mp_name, party, vote, date, created_at
- migrations/2026_03_21__create_mp_metadata.sql — columns: mp_name (PK), party, entry_date, exit_date, source_id - migrations/2026_03_21__create_mp_metadata.sql — columns: mp_name (PK), party, van (entry_date), tot_en_met (exit_date, nullable), persoon_id (GUID source_id)
- migrations/2026_03_21__create_svd_vectors.sql — columns: window_id, entity_type, entity_id, vector, model, created_at - migrations/2026_03_21__create_svd_vectors.sql — columns: window_id, entity_type, entity_id, vector, model, created_at
- migrations/2026_03_21__create_fused_embeddings.sql — columns: motion_id, window_id, vector, svd_dims, text_dims, created_at - migrations/2026_03_21__create_fused_embeddings.sql — columns: motion_id, window_id, vector, svd_dims, text_dims, created_at
@ -115,10 +120,10 @@ Batch 3 (parallel): 3.1, 3.2 [integration & CI - depends on batch 2]
## 3-Sprint Schedule (2-week sprints) ## 3-Sprint Schedule (2-week sprints)
Sprint 1 (Weeks 1–2): Tasks 1.1, 1.2, 1.3, 1.4, 2.2-spike Sprint 1 (Weeks 1–2): Tasks 1.1, 1.2, 1.3, 1.4, 2.2
- Deliverables: DB schema extended, migrations present, spike result documented - Deliverables: DB schema extended, migrations present, mp_metadata fetch implemented and tested
Sprint 2 (Weeks 3–4): Tasks 2.1, 2.3, 2.4, 2.5 (+ 2.2b if spike succeeded) Sprint 2 (Weeks 3–4): Tasks 2.1, 2.3, 2.4, 2.5
- Deliverables: All pipeline modules implemented with passing unit tests - Deliverables: All pipeline modules implemented with passing unit tests
Sprint 3 (Weeks 5–6): Tasks 3.1, 3.2 Sprint 3 (Weeks 5–6): Tasks 3.1, 3.2
@ -132,14 +137,14 @@ Sprint 3 (Weeks 5–6): Tasks 3.1, 3.2
2. Use existing ai_provider.get_embedding for text embeddings — no new model calls 2. Use existing ai_provider.get_embedding for text embeddings — no new model calls
3. SVD k enforced dynamically (k < min(n_mps, n_motions)); tests cover this path 3. SVD k enforced dynamically (k < min(n_mps, n_motions)); tests cover this path
4. Procrustes rotation matrices NOT persisted in MVP (aligned vectors stored directly) 4. Procrustes rotation matrices NOT persisted in MVP (aligned vectors stored directly)
5. mp_metadata: try OData spike first; fallback to majority-party heuristic if unavailable 5. mp_metadata: fetch from OData FractieZetelPersoon endpoint (confirmed available); Van/TotEnMet give tenure windows
6. Default quarterly time windows, but parameterized for Annual validation in Sprint 2 6. Default quarterly time windows, but parameterized for Annual validation in Sprint 2
7. All new helpers go into existing database.py MotionDatabase class (not a new module) 7. All new helpers go into existing database.py MotionDatabase class (not a new module)
8. Analysis/visualization (UMAP, Plotly plots) is a follow-up sprint, NOT included here 8. Analysis/visualization (UMAP, Plotly plots) is a follow-up sprint, NOT included here
## Open questions ## Open questions
1. Does OData /Kamerlid expose party affiliation + tenure dates? (Sprint 1 spike answers this) 1. [RESOLVED] OData FractieZetelPersoon confirmed available with Van/TotEnMet tenure dates; Stemming.ActorFractie gives party for each individual vote; name normalization from Persoon.Achternaam+Initialen+Tussenvoegsel confirmed feasible
2. Should Procrustes rotation matrices be persisted? (MVP: no; revisit after) 2. Should Procrustes rotation matrices be persisted? (MVP: no; revisit after)
3. Time-window granularity: annual first for stability validation, then quarterly? 3. Time-window granularity: annual first for stability validation, then quarterly?
4. Production k value for SVD: default 50 but must be validated against real data sizes 4. Production k value for SVD: default 50 but must be validated against real data sizes

Loading…
Cancel
Save