Sven Geboers
aef7c45074
Refactor tests: replace sys.modules hacks with real DI + in-memory DB
...
- Add db=None, embedder=None params to ai_provider_wrapper, text_pipeline, compute_similarities
- New conftest.py: FakeEmbedder, mem_db (in-memory DuckDB), fake_embedder fixtures
- Rewrite test_ai_provider_wrapper (4 tests), test_rerun_embeddings_retry (2 tests), test_similarity_compute_filter (1 test) with real implementations
- Fix rerun_embeddings tests hanging on _get_all_windows by patching it alongside _clear_embeddings
- All 53 tests pass (2 skipped), 0 sys.modules hacks in refactored files
1 month ago
Sven Geboers
2891e9ee70
feat: add StemAtlas Streamlit app, explorer, Docker deployment, blog charts
1 month ago
Sven Geboers
daa22c5e2b
feat: complete parliamentary embedding pipeline with full historical coverage
...
- Add fused (SVD + text) embedding pipeline for annual windows 2016-2026
- Fix store_fused_embedding duplicate bug: DELETE before INSERT (idempotent)
- Add --text-batch-size CLI flag to run_pipeline.py (default 200)
- Add explicit --start-date/--end-date to download_past_year.py
- Backfill mp_votes for all motions (party-level votes, 111k new rows)
- Add similarity cache recompute: 212k rows across 9 annual windows
- Improve ai_provider retry logic, text_pipeline batching
- Improve analysis/political_axis PCA handling and visualizations
- Add diagnostic/utility scripts: compare_svd, generate_compass, inspect_axis, etc.
- Untrack data/motions.db (3.6GB binary), add to .gitignore with outputs/
- Update continuity ledger with full session state
1 month ago
Sven Geboers
a78bee9b0a
feat(similarity): add precomputed similarity cache, fix fusion N+1, add 429 retry
...
- Add similarity/ package (compute.py, lookup.py) with numpy-based
pairwise cosine similarity and cached lookup
- database.py: create embeddings + similarity_cache tables in _init_database(),
add store_similarity_batch/get_cached_similarities/clear_similarity_cache helpers
- pipeline/fusion.py: replace N+1 per-motion embedding SELECT with single
bulk JOIN using DuckDB QUALIFY window function
- ai_provider.py: retry HTTP 429 with Retry-After header support
- migrations/2026-03-22-add-similarity-cache.sql: make executable
- Add tests for similarity compute, db helpers, and 429 retry (34 pass, 2 skip)
1 month ago
Sven Geboers
aa2f66ac9f
feat(analysis): fetch real MP metadata, fix anchor axis for party-level actors
...
- fetch_mp_metadata: use real OData URL with pagination (1200 records, 5 pages)
uses Fractie.Afkorting not NaamNL for abbreviation matching
skips Verwijderd=true records
- upsert_mp_metadata: keep most recent membership (prefer active over ended,
then higher Van date) so current party affiliations are not overwritten by historical
- compute_anchor_axis: anchor directly on party-level SVD entities (GroenLinks-PvdA etc)
before falling back to mp_metadata individual MP lookup
- test_fetch_mp_metadata: fix mock for timeout kwarg + pagination + Afkorting field
- Generated anchor axis HTML for 2025-Q2 through 2026-Q1 in outputs/
1 month ago
Sven Geboers
847b783877
fix(pipeline): fix API pagination, add skip_details fast path, bulk mp_votes insert
...
- _get_voting_records returns (records, besluit_meta) tuple; paginate via Besluit?expand=Stemming (469/mo vs 8400)
- get_motions(skip_details=True) bypasses per-motion detail chain (3 HTTP calls/motion)
- extract_mp_votes rewritten: bulk DataFrame insert (80k rows in 1.9s), includes party-level actors
- run_pipeline.py fixed: pass db_path not db, handle dict/int return types
- download_past_year.py: skip_details=True default, limit-per-chunk default 50000
1 month ago
Sven Geboers
f2a831dfcf
feat(pipeline): add orchestrator CLI, analysis modules, and ActorFractie ingestion
...
- pipeline/run_pipeline.py: CLI orchestrator for all 5 pipeline phases with
--dry-run, --skip-*, --window-size, --svd-k, --start/end-date flags
- analysis/{political_axis,trajectory,clustering,visualize}.py: PCA/anchor
ideological axis, MP drift trajectories, UMAP + KMeans clustering, Plotly HTML output
- api_client.py: capture ActorFractie per individual MP vote (comma in ActorNaam)
into mp_vote_parties dict on each motion
- database.insert_motion: auto-insert mp_votes rows with party affiliation for
newly ingested motions when mp_vote_parties is present
- Add scikit-learn to pyproject.toml for KMeans clustering
- tests/test_run_pipeline.py: window generation, dry-run, skip-all paths
- tests/test_analysis.py: PCA axis, anchor axis, trajectory drift, KMeans
Ref: thoughts/shared/plans/2026-03-21-parliamentary-embedding-pipeline-plan.md
1 month ago
Sven Geboers
a36e6cba4e
feat(pipeline): implement parliamentary embedding pipeline MVP
...
- Add 4 migration files: mp_votes, mp_metadata, svd_vectors, fused_embeddings
- Extend database.py with 5 new helper methods and table init
- Add pipeline/ package: extract_mp_votes, fetch_mp_metadata, text_pipeline,
svd_pipeline (with Procrustes alignment), fusion
- Add full test suite (17 tests) covering all pipeline modules and migrations
- Fix Procrustes alignment bug: scipy scale is a norm value, not a multiplier
- Fix DuckDB date type handling in test assertions (datetime.date vs string)
- Remove duckdb.py shim; tests now run against real duckdb + scipy via uv
Ref: thoughts/shared/plans/2026-03-21-parliamentary-embedding-pipeline-plan.md
1 month ago