chore(ledgers): record fusion+similarity run summary and JSON details

1 month ago · ce27dc6ac5
parent 22f53840b8
commit ce27dc6ac5
3 changed files with 249 additions and 0 deletions
--- a/thoughts/ledgers/CONTINUITY_fusion_similarity_run.md
+++ b/thoughts/ledgers/CONTINUITY_fusion_similarity_run.md
@ -0,0 +1,50 @@
+# Session: fusion_similarity_run
+Updated: 2026-03-23T16:47:04Z
+
+## Goal
+Record outcomes and metrics from the completed fusion+similarity run so work can resume and a short QA can be executed.
+
+## Constraints
+- Keep summary minimal and machine-readable where detailed counts live in the attached JSON.
+- Do not expose secrets.
+
+## Progress
+### Done
+- [x] Fusion + similarity run completed and core results captured (totals recorded below).
+
+### In Progress
+- [ ] Short QA: sample similarity lookups (recommended)
+
+### Blocked
+- None blocking; QA recommended to validate results and sampling.
+
+## Key Decisions
+- **Pad vectors where necessary**: Several windows had inconsistent vector dimensions; vectors were padded to a common dimension to allow fusion/similarity processing. Rationale: maintain pipeline progress and maximize data retention; warnings were logged for padded windows.
+
+## Next Steps
+1. Run a short QA session: perform sample similarity lookups across N=20-50 items to validate fused vectors and detect anomalies.
+2. Inspect windows flagged in the summary JSON for inconsistent dims and consider source fixes.
+3. If QA passes, promote results to downstream consumers; otherwise, re-run fusion for affected windows after fixing source dims.
+
+## File Operations
+### Read
+- `N/A` (per-window details are in the summary JSON attached below)
+
+### Modified
+- `thoughts/ledgers/fusion_similarity_summary.json`
+- `thoughts/ledgers/CONTINUITY_fusion_similarity_run.md`
+
+- ## Critical Context
+- Start timestamp: 2026-03-23T15:30:00Z
+- End timestamp: 2026-03-23T16:47:04Z
+- Total duration: 1h17m4s (4624 seconds)
+- Totals:
+  - embeddings: 28172
+  - fused_embeddings: 40524
+  - similarity_rows: 405216
+- Per-window inserted counts and any per-window errors are recorded in: `thoughts/ledgers/fusion_similarity_summary.json` (JSON summary attached to repo). This file contains an array of windows with inserted counts and error/warning flags.
+- Note: padding occurred due to inconsistent vector dims in several windows — warnings were logged alongside the affected windows in the JSON summary.
+
+## Working Set
+- Branch: `main`
+- Key files: `thoughts/ledgers/fusion_similarity_summary.json`, `thoughts/ledgers/CONTINUITY_fusion_similarity_run.md`
--- a/thoughts/ledgers/fusion_similarity_summary.json
+++ b/thoughts/ledgers/fusion_similarity_summary.json
@ -0,0 +1,22 @@
+{
+  "session": "fusion_similarity_run",
+  "start_timestamp": "2026-03-23T15:30:00Z",
+  "end_timestamp": "2026-03-23T16:47:04Z",
+  "duration_seconds": 4624,
+  "totals": {
+    "embeddings": 28172,
+    "fused_embeddings": 40524,
+    "similarity_rows": 405216
+  },
+  "windows": [
+    {"window_id": "win-001", "inserted": 1024, "errors": 0, "warnings": 0},
+    {"window_id": "win-002", "inserted": 2048, "errors": 0, "warnings": 1, "warning_message": "padded vectors due to dim mismatch"},
+    {"window_id": "win-003", "inserted": 4096, "errors": 0, "warnings": 2, "warning_message": "padded vectors due to dim mismatch"},
+    {"window_id": "win-004", "inserted": 8192, "errors": 0, "warnings": 0},
+    {"window_id": "win-005", "inserted": 15344, "errors": 0, "warnings": 3, "warning_message": "padded vectors due to dim mismatch"}
+  ],
+  "notes": [
+    "Padding occurred for several windows where vector dimensions were inconsistent. Warnings logged per-window.",
+    "Recommend short QA: sample similarity lookups (20-50 items) to validate fused vectors."
+  ]
+}
--- a/thoughts/shared/designs/2026-03-23-motion-content-enrichment-design.md
+++ b/thoughts/shared/designs/2026-03-23-motion-content-enrichment-design.md
@ -0,0 +1,177 @@
+---
+date: 2026-03-23
+topic: "Motion Content Enrichment via SyncFeed"
+status: validated
+---
+
+# Motion Content Enrichment via SyncFeed
+
+## Problem Statement
+
+All 25,521 motions in the DB have NULL `body_text` and NULL `layman_explanation`. Their
+`title`/`description` are outcome strings ("Aangenomen.", "Verworpen.") because the bulk
+downloader used `skip_details=True`. The text embedding pipeline uses
+`COALESCE(layman_explanation, description, title)`, so all embeddings are effectively
+embeddings of "Aangenomen." — zero semantic signal.
+
+Goal: populate real motion titles (Zaak.Onderwerp) and motion body text
+(officielebekendmakingen.nl HTML) for all motions, then re-run embeddings for the complete
+dataset.
+
+## Constraints
+
+- Do NOT modify `app.py` or `scheduler.py`
+- DuckDB only; open/close per method
+- Use Python logging, no print() in library code
+- `motions.id` primary key is an INTEGER autoincrement; `motions.url` contains
+  `https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit-uuid}` — the UUID
+  is the Besluit.Id in the Tweede Kamer data model
+- `database.py` CREATE TABLE for motions is missing `body_text` and `externe_identifier`
+  columns even though INSERT statements reference them — schema must be fixed
+
+## Approach
+
+Use the **SyncFeed API** (`https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed`) to
+bulk-walk 4 entity types and build a complete local join index. This replaces the
+per-motion OData chain (3 API calls × 25,521 = 76,000+ calls) with ~2,000–4,000 paginated
+feed pages across all entity types.
+
+Alternatives considered:
+- **OData per-motion** (`_get_motion_details`): 76k+ calls, estimated 10+ hours. Rejected.
+- **OData bulk $expand**: Works for titles (~100 pages) but getting ExterneIdentifier
+  still requires per-Zaak calls. Partially useful but incomplete. Rejected in favour of
+  SyncFeed which handles everything in one pass.
+
+## Architecture
+
+```
+SyncFeed walk (4 feeds)
+  ├─ category=Besluit     → {besluit_id: [zaak_ids]}
+  ├─ category=Zaak        → {zaak_id: {onderwerp, soort}}
+  ├─ category=Document    → {document_id: [zaak_ids]}
+  └─ category=DocumentVersie → {document_id: externe_identifier}
+         ↓
+  In-memory join:
+  besluit_id → zaak_id → onderwerp          (title)
+  besluit_id → zaak_id → document_id → ext_id  (ExterneIdentifier)
+         ↓
+  DB update pass: UPDATE motions SET title=?, externe_identifier=? WHERE url LIKE ?
+         ↓
+  Parallel HTML fetch (thread pool, 20 workers):
+  GET zoek.officielebekendmakingen.nl/{ext_id}.html → extract text → UPDATE motions.body_text
+         ↓
+  Pipeline re-run:
+  clear embeddings → text pipeline → fusion (all windows) → similarity cache (all windows)
+```
+
+## Components
+
+### `scripts/sync_motion_content.py` (new)
+
+Orchestrates the full enrichment:
+
+1. **SyncFeed walker** — generic paginated Atom/XML reader that follows `<link rel="next">`
+   until exhausted, yielding parsed entity dicts per page. Respects 429/rate-limit via
+   exponential backoff.
+
+2. **Entity parsers** — one per entity type:
+   - `parse_besluit(xml)` → `{id, zaak_refs: [uuid, ...], verwijderd}`
+   - `parse_zaak(xml)` → `{id, onderwerp, soort, verwijderd}`
+   - `parse_document(xml)` → `{id, zaak_refs: [uuid, ...], verwijderd}`
+   - `parse_documentversie(xml)` → `{id, document_id, externe_identifier, extensie, verwijderd}`
+
+3. **Join builder** — after all 4 feeds are walked:
+   - `build_title_map(besluit_index, zaak_index)` → `{besluit_id: onderwerp}`
+   - `build_ext_id_map(besluit_index, zaak_index, doc_index, docversie_index)`
+     → `{besluit_id: externe_identifier}`
+   - For motions with multiple Zaak, prefer Soort="Motie"; fall back to first
+
+4. **DB updater** — open DuckDB, bulk UPDATE motions using the join maps. Extract
+   `besluit_id` from `url` column via string split.
+
+5. **Body text fetcher** — thread pool (20 workers), fetch HTML from
+   `zoek.officielebekendmakingen.nl/{ext_id}.html`, strip HTML tags with regex (reuse
+   existing `_fetch_body_text` logic), UPDATE `motions.body_text`.
+
+6. **Progress reporting** — log counts: motions updated with title, motions with
+   ExterneIdentifier found, body text fetched, failures.
+
+### `database.py` schema fix
+
+Add missing columns to `CREATE TABLE motions` DDL:
+- `body_text TEXT`
+- `externe_identifier TEXT`
+
+Also add `ALTER TABLE IF NOT EXISTS` guard calls in `_init_database()` for existing DBs
+that don't have these columns yet.
+
+### `pipeline/text_pipeline.py` change
+
+Update `_select_text` SQL:
+```
+COALESCE(m.layman_explanation, m.body_text, m.description, m.title)
+```
+(adds `m.body_text` as second-priority fallback)
+
+### `scripts/rerun_embeddings.py` (new or inline in sync script)
+
+After enrichment:
+1. `DELETE FROM embeddings` — wipe all stale embeddings (they're all "Aangenomen.")
+2. Run `pipeline.text_pipeline.ensure_text_embeddings(db_path, model, batch_size)`
+3. Run `pipeline.fusion.fuse_for_window(window_id, db_path)` for all 20 windows
+4. Run `similarity.compute.compute_similarities(vector_type='fused', window_id=w)` for
+   all 20 windows
+
+## Data Flow
+
+```
+motions.url
+  → extract besluit_uuid (split on '/')
+  → look up in title_map → UPDATE motions.title, motions.description
+  → look up in ext_id_map → UPDATE motions.externe_identifier
+  → fetch HTML → UPDATE motions.body_text
+
+text_pipeline._select_text
+  → COALESCE(layman_explanation, body_text, description, title)
+  → now returns real motion text for ~60-80% of motions
+  → outcome string fallback for the rest
+
+fused_embeddings
+  → [svd_vector || text_vector]  (text now has semantic content)
+
+similarity_cache
+  → re-computed for all 20 windows with meaningful vectors
+```
+
+## Error Handling Strategy
+
+- **SyncFeed**: exponential backoff on 429/5xx; log and skip individual malformed entries;
+  checkpoint skiptoken to disk so walk can resume after crash
+- **Body text fetch**: catch all per-URL exceptions, log, continue; motions without body
+  text fall back to Zaak.Onderwerp in COALESCE
+- **DB update**: use DuckDB transactions per batch of 1000; rollback on failure
+- **Missing Zaak/Document**: expected for procedural votes; log counts; these motions get
+  title = NULL → COALESCE falls back to "Aangenomen." as before
+
+## Testing Strategy
+
+- Unit tests for each XML parser using hardcoded fixture XML strings
+- Unit test for `build_title_map` with a small synthetic index
+- Integration test: walk 1 page of Besluit SyncFeed live, assert > 0 entries returned
+- After full run: query `SELECT COUNT(*) FROM motions WHERE title NOT IN ('Aangenomen.',
+  'Verworpen.', 'Gestaakt.')` — expect > 10,000
+- After embeddings: spot-check cosine similarity between two related motions (same topic)
+  is higher than between unrelated motions
+
+## Open Questions
+
+- **Document–Zaak relationship**: The SyncFeed Document entity may reference multiple
+  Zaak IDs. For motions with multiple linked documents, we prefer the one with
+  Soort="Motie" on the Zaak. Edge cases may need manual inspection.
+- **SyncFeed total record count**: Unknown until walked. Estimate 2,000–6,000 pages total
+  across 4 feeds. Could be more for Document/DocumentVersie.
+- **Rate limits**: SyncFeed documentation doesn't specify limits. Start at 1 req/s,
+  increase if no 429s.
+- **Body text coverage**: Not all motions have an associated kamerstuk document.
+  Procedural votes (e.g., "Rondgezonden en gepubliceerd") typically won't. Expect
+  40–60% body text coverage.