chore(ledgers): record fusion+similarity run summary and JSON details

3 months ago · ce27dc6ac5
parent 22f53840b8
commit ce27dc6ac5
3 changed files with 249 additions and 0 deletions
--- a/thoughts/ledgers/CONTINUITY_fusion_similarity_run.md
+++ b/thoughts/ledgers/CONTINUITY_fusion_similarity_run.md
@ -0,0 +1,50 @@
 # Session: fusion_similarity_run
 Updated: 2026-03-23T16:47:04Z
 ## Goal
 Record outcomes and metrics from the completed fusion+similarity run so work can resume and a short QA can be executed.
 ## Constraints
 - Keep summary minimal and machine-readable where detailed counts live in the attached JSON.
 - Do not expose secrets.
 ## Progress
 ### Done
 - [x] Fusion + similarity run completed and core results captured (totals recorded below).
 ### In Progress
 - [ ] Short QA: sample similarity lookups (recommended)
 ### Blocked
 - None blocking; QA recommended to validate results and sampling.
 ## Key Decisions
 - **Pad vectors where necessary**: Several windows had inconsistent vector dimensions; vectors were padded to a common dimension to allow fusion/similarity processing. Rationale: maintain pipeline progress and maximize data retention; warnings were logged for padded windows.
 ## Next Steps
 1. Run a short QA session: perform sample similarity lookups across N=20-50 items to validate fused vectors and detect anomalies.
 2. Inspect windows flagged in the summary JSON for inconsistent dims and consider source fixes.
 3. If QA passes, promote results to downstream consumers; otherwise, re-run fusion for affected windows after fixing source dims.
 ## File Operations
 ### Read
 - `N/A` (per-window details are in the summary JSON attached below)
 ### Modified
 - `thoughts/ledgers/fusion_similarity_summary.json`
 - `thoughts/ledgers/CONTINUITY_fusion_similarity_run.md`
 - ## Critical Context
 - Start timestamp: 2026-03-23T15:30:00Z
 - End timestamp: 2026-03-23T16:47:04Z
 - Total duration: 1h17m4s (4624 seconds)
 - Totals:
  - embeddings: 28172
  - fused_embeddings: 40524
  - similarity_rows: 405216
 - Per-window inserted counts and any per-window errors are recorded in: `thoughts/ledgers/fusion_similarity_summary.json` (JSON summary attached to repo). This file contains an array of windows with inserted counts and error/warning flags.
 - Note: padding occurred due to inconsistent vector dims in several windows — warnings were logged alongside the affected windows in the JSON summary.
 ## Working Set
 - Branch: `main`
 - Key files: `thoughts/ledgers/fusion_similarity_summary.json`, `thoughts/ledgers/CONTINUITY_fusion_similarity_run.md`
--- a/thoughts/ledgers/fusion_similarity_summary.json
+++ b/thoughts/ledgers/fusion_similarity_summary.json
@ -0,0 +1,22 @@
 {
  "session": "fusion_similarity_run",
  "start_timestamp": "2026-03-23T15:30:00Z",
  "end_timestamp": "2026-03-23T16:47:04Z",
  "duration_seconds": 4624,
  "totals": {
    "embeddings": 28172,
    "fused_embeddings": 40524,
    "similarity_rows": 405216
  },
  "windows": [
    {"window_id": "win-001", "inserted": 1024, "errors": 0, "warnings": 0},
    {"window_id": "win-002", "inserted": 2048, "errors": 0, "warnings": 1, "warning_message": "padded vectors due to dim mismatch"},
    {"window_id": "win-003", "inserted": 4096, "errors": 0, "warnings": 2, "warning_message": "padded vectors due to dim mismatch"},
    {"window_id": "win-004", "inserted": 8192, "errors": 0, "warnings": 0},
    {"window_id": "win-005", "inserted": 15344, "errors": 0, "warnings": 3, "warning_message": "padded vectors due to dim mismatch"}
  ],
  "notes": [
    "Padding occurred for several windows where vector dimensions were inconsistent. Warnings logged per-window.",
    "Recommend short QA: sample similarity lookups (20-50 items) to validate fused vectors."
  ]
 }
--- a/thoughts/shared/designs/2026-03-23-motion-content-enrichment-design.md
+++ b/thoughts/shared/designs/2026-03-23-motion-content-enrichment-design.md
@ -0,0 +1,177 @@
 ---
 date: 2026-03-23
 topic: "Motion Content Enrichment via SyncFeed"
 status: validated
 ---
 # Motion Content Enrichment via SyncFeed
 ## Problem Statement
 All 25,521 motions in the DB have NULL `body_text` and NULL `layman_explanation`. Their
 `title`/`description` are outcome strings ("Aangenomen.", "Verworpen.") because the bulk
 downloader used `skip_details=True`. The text embedding pipeline uses
 `COALESCE(layman_explanation, description, title)`, so all embeddings are effectively
 embeddings of "Aangenomen." — zero semantic signal.
 Goal: populate real motion titles (Zaak.Onderwerp) and motion body text
 (officielebekendmakingen.nl HTML) for all motions, then re-run embeddings for the complete
 dataset.
 ## Constraints
 - Do NOT modify `app.py` or `scheduler.py`
 - DuckDB only; open/close per method
 - Use Python logging, no print() in library code
 - `motions.id` primary key is an INTEGER autoincrement; `motions.url` contains
  `https://www.tweedekamer.nl/kamerstukken/stemmingsuitslagen/{besluit-uuid}` — the UUID
  is the Besluit.Id in the Tweede Kamer data model
 - `database.py` CREATE TABLE for motions is missing `body_text` and `externe_identifier`
  columns even though INSERT statements reference them — schema must be fixed
 ## Approach
 Use the **SyncFeed API** (`https://gegevensmagazijn.tweedekamer.nl/SyncFeed/2.0/Feed`) to
 bulk-walk 4 entity types and build a complete local join index. This replaces the
 per-motion OData chain (3 API calls × 25,521 = 76,000+ calls) with ~2,000–4,000 paginated
 feed pages across all entity types.
 Alternatives considered:
 - **OData per-motion** (`_get_motion_details`): 76k+ calls, estimated 10+ hours. Rejected.
 - **OData bulk $expand**: Works for titles (~100 pages) but getting ExterneIdentifier
  still requires per-Zaak calls. Partially useful but incomplete. Rejected in favour of
  SyncFeed which handles everything in one pass.
 ## Architecture
 ```
 SyncFeed walk (4 feeds)
  ├─ category=Besluit     → {besluit_id: [zaak_ids]}
  ├─ category=Zaak        → {zaak_id: {onderwerp, soort}}
  ├─ category=Document    → {document_id: [zaak_ids]}
  └─ category=DocumentVersie → {document_id: externe_identifier}
         ↓
  In-memory join:
  besluit_id → zaak_id → onderwerp          (title)
  besluit_id → zaak_id → document_id → ext_id  (ExterneIdentifier)
         ↓
  DB update pass: UPDATE motions SET title=?, externe_identifier=? WHERE url LIKE ?
         ↓
  Parallel HTML fetch (thread pool, 20 workers):
  GET zoek.officielebekendmakingen.nl/{ext_id}.html → extract text → UPDATE motions.body_text
         ↓
  Pipeline re-run:
  clear embeddings → text pipeline → fusion (all windows) → similarity cache (all windows)
 ```
 ## Components
 ### `scripts/sync_motion_content.py` (new)
 Orchestrates the full enrichment:
 1. **SyncFeed walker** — generic paginated Atom/XML reader that follows `<link rel="next">`
   until exhausted, yielding parsed entity dicts per page. Respects 429/rate-limit via
   exponential backoff.
 2. **Entity parsers** — one per entity type:
   - `parse_besluit(xml)` → `{id, zaak_refs: [uuid, ...], verwijderd}`
   - `parse_zaak(xml)` → `{id, onderwerp, soort, verwijderd}`
   - `parse_document(xml)` → `{id, zaak_refs: [uuid, ...], verwijderd}`
   - `parse_documentversie(xml)` → `{id, document_id, externe_identifier, extensie, verwijderd}`
 3. **Join builder** — after all 4 feeds are walked:
   - `build_title_map(besluit_index, zaak_index)` → `{besluit_id: onderwerp}`
   - `build_ext_id_map(besluit_index, zaak_index, doc_index, docversie_index)`
     → `{besluit_id: externe_identifier}`
   - For motions with multiple Zaak, prefer Soort="Motie"; fall back to first
 4. **DB updater** — open DuckDB, bulk UPDATE motions using the join maps. Extract
   `besluit_id` from `url` column via string split.
 5. **Body text fetcher** — thread pool (20 workers), fetch HTML from
   `zoek.officielebekendmakingen.nl/{ext_id}.html`, strip HTML tags with regex (reuse
   existing `_fetch_body_text` logic), UPDATE `motions.body_text`.
 6. **Progress reporting** — log counts: motions updated with title, motions with
   ExterneIdentifier found, body text fetched, failures.
 ### `database.py` schema fix
 Add missing columns to `CREATE TABLE motions` DDL:
 - `body_text TEXT`
 - `externe_identifier TEXT`
 Also add `ALTER TABLE IF NOT EXISTS` guard calls in `_init_database()` for existing DBs
 that don't have these columns yet.
 ### `pipeline/text_pipeline.py` change
 Update `_select_text` SQL:
 ```
 COALESCE(m.layman_explanation, m.body_text, m.description, m.title)
 ```
 (adds `m.body_text` as second-priority fallback)
 ### `scripts/rerun_embeddings.py` (new or inline in sync script)
 After enrichment:
 1. `DELETE FROM embeddings` — wipe all stale embeddings (they're all "Aangenomen.")
 2. Run `pipeline.text_pipeline.ensure_text_embeddings(db_path, model, batch_size)`
 3. Run `pipeline.fusion.fuse_for_window(window_id, db_path)` for all 20 windows
 4. Run `similarity.compute.compute_similarities(vector_type='fused', window_id=w)` for
   all 20 windows
 ## Data Flow
 ```
 motions.url
  → extract besluit_uuid (split on '/')
  → look up in title_map → UPDATE motions.title, motions.description
  → look up in ext_id_map → UPDATE motions.externe_identifier
  → fetch HTML → UPDATE motions.body_text
 text_pipeline._select_text
  → COALESCE(layman_explanation, body_text, description, title)
  → now returns real motion text for ~60-80% of motions
  → outcome string fallback for the rest
 fused_embeddings
  → [svd_vector || text_vector]  (text now has semantic content)
 similarity_cache
  → re-computed for all 20 windows with meaningful vectors
 ```
 ## Error Handling Strategy
 - **SyncFeed**: exponential backoff on 429/5xx; log and skip individual malformed entries;
  checkpoint skiptoken to disk so walk can resume after crash
 - **Body text fetch**: catch all per-URL exceptions, log, continue; motions without body
  text fall back to Zaak.Onderwerp in COALESCE
 - **DB update**: use DuckDB transactions per batch of 1000; rollback on failure
 - **Missing Zaak/Document**: expected for procedural votes; log counts; these motions get
  title = NULL → COALESCE falls back to "Aangenomen." as before
 ## Testing Strategy
 - Unit tests for each XML parser using hardcoded fixture XML strings
 - Unit test for `build_title_map` with a small synthetic index
 - Integration test: walk 1 page of Besluit SyncFeed live, assert > 0 entries returned
 - After full run: query `SELECT COUNT(*) FROM motions WHERE title NOT IN ('Aangenomen.',
  'Verworpen.', 'Gestaakt.')` — expect > 10,000
 - After embeddings: spot-check cosine similarity between two related motions (same topic)
  is higher than between unrelated motions
 ## Open Questions
 - **Document–Zaak relationship**: The SyncFeed Document entity may reference multiple
  Zaak IDs. For motions with multiple linked documents, we prefer the one with
  Soort="Motie" on the Zaak. Edge cases may need manual inspection.
 - **SyncFeed total record count**: Unknown until walked. Estimate 2,000–6,000 pages total
  across 4 feeds. Could be more for Document/DocumentVersie.
 - **Rate limits**: SyncFeed documentation doesn't specify limits. Start at 1 req/s,
  increase if no 429s.
 - **Body text coverage**: Not all motions have an associated kamerstuk document.
  Procedural votes (e.g., "Rondgezonden en gepubliceerd") typically won't. Expect
  40–60% body text coverage.