motief

7.9 KiB

Raw Blame History

date	topic	status
2026-03-21	Reuse motions as a guided policy explorer	draft

Problem Statement

We want to repurpose existing "motions" data so it becomes a lightweight, discovery-driven way for users to explore policy positions and discover related content. This is not a full proposal system; it's a guided exploration and bookmarking flow that leverages our existing ingestion, summarization, embeddings, and session voting work.

Why now: We already ingest motions, generate layman explanations, compute embeddings, and store per-session votes. Reusing those building blocks gives high user value with modest effort.

Constraints

Non-negotiables and technical limits:

Use the existing database schema where possible (motions table, embeddings table, user_sessions). Do not require a new external vector DB for MVP.
Keep the Streamlit UI model (app.py) and session-based votes intact for the initial rollout.
Avoid breaking migrations: rely on existing migrations and add new ones when necessary (no forced drops).
Respect current error-handling posture: network calls can fail; system must degrade gracefully.

Chosen Approach

I'm choosing a "Guided Policy Explorer" approach because it reuses thehighest-value existing pieces (summaries, embeddings, session voting) and delivers a clear UX that fits the current codebase. This gives immediate product value with low risk.

Core idea: present curated short sessions and motion detail pages that combine the existing layman explanation, party-match results, and semantic "related motions" powered by stored embeddings.

Alternatives considered:

"Motion-as-Proposal platform": full lifecycle (draft → comment → vote). Rejected for MVP due to high complexity and data model changes.
"Motion Digest / Research Assistant": read-only pages and newsletters. Lower effort, but less interactive and reuses fewer of our current session features.

Architecture

High-level view (existing pieces in bold):

Ingest: api_client.py + scraper.py gather motions and create motion records in the DB.
Persist: database.py stores motions, embeddings, and user_sessions.
Enrichment: summarizer.py + ai_provider.py generate layman explanations and embeddings.
Background jobs: scheduler.py runs ingest, summarization, and periodic clustering.
UI: app.py current Streamlit session flow — extend with "Explore" and "Motion detail" pages.
New: small clusterer / similarity API to compute and cache related-motion lists per motion.

Key Components & Responsibilities

Motion Ingest (existing): keep ingest as-is; add metadata flags (e.g., curated, candidate).
Motion Store (existing): motions table + embeddings table; add an events/audit table for user actions and important state transitions.
Summarizer / Embedding Worker (existing): scheduled job that ensures motions have layman_explanation and embeddings; add retry/backoff and logging.
Similarity service (new): computes nearest neighbors using stored vectors in-process for MVP and caches results in a small table. Swap to a vector index later if needed.
Session & Voting (existing): continue using user_sessions JSON blob for individual sessions; add optional event log entries for each vote.
UI (update): add "Explore" landing, motion detail view with layman text, party-match snapshot, related motions, and bookmark/flag actions. Reuse Streamlit components.
Admin tooling (new): migration scripts, a CLI to recompute embeddings/similarity, and an audit query helper.

Data Flow

Ingest job (api_client/scraper) produces motion records and calls db.insert_motion.
Summarizer worker picks up motions without layman_explanation or embeddings, calls ai_provider, and writes layman_explanation + embeddings.
Clusterer/similarity job computes related-motion lists using stored embeddings and writes them to a cache table.
UI "Explore" shows curated motion lists; "Motion detail" reads motion, layman_explanation, party-match snapshot, and cached related motions.
User vote actions update user_sessions and also append an event to the audit table for traceability.
Background analytics (optional) reuses user_events and embeddings for offline insights.

Error Handling Strategy

External calls: add retries with exponential backoff for AI provider and external APIs. Failures set a marker (e.g., summary_missing) and the system continues.
Missing embeddings: UI gracefully disables "related motions" and offers "compute on demand".
Idempotency: make insert_motion idempotent by URL/external id check at DB layer; use optimistic handling for duplicates.
Concurrency: avoid read-modify-write races by writing user events (append-only) and deriving session state from events when race-prone updates are detected.
Observability: replace prints with structured logging (module-level logger) and add basic metrics for worker errors, API failures, and queue lags.

Testing Strategy

Unit tests: DB helpers (insert_motion, store_embedding, similarity cache), summarizer functions (mock ai_provider), and session vote logic.
Migration tests: follow the existing pattern of applying migration SQL in a temp DB and asserting schema.
Integration tests: end-to-end ingest → summarize → embedding → similarity → UI-read path in CI (use monkeypatch for AI calls).
Load tests: simulate a few thousand embeddings search calls against the in-process search to validate performance assumptions for MVP.
Acceptance: confirm UX flows: Explore session, Motion detail, Vote -> party match, Related motions populated.

High-level Plan & Estimates

Assumptions: one full-stack engineer (Python + Streamlit) and one part-time reviewer. All estimates are rough.

Milestone 0 — Validate & quick discovery (1 day)

Locate user's added markdown plan and extract exact requirements. (I'm assuming the file exists in thoughts/shared; if not, we validated by searching.)

Milestone 1 — MVP (8–12 engineer days)

Add similarity cache table and migration.
Summarizer: make embedding generation robust with retries and store vectors.
Clusterer job: compute and cache related motions.
UI: Explore landing, Motion detail page, related motion UI, bookmark/flag button.
Add event/audit table and write events on user votes and bookmarks.

Milestone 2 — Hardening & instrumentation (3–5 engineer days)

Replace prints with structured logging across touched modules.
Add migration tests and CI integration tests (mock AI).
Add health metrics & basic alerting for worker failures.

Milestone 3 — Polish & UX feedback (3–5 engineer days)

UX tweaks, performance tuning, compute on-demand fallback for embeddings, documentation, admin CLI.

Total MVP + polish: ~2–3 weeks of focused work.

Risks & Mitigations

Risk: Naive in-process embedding search will not scale. Mitigation: cache nearest neighbors per motion and plan a migration path to a vector index.
Risk: AI provider flakiness. Mitigation: retries, timeouts, and clear UI fallback. Tests must mock provider in CI.
Risk: Race conditions on session votes. Mitigation: append-only event log and derive authoritative session view from events when needed.
Risk: Schema drift and missing migrations. Mitigation: add migration tests and document required migrations in repo.

Open Questions

Which exact user journeys do we want first (single-session discover vs. persistent account/bookmarking)?
Do we want bookmarks persisted globally or per-session only? (Privacy implications.)
What's acceptable latency for "related motions" — precomputed nightly vs. near-real-time?
Any policy/legal ban on storing full body_text or on long-term retention of user votes?

I'm proceeding to create the design doc file at thoughts/shared/designs/2026-03-21-motions-guided-explorer-design.md and will spawn the implementation planner next. Interrupt if you want changes to the approach or scope now.

7.9 KiB Raw Blame History