You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
motief/docs/plans/2026-05-01-002-agent-native...

233 lines
13 KiB

---
title: Agent-Native Architecture Plan for Stemwijzer
type: refactor
status: active
date: 2026-05-01
origin: STRATEGY.md (agent-native architecture track)
---
# Agent-Native Architecture Plan for Stemwijzer
## Overview
Stemwijzer is a data-heavy analytical application with three surfaces: a Streamlit voting UI, a data pipeline (OData ingestion → DuckDB → SVD/embedding computation), and an analytics explorer. The agent-native architecture track aims to make every operation an agent can perform as capable as a human operator—whether that's running the pipeline, diagnosing drift, or answering research questions about parliamentary voting patterns.
**Current state:** The codebase is human-operated. Scripts are run manually, pipeline status is checked by eye, and analysis requires writing Python/DuckDB queries.
**Target state:** An agent with access to atomic primitives can run the pipeline, diagnose issues, generate reports, and answer open-ended questions about the data—operating in a loop until outcomes are achieved.
---
## Problem Frame
- **Pipeline operators** need to know when data is stale, why SVD vectors look wrong, or whether the similarity cache is healthy. Currently this requires manually running scripts and interpreting output.
- **Analysts/researchers** want to ask questions like "Which parties shifted most on economic axes between 2020 and 2024?" Currently this requires writing DuckDB queries and Python analysis code.
- **Developers** need to understand pipeline state, verify data integrity, and troubleshoot ingestion issues. Currently this requires reading logs and running diagnostics manually.
- **Content maintainers** need to verify SVD labels match actual voting patterns, check motion coverage, and validate layman explanations. Currently ad-hoc.
---
## Requirements Trace
- R1. The agent can achieve anything a pipeline operator can achieve (parity)
- R2. The agent can answer open-ended analytical questions about parliamentary data (emergent capability)
- R3. The agent can diagnose pipeline health and suggest remediation (self-service operations)
- R4. The agent can generate and validate content (SVD labels, motion summaries)
- R5. New capabilities can be added by writing prompts, not code (composability)
---
## Scope Boundaries
- **In scope:** Agent primitives for data operations, pipeline control, analysis, and diagnostics
- **Deferred:** Real-time agent UI inside Streamlit (future phase—add chat interface to explorer)
- **Deferred:** Autonomous pipeline scheduling (scheduler.py exists but agent control is v2)
- **Not working on:** Natural language to SQL for end users (this plan targets agent operators, not voter-facing features)
---
## Key Technical Decisions
- **Files as universal interface:** DuckDB is already file-based (`data/motions.db`). The agent's workspace is the repo itself. Logs, reports, and analysis outputs are files the agent writes and the human reads.
- **Database tools over file tools for structured data:** For querying motions, votes, and embeddings, the agent needs `query_database` primitives that wrap DuckDB/SQL, not raw file operations.
- **Pipeline as state machine:** The pipeline has discrete stages (ingestion → vote extraction → SVD → text embeddings → fusion → similarity). The agent needs stage-aware tools, not just "run everything."
- **Shared workspace:** Agent and human operate on the same `data/motions.db`, the same `thoughts/explorer/` outputs, the same `docs/solutions/` knowledge base.
---
## Implementation Units
- [ ] U1. **Database query primitives**
- **Goal:** Give the agent structured access to the DuckDB database
- **Requirements:** R1, R2, R4
- **Dependencies:** None
- **Files:**
- Create: `agent_tools/database.py`
- Test: `tests/agent_tools/test_database_tools.py`
- **Approach:** Wrap DuckDB queries as atomic tools:
- `query_motions(filter, limit, order)` → returns motion rows as JSON
- `query_votes(motion_id, party)` → returns vote counts
- `query_svd_vectors(window_id, entity_type)` → returns vectors
- `query_party_positions(window_id)` → returns party axis scores
- `query_pipeline_status()` → returns freshness metrics from health checks
- **Patterns to follow:** `health/checks.py` already has DB query patterns; `analysis/explorer_data.py` has read-only query patterns
- **Test scenarios:**
- Happy path: query returns valid JSON for known filters
- Edge case: empty result set returns `[]` not error
- Error path: invalid SQL/filter returns structured error with suggestion
- **Verification:** Agent can answer "How many motions in 2024?" using only the tool
- [ ] U2. **Pipeline control primitives**
- **Goal:** Let the agent run, monitor, and diagnose pipeline stages
- **Requirements:** R1, R3
- **Dependencies:** U1
- **Files:**
- Create: `agent_tools/pipeline.py`
- Test: `tests/agent_tools/test_pipeline_tools.py`
- **Approach:** Stage-aware pipeline tools:
- `pipeline_run_stage(stage, window_id, dry_run)` → runs one stage, returns status
- `pipeline_run_full(dry_run)` → orchestrates all stages with dependency ordering
- `pipeline_check_health()` → returns health report (reuses `health/` module)
- `pipeline_get_logs(stage, lines)` → returns recent logs for a stage
- `pipeline_validate_output(stage)` → checks output exists and looks reasonable
- **Patterns to follow:** `pipeline/run_pipeline.py` has the stage orchestration; `scripts/health_check.py` has the CLI pattern
- **Test scenarios:**
- Happy path: dry-run returns planned actions without executing
- Integration: running `pipeline_run_stage("svd", "2024")` produces expected `svd_vectors` rows
- Error path: running a stage with missing dependencies returns clear error
- **Verification:** Agent can diagnose "Why are SVD vectors stale?" by checking health, reading logs, and suggesting which stage to re-run
- [ ] U3. **Analysis and report generation primitives**
- **Goal:** Let the agent perform analytical tasks and write reports
- **Requirements:** R2, R4
- **Dependencies:** U1
- **Files:**
- Create: `agent_tools/analysis.py`
- Create: `agent_tools/reports.py`
- Test: `tests/agent_tools/test_analysis_tools.py`
- **Approach:**
- `analyze_party_shift(party, window_start, window_end, metric)` → computes and returns shift data
- `analyze_axis_stability(component, windows)` → returns stability scores
- `generate_report(type, parameters, output_path)` → writes markdown report to `reports/`
- `validate_svd_labels(component)` → compares theme labels to actual party positions
- **Patterns to follow:** `analysis/political_axis.py`, `scripts/motion_drift.py`, `scripts/validate_svd_themes.py`
- **Test scenarios:**
- Happy path: `analyze_party_shift` returns structured data for known party
- Integration: `generate_report("drift", {windows: ["2020", "2024"]})` produces valid markdown
- Edge case: requesting analysis for nonexistent window returns empty result
- **Verification:** Agent can answer "Which parties shifted most on economic axes?" by running analysis and summarizing results
- [ ] U4. **Content validation primitives**
- **Goal:** Let the agent validate and suggest content improvements
- **Requirements:** R4
- **Dependencies:** U1, U3
- **Files:**
- Create: `agent_tools/content.py`
- Test: `tests/agent_tools/test_content_tools.py`
- **Approach:**
- `validate_motion_coverage(start_date, end_date)` → returns coverage gaps
- `validate_layman_explanations(sample_size)` → samples motions, checks explanation quality
- `suggest_svd_label(component, top_n_motions)` → analyzes top motions, suggests label
- `check_embedding_quality(window_id)` → returns coverage stats for fused embeddings
- **Patterns to follow:** `summarizer.py` for explanation logic; `scripts/validate_svd_themes.py` for theme validation
- **Test scenarios:**
- Happy path: `validate_motion_coverage` returns accurate gap list
- Edge case: all motions covered returns empty gaps
- **Verification:** Agent can run weekly content quality checks and produce a report
- [ ] U5. **System prompt and context injection**
- **Goal:** Define agent behavior and inject runtime context
- **Requirements:** R1, R2, R3, R4, R5
- **Dependencies:** U1-U4
- **Files:**
- Create: `agent_tools/SYSTEM_PROMPT.md`
- Create: `agent_tools/context.py`
- **Approach:**
- `SYSTEM_PROMPT.md`: Defines agent identity ("You are the Stemwijzer pipeline operator"), available tools, decision criteria, and output conventions
- `context.py`: Injects runtime context—current pipeline status, latest SVD window, known issues from `docs/solutions/`, active party list
- `context.md` pattern: Agent maintains `agent_tools/context.md` with accumulated learnings about the pipeline
- **Patterns to follow:** `ce-agent-native-architecture` context.md pattern; `AGENTS.md` for project conventions
- **Test scenarios:**
- Context injection produces valid markdown with current DB stats
- System prompt loads and parses without errors
- **Verification:** Agent session starts with full context of pipeline state
- [ ] U6. **Agent-native testing and parity verification**
- **Goal:** Ensure agent can do everything humans can do
- **Requirements:** R1
- **Dependencies:** U1-U5
- **Files:**
- Create: `tests/agent_tools/test_parity.py`
- Modify: `tests/conftest.py` (add agent tool fixtures)
- **Approach:**
- Parity tests: For each human action (run pipeline, check health, generate report), verify the agent tool achieves the same outcome
- Integration tests: Agent runs a full diagnostic loop (check health → identify issue → run fix → verify)
- `test_parity.py`: Matrix of human action → agent tool → expected outcome
- **Test scenarios:**
- Parity: "Human runs health check CLI" vs "Agent calls pipeline_check_health()" → same result
- Integration: Agent detects stale data, runs pipeline, verifies freshness
- **Verification:** All parity tests pass
---
## Output Structure
```
agent_tools/ # New directory
├── __init__.py
├── SYSTEM_PROMPT.md # Agent behavior definition
├── context.py # Runtime context injection
├── context.md # Accumulated agent knowledge
├── database.py # DB query primitives
├── pipeline.py # Pipeline control primitives
├── analysis.py # Analysis primitives
├── reports.py # Report generation
└── content.py # Content validation primitives
tests/agent_tools/ # New test directory
├── __init__.py
├── test_database_tools.py
├── test_pipeline_tools.py
├── test_analysis_tools.py
├── test_content_tools.py
└── test_parity.py
reports/ # Agent-generated reports (gitignored)
```
---
## System-Wide Impact
- **Interaction graph:** Agent tools call into `database.py`, `pipeline/`, `analysis/`, `health/` modules. These modules are already well-factored and read-only where appropriate.
- **Error propagation:** Agent tools return structured errors (JSON with `error`, `suggestion`, `retryable` fields) rather than raising exceptions. This lets the agent reason about failures.
- **State lifecycle:** Agent-generated reports in `reports/` are ephemeral (gitignored). Agent updates to `context.md` are durable and committed.
- **Unchanged invariants:** The Streamlit UI, the data pipeline logic, and the SVD computation remain unchanged. Agent tools are a new surface, not a refactor.
---
## Risks & Dependencies
| Risk | Mitigation |
|------|-----------|
| DuckDB concurrency (read-only agent + write pipeline) | Agent uses read-only connections; pipeline uses write connections. DuckDB handles this at the file level. |
| Agent tools become stale as pipeline evolves | Tools are thin wrappers around stable module interfaces. U6 parity tests catch drift. |
| Context injection grows too large | Context is scoped to the task. `context.py` generates minimal relevant context, not full DB dumps. |
| Security: agent has DB access | Agent runs in the same trust boundary as the developer. No new security surface. |
---
## Documentation / Operational Notes
- Add `agent_tools/` to `AGENTS.md` so future agents know the capability surface exists
- Document the parity test matrix in `tests/agent_tools/README.md`
- `reports/` should be gitignored; agent reports are ephemeral outputs
---
## Sources & References
- **Origin:** STRATEGY.md (agent-native architecture track)
- **Skill:** `ce-agent-native-architecture` (parity, granularity, composability, emergent capability)
- **Related code:** `health/`, `pipeline/`, `analysis/`, `database.py`
- **Related docs:** `docs/plans/2026-04-24-ROADMAP-stemwijzer-improvements.md` (P4 tracks)