motief/docs/plans/2026-05-01-002-agent-native...

---
title: Agent-Native Architecture Plan for Stemwijzer
type: refactor
status: active
date: 2026-05-01
origin: STRATEGY.md (agent-native architecture track)
---

# Agent-Native Architecture Plan for Stemwijzer

## Overview

Stemwijzer is a data-heavy analytical application with three surfaces: a Streamlit voting UI, a data pipeline (OData ingestion → DuckDB → SVD/embedding computation), and an analytics explorer. The agent-native architecture track aims to make every operation an agent can perform as capable as a human operator—whether that's running the pipeline, diagnosing drift, or answering research questions about parliamentary voting patterns.

**Current state:** The codebase is human-operated. Scripts are run manually, pipeline status is checked by eye, and analysis requires writing Python/DuckDB queries.

**Target state:** An agent with access to atomic primitives can run the pipeline, diagnose issues, generate reports, and answer open-ended questions about the data—operating in a loop until outcomes are achieved.

---

## Problem Frame

- **Pipeline operators** need to know when data is stale, why SVD vectors look wrong, or whether the similarity cache is healthy. Currently this requires manually running scripts and interpreting output.
- **Analysts/researchers** want to ask questions like "Which parties shifted most on economic axes between 2020 and 2024?" Currently this requires writing DuckDB queries and Python analysis code.
- **Developers** need to understand pipeline state, verify data integrity, and troubleshoot ingestion issues. Currently this requires reading logs and running diagnostics manually.
- **Content maintainers** need to verify SVD labels match actual voting patterns, check motion coverage, and validate layman explanations. Currently ad-hoc.

---

## Requirements Trace

- R1. The agent can achieve anything a pipeline operator can achieve (parity)
- R2. The agent can answer open-ended analytical questions about parliamentary data (emergent capability)
- R3. The agent can diagnose pipeline health and suggest remediation (self-service operations)
- R4. The agent can generate and validate content (SVD labels, motion summaries)
- R5. New capabilities can be added by writing prompts, not code (composability)

---

## Scope Boundaries

- **In scope:** Agent primitives for data operations, pipeline control, analysis, and diagnostics
- **Deferred:** Real-time agent UI inside Streamlit (future phase—add chat interface to explorer)
- **Deferred:** Autonomous pipeline scheduling (scheduler.py exists but agent control is v2)
- **Not working on:** Natural language to SQL for end users (this plan targets agent operators, not voter-facing features)

---

## Key Technical Decisions

- **Files as universal interface:** DuckDB is already file-based (`data/motions.db`). The agent's workspace is the repo itself. Logs, reports, and analysis outputs are files the agent writes and the human reads.
- **Database tools over file tools for structured data:** For querying motions, votes, and embeddings, the agent needs `query_database` primitives that wrap DuckDB/SQL, not raw file operations.
- **Pipeline as state machine:** The pipeline has discrete stages (ingestion → vote extraction → SVD → text embeddings → fusion → similarity). The agent needs stage-aware tools, not just "run everything."
- **Shared workspace:** Agent and human operate on the same `data/motions.db`, the same `thoughts/explorer/` outputs, the same `docs/solutions/` knowledge base.

---

## Implementation Units

- [ ] U1. **Database query primitives**
  - **Goal:** Give the agent structured access to the DuckDB database
  - **Requirements:** R1, R2, R4
  - **Dependencies:** None
  - **Files:**
    - Create: `agent_tools/database.py`
    - Test: `tests/agent_tools/test_database_tools.py`
  - **Approach:** Wrap DuckDB queries as atomic tools:
    - `query_motions(filter, limit, order)` → returns motion rows as JSON
    - `query_votes(motion_id, party)` → returns vote counts
    - `query_svd_vectors(window_id, entity_type)` → returns vectors
    - `query_party_positions(window_id)` → returns party axis scores
    - `query_pipeline_status()` → returns freshness metrics from health checks
  - **Patterns to follow:** `health/checks.py` already has DB query patterns; `analysis/explorer_data.py` has read-only query patterns
  - **Test scenarios:**
    - Happy path: query returns valid JSON for known filters
    - Edge case: empty result set returns `[]` not error
    - Error path: invalid SQL/filter returns structured error with suggestion
  - **Verification:** Agent can answer "How many motions in 2024?" using only the tool

- [ ] U2. **Pipeline control primitives**
  - **Goal:** Let the agent run, monitor, and diagnose pipeline stages
  - **Requirements:** R1, R3
  - **Dependencies:** U1
  - **Files:**
    - Create: `agent_tools/pipeline.py`
    - Test: `tests/agent_tools/test_pipeline_tools.py`
  - **Approach:** Stage-aware pipeline tools:
    - `pipeline_run_stage(stage, window_id, dry_run)` → runs one stage, returns status
    - `pipeline_run_full(dry_run)` → orchestrates all stages with dependency ordering
    - `pipeline_check_health()` → returns health report (reuses `health/` module)
    - `pipeline_get_logs(stage, lines)` → returns recent logs for a stage
    - `pipeline_validate_output(stage)` → checks output exists and looks reasonable
  - **Patterns to follow:** `pipeline/run_pipeline.py` has the stage orchestration; `scripts/health_check.py` has the CLI pattern
  - **Test scenarios:**
    - Happy path: dry-run returns planned actions without executing
    - Integration: running `pipeline_run_stage("svd", "2024")` produces expected `svd_vectors` rows
    - Error path: running a stage with missing dependencies returns clear error
  - **Verification:** Agent can diagnose "Why are SVD vectors stale?" by checking health, reading logs, and suggesting which stage to re-run

- [ ] U3. **Analysis and report generation primitives**
  - **Goal:** Let the agent perform analytical tasks and write reports
  - **Requirements:** R2, R4
  - **Dependencies:** U1
  - **Files:**
    - Create: `agent_tools/analysis.py`
    - Create: `agent_tools/reports.py`
    - Test: `tests/agent_tools/test_analysis_tools.py`
  - **Approach:**
    - `analyze_party_shift(party, window_start, window_end, metric)` → computes and returns shift data
    - `analyze_axis_stability(component, windows)` → returns stability scores
    - `generate_report(type, parameters, output_path)` → writes markdown report to `reports/`
    - `validate_svd_labels(component)` → compares theme labels to actual party positions
  - **Patterns to follow:** `analysis/political_axis.py`, `scripts/motion_drift.py`, `scripts/validate_svd_themes.py`
  - **Test scenarios:**
    - Happy path: `analyze_party_shift` returns structured data for known party
    - Integration: `generate_report("drift", {windows: ["2020", "2024"]})` produces valid markdown
    - Edge case: requesting analysis for nonexistent window returns empty result
  - **Verification:** Agent can answer "Which parties shifted most on economic axes?" by running analysis and summarizing results

- [ ] U4. **Content validation primitives**
  - **Goal:** Let the agent validate and suggest content improvements
  - **Requirements:** R4
  - **Dependencies:** U1, U3
  - **Files:**
    - Create: `agent_tools/content.py`
    - Test: `tests/agent_tools/test_content_tools.py`
  - **Approach:**
    - `validate_motion_coverage(start_date, end_date)` → returns coverage gaps
    - `validate_layman_explanations(sample_size)` → samples motions, checks explanation quality
    - `suggest_svd_label(component, top_n_motions)` → analyzes top motions, suggests label
    - `check_embedding_quality(window_id)` → returns coverage stats for fused embeddings
  - **Patterns to follow:** `summarizer.py` for explanation logic; `scripts/validate_svd_themes.py` for theme validation
  - **Test scenarios:**
    - Happy path: `validate_motion_coverage` returns accurate gap list
    - Edge case: all motions covered returns empty gaps
  - **Verification:** Agent can run weekly content quality checks and produce a report

- [ ] U5. **System prompt and context injection**
  - **Goal:** Define agent behavior and inject runtime context
  - **Requirements:** R1, R2, R3, R4, R5
  - **Dependencies:** U1-U4
  - **Files:**
    - Create: `agent_tools/SYSTEM_PROMPT.md`
    - Create: `agent_tools/context.py`
  - **Approach:**
    - `SYSTEM_PROMPT.md`: Defines agent identity ("You are the Stemwijzer pipeline operator"), available tools, decision criteria, and output conventions
    - `context.py`: Injects runtime context—current pipeline status, latest SVD window, known issues from `docs/solutions/`, active party list
    - `context.md` pattern: Agent maintains `agent_tools/context.md` with accumulated learnings about the pipeline
  - **Patterns to follow:** `ce-agent-native-architecture` context.md pattern; `AGENTS.md` for project conventions
  - **Test scenarios:**
    - Context injection produces valid markdown with current DB stats
    - System prompt loads and parses without errors
  - **Verification:** Agent session starts with full context of pipeline state

- [ ] U6. **Agent-native testing and parity verification**
  - **Goal:** Ensure agent can do everything humans can do
  - **Requirements:** R1
  - **Dependencies:** U1-U5
  - **Files:**
    - Create: `tests/agent_tools/test_parity.py`
    - Modify: `tests/conftest.py` (add agent tool fixtures)
  - **Approach:**
    - Parity tests: For each human action (run pipeline, check health, generate report), verify the agent tool achieves the same outcome
    - Integration tests: Agent runs a full diagnostic loop (check health → identify issue → run fix → verify)
    - `test_parity.py`: Matrix of human action → agent tool → expected outcome
  - **Test scenarios:**
    - Parity: "Human runs health check CLI" vs "Agent calls pipeline_check_health()" → same result
    - Integration: Agent detects stale data, runs pipeline, verifies freshness
  - **Verification:** All parity tests pass

---

## Output Structure

```
agent_tools/                    # New directory
├── __init__.py
├── SYSTEM_PROMPT.md            # Agent behavior definition
├── context.py                  # Runtime context injection
├── context.md                  # Accumulated agent knowledge
├── database.py                 # DB query primitives
├── pipeline.py                 # Pipeline control primitives
├── analysis.py                 # Analysis primitives
├── reports.py                  # Report generation
└── content.py                  # Content validation primitives

tests/agent_tools/              # New test directory
├── __init__.py
├── test_database_tools.py
├── test_pipeline_tools.py
├── test_analysis_tools.py
├── test_content_tools.py
└── test_parity.py

reports/                        # Agent-generated reports (gitignored)
```

---

## System-Wide Impact

- **Interaction graph:** Agent tools call into `database.py`, `pipeline/`, `analysis/`, `health/` modules. These modules are already well-factored and read-only where appropriate.
- **Error propagation:** Agent tools return structured errors (JSON with `error`, `suggestion`, `retryable` fields) rather than raising exceptions. This lets the agent reason about failures.
- **State lifecycle:** Agent-generated reports in `reports/` are ephemeral (gitignored). Agent updates to `context.md` are durable and committed.
- **Unchanged invariants:** The Streamlit UI, the data pipeline logic, and the SVD computation remain unchanged. Agent tools are a new surface, not a refactor.

---

## Risks & Dependencies

| Risk | Mitigation |
|------|-----------|
| DuckDB concurrency (read-only agent + write pipeline) | Agent uses read-only connections; pipeline uses write connections. DuckDB handles this at the file level. |
| Agent tools become stale as pipeline evolves | Tools are thin wrappers around stable module interfaces. U6 parity tests catch drift. |
| Context injection grows too large | Context is scoped to the task. `context.py` generates minimal relevant context, not full DB dumps. |
| Security: agent has DB access | Agent runs in the same trust boundary as the developer. No new security surface. |

---

## Documentation / Operational Notes

- Add `agent_tools/` to `AGENTS.md` so future agents know the capability surface exists
- Document the parity test matrix in `tests/agent_tools/README.md`
- `reports/` should be gitignored; agent reports are ephemeral outputs

---

## Sources & References

- **Origin:** STRATEGY.md (agent-native architecture track)
- **Skill:** `ce-agent-native-architecture` (parity, granularity, composability, emergent capability)
- **Related code:** `health/`, `pipeline/`, `analysis/`, `database.py`
- **Related docs:** `docs/plans/2026-04-24-ROADMAP-stemwijzer-improvements.md` (P4 tracks)