Archives 8 one-off/backfill/research scripts to scripts/archive/: - compare_svd_exclude_parties.py (diagnostic) - compute_test_batch.py (test utility) - fill_mp_votes_parties.py (backfill) - generate_compass.py (generates to deleted outputs/) - inspect_axis.py (diagnostic) - qa_similarity.py (QA script, references deleted thoughts/ledgers/) - recompute_svd.py (one-off recompute) - semantic_gravity_examples.py (research) Deletes: - generate_extra_charts.py (0 references, generates to deleted outputs/) - tests/test_qa_similarity.py (test for archived script) Adds: - scripts/archive/README.md explaining archive purpose - docs/plans/2026-05-01-001-scripts-audit-cleanup-plan.mdmain
parent
07dd393533
commit
2c60f41f29
@ -0,0 +1,137 @@ |
||||
--- |
||||
title: Scripts Directory Audit and Cleanup Plan |
||||
type: refactor |
||||
status: active |
||||
date: 2026-05-01 |
||||
--- |
||||
|
||||
# Scripts Directory Audit and Cleanup Plan |
||||
|
||||
## Overview |
||||
|
||||
The `scripts/` directory contains 20 Python files (~4,900 lines total). Many are one-off diagnostics, research utilities, or data backfill scripts from early pipeline development. Several are no longer needed, some generate outputs to now-deleted directories, and a few have overlapping functionality. This plan establishes a clear taxonomy and cleanup path. |
||||
|
||||
--- |
||||
|
||||
## Current Inventory |
||||
|
||||
| Script | Lines | Last Commit | References | Status | |
||||
|--------|-------|-------------|------------|--------| |
||||
| `download_past_year.py` | 295 | 2026-04-30 | 11 | **Keep** — Active data ingestion | |
||||
| `health_check.py` | 98 | 2026-05-01 | 21 | **Keep** — Active health check CLI | |
||||
| `validate_svd_themes.py` | 343 | 2026-04-30 | 13 | **Keep** — Active validation | |
||||
| `generate_svd_json.py` | 594 | 2026-04-13 | 12 | **Keep** — Generates `thoughts/explorer/top_svd_top_motions.json` | |
||||
| `motion_drift.py` | 1,207 | 2026-04-05 | 42 | **Keep** — Referenced in active plans | |
||||
| `sync_motion_content.py` | 704 | 2026-03-23 | 8 | **Keep** — Content enrichment pipeline | |
||||
| `rerun_embeddings.py` | 233 | 2026-03-23 | 15 | **Keep** — Embedding rebuild utility | |
||||
| `derive_svd_labels.py` | 423 | 2026-04-13 | 5 | **Keep** — SVD label derivation | |
||||
| `diagnose_trajectories_cli.py` | 234 | 2026-03-31 | 5 | **Keep** — Diagnostic utility | |
||||
| `svd_diagnostics.py` | 214 | 2026-03-22 | 9 | **Keep** — SVD diagnostics | |
||||
| `recompute_svd.py` | 172 | 2026-04-16 | 2 | **Archive** — One-off recompute | |
||||
| `semantic_gravity_examples.py` | 286 | 2026-04-05 | 6 | **Archive** — Research script | |
||||
| `qa_similarity.py` | 150 | 2026-03-23 | 4 | **Archive** — QA script (references deleted `thoughts/ledgers/`) | |
||||
| `fill_mp_votes_parties.py` | 277 | 2026-03-22 | 2 | **Archive** — Backfill script | |
||||
| `inspect_axis.py` | 137 | 2026-03-22 | 3 | **Archive** — Diagnostic | |
||||
| `compare_svd_exclude_parties.py` | 204 | 2026-03-22 | 1 | **Archive** — Diagnostic | |
||||
| `generate_compass.py` | 157 | 2026-03-22 | 2 | **Archive** — Generates to deleted `outputs/` | |
||||
| `compute_test_batch.py` | 128 | 2026-03-20 | 3 | **Archive** — Test batch | |
||||
| `generate_extra_charts.py` | 172 | 2026-03-22 | 0 | **Delete** — Generates to deleted `outputs/`, 0 references | |
||||
|
||||
--- |
||||
|
||||
## Categorization Rules |
||||
|
||||
### Keep (10 scripts) |
||||
Scripts that are: |
||||
- Imported or invoked by active code/tests |
||||
- Referenced in active plans (docs/plans/) |
||||
- Run regularly as part of pipeline or diagnostics |
||||
- Updated recently (April 2026+) |
||||
|
||||
### Archive (9 scripts) |
||||
Scripts that are: |
||||
- One-off diagnostics or backfill utilities |
||||
- Research/exploration scripts with no active plan references |
||||
- Superseded by pipeline code but kept for historical reference |
||||
- Generate outputs to `outputs/` (deleted) or `thoughts/ledgers/` (deleted) |
||||
|
||||
**Archive location:** `scripts/archive/` — not imported, not tested, preserved for reference. |
||||
|
||||
### Delete (1 script) |
||||
Scripts that are: |
||||
- Completely orphaned (0 references) |
||||
- Superseded with no unique value |
||||
- Generate outputs to non-existent directories |
||||
|
||||
--- |
||||
|
||||
## Implementation Units |
||||
|
||||
- [ ] U1. **Create `scripts/archive/` directory** |
||||
- Files: `scripts/archive/` (new directory) |
||||
- Verification: Directory exists |
||||
|
||||
- [ ] U2. **Move archive scripts to `scripts/archive/`** |
||||
- Files to move: |
||||
- `scripts/recompute_svd.py` |
||||
- `scripts/semantic_gravity_examples.py` |
||||
- `scripts/qa_similarity.py` |
||||
- `scripts/fill_mp_votes_parties.py` |
||||
- `scripts/inspect_axis.py` |
||||
- `scripts/compare_svd_exclude_parties.py` |
||||
- `scripts/generate_compass.py` |
||||
- `scripts/compute_test_batch.py` |
||||
- Verification: Scripts are in `scripts/archive/`, not in `scripts/` |
||||
|
||||
- [ ] U3. **Delete orphaned scripts** |
||||
- Files to delete: |
||||
- `scripts/generate_extra_charts.py` |
||||
- Verification: File no longer exists |
||||
|
||||
- [ ] U4. **Update `.gitignore` for archive** |
||||
- Add: `scripts/archive/` (optional — if we don't want to track archived scripts) |
||||
- Or add README in archive explaining purpose |
||||
- Verification: Archive is handled appropriately |
||||
|
||||
- [ ] U5. **Run test suite** |
||||
- Command: `uv run pytest tests/ -q` |
||||
- Verification: All tests pass, no import errors from moved scripts |
||||
|
||||
--- |
||||
|
||||
## Risks |
||||
|
||||
| Risk | Mitigation | |
||||
|------|-----------| |
||||
| A test imports an archived script | Check all test imports before moving | |
||||
| A plan references an archived script | Plans already checked — none reference archive candidates exclusively | |
||||
| Future need for archived script | Git history preserves everything; archive is just convenience | |
||||
|
||||
--- |
||||
|
||||
## Post-Cleanup State |
||||
|
||||
``` |
||||
scripts/ |
||||
├── archive/ # 8 archived scripts (reference only) |
||||
│ ├── compare_svd_exclude_parties.py |
||||
│ ├── compute_test_batch.py |
||||
│ ├── fill_mp_votes_parties.py |
||||
│ ├── generate_compass.py |
||||
│ ├── inspect_axis.py |
||||
│ ├── qa_similarity.py |
||||
│ ├── recompute_svd.py |
||||
│ └── semantic_gravity_examples.py |
||||
├── download_past_year.py |
||||
├── health_check.py |
||||
├── derive_svd_labels.py |
||||
├── diagnose_trajectories_cli.py |
||||
├── generate_svd_json.py |
||||
├── motion_drift.py |
||||
├── rerun_embeddings.py |
||||
├── sync_motion_content.py |
||||
├── svd_diagnostics.py |
||||
└── validate_svd_themes.py |
||||
``` |
||||
|
||||
**Result:** 10 active scripts + 8 archived. ~1,700 lines removed from active directory. |
||||
@ -0,0 +1,7 @@ |
||||
# Archived scripts |
||||
# |
||||
# These scripts are preserved for reference but are no longer actively |
||||
# maintained or run. They include one-off diagnostics, backfill utilities, |
||||
# and research scripts from early pipeline development. |
||||
# |
||||
# Git history preserves everything; this directory is just a convenience. |
||||
@ -1,172 +0,0 @@ |
||||
"""Generate additional blog charts: controversy trend + party alignment heatmap.""" |
||||
|
||||
from __future__ import annotations |
||||
import os, sys |
||||
|
||||
ROOT = os.path.dirname(os.path.abspath(__file__)) |
||||
if ROOT not in sys.path: |
||||
sys.path.insert(0, ROOT) |
||||
|
||||
import duckdb |
||||
import plotly.graph_objects as go |
||||
import plotly.express as px |
||||
import numpy as np |
||||
|
||||
DB = "data/motions.db" |
||||
OUT = "outputs/blog-charts" |
||||
os.makedirs(OUT, exist_ok=True) |
||||
|
||||
con = duckdb.connect(DB, read_only=True) |
||||
|
||||
# ─── 1. Controversy trend (bar chart, 2019-2026, quarterly) ────────────────── |
||||
rows = con.execute(""" |
||||
SELECT |
||||
YEAR(date) || '-Q' || QUARTER(date) as wid, |
||||
YEAR(date) as yr, |
||||
QUARTER(date) as q, |
||||
COUNT(*) as n, |
||||
ROUND(AVG(controversy_score), 3) as avg_c, |
||||
COUNT(*) FILTER (WHERE controversy_score >= 0.7) as high_c |
||||
FROM motions |
||||
WHERE controversy_score IS NOT NULL |
||||
AND date >= '2019-01-01' AND date < '2026-04-01' |
||||
GROUP BY wid, yr, q |
||||
ORDER BY yr, q |
||||
""").fetchall() |
||||
|
||||
windows = [r[0] for r in rows] |
||||
avg_c = [r[4] for r in rows] |
||||
high_pct = [round(100.0 * r[5] / r[3], 1) if r[3] else 0 for r in rows] |
||||
|
||||
fig = go.Figure() |
||||
fig.add_trace( |
||||
go.Bar( |
||||
x=windows, |
||||
y=high_pct, |
||||
name="% highly contested (score ≥ 0.7)", |
||||
marker_color="#00d9a3", |
||||
opacity=0.85, |
||||
) |
||||
) |
||||
fig.add_trace( |
||||
go.Scatter( |
||||
x=windows, |
||||
y=[v * 100 for v in avg_c], |
||||
name="avg controversy × 100", |
||||
mode="lines+markers", |
||||
line=dict(color="#e6edf3", width=2), |
||||
marker=dict(size=4), |
||||
) |
||||
) |
||||
fig.update_layout( |
||||
title="Political controversy per quarter (Tweede Kamer, 2019–2026)", |
||||
xaxis_title="Quarter", |
||||
yaxis_title="% of motions", |
||||
plot_bgcolor="#161b22", |
||||
paper_bgcolor="#0d1117", |
||||
font=dict(color="#e6edf3", family="Inter, system-ui"), |
||||
legend=dict(bgcolor="rgba(0,0,0,0)", bordercolor="#30363d", borderwidth=1), |
||||
xaxis=dict(tickangle=-45, gridcolor="#30363d"), |
||||
yaxis=dict(gridcolor="#30363d", range=[0, 55]), |
||||
bargap=0.15, |
||||
) |
||||
out1 = os.path.join(OUT, "controversy_trend.html") |
||||
fig.write_html(out1, include_plotlyjs="cdn", full_html=True) |
||||
print(f"Wrote {out1}") |
||||
|
||||
# ─── 2. Party alignment heatmap ────────────────────────────────────────────── |
||||
# Only include major parties with sufficient data |
||||
MAJOR = [ |
||||
"VVD", |
||||
"PVV", |
||||
"D66", |
||||
"CDA", |
||||
"PvdA", |
||||
"GroenLinks", |
||||
"SP", |
||||
"ChristenUnie", |
||||
"SGP", |
||||
"FVD", |
||||
"BBB", |
||||
"PvdD", |
||||
"Volt", |
||||
"GroenLinks-PvdA", |
||||
"Nieuw Sociaal Contract", |
||||
"DENK", |
||||
"JA21", |
||||
] |
||||
|
||||
rows = con.execute(""" |
||||
WITH pv AS ( |
||||
SELECT motion_id, party, |
||||
CASE |
||||
WHEN SUM(CASE WHEN vote='voor' THEN 1 ELSE 0 END) > SUM(CASE WHEN vote='tegen' THEN 1 ELSE 0 END) THEN 'voor' |
||||
WHEN SUM(CASE WHEN vote='tegen' THEN 1 ELSE 0 END) > SUM(CASE WHEN vote='voor' THEN 1 ELSE 0 END) THEN 'tegen' |
||||
ELSE 'split' |
||||
END as pv |
||||
FROM mp_votes WHERE party IS NOT NULL AND vote IN ('voor','tegen') |
||||
GROUP BY motion_id, party |
||||
), |
||||
d AS (SELECT * FROM pv WHERE pv != 'split') |
||||
SELECT a.party, b.party, |
||||
COUNT(*) as shared, |
||||
ROUND(100.0 * SUM(CASE WHEN a.pv = b.pv THEN 1 ELSE 0 END) / COUNT(*), 1) as pct |
||||
FROM d a JOIN d b ON a.motion_id = b.motion_id AND a.party != b.party |
||||
GROUP BY a.party, b.party |
||||
HAVING COUNT(*) >= 100 |
||||
""").fetchall() |
||||
|
||||
# Build matrix |
||||
agree = {} |
||||
for a, b, _, pct in rows: |
||||
agree[(a, b)] = pct |
||||
|
||||
# Filter to parties that have data |
||||
present = set() |
||||
for a, b in agree: |
||||
if a in MAJOR: |
||||
present.add(a) |
||||
if b in MAJOR: |
||||
present.add(b) |
||||
parties = [p for p in MAJOR if p in present] |
||||
|
||||
n = len(parties) |
||||
matrix = np.full((n, n), np.nan) |
||||
for i, a in enumerate(parties): |
||||
matrix[i, i] = 100.0 |
||||
for j, b in enumerate(parties): |
||||
if i != j and (a, b) in agree: |
||||
matrix[i, j] = agree[(a, b)] |
||||
|
||||
fig2 = go.Figure( |
||||
data=go.Heatmap( |
||||
z=matrix, |
||||
x=parties, |
||||
y=parties, |
||||
colorscale=[[0, "#6e40c9"], [0.5, "#30363d"], [1, "#00d9a3"]], |
||||
zmid=70, |
||||
zmin=35, |
||||
zmax=100, |
||||
text=[[f"{v:.0f}%" if not np.isnan(v) else "" for v in row] for row in matrix], |
||||
texttemplate="%{text}", |
||||
textfont=dict(size=9), |
||||
hoverongaps=False, |
||||
showscale=True, |
||||
colorbar=dict(title="Agreement %", tickfont=dict(color="#e6edf3")), |
||||
) |
||||
) |
||||
fig2.update_layout( |
||||
title="Cross-party vote alignment (all years combined)", |
||||
plot_bgcolor="#161b22", |
||||
paper_bgcolor="#0d1117", |
||||
font=dict(color="#e6edf3", family="Inter, system-ui", size=11), |
||||
xaxis=dict(tickangle=-45, side="bottom", gridcolor="#30363d"), |
||||
yaxis=dict(autorange="reversed", gridcolor="#30363d"), |
||||
height=600, |
||||
) |
||||
out2 = os.path.join(OUT, "party_alignment.html") |
||||
fig2.write_html(out2, include_plotlyjs="cdn", full_html=True) |
||||
print(f"Wrote {out2}") |
||||
|
||||
con.close() |
||||
print("Done.") |
||||
@ -1,51 +0,0 @@ |
||||
import json |
||||
from pathlib import Path |
||||
|
||||
|
||||
def test_qa_similarity_creates_ledger(tmp_path, monkeypatch): |
||||
# Prepare monkeypatched database.db |
||||
class DummyDB: |
||||
def sample_motions(self, sample_size): |
||||
assert sample_size == 2 |
||||
return [1, 2] |
||||
|
||||
def get_cached_similarities(self, motion_id, top_k): |
||||
# return deterministic neighbors |
||||
return [ |
||||
{"id": motion_id * 10 + i, "score": 1.0 - i * 0.1} for i in range(top_k) |
||||
] |
||||
|
||||
dummy = DummyDB() |
||||
|
||||
# Monkeypatch the database module to provide .db — use monkeypatch.setitem |
||||
# so the override is active for this test and auto-reverts after. |
||||
import types |
||||
|
||||
fake_db_module = types.SimpleNamespace(db=dummy) |
||||
|
||||
import sys |
||||
|
||||
monkeypatch.setitem(sys.modules, "database", fake_db_module) |
||||
|
||||
# Ensure thoughts/ledgers inside tmp_path |
||||
base = tmp_path |
||||
(base / "thoughts" / "ledgers").mkdir(parents=True) |
||||
|
||||
# Monkeypatch cwd so ledger writes to tmp_path/thoughts |
||||
monkeypatch.chdir(base) |
||||
|
||||
from scripts.qa_similarity import main |
||||
|
||||
summary = main(db_path=":memory:", sample_size=2, top_k=3) |
||||
|
||||
assert summary["sample_size"] == 2 |
||||
assert summary["top_k"] == 3 |
||||
assert 1 in summary["motions"] |
||||
assert 2 in summary["motions"] |
||||
|
||||
ledger_path = Path(summary["ledger_path"]) |
||||
assert ledger_path.exists() |
||||
|
||||
data = json.loads(ledger_path.read_text(encoding="utf-8")) |
||||
assert "motions" in data |
||||
assert len(data["motions"]) == 2 |
||||
Loading…
Reference in new issue