You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
72 lines
3.1 KiB
72 lines
3.1 KiB
# Recomputing Similarity (Admin)
|
|
|
|
This document explains the admin CLI and developer workflows for recomputing similarity scores and running clustering jobs locally.
|
|
|
|
## What this does
|
|
|
|
- Recompute similarity vectors/scores for existing records in the database.
|
|
- (Optionally) run the clusterer job that groups similar items based on recomputed vectors.
|
|
|
|
These operations are typically run as admin/maintenance tasks after changing the embedding/similarity logic or restoring a database snapshot.
|
|
|
|
## Migration filenames
|
|
|
|
When adding or running migrations related to similarity or clustering, follow the project's migration filename pattern. Migration files touching similarity will typically include keywords like `recompute_similarity` or `clusterer` in the filename, for example:
|
|
|
|
- `20260101_001_recompute_similarity.py`
|
|
- `20260215_002_clusterer_migration.py`
|
|
|
|
Check your migrations folder for the exact filenames used in your environment.
|
|
|
|
## Environment variables
|
|
|
|
When running the CLI locally you may need to set the following environment variables.
|
|
|
|
- `TEST_DB_URL` — connection string for a test/development database (used by local runs when you don't want to touch production data).
|
|
- `AI_PROVIDER_MOCK` — when set to a truthy value (`1`, `true`, `yes`) the AI/embedding provider is mocked so you don't make real API calls during development. Treat any non-empty value of `AI_PROVIDER_MOCK` as truthy.
|
|
- `SIMILARITY_TOP_N` — default number of top similar items to compute/keep for each record. The CLI `--top-n` flag overrides this value for the duration of the run.
|
|
|
|
Examples:
|
|
|
|
- Export in a shell (persistent for your session):
|
|
export TEST_DB_URL="postgresql://user:pass@localhost:5432/devdb"
|
|
export AI_PROVIDER_MOCK="true"
|
|
export SIMILARITY_TOP_N="50"
|
|
|
|
- Inline for a single command (non-persistent):
|
|
TEST_DB_URL="postgresql://user:pass@localhost/devdb" AI_PROVIDER_MOCK=1 python -m src.cli.recompute_similarity --batch-size 100
|
|
|
|
Notes:
|
|
|
|
- `--top-n` CLI flag takes precedence over `SIMILARITY_TOP_N` when both are provided.
|
|
- `AI_PROVIDER_MOCK` should be set to a truthy value (e.g. `1`, `true`, `yes`) to avoid real external AI calls during local runs.
|
|
|
|
## Running locally (development)
|
|
|
|
The CLI lives under src/cli. Use the module runner to execute the recompute script. Example commands:
|
|
|
|
Run a dry-run that doesn't persist changes:
|
|
|
|
```
|
|
python -m src.cli.recompute_similarity --top-n 10 --batch-size 100 --dry-run
|
|
```
|
|
|
|
Run for real (writes results to the DB):
|
|
|
|
```
|
|
python -m src.cli.recompute_similarity --top-n 50 --batch-size 500
|
|
```
|
|
|
|
Common flags
|
|
|
|
- `--top-n` — override SIMILARITY_TOP_N for this run.
|
|
- `--batch-size` — number of records to process per batch.
|
|
- `--dry-run` — inspect what would be changed without writing to the DB.
|
|
|
|
Notes
|
|
|
|
- Always point `TEST_DB_URL` at a non-production database when experimenting.
|
|
- Use `AI_PROVIDER_MOCK=true` to skip external calls and speed up local dev.
|
|
- If you change the embedding or similarity algorithm, re-run the recompute job and re-index/cluster as needed.
|
|
|
|
If you need help or encounter mismatches between migration files and the CLI, check the migrations folder and speak with the team member that authored the change.
|
|
|