3.1 KiB
Recomputing Similarity (Admin)
This document explains the admin CLI and developer workflows for recomputing similarity scores and running clustering jobs locally.
What this does
- Recompute similarity vectors/scores for existing records in the database.
- (Optionally) run the clusterer job that groups similar items based on recomputed vectors.
These operations are typically run as admin/maintenance tasks after changing the embedding/similarity logic or restoring a database snapshot.
Migration filenames
When adding or running migrations related to similarity or clustering, follow the project's migration filename pattern. Migration files touching similarity will typically include keywords like recompute_similarity or clusterer in the filename, for example:
20260101_001_recompute_similarity.py20260215_002_clusterer_migration.py
Check your migrations folder for the exact filenames used in your environment.
Environment variables
When running the CLI locally you may need to set the following environment variables.
TEST_DB_URL— connection string for a test/development database (used by local runs when you don't want to touch production data).AI_PROVIDER_MOCK— when set to a truthy value (1,true,yes) the AI/embedding provider is mocked so you don't make real API calls during development. Treat any non-empty value ofAI_PROVIDER_MOCKas truthy.SIMILARITY_TOP_N— default number of top similar items to compute/keep for each record. The CLI--top-nflag overrides this value for the duration of the run.
Examples:
-
Export in a shell (persistent for your session): export TEST_DB_URL="postgresql://user:pass@localhost:5432/devdb" export AI_PROVIDER_MOCK="true" export SIMILARITY_TOP_N="50"
-
Inline for a single command (non-persistent): TEST_DB_URL="postgresql://user:pass@localhost/devdb" AI_PROVIDER_MOCK=1 python -m src.cli.recompute_similarity --batch-size 100
Notes:
--top-nCLI flag takes precedence overSIMILARITY_TOP_Nwhen both are provided.AI_PROVIDER_MOCKshould be set to a truthy value (e.g.1,true,yes) to avoid real external AI calls during local runs.
Running locally (development)
The CLI lives under src/cli. Use the module runner to execute the recompute script. Example commands:
Run a dry-run that doesn't persist changes:
python -m src.cli.recompute_similarity --top-n 10 --batch-size 100 --dry-run
Run for real (writes results to the DB):
python -m src.cli.recompute_similarity --top-n 50 --batch-size 500
Common flags
--top-n— override SIMILARITY_TOP_N for this run.--batch-size— number of records to process per batch.--dry-run— inspect what would be changed without writing to the DB.
Notes
- Always point
TEST_DB_URLat a non-production database when experimenting. - Use
AI_PROVIDER_MOCK=trueto skip external calls and speed up local dev. - If you change the embedding or similarity algorithm, re-run the recompute job and re-index/cluster as needed.
If you need help or encounter mismatches between migration files and the CLI, check the migrations folder and speak with the team member that authored the change.