3.1 KiB

Raw Permalink Blame History

Recomputing Similarity (Admin)

This document explains the admin CLI and developer workflows for recomputing similarity scores and running clustering jobs locally.

What this does

Recompute similarity vectors/scores for existing records in the database.
(Optionally) run the clusterer job that groups similar items based on recomputed vectors.

These operations are typically run as admin/maintenance tasks after changing the embedding/similarity logic or restoring a database snapshot.

Migration filenames

When adding or running migrations related to similarity or clustering, follow the project's migration filename pattern. Migration files touching similarity will typically include keywords like recompute_similarity or clusterer in the filename, for example:

20260101_001_recompute_similarity.py
20260215_002_clusterer_migration.py

Check your migrations folder for the exact filenames used in your environment.

Environment variables

When running the CLI locally you may need to set the following environment variables.

TEST_DB_URL — connection string for a test/development database (used by local runs when you don't want to touch production data).
AI_PROVIDER_MOCK — when set to a truthy value (1, true, yes) the AI/embedding provider is mocked so you don't make real API calls during development. Treat any non-empty value of AI_PROVIDER_MOCK as truthy.
SIMILARITY_TOP_N — default number of top similar items to compute/keep for each record. The CLI --top-n flag overrides this value for the duration of the run.

Examples:

Export in a shell (persistent for your session): export TEST_DB_URL="postgresql://user:pass@localhost:5432/devdb" export AI_PROVIDER_MOCK="true" export SIMILARITY_TOP_N="50"
Inline for a single command (non-persistent): TEST_DB_URL="postgresql://user:pass@localhost/devdb" AI_PROVIDER_MOCK=1 python -m src.cli.recompute_similarity --batch-size 100

Notes:

--top-n CLI flag takes precedence over SIMILARITY_TOP_N when both are provided.
AI_PROVIDER_MOCK should be set to a truthy value (e.g. 1, true, yes) to avoid real external AI calls during local runs.

Running locally (development)

The CLI lives under src/cli. Use the module runner to execute the recompute script. Example commands:

Run a dry-run that doesn't persist changes:

python -m src.cli.recompute_similarity --top-n 10 --batch-size 100 --dry-run

Run for real (writes results to the DB):

python -m src.cli.recompute_similarity --top-n 50 --batch-size 500

Common flags

--top-n — override SIMILARITY_TOP_N for this run.
--batch-size — number of records to process per batch.
--dry-run — inspect what would be changed without writing to the DB.

Notes

Always point TEST_DB_URL at a non-production database when experimenting.
Use AI_PROVIDER_MOCK=true to skip external calls and speed up local dev.
If you change the embedding or similarity algorithm, re-run the recompute job and re-index/cluster as needed.

If you need help or encounter mismatches between migration files and the CLI, check the migrations folder and speak with the team member that authored the change.

3.1 KiB Raw Permalink Blame History