You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
motief/docs/admin/recompute_similarity.md

3.1 KiB

Recomputing Similarity (Admin)

This document explains the admin CLI and developer workflows for recomputing similarity scores and running clustering jobs locally.

What this does

  • Recompute similarity vectors/scores for existing records in the database.
  • (Optionally) run the clusterer job that groups similar items based on recomputed vectors.

These operations are typically run as admin/maintenance tasks after changing the embedding/similarity logic or restoring a database snapshot.

Migration filenames

When adding or running migrations related to similarity or clustering, follow the project's migration filename pattern. Migration files touching similarity will typically include keywords like recompute_similarity or clusterer in the filename, for example:

  • 20260101_001_recompute_similarity.py
  • 20260215_002_clusterer_migration.py

Check your migrations folder for the exact filenames used in your environment.

Environment variables

When running the CLI locally you may need to set the following environment variables.

  • TEST_DB_URL — connection string for a test/development database (used by local runs when you don't want to touch production data).
  • AI_PROVIDER_MOCK — when set to a truthy value (1, true, yes) the AI/embedding provider is mocked so you don't make real API calls during development. Treat any non-empty value of AI_PROVIDER_MOCK as truthy.
  • SIMILARITY_TOP_N — default number of top similar items to compute/keep for each record. The CLI --top-n flag overrides this value for the duration of the run.

Examples:

  • Export in a shell (persistent for your session): export TEST_DB_URL="postgresql://user:pass@localhost:5432/devdb" export AI_PROVIDER_MOCK="true" export SIMILARITY_TOP_N="50"

  • Inline for a single command (non-persistent): TEST_DB_URL="postgresql://user:pass@localhost/devdb" AI_PROVIDER_MOCK=1 python -m src.cli.recompute_similarity --batch-size 100

Notes:

  • --top-n CLI flag takes precedence over SIMILARITY_TOP_N when both are provided.
  • AI_PROVIDER_MOCK should be set to a truthy value (e.g. 1, true, yes) to avoid real external AI calls during local runs.

Running locally (development)

The CLI lives under src/cli. Use the module runner to execute the recompute script. Example commands:

Run a dry-run that doesn't persist changes:

python -m src.cli.recompute_similarity --top-n 10 --batch-size 100 --dry-run

Run for real (writes results to the DB):

python -m src.cli.recompute_similarity --top-n 50 --batch-size 500

Common flags

  • --top-n — override SIMILARITY_TOP_N for this run.
  • --batch-size — number of records to process per batch.
  • --dry-run — inspect what would be changed without writing to the DB.

Notes

  • Always point TEST_DB_URL at a non-production database when experimenting.
  • Use AI_PROVIDER_MOCK=true to skip external calls and speed up local dev.
  • If you change the embedding or similarity algorithm, re-run the recompute job and re-index/cluster as needed.

If you need help or encounter mismatches between migration files and the CLI, check the migrations folder and speak with the team member that authored the change.