Evals

Why evals exist

Gobii evals are executable product checks for agent behavior. They answer questions that unit tests cannot: did the agent choose the right tool, stop at the right time, ask for approval, avoid duplicate replies, and keep behavior stable when we change prompts, tools, routing profiles, or models?

The canonical eval system lives under api/evals and is launched with uv run python manage.py run_evals. Do not add one-off management commands for feature evals. Add scenarios and suites to the canonical registry instead.

What to run

Run type	Purpose	Command shape	Counts as
Unit tests	Prove Python code, scoring helpers, registration, and local setup behavior.	`uv run python manage.py test ... --settings=config.test_settings`	Unit test evidence only.
Simulated evals	Deterministic offline eval path for scenarios that declare simulation support.	`run_evals --simulated --settings=config.eval_local_settings`	Eval runner smoke, not live model quality.
Live evals	Run scenarios against a real model through an LLM routing profile.	`run_evals --routing-profile <profile> --settings=config.eval_local_settings`	Model behavior evidence.
Official evals	Durable comparison runs for tracked releases or model changes.	Add `--official` or `--run-type official`.	Official trend data.

Keep those categories separate in reports. Passing unit tests does not mean evals passed, and a simulated eval does not mean a model passed.

Architecture

Piece	Location	Role
`EvalScenario`	`api/evals/base.py`	Base class for one behavioral scenario.
`ScenarioTask`	`api/evals/base.py`	The visible assertions/tasks recorded under each run.
`ScenarioRegistry`	`api/evals/registry.py`	Global scenario registry.
`SuiteRegistry`	`api/evals/suites.py`	Named groups of scenario slugs, plus the dynamic `all` suite.
`EvalRunner`	`api/evals/runner.py`	Executes one `EvalRun`, creates `EvalRunTask` rows, captures fingerprints and model metadata.
`EvalSuiteRun`	`api/models.py`	One suite invocation and its requested repeats, run type, launch config, and routing profile snapshot.
`EvalRun`	`api/models.py`	One scenario execution for one agent.
`EvalRunTask`	`api/models.py`	One scored task/assertion inside a run.
`run_evals`	`api/management/commands/run_evals.py`	The only CLI entry point for canonical evals.

api/evals/loader.py imports scenario modules and registers built-in suites. run_evals loads that registry directly, so a clean CLI invocation does not depend on test imports.

Local setup

For local eval development, use config.eval_local_settings. It creates and migrates .local/eval-local.sqlite3, runs Celery work eagerly, disables browser task execution, and seeds local LLM routing profiles. It is explicit and scoped to eval-local SQLite.

Do not use config.test_settings for live evals. Test settings mock and isolate runtime behavior for unit tests; live evals need the eval-local or real deployment settings path.

Run one eval-local command at a time. The local SQLite database is intentionally simple and can lock if multiple eval-local commands try to migrate or seed profiles concurrently.

List suites and scenarios without touching the database:

uv run python manage.py run_evals --list

Create the local SQLite schema, seed routing profiles, and list available non-snapshot profiles:

uv run python manage.py run_evals --list-routing-profiles --settings=config.eval_local_settings

The eval-local profiles store env var names only. They do not store API key values.

Common commands

Run one simulated scenario:

uv run python manage.py run_evals \
  --scenario meta_gobii_negative_content_task \
  --sync \
  --n-runs 1 \
  --simulated \
  --settings=config.eval_local_settings

Run one simulated suite:

uv run python manage.py run_evals \
  --suite meta_gobii \
  --sync \
  --n-runs 1 \
  --simulated \
  --settings=config.eval_local_settings

Run every registered scenario:

uv run python manage.py run_evals \
  --suite all \
  --sync \
  --n-runs 1 \
  --settings=config.eval_local_settings

For live all-suite runs, pass --routing-profile <profile> and expect a larger bill and a longer run. Start with one scenario or one suite before running all.

Live model runs

Source key files without printing values:

set -a; source /Users/andrew/.env-openrouter >/dev/null; set +a

OpenRouter DeepSeek V4 Flash:

uv run python manage.py run_evals \
  --suite meta_gobii \
  --sync \
  --n-runs 1 \
  --routing-profile openrouter-deepseek-v4-flash \
  --settings=config.eval_local_settings

OpenRouter Qwen:

uv run python manage.py run_evals \
  --suite meta_gobii \
  --sync \
  --n-runs 1 \
  --routing-profile openrouter-qwen \
  --settings=config.eval_local_settings

OpenAI:

set -a; source /path/to/your/openai.env >/dev/null; set +a

uv run python manage.py run_evals \
  --suite meta_gobii \
  --sync \
  --n-runs 1 \
  --routing-profile openai-gpt-4-1-mini \
  --settings=config.eval_local_settings

Custom LiteLLM model string:

set -a; source /path/to/your/provider.env >/dev/null; set +a

EVAL_LOCAL_CUSTOM_MODEL="anthropic/claude-sonnet-4-20250514" \
EVAL_LOCAL_CUSTOM_API_KEY_ENV_VAR="ANTHROPIC_API_KEY" \
uv run python manage.py run_evals \
  --suite meta_gobii \
  --sync \
  --n-runs 1 \
  --routing-profile custom-litellm \
  --settings=config.eval_local_settings

For OpenAI-compatible local or proxy endpoints, also set EVAL_LOCAL_CUSTOM_API_BASE and use the model string expected by that endpoint.

Reading results

The CLI prints each created EvalSuiteRun, each scheduled EvalRun, task status lines, pass rate, suite status, and audit links for created eval agents.

Inspect the latest local suite run:

uv run python manage.py shell --settings=config.eval_local_settings -c "\
from api.models import EvalSuiteRun; \
r = EvalSuiteRun.objects.prefetch_related('runs__tasks').latest('created_at'); \
print(r.id, r.suite_slug, r.status, r.run_type); \
[print(run.scenario_slug, run.status, run.primary_model, run.tasks.filter(status='passed').count(), '/', run.tasks.count()) for run in r.runs.all()]"

Compare model experiments by running the same scenario or suite with different --routing-profile values. EvalRunner snapshots the routing profile onto the suite run and records EvalRun.primary_model, scenario_fingerprint, code_version, and code_branch.

Adding scenarios

Add new eval behavior under api/evals/scenarios/. A normal scenario should:

Subclass EvalScenario.
Define a stable slug, version, description, and tasks.
Implement run(self, run_id: str, agent_id: str).
Use ScenarioExecutionTools when it needs to inject messages, trigger processing, or record task results.
Register itself with ScenarioRegistry.register(...).
Ensure the module is imported by api/evals/loader.py.
Add targeted unit tests tagged with an existing eval batch tag, usually @tag("batch_eval_fingerprint") unless a more specific registered tag fits.

Keep scenarios deterministic where practical. If a scenario supports offline execution, set supports_simulation = True and make --simulated use deterministic local data. If it needs a real model, do not pretend the simulated path measures model quality.

Adding suites

Suites are named groups in api/evals/loader.py:

EvalSuite(
    slug="my_feature",
    description="Focused checks for my feature.",
    scenario_slugs=["my_feature_happy_path", "my_feature_guardrail"],
)

Use a focused suite when developers should run a related group repeatedly. The dynamic all suite already covers every registered scenario.

Scoring and stop policies

Prefer small, explicit ScenarioTask records over one broad pass/fail. A good scenario says what behavior failed: discovery, tool choice, approval policy, output safety, stop condition, or final response quality.

For direct scoring helpers, keep pure scoring functions close to the scenario data and unit-test them. The Meta Gobii evals are the worked example: api/evals/meta_gobii.py defines cases and scoring, while api/evals/scenarios/meta_gobii.py executes simulated or live model calls through the canonical runner.

For agent-processing evals, use api/evals/stop_policy.py through ScenarioExecutionTools.trigger_processing(..., eval_stop_policy=...) when the run should stop after a terminal tool call, all expected tool calls, a tracked human-input request, or an unexpected relevant tool. Keep stop policies narrow so unrelated tool noise does not hide a regression.

Meta Gobii worked example

Meta Gobii lives in the canonical suite meta_gobii. Its scenarios check that a manager Gobii discovers and enables the Meta Gobii system skill only when needed, plans direct Meta Gobii tools, requires confirmation for mutations, avoids legacy spawn_agent, handles contact output safely, and does not duplicate the same response.

Use simulated Meta Gobii runs for quick local confidence:

uv run python manage.py run_evals \
  --suite meta_gobii \
  --sync \
  --n-runs 1 \
  --simulated \
  --settings=config.eval_local_settings

Then run the same suite against a live routing profile before reporting model behavior:

set -a; source /Users/andrew/.env-openrouter >/dev/null; set +a

uv run python manage.py run_evals \
  --suite meta_gobii \
  --sync \
  --n-runs 1 \
  --routing-profile openrouter-deepseek-v4-flash \
  --settings=config.eval_local_settings

Best practices

Use api/evals and manage.py run_evals; do not add standalone feature-specific eval commands.
Keep unit tests, simulated evals, live evals, and official evals separate in code reviews and final reports.
Never hardcode provider calls inside a feature eval. Route live models through LLMRoutingProfile.
Never print, commit, or store inference key values. Refer to env var names only.
Do not use config.test_settings for live run_evals.
Do not conflate simulated and live model results.
Prefer one scenario or one focused suite while iterating; run all only when the change needs whole-registry confidence.
Keep eval-local setup explicit. config.eval_local_settings is for local SQLite evals, not production traffic.

Why evals exist​

What to run​

Architecture​

Local setup​

Common commands​

Live model runs​

Reading results​

Adding scenarios​

Adding suites​

Scoring and stop policies​

Meta Gobii worked example​

Best practices​