Evals
Why evals exist
Gobii evals are executable product checks for agent behavior. They answer questions that unit tests cannot: did the agent choose the right tool, stop at the right time, ask for approval, avoid duplicate replies, and keep behavior stable when we change prompts, tools, routing profiles, or models?
The canonical eval system lives under api/evals and is launched with uv run python manage.py run_evals. Do not add one-off management commands for feature evals. Add scenarios and suites to the canonical registry instead.
What to run
| Run type | Purpose | Command shape | Counts as |
|---|---|---|---|
| Unit tests | Prove Python code, scoring helpers, registration, and local setup behavior. | uv run python manage.py test ... --settings=config.test_settings | Unit test evidence only. |
| Simulated evals | Deterministic offline eval path for scenarios that declare simulation support. | run_evals --simulated --settings=config.eval_local_settings | Eval runner smoke, not live model quality. |
| Live evals | Run scenarios against a real model through an LLM routing profile. | run_evals --routing-profile <profile> --settings=config.eval_local_settings | Model behavior evidence. |
| Official evals | Durable comparison runs for tracked releases or model changes. | Add --official or --run-type official. | Official trend data. |
Keep those categories separate in reports. Passing unit tests does not mean evals passed, and a simulated eval does not mean a model passed.
Architecture
| Piece | Location | Role |
|---|---|---|
EvalScenario | api/evals/base.py | Base class for one behavioral scenario. |
ScenarioTask | api/evals/base.py | The visible assertions/tasks recorded under each run. |
ScenarioRegistry | api/evals/registry.py | Global scenario registry. |
SuiteRegistry | api/evals/suites.py | Named groups of scenario slugs, plus the dynamic all suite. |
EvalRunner | api/evals/runner.py | Executes one EvalRun, creates EvalRunTask rows, captures fingerprints and model metadata. |
EvalSuiteRun | api/models.py | One suite invocation and its requested repeats, run type, launch config, and routing profile snapshot. |
EvalRun | api/models.py | One scenario execution for one agent. |
EvalRunTask | api/models.py | One scored task/assertion inside a run. |
run_evals | api/management/commands/run_evals.py | The only CLI entry point for canonical evals. |
api/evals/loader.py imports scenario modules and registers built-in suites. run_evals loads that registry directly, so a clean CLI invocation does not depend on test imports.
Local setup
For local eval development, use config.eval_local_settings. It creates and migrates .local/eval-local.sqlite3, runs Celery work eagerly, disables browser task execution, and seeds local LLM routing profiles. It is explicit and scoped to eval-local SQLite.
Do not use config.test_settings for live evals. Test settings mock and isolate runtime behavior for unit tests; live evals need the eval-local or real deployment settings path.
Run one eval-local command at a time. The local SQLite database is intentionally simple and can lock if multiple eval-local commands try to migrate or seed profiles concurrently.
List suites and scenarios without touching the database:
uv run python manage.py run_evals --list
Create the local SQLite schema, seed routing profiles, and list available non-snapshot profiles:
uv run python manage.py run_evals --list-routing-profiles --settings=config.eval_local_settings
The eval-local profiles store env var names only. They do not store API key values.
Common commands
Run one simulated scenario:
uv run python manage.py run_evals \
--scenario meta_gobii_negative_content_task \
--sync \
--n-runs 1 \
--simulated \
--settings=config.eval_local_settings
Run one simulated suite:
uv run python manage.py run_evals \
--suite meta_gobii \
--sync \
--n-runs 1 \
--simulated \
--settings=config.eval_local_settings
Run every registered scenario:
uv run python manage.py run_evals \
--suite all \
--sync \
--n-runs 1 \
--settings=config.eval_local_settings
For live all-suite runs, pass --routing-profile <profile> and expect a larger bill and a longer run. Start with one scenario or one suite before running all.
Live model runs
Source key files without printing values:
set -a; source /Users/andrew/.env-openrouter >/dev/null; set +a
OpenRouter DeepSeek V4 Flash:
uv run python manage.py run_evals \
--suite meta_gobii \
--sync \
--n-runs 1 \
--routing-profile openrouter-deepseek-v4-flash \
--settings=config.eval_local_settings
OpenRouter Qwen:
uv run python manage.py run_evals \
--suite meta_gobii \
--sync \
--n-runs 1 \
--routing-profile openrouter-qwen \
--settings=config.eval_local_settings
OpenAI:
set -a; source /path/to/your/openai.env >/dev/null; set +a
uv run python manage.py run_evals \
--suite meta_gobii \
--sync \
--n-runs 1 \
--routing-profile openai-gpt-4-1-mini \
--settings=config.eval_local_settings
Custom LiteLLM model string:
set -a; source /path/to/your/provider.env >/dev/null; set +a
EVAL_LOCAL_CUSTOM_MODEL="anthropic/claude-sonnet-4-20250514" \
EVAL_LOCAL_CUSTOM_API_KEY_ENV_VAR="ANTHROPIC_API_KEY" \
uv run python manage.py run_evals \
--suite meta_gobii \
--sync \
--n-runs 1 \
--routing-profile custom-litellm \
--settings=config.eval_local_settings
For OpenAI-compatible local or proxy endpoints, also set EVAL_LOCAL_CUSTOM_API_BASE and use the model string expected by that endpoint.
Reading results
The CLI prints each created EvalSuiteRun, each scheduled EvalRun, task status lines, pass rate, suite status, and audit links for created eval agents.
Inspect the latest local suite run:
uv run python manage.py shell --settings=config.eval_local_settings -c "\
from api.models import EvalSuiteRun; \
r = EvalSuiteRun.objects.prefetch_related('runs__tasks').latest('created_at'); \
print(r.id, r.suite_slug, r.status, r.run_type); \
[print(run.scenario_slug, run.status, run.primary_model, run.tasks.filter(status='passed').count(), '/', run.tasks.count()) for run in r.runs.all()]"
Compare model experiments by running the same scenario or suite with different --routing-profile values. EvalRunner snapshots the routing profile onto the suite run and records EvalRun.primary_model, scenario_fingerprint, code_version, and code_branch.
Adding scenarios
Add new eval behavior under api/evals/scenarios/. A normal scenario should:
- Subclass
EvalScenario. - Define a stable
slug,version,description, andtasks. - Implement
run(self, run_id: str, agent_id: str). - Use
ScenarioExecutionToolswhen it needs to inject messages, trigger processing, or record task results. - Register itself with
ScenarioRegistry.register(...). - Ensure the module is imported by
api/evals/loader.py. - Add targeted unit tests tagged with an existing eval batch tag, usually
@tag("batch_eval_fingerprint")unless a more specific registered tag fits.
Keep scenarios deterministic where practical. If a scenario supports offline execution, set supports_simulation = True and make --simulated use deterministic local data. If it needs a real model, do not pretend the simulated path measures model quality.
Adding suites
Suites are named groups in api/evals/loader.py:
EvalSuite(
slug="my_feature",
description="Focused checks for my feature.",
scenario_slugs=["my_feature_happy_path", "my_feature_guardrail"],
)
Use a focused suite when developers should run a related group repeatedly. The dynamic all suite already covers every registered scenario.
Scoring and stop policies
Prefer small, explicit ScenarioTask records over one broad pass/fail. A good scenario says what behavior failed: discovery, tool choice, approval policy, output safety, stop condition, or final response quality.
For direct scoring helpers, keep pure scoring functions close to the scenario data and unit-test them. The Meta Gobii evals are the worked example: api/evals/meta_gobii.py defines cases and scoring, while api/evals/scenarios/meta_gobii.py executes simulated or live model calls through the canonical runner.
For agent-processing evals, use api/evals/stop_policy.py through ScenarioExecutionTools.trigger_processing(..., eval_stop_policy=...) when the run should stop after a terminal tool call, all expected tool calls, a tracked human-input request, or an unexpected relevant tool. Keep stop policies narrow so unrelated tool noise does not hide a regression.
Meta Gobii worked example
Meta Gobii lives in the canonical suite meta_gobii. Its scenarios check that a manager Gobii discovers and enables the Meta Gobii system skill only when needed, plans direct Meta Gobii tools, requires confirmation for mutations, avoids legacy spawn_agent, handles contact output safely, and does not duplicate the same response.
Use simulated Meta Gobii runs for quick local confidence:
uv run python manage.py run_evals \
--suite meta_gobii \
--sync \
--n-runs 1 \
--simulated \
--settings=config.eval_local_settings
Then run the same suite against a live routing profile before reporting model behavior:
set -a; source /Users/andrew/.env-openrouter >/dev/null; set +a
uv run python manage.py run_evals \
--suite meta_gobii \
--sync \
--n-runs 1 \
--routing-profile openrouter-deepseek-v4-flash \
--settings=config.eval_local_settings
Best practices
- Use
api/evalsandmanage.py run_evals; do not add standalone feature-specific eval commands. - Keep unit tests, simulated evals, live evals, and official evals separate in code reviews and final reports.
- Never hardcode provider calls inside a feature eval. Route live models through
LLMRoutingProfile. - Never print, commit, or store inference key values. Refer to env var names only.
- Do not use
config.test_settingsfor liverun_evals. - Do not conflate simulated and live model results.
- Prefer one scenario or one focused suite while iterating; run
allonly when the change needs whole-registry confidence. - Keep eval-local setup explicit.
config.eval_local_settingsis for local SQLite evals, not production traffic.