Evaluation infrastructure / release operations

AI Eval Control Tower

A cloneable model-evaluation system that turns quality, safety, cost, latency, and score provenance into a PM-readable launch decision.

Models: 7 registered
Domains: 3 workflows
Finding: Safety floor fail
Use: Release gate

Launch local tool Benchmarks

Quick read01 / operating layer

Most AI teams still lack a clear operating layer for model decisions.

One spreadsheet for prompts, one notebook for model experiments, another chart for latency, and a separate launch debate creates fragmented evidence. This tool makes the decision chain inspectable.

7.64 safety score

The current SproutRoute production model fell below the 8.0 safety floor.

87.8 top quality

Sonnet 4.6 led the field, but with latency and cost tradeoffs.

19x cost spread

The matrix makes "good enough and cheaper" a provable product decision.

Manual live evals

Secret-free PR checks stay safe; live OpenRouter runs happen only by manual workflow.

Benchmarks02 / decision

Seven models, one rubric, one decision.

Quality leader

Sonnet 4.6

Quality 87.8, safety 9.04, high latency and cost.

Best tradeoff

Gemini 2.5 Flash

Cleared safety, faster than production, lower cost.

Blocked path

Haiku 4.5

Production model scored below the safety floor.

Evidence

Judge citations

Car seat law, venue closure, and infeasible routing errors were surfaced.

Artifact03 / live build

The working tool is part of the case study.

In production, this page should link to the live `/evals/` build, the GitHub runbook, the evidence chain, and the current release verdict. The portfolio claim becomes inspectable.

Open eval tool