The current SproutRoute production model fell below the 8.0 safety floor.
Evaluation infrastructure / release operations
AI Eval Control Tower
A cloneable model-evaluation system that turns quality, safety, cost, latency, and score provenance into a PM-readable launch decision.
- Models
- 7 registered
- Domains
- 3 workflows
- Finding
- Safety floor fail
- Use
- Release gate
Most AI teams still lack a clear operating layer for model decisions.
One spreadsheet for prompts, one notebook for model experiments, another chart for latency, and a separate launch debate creates fragmented evidence. This tool makes the decision chain inspectable.
Sonnet 4.6 led the field, but with latency and cost tradeoffs.
The matrix makes "good enough and cheaper" a provable product decision.
Secret-free PR checks stay safe; live OpenRouter runs happen only by manual workflow.
Seven models, one rubric, one decision.
Sonnet 4.6
Quality 87.8, safety 9.04, high latency and cost.
Gemini 2.5 Flash
Cleared safety, faster than production, lower cost.
Haiku 4.5
Production model scored below the safety floor.
Judge citations
Car seat law, venue closure, and infeasible routing errors were surfaced.
The working tool is part of the case study.
In production, this page should link to the live `/evals/` build, the GitHub runbook, the evidence chain, and the current release verdict. The portfolio claim becomes inspectable.
Open eval tool