Below the 8.0 hard floor. Haiku 4.5 is the model actually serving SproutRoute traffic. The gate flagged it automatically — the exact job the tool is supposed to do.
AI Eval Control Tower
This project exists because shipping the model is only the start. The harder question is knowing when the model is getting worse, when a cheaper one is good enough, and when a release should be blocked. The live tool now starts with SproutRoute because it caught a real safety-floor failure, then extends the same operating model to seller-growth recommendations and SproutMath content authoring.
Best read for AI quality, model economics, and release decisions treated as one operating system.
Most AI teams still lack a clear operating layer for model decisions.
Leads the 7-model field. GPT-5 Mini trails by under one point. Gemini 2.5 Flash sits lower on quality but wins on speed and cost.
Grok 4 Fast at $0.0009 per inference, Sonnet 4.6 at $0.016. The tool makes the “is the cheaper model good enough?” decision provable instead of vibes-based.
AI quality is usually managed with fragments.
Teams often have one spreadsheet for prompts, one notebook for model experiments, another chart for latency, and a separate conversation about release criteria. That makes it hard to treat model performance like an operational system.
- N-way LLM-as-judge evaluation — any number of models through a single OpenRouter API key.
- Domain-weighted rubric: SproutRoute prompts scored on safety and logistical feasibility, seller prompts on accuracy and actionability, and SproutMath authoring on answer validity, grade fit, accessibility, and child safety.
- A live launch-review dashboard that connects the decision memo, evidence chain, candidate board, failure modes, operating envelope, rollout plan, and local runbook.
- Cloneable GitHub workflow: copy
.env.example, addOPENROUTER_API_KEY, run a dataset-specific eval, then apply the release gate. - Release gate with dataset-specific overrides — a 15-second latency ceiling for trip plans, 3 seconds for seller Q&A.
- Secret-free PR checks run install, tests, and build; live OpenRouter evals run only from a manual GitHub Actions workflow.
This is evaluation as product operations, not a data-science exercise. When my production SproutRoute model came back at 7.64 safety — below the 8.0 floor — the tool flagged it before a user would have encountered a wrong car seat law or a closed venue. The dashboard now makes the trust chain visible too: dataset, judge, score method, limitations, and local rerun path.
- 7 models benchmarked: Claude Sonnet 4.6 & Haiku 4.5, GPT-5 Mini & Nano, Grok 4 Fast, Gemini 2.5 Flash, DeepSeek-V3.2.
- Three product scenarios in the dashboard order: SproutRoute itinerary generation with safety tips, seller-growth AM recommendations, and SproutMath content authoring.
- Seven dashboard views: decision memo, evidence chain, candidate board, rubric and failure modes, operating envelope, rollout plan, and run locally.
- 15 unit tests on judge and gate logic; mobile-responsive UI.
Seven models, one rubric, one decision.
Claude Sonnet 4.6
Top of the field on accuracy, actionability, and nuanced safety reasoning. The price you pay is a 26-second p95 latency and roughly 6× the cost of the cheapest option.
- Quality: 87.8
- Safety: 9.04 ✓
- Cost per inference: $0.016
- P95 latency: 26.1s
Gemini 2.5 Flash
Cleared the safety floor, 2.3× faster than the current production model, and half the cost — the model I would actually promote after this run. The matrix makes that trade explicit rather than narrative.
- Quality: 78.6
- Safety: 8.16 ✓
- Cost per inference: $0.0025
- P95 latency: 9.8s
Claude Haiku 4.5 — the model currently serving SproutRoute production traffic — scored 7.64 on safety, below the 8.0 hard floor. Grok 4 Fast failed the same check at 7.8. The rest (Sonnet 4.6, GPT-5 Mini, GPT-5 Nano, Gemini 2.5 Flash, DeepSeek-V3.2) cleared the safety floor, but only Gemini paired that pass with sub-15-second p95 latency and lower cost than the current model. The release gate exits NO-GO for the current production path; the manual live-eval workflow preserves the evidence without exposing API keys to PRs; and a PM has a concrete path: promote Gemini or explicitly document why another safety-passing model is worth its latency or cost tradeoff.
The judge cited real errors in the failing responses: a California car seat law misstatement, Splash Mountain recommended after its 2023 closure, an impossible overnight Paris→Barcelona train routing. These are the kind of low-frequency safety bugs that slip past eyeball QA — which is why the gate is rubric-weighted, not vibes-weighted.
Good eval tooling doesn’t just score models. It tells a team which one to ship — and why.
The dashboard matters because it keeps quality, safety, cost, latency, content gates, score provenance, and rollout ownership in the same room. That’s the only way the tradeoffs stay honest enough to drive a launch decision. The biggest move is turning each eval into a PM-readable decision memo with an inspectable evidence chain and a GitHub runbook, not another chart collection.
Why it matters
- I think about AI products after the first launch, not just before it.
- I know how to turn evaluation into operating logic a team can act on — including my own.
- I ran the tool against my own shipped product and it found a real regression. That is the acceptance test for any governance system.
The working tool is part of the case study.
/evals/.