Evaluation infrastructure / release operations

AI Eval Control Tower

A cloneable model-evaluation system that turns quality, safety, cost, latency, and score provenance into a PM-readable launch decision.

Artifact replayBrowser-local result import

Models: 7 registered
Domains: 3 workflows
Finding: Safety floor fail
Use: Release gate

Inspect evidence View GitHub ↗ Benchmarks

Decision: Combine model quality, safety, latency, cost, and provenance in one release verdict.
Why: Separate notebooks and spreadsheets left launch owners without a comparable evidence chain.
Result: The benchmark exposed a production-model safety-floor failure and identified a stronger routing option.

What this provesOperating mechanism

AI quality should be an operating mechanism, not a subjective demo review.

This case shows how I turn AI quality from a subjective demo review into an operating mechanism: eval rubrics, safety floors, latency/cost tradeoffs, score provenance, and explicit launch decisions. The Control Tower serves as the enterprise platform's decision plane.

See its role in the platform operating model

Quick readOperating layer

Most AI teams still lack a clear operating layer for model decisions.

One spreadsheet for prompts, one notebook for model experiments, another chart for latency, and a separate launch debate creates fragmented evidence. This tool makes the decision chain inspectable.

7.64 safety score

The current SproutRoute production model fell below the 8.0 safety floor.

87.8 top quality

Sonnet 4.6 led the field, but with latency and cost tradeoffs.

19x cost spread

The matrix makes "good enough and cheaper" a provable product decision.

Manual live evals

Secret-free PR checks stay safe; live OpenRouter runs happen only by manual workflow.

BenchmarksDecision

Seven models, one rubric, one decision.

Quality leader

Sonnet 4.6

Quality 87.8, safety 9.04, high latency and cost.

Best tradeoff

Gemini 2.5 Flash

Cleared safety, faster than production, lower cost.

Blocked path

Haiku 4.5

Production model scored below the safety floor.

Evidence

Judge citations

Car seat law, venue closure, and infeasible routing errors were surfaced.

ArtifactLive build

The evidence explorer is part of the case study.

The explorer derives candidates, dimensions, verdicts, latency, cost, safety floors, judge metadata, and recommendations from selected JSON artifacts. A validated local import supports new evidence without transmitting the file.

Open evaluation evidence Inspect the runtime trace layer

Evidence contractArtifact replay

The frontend contains no competing metric source.

Run npm test && npm run eval:sproutroute:full && npm run eval:seller:v3 in the source branch. The explorer validates the result contract, calculates artifact age, enforces the recorded policy-violation floor, and refuses to score an import with missing evidence.

Evidence manifest ↗SproutRoute artifact ↗Seller artifact ↗