Live Last run · 2026-07-18 17:59 UTC · 37 models · 8 providers · 5 tasks

State of the FleetA living public benchmark of the AI models behind a multi-agent fleet — versioned, objectively graded, and documented in the open. The question isn't “which model is best?” It's “which model do I route each job to?”

This page is a live experiment, documented in public. Every result is fingerprinted to its task, grader, and harness version; nothing is hand-curated. When a grader gets stricter, old results are superseded and the audit trail records it.

fleet.arnao.ai Author · Byron Arnao Generated 2026-07-18 17:59 UTC

The stack, in motion

How the fleet reaches its models

There is no central router: each agent holds its own provider credentials and a tiered fallback chain, connecting directly to the model APIs it is allowed to use. The bubbles below are live request/response traffic along each agent's primary model path — this is the wiring the rest of the page benchmarks.

Reading top-to-bottom: Hardware (Mac Mini, Windows PC) hosts the agent containers, and every agent connects directly to its providers. Gia (Fable 5 primary) carries the full deck: direct Anthropic, Google, OpenAI, xAI and DeepSeek keys, plus OpenRouter as a routed path (how kimi-k3 and glm are reached) and local Ollama. Mia runs Gemini-flash with Anthropic and Ollama fallbacks; Tia is local-first and capped at T2 (Haiku / Gemini-flash); Cia is local-only by design — no cloud keys at all; Zia runs Sonnet on the Windows box. Solid edges are primary model paths, dashed grey edges are direct fallback tiers.

The fleet right now

Winners — current run

Auto-computed from the latest non-superseded results only. “Free” means a model that truly runs at $0 on the local node — grok variants are pricing-unconfirmed and are never labeled free.

Top model

gemini-3.1-flash-lite

Best overall — fastest-correct across 4 graded tasks (avg speed-rank 2.0).

Best free / local

llama3.1:8b

Truly free (runs on the local node). 2 correct across tested tasks, $0 marginal cost.

Best vision

grok-4.20-non-reasoning

Fastest correct image read: 0.91s (xAI).

Lowest cost

gpt-4.1-nano

Correct at $0.000013/call — the cloud price floor (code task).

Most efficient

gpt-4.1-nano

Best correct-per-dollar-second: $0.000013 in 0.86s (code).

Executive overview

This cycle ran 37 models across 8 providers through 5 task types under one harness. The headline: correctness has largely collapsed as a differentiator — math is solved fleet-wide and code is near-saturated — so the real signal now lives in speed, cost, and instruction-following discipline. The most discriminating task was the 50-word summary: only 21/36 models held the exact word-count budget — where the ability to obey a hard constraint, not raw intelligence, shows up.

Best overall this run was gemini-3.1-flash-lite; the cautionary tail was phi4:latest, which missed 3 graded tasks. My routing read, by objective: optimize for raw cost → gpt-4.1-nano at $0.000013/call; for speed → gemini-3.1-flash-lite; for vision → grok-4.20-non-reasoning (0.91s); for local / $0 → llama3.1:8b; for best price-per-correct-second → gpt-4.1-nano. That is the actual decision I encode in the router: not a single ‘best’ model, but the right model dispatched per job-type.

I run this in public as a deliberate, controlled experiment — same prompts, versioned graders, results fingerprinted and superseded when a grader tightens. The ‘experiment’ framing is intentional: it lets me probe a fast-moving market through my own infrastructure without overfitting to any vendor’s benchmark theater. The goal isn’t to crown a winner — it’s a defensible, reproducible map of where each model earns its slot.

Responsible-AI lens: as correctness converges, governance — not capability — becomes the differentiator. Objective graders, a full audit trail, and per-job routing are what observable, well-governed agent deployment looks like: every answer is attributable to a known model, grader, and harness version, and a stricter test retroactively invalidates stale results instead of quietly inflating a score. That auditability is the prerequisite for trusting an autonomous fleet with real work.

models, validated live

providers

task types

567

total results logged

Per-task scorecard

How each task shook out

A compact read per task: how many models were tested, how many passed the objective grader, and the fastest & cheapest model that got it right.

Math37 models

37/37 correct

▲ gemini-2.5-flash-lite · 0.65s

$ gpt-4.1-nano · $0.000027

37/37 correct — math is effectively solved across the fleet; speed & cost decide.

Code36 models

34/36 correct

▲ grok-4.20-non-reasoning · 0.66s

$ gpt-4.1-nano · $0.000013

34/36 passed all 6 unit tests. The 2 misses are real logic/edge-case failures.

Summarize36 models

21/36 correct

▲ gpt-4.1-nano · 0.76s

$ gpt-4.1-nano · $0.000016

Hard target: only 21/36 models hit the 50±2-word count. Most over- or under-shoot.

Creative37 models

qualitative

▲ —

$ — (no priced-correct model)

Qualitative task — 37 models produced a poem; correctness not objectively scored.

Vision18 models

17/18 correct

▲ grok-4.20-non-reasoning · 0.91s

$ gemini-2.5-flash · $0.000028

17/18 read the image correctly. Only text-capable multimodal models were tested.

Visual read

Cost, correctness, latency

Lightweight, dependency-free charts rendered inline.

Math: cost vs latency

Correct cloud models on the math task. Down-and-left is cheaper and faster.

Correctness by provider

Share of gradable tasks answered correctly, aggregated per provider.

Code task: latency spread (cloud)

Median (filled) vs p95 (hollow) wall-clock per cloud model. Red = failed unit tests.

Full results

Every model, every task

Current (non-superseded) results, sorted by median time. Local models are muted; local $0 marks truly-free models, pricing TBD marks unconfirmed grok pricing. Hover a ✓/✗ for grader detail.

Math — 37 models, ranked by speed

Model	Provider	Median time	Cost/call	Correct
`gemini-2.5-flash-lite`	Google	0.65s	$0.000037	✓
`gemini-3.1-flash-lite`	Google	0.91s	$0.000039	✓
`gpt-4o`	OpenAI	1.19s / p95 1.63s	$0.000965	✓
`gpt-5.4-mini`	OpenAI	1.24s / p95 2.54s	$0.000154	✓
`grok-4.20-non-reasoning`	xAI	1.26s / p95 1.41s	$0.0019	✓
`gpt-4.1-nano`	OpenAI	1.32s / p95 1.81s	$0.000027	✓
`claude-haiku-4-5`	Anthropic	1.52s / p95 1.58s	$0.000613	✓
`gpt-4.1-mini`	OpenAI	1.64s / p95 2.87s	$0.000038	✓
`gpt-5.4`	OpenAI	1.65s / p95 1.71s	$0.0014	✓
`gpt-4o-mini`	OpenAI	1.74s / p95 1.79s	$0.000053	✓
`gpt-4.1`	OpenAI	1.76s / p95 2.20s	$0.0014	✓
`claude-opus-4-8`	Anthropic	2.42s / p95 2.54s	$0.0085	✓
`o3-mini`	Local	2.45s / p95 2.55s	$0.0012	✓
`grok-4.3`	xAI	2.62s / p95 3.56s	$0.0023	✓
`o4-mini`	Local	3.27s / p95 3.67s	$0.0013	✓
`gemini-2.5-flash`	Google	3.47s / p95 4.00s	$0.000031	✓
`gpt-5.5`	OpenAI	3.62s / p95 4.18s	$0.0033	✓
`claude-sonnet-4-6`	Anthropic	3.64s / p95 4.14s	$0.0026	✓
`gpt-5.6-sol`	OpenAI	3.64s / p95 3.84s	$0.0041	✓
`o3`	Local	4.80s / p95 5.84s	$0.0098	✓
`claude-fable-5`	Anthropic	5.08s / p95 5.33s	$0.0117	✓
`grok-4.20-reasoning`	xAI	5.61s / p95 5.77s	$0.0024	✓
`glm-5.2`	Local	5.83s / p95 11.05s	$0.0024	✓
`inkling`	Thinking Machines	7.03s / p95 7.81s	$0.0019	✓
`llama3.1:8b`	Local	7.07s	$0	✓
`gemma3:27b`	Local	7.85s	$0	✓
`gemini-3.1-pro-preview`	Google	8.18s / p95 8.91s	$0.0019	✓
`gemini-2.5-pro`	Google	8.21s / p95 8.93s	$0.000852	✓
`kimi-k3`	Moonshot	11.83s / p95 21.38s	$0.0046	✓
`gemma3:12b`	Local	12.62s	$0	✓
`gemma4:latest`	Local	13.26s	$0	✓
`qwen3-coder:30b`	Local	16.23s	$0	✓
`qwen2.5-coder:32b`	Local	21.38s	$0	✓
`phi4:latest`	Local	22.55s	$0	✓
`deepseek-r1:14b`	Local	23.75s	$0	✓
`mistral-small3.2`	Local	24.62s	$0	✓
`qwen3.7-max`	Alibaba	34.37s / p95 37.87s	$0.0079	✓

Code — 36 models, ranked by speed

Model	Provider	Median time	Cost/call	Correct
`grok-4.20-non-reasoning`	xAI	0.66s / p95 0.72s	pricing TBD	✓
`gemini-3.1-flash-lite`	Google	0.71s / p95 0.82s	$0.000031	✓
`gpt-5.4-mini`	OpenAI	0.75s / p95 0.77s	$0.000105	✓
`gpt-4.1-nano`	OpenAI	0.86s / p95 1.37s	$0.000013	✓
`gpt-4o`	OpenAI	0.97s / p95 1.63s	$0.00075	✓
`gpt-4.1-mini`	OpenAI	1.08s / p95 1.09s	$0.000026	✓
`gpt-4.1`	OpenAI	1.14s / p95 1.15s	$0.00056	✓
`gpt-4o-mini`	OpenAI	1.15s / p95 2.01s	$0.000043	✓
`gpt-5.4`	OpenAI	1.25s / p95 1.42s	$0.0012	✓
`claude-haiku-4-5`	Anthropic	1.76s / p95 2.04s	$0.0014	✓
`claude-sonnet-4-6`	Anthropic	1.98s / p95 4.32s	$0.0028	✓
`o3-mini`	Local	2.09s / p95 2.14s	$0.0015	✓
`claude-opus-4-8`	Anthropic	2.48s / p95 5.54s	$0.0073	✓
`gpt-5.6-sol`	OpenAI	2.53s / p95 2.92s	$0.0018	✓
`o3`	Local	2.67s / p95 4.17s	$0.0089	✓
`gpt-5.5`	OpenAI	2.92s / p95 3.03s	$0.0021	✓
`o4-mini`	Local	3.27s / p95 3.33s	$0.0018	✓
`glm-5.2`	Local	3.33s / p95 4.91s	$0.0016	✓
`gemini-2.5-flash`	Google	3.48s / p95 4.26s	$0.000161	✓
`claude-fable-5`	Anthropic	4.08s / p95 4.24s	$0.005	✓
`grok-4.20-reasoning`	xAI	4.60s / p95 4.79s	pricing TBD	✓
`grok-4.3`	xAI	4.83s / p95 10.30s	pricing TBD	✓
`inkling`	Thinking Machines	5.18s / p95 7.52s	$0.001	✓
`gemini-3.1-pro-preview`	Google	5.85s / p95 5.97s	$0.0012	✓
`phi4:latest`	Local	9.48s	local $0	✗
`gemini-2.5-pro`	Google	10.38s / p95 12.98s	$0.0013	✓
`gemma4:latest`	Local	12.04s	local $0	✓
`kimi-k3`	Moonshot	12.05s / p95 37.43s	$0.0033	✓
`gemma3:12b`	Local	13.90s	local $0	✓
`qwen3-coder:30b`	Local	15.42s	local $0	✓
`qwen2.5-coder:32b`	Local	18.05s	local $0	✓
`llama3.1:8b`	Local	19.01s	local $0	✓
`mistral-small3.2`	Local	20.10s	local $0	✓
`qwen3.7-max`	Alibaba	22.55s / p95 24.85s	$0.0037	✓
`gemma3:27b`	Local	25.08s	local $0	✓
`deepseek-r1:14b`	Local	66.31s	local $0	✗

Summarize — 36 models, ranked by speed

Model	Provider	Median time	Cost/call	Correct
`grok-4.20-non-reasoning`	xAI	0.62s / p95 0.67s	pricing TBD	✗
`gpt-4.1-nano`	OpenAI	0.76s / p95 0.88s	$0.000016	✓
`gemini-3.1-flash-lite`	Google	0.79s / p95 0.81s	$0.000031	✓
`gpt-5.4-mini`	OpenAI	1.00s / p95 1.09s	$0.000144	✓
`gpt-4.1`	OpenAI	1.00s / p95 1.40s	$0.000634	✓
`claude-haiku-4-5`	Anthropic	1.04s / p95 2.85s	$0.000396	✗
`gpt-4o`	OpenAI	1.24s / p95 1.40s	$0.000733	✗
`gpt-4o-mini`	OpenAI	1.26s / p95 1.40s	$0.000046	✓
`gpt-5.4`	OpenAI	1.46s / p95 1.47s	$0.0013	✓
`gpt-4.1-mini`	OpenAI	1.55s / p95 2.65s	$0.000031	✗
`claude-opus-4-8`	Anthropic	2.45s / p95 2.61s	$0.0108	✓
`o3-mini`	Local	2.66s / p95 4.28s	$0.0038	✓
`claude-sonnet-4-6`	Anthropic	2.81s / p95 3.08s	$0.0015	✗
`o3`	Local	3.26s / p95 3.32s	$0.022	✓
`gemini-2.5-flash`	Google	4.15s / p95 7.20s	$0.000022	✗
`grok-4.20-reasoning`	xAI	4.70s / p95 15.68s	pricing TBD	✓
`gpt-5.5`	OpenAI	4.77s / p95 5.45s	$0.0065	✓
`o4-mini`	Local	4.80s / p95 8.80s	$0.0024	✓
`gpt-5.6-sol`	OpenAI	5.08s / p95 5.88s	$0.0089	✓
`llama3.1:8b`	Local	6.20s	local $0	✓
`claude-fable-5`	Anthropic	7.58s / p95 8.99s	$0.0297	✓
`grok-4.3`	xAI	8.98s / p95 9.03s	pricing TBD	✗
`phi4:latest`	Local	9.51s	local $0	✗
`gemma3:12b`	Local	10.52s	local $0	✗
`inkling`	Thinking Machines	14.93s / p95 104.58s	$0.0245	✓
`qwen3-coder:30b`	Local	15.53s	local $0	✗
`gemini-2.5-pro`	Google	15.66s / p95 16.77s	$0.000663	✗
`deepseek-r1:14b`	Local	17.40s	local $0	✗
`glm-5.2`	Local	18.32s / p95 22.97s	$0.0061	✗
`gemma4:latest`	Local	18.62s	local $0	✗
`gemma3:27b`	Local	19.89s	local $0	✗
`qwen2.5-coder:32b`	Local	19.94s	local $0	✓
`mistral-small3.2`	Local	21.11s	local $0	✓
`gemini-3.1-pro-preview`	Google	28.49s / p95 33.88s	$0.0011	✓
`kimi-k3`	Moonshot	35.45s / p95 64.25s	$0.0172	✓
`qwen3.7-max`	Alibaba	56.06s / p95 79.23s	$0.0136	✓

Creative — 37 models, ranked by speed

Model	Provider	Median time	Cost/call	Correct
`gemini-2.5-flash-lite`	Google	0.55s	$0.000018	—
`gpt-4.1-nano`	OpenAI	0.67s / p95 0.82s	$0.000009	—
`gemini-3.1-flash-lite`	Google	0.87s / p95 0.94s	$0.00002	—
`grok-4.20-non-reasoning`	xAI	0.87s / p95 0.94s	$0.001	—
`gpt-4o-mini`	OpenAI	1.00s / p95 1.39s	$0.000031	—
`gpt-4.1-mini`	OpenAI	1.00s / p95 1.14s	$0.00002	—
`gpt-5.4-mini`	OpenAI	1.08s / p95 1.34s	$0.000094	—
`claude-haiku-4-5`	Anthropic	1.23s / p95 1.28s	$0.000279	—
`gpt-4.1`	OpenAI	1.26s / p95 1.53s	$0.000294	—
`gpt-4o`	OpenAI	1.34s / p95 1.38s	$0.000508	—
`gpt-5.4`	OpenAI	2.20s / p95 2.34s	$0.00092	—
`claude-sonnet-4-6`	Anthropic	2.56s / p95 2.60s	$0.0011	—
`claude-opus-4-8`	Anthropic	3.34s / p95 4.35s	$0.0063	—
`o3`	Local	4.97s / p95 9.65s	$0.0206	—
`grok-4.3`	xAI	5.40s / p95 7.30s	$0.000972	—
`llama3.1:8b`	Local	5.73s	$0	—
`o3-mini`	Local	7.09s / p95 7.69s	$0.0066	—
`gemini-2.5-flash`	Google	7.20s / p95 8.11s	$0.000014	—
`phi4:latest`	Local	8.21s	$0	—
`gemma3:12b`	Local	10.00s	$0	—
`o4-mini`	Local	10.08s / p95 11.15s	$0.0066	—
`claude-fable-5`	Anthropic	10.49s / p95 17.10s	$0.0255	—
`gpt-5.5`	OpenAI	11.24s / p95 11.39s	$0.0115	—
`qwen3-coder:30b`	Local	14.80s	$0	—
`qwen2.5-coder:32b`	Local	15.69s	$0	—
`gemini-2.5-pro`	Google	15.74s / p95 22.97s	$0.000381	—
`grok-4.20-reasoning`	xAI	16.01s / p95 17.39s	$0.000909	—
`gpt-5.6-sol`	OpenAI	16.33s / p95 26.15s	$0.0171	—
`mistral-small3.2`	Local	17.96s	$0	—
`gemma3:27b`	Local	18.10s	$0	—
`gemma4:latest`	Local	24.63s	$0	—
`deepseek-r1:14b`	Local	29.19s	$0	—
`glm-5.2`	Local	35.79s / p95 52.97s	$0.006	—
`inkling`	Thinking Machines	68.44s / p95 88.28s	$0.0222	—
`gemini-3.1-pro-preview`	Google	95.62s / p95 104.65s	$0.000633	—
`qwen3.7-max`	Alibaba	148.44s / p95 161.79s	$0.0176	—
`kimi-k3`	Moonshot	191.79s / p95 274.26s	$0.065	—

Vision — 18 models, ranked by speed

Model	Provider	Median time	Cost/call	Correct
`grok-4.20-non-reasoning`	xAI	0.91s / p95 0.93s	$0.0038	✓
`gemini-3.1-flash-lite`	Google	1.36s / p95 1.41s	$0.000118	✓
`gpt-4o`	OpenAI	1.72s / p95 1.82s	$0.003	✓
`gpt-5.6-sol`	OpenAI	2.02s / p95 2.60s	$0.0072	✓
`grok-4.3`	xAI	2.15s / p95 2.28s	$0.0042	✓
`gemini-2.5-flash`	Google	2.29s	$0.000028	✓
`gpt-5.5`	OpenAI	2.45s / p95 16.35s	$0.0068	✓
`grok-4.20-reasoning`	xAI	2.62s / p95 2.72s	$0.0038	✓
`inkling`	Thinking Machines	2.80s / p95 3.43s	$0.0019	✓
`claude-fable-5`	Anthropic	3.62s / p95 3.66s	$0.0163	✓
`gemini-2.5-pro`	Google	5.46s / p95 5.82s	$0.000538	✓
`gemini-3.1-pro-preview`	Google	6.75s / p95 8.56s	$0.003	✓
`phi4:latest`	Local	8.28s	$0	✗
`gemma3:12b`	Local	10.54s	$0	✓
`gemma3:27b`	Local	20.08s	$0	✓
`kimi-k3`	Moonshot	24.89s / p95 27.09s	$0.0078	✓
`gemma4:latest`	Local	27.24s	$0	✓
`mistral-small3.2`	Local	29.85s	$0	✓

Methodology

The protocol

Every model sees the same prompt per task. Cloud models are timed over N=3 trials (median reported, p95 where available); local models are single-trial and their latency includes cold model-load. Grading is objective wherever possible.

Graders, in plain English

Math: answer string contains 19 (apples-and-coffee budget problem).
Code: extracted Python is AST-parsed and executed against 6 unit tests; pass = 6/6.
Summarize: exact word count must land at 50 ±2 words.
Creative: qualitative only — a 4-line metered poem, no clichés; not objectively graded.
Vision: model output must contain 3 or three and red and fleet 19 (reads a generated image).

Hypotheses under test

Is correctness still a differentiator, or has it collapsed (esp. on math)?
What is the real cloud price floor for a correct answer?
Are local models a viable free tier, or does latency disqualify them?
Which provider/model should each job-type route to?

Limitations — stated openly

Grok pricing unconfirmed. xAI variants are excluded from cost rankings and never labeled “free.” Speed is real; economics are not yet validated.
Local latency includes cold-load. “$0” ignores wall-clock (often 20–110s) and local compute.
N=3 cloud trials is enough for a stable median, not a tight tail estimate.
Creative is not objectively graded — it's qualitative and excluded from correctness math.
Vision tested only multimodal-capable models; text-only models were skipped by design, not failed.

Versioning & audit trail

Nothing is hand-curated

Each result carries a run_key = fingerprint of model + task + task_version + grader_version + harness_version. A newer, higher-version result for the same model+task supersedes the old one. 142 of 567 logged results are currently superseded — preserved for audit, excluded from the dashboard.

What changed — harness

harness v1 · 2026-06-11 — Initial harness: Anthropic + Google callers, serial, single-trial.
harness v2 · 2026-06-13 — Added OpenAI/xAI/Ollama callers, parallel HTTP, local preflight. Fixed gpt-5 max_completion_tokens param.
harness v3 · 2026-06-13 — Full output capture (no truncation), inline objective grading, multi-trial cloud timing (median + p95). Found claude-fable-5 needs OpenRouter route (404 on direct Anthropic).
harness v3 · 2026-06-14 — Bugfix (no version bump): --trials value was leaking into task list causing KeyError crash before append (lost in-memory results). Parser now consumes the flag value and validates task names against manifest.
harness v3 · 2026-06-28 — 2026-06-28 delta run: claude-fable-5 permanently unavailable (Anthropic returns 404, US export-control order 2026-06-13, msg redirects to Opus 4.8). glm-5.2 blocked this run: OpenRouter key credit cap reached (limit_remaining=0); glm-5.2 also lacks a vision endpoint on OpenRouter. 0 new valid results appended; no deploy.

What changed — graders

code grader v1 · 2026-06-11 — Qualitative eyeball only — NOT rigorous.
summarize grader v1 · 2026-06-11 — Qualitative eyeball only — NOT rigorous.
code grader v2 · 2026-06-13 — Execute code vs 6 unit tests. Requires full output (harness v3). Supersedes all v1/v2 code results.
summarize grader v2 · 2026-06-13 — Exact word-count check (50 ±2). Requires full output (harness v3). Supersedes all v1/v2 summarize results.
vision grader v1 · 2026-06-13 — Strict digit match '3' — INVALID, miscounted word-numbers.
vision grader v2 · 2026-06-13 — Accept word-numbers ('three'). Supersedes all grader-v1 vision results.

Invalidation triggers

Vision grader v1→v2 (accept word-numbers like “three”): superseded 28 vision results that were mis-scored on digit-only matching.
Code & Summarize grader v2 + harness v3 (objective execution / exact word-count, full-output capture): superseded 59 single-shot/eyeball results across code and summarize.

Run log

Date	Records	Tasks run
2026-07-18	21	Code, Creative, Math, Summarize, Vision (2 since superseded)
2026-07-09	92	Code, Creative, Math, Summarize
2026-07-01	107	Code, Creative, Math, Summarize, Vision
2026-06-21	105	Code, Creative, Math, Summarize, Vision
2026-06-14	56	Creative, Math, Vision
2026-06-13	158	Code, Creative, Math, Summarize, Vision (114 since superseded)
2026-06-11	28	Code, Creative, Math, Summarize (26 since superseded)

Security posture

Documented in public, redacted by policy

This is a live experiment run on real infrastructure and shared openly. Infrastructure details (IP addresses, hostnames) are redacted with functional tags — <host-a>, <local-node> — so the methodology is fully reproducible without exposing the network. Benchmark data, prompts, graders, and version history are public; the wiring is not.

The earlier write-ups

The original June 11 fleet analysis (architecture, tokenomics, 7-model v1 benchmark) and the June 13 v2 evolution narrative are preserved for continuity.

Open the v1 / v2 narrative archive

The v1 page asked “can the models do the math?” (7 cloud models, single prompt) and documented the fleet architecture and tokenomics. The v2 evolution expanded to 37+ models across 8 providers and added a vision task. Both have been superseded by this living, versioned page — which recomputes every winner directly from the append-only result store rather than from hand-written prose. The full prior narrative remains in version control.

Key v2 findings that still hold: math is solved across the fleet; the cloud price floor is effectively zero (sub-$0.0001 correct answers); local models are correct but slow (cold-load latency dominates); verbosity, not sticker price, drives cost.