Fleet Analysis · 2026-06-01

What a 5-agent AI fleet actually costs to runA working analysis of architecture, tokenomics, and the framework behind specialization

Live fleet.arnao.ai June 2026 snapshot Author · Byron Arnao

Executive summary

This is a working analysis of a personal AI agent fleet — five distinct agents distributed across two hosts, each running a different model, each playing a specialized role. The artifact documents what it actually costs to run, what each model is genuinely good at, and the framework behind specialization. It is intentionally written so a reader without deep AI infrastructure background can follow it.

The thesis is simple: headcount of agents is not the value. Specialization is. Five general-purpose assistants is worse than two specialized ones — every interaction starts with "wait, which one did I tell about X?" and their memories drift apart. The value of a fleet goes up when each agent has a distinct job, a distinct cost profile, and a distinct trust boundary.

What you'll find in this page:

The architecture

Two physical hosts. Five Docker containers. Five different models. Three external model providers. One local model service. Inter-agent communication is the next layer up — the A2A protocol is the public standard that makes this kind of fleet legible to other systems.

Fleet architecture: Apple Silicon Mac Mini running four agents (Apex/Opus 4.8, Operator/Sonnet 4.6, Concierge/Gemini 2.5 Flash, Sentinel/Gemma4 local) and Windows x64 PC running one agent (Mobile/Sonnet 4.6). External providers Anthropic, Google AI, OpenRouter connect from above. Local Ollama service serves the Sentinel agent. Inter-Agent Fabric (A2A protocol) at the bottom.

Functional names used throughout this analysis. Real handles, ports, and network details are intentionally not shown.

The fleet

Each agent has a role, a model tier, and a cost shape. The roles map to the four archetypes worth keeping in mind for any future fleet design: best, workhorse, mobile, restricted/public-facing, and always-on/private.

Best

Apex

Opus 4.8

Heavy synthesis, published content, hard reasoning. The frontier model for tasks where quality matters more than cost or speed.

~$15/$75 per M tok

Workhorse

Operator

Sonnet 4.6

Daily-driver ops. Mid-cost, mid-speed, competent across most tasks. The agent that runs the most actual work.

~$3/$15 per M tok

Mobile

Mobile

Sonnet 4.6

Mobile-facing front-end. Lives on a Windows host because the messaging stack pairs to one device. Same model class as Operator but distinct purpose.

~$3/$15 per M tok

Fast / restricted

Concierge

Gemini 2.5 Flash

Public-facing secretary. Scheduling, screening, FAQ. Hard restrictions: no email, no web, no exec, no browser.

~$0.075/$0.30 per M tok

Always-on / free

Sentinel

Gemma4 (local)

Local-only, free, always-up. Runs against a local Ollama service. Backup when paid quotas die; baseline analyst for non-urgent jobs.

$0 / $0 per M tok

Design rule

Each agent has a different model class, not because it had to — but so the fleet has redundancy across providers. If Anthropic has a bad day, the Google-backed and local-backed agents keep working. If quotas die, the local-backed one keeps working. Provider diversity is a feature, not an accident.

Methodology

Every comparison in this page is from a serial run — one model call at a time, with a 2-second gap between calls. Parallel runs were avoided so concurrent load couldn't confound the timing numbers.

The standard prompt

For the model benchmark and the host-level comparison, all calls used the same prompt — chosen because it has one correct numerical answer and requires brief reasoning prose:

A store sells apples for $0.85 each. I have $20 and want to buy as
many as possible while leaving enough to buy a $3.50 coffee with the
change. How many apples can I buy? Show your reasoning in 2-3 sentences.

Correct answer: 19 apples ($16.15 spent, $3.85 left → covers the $3.50 coffee).

Why this prompt: it tests arithmetic, constraint reasoning, and writing — all comparable across models. The answer is unambiguous (a model either says 19 or it doesn't), so "correctness" isn't a judgment call.

How cost was computed

Each provider returns token counts in their response (input_tokens, output_tokens, or the equivalent). Cost is computed against the public pricing in effect as of June 2026:

ModelInput ($/M tok)Output ($/M tok)
anthropic/claude-opus-4-815.0075.00
anthropic/claude-opus-4-715.0075.00
anthropic/claude-sonnet-4-63.0015.00
anthropic/claude-haiku-4-51.005.00
google/gemini-2.5-pro1.2510.00
google/gemini-2.5-flash0.0750.30
google/gemini-2.5-flash-lite0.100.40
ollama/gemma4 (local)0.000.00

Hardware setup

Two hosts:

Both hosts have the same Anthropic and Google API keys available to their containers, so the model availability matrix is identical except for local Ollama (Mac Mini only).

Model benchmark

Same prompt, every reachable model, run serially. This is the headline comparison: speed, cost, and correctness on a representative reasoning task.

Model Time (s) Input tok Output tok Cost (USD) Correct?
anthropic/claude-opus-4-81.988392$0.008145
anthropic/claude-opus-4-72.638894$0.008370
anthropic/claude-sonnet-4-63.4368122$0.002034
anthropic/claude-haiku-4-51.5368110$0.000618
google/gemini-2.5-pro7.2558109$0.001163
google/gemini-2.5-flash4.635889$0.000031
google/gemini-2.5-flash-lite0.575883$0.000039
ollama/gemma4 (local)12.2874587$0.000000
ollama/nemotron-mini:4b (local)8.2666360$0.000000✗ (said 24)
google/gemini-3-pro-preview✗ (404 — retired)

Key takeaways

Host-level comparison

Same prompt, same models, but each run is initiated from inside a different agent's container. This isolates host hardware, network path, and runtime overhead — useful for deciding which host should run which workload.

Caveat for this run

During the host-level benchmark, the Anthropic credit balance hit zero across the fleet. Every Anthropic call returned HTTP 400: credit balance too low. This page reports the Google-side and local-Ollama data only for the host comparison. The original model benchmark above ran while Anthropic credits were live — that data is intact.

Host (agent's container) Gemini 2.5 Flash Gemini 2.5 Flash Lite Gemma4 (local Ollama)
Mac Mini · Apex container3.40s / $0.000033 / ✓0.60s / $0.000044 / ✓15.11s / $0 / ✓
Mac Mini · Concierge container3.12s / $0.000029 / ✓0.65s / $0.000040 / ✓14.49s / $0 / ✓
Mac Mini · Sentinel container2.99s / $0.000030 / ✓0.63s / $0.000039 / ✓25.53s / $0 / ✓
Mac Mini · Operator container3.89s / $0.000030 / ✓0.56s / $0.000041 / ✓20.85s / $0 / ✓*
Windows · Mobile container3.51s / $0.000034 / ✓0.69s / $0.000042 / ✓— (no local Ollama)

* Mac Mini's local Ollama service is effectively single-tenant — first parallel attempt dropped a connection; the timing above is from a serial retry.

What this tells us

Task-type comparison

Speed and cost only tell you half the story. Different models are good at different shapes of work — math/logic, code generation, summarization, creative writing. The plan was four models × four task types. Anthropic's credit exhaustion limited what we could measure.

Caveat for this run

Anthropic models (Opus 4.8, Sonnet 4.6, Haiku 4.5) all returned credit-balance errors during this benchmark. Only Gemini 2.5 Flash returned usable data. The Anthropic row will be filled when credits are restored.

Model Math / Logic Code Generation Summarization (50 words) Creative (4-line poem)
Gemini 2.5 Flash 4.00s / $0.000028
✓ correct (19)
5.61s / $0.000036
✓ typed return, handles edge case
6.03s / $0.000028
✓ exactly 50 words, faithful
9.00s / $0.000015
✓ avoided cliches, 4 metered lines
Opus 4.8 pending — Anthropic credits depleted during this benchmark
Sonnet 4.6 pending — Anthropic credits depleted during this benchmark
Haiku 4.5 pending — Anthropic credits depleted during this benchmark

The accidental finding

The credit-exhaustion failure during the benchmark is itself instructive. The fleet kept working — because Gemini was still available and the Sentinel agent's local Gemma4 was free. The provider-diversity design rule didn't survive contact with the bill; it survived because of it. A fleet that depends on a single provider can't tolerate a billing event. A fleet with three independent providers (Anthropic + Google + local) degrades gracefully.

This is the case study for the "design for provider diversity" principle in the Lessons section. We didn't plan for the credit to run out during a public benchmark — but the architecture absorbed it.

Tokenomics — the real cost of running this

The interesting cost isn't a single API call. It's the recurring work — heartbeats, dreaming, inbox classification, polling — running unattended at 3am, every 30 minutes, every 10 minutes. That's where the bill compounds.

Per-job weekly cost (before redistribution)

Assuming each cron firing uses ~500 input + 1500 output tokens (a defensible midpoint for the prompts in production):

Job Cadence Firings / week Was running on Weekly cost
Inter-agent progress watchevery 10 min1,008Opus 4.8$117.18
Unified inbox checkevery 30 min336Opus 4.8$39.06
High-frequency email checkevery 4 hr42Opus 4.8$4.88
Memory dreaming (×5 agents)nightly, 3 phases105Opus 4.8 default$12.21
Heartbeats (mobile + Windows)every 30 min672gpt-4o-mini (DEAD)$0 (silently failing)

Per-job weekly cost (after redistribution)

The same jobs, routed to the model class that actually fits the work:

Job Now running on Weekly cost Saved / week
Inter-agent progress watchGemma4 local$0.00$117.18
Unified inbox checkHaiku 4.5$2.55$36.51
High-frequency email checkHaiku 4.5$0.32$4.56
Memory dreaming (×5 agents)Haiku 4.5 / Gemma4$0.80$11.41
HeartbeatsGemma4 local + Gemini Flash Lite$0.01(restored function)
~$170
Weekly savings
~$680
Monthly savings
~$8,200
Annual savings

The guardrail — what was NOT moved off Opus

Cost reduction without quality loss is the whole point. These remain on Opus 4.8 because the output is published, brand-shaping, or genuinely complex:

Lessons learned · a framework for fleet design

After a quarter of running this in production, three principles keep proving themselves out.

1. Specialize by role, not by handle

The most honest organizing principle is the one you'd describe to a stranger in one sentence per agent: "this one is the best; this one is the cheapest; this one is the always-on; this one is restricted." That's the right level of abstraction. Naming agents after pets or letters doesn't survive contact with a real workload — roles do.

2. Build for provider diversity

Quotas die. Models get deprecated. A whole provider can have an outage. When the fleet spans Anthropic + Google + local Ollama, no single provider failure takes the fleet down. The "free + always-on" agent is structurally the most important one because it's the only one that cannot fail for a billing reason.

3. The recurring work is where the bill lives

One off-the-cuff "explain this to me" request is rounding error. A 10-minute cron job firing 144 times a day silently into a frontier model is where the budget evaporates. The audit is worth doing — but the better discipline is to set per-job model at job creation time, never default to "the smartest thing available."

4. Treat the model whitelist as a runtime fact, not a static doc

This fleet was offline for a full day because a model documentation file said "Opus 4.8 works ✅" — but the runtime hadn't been upgraded to recognize it. The lesson: agent skills that change configuration should run a live probe before persisting, and a runtime upgrade should be part of the model-adoption workflow.

5. Functional names beat real names for public artifacts

The functional names used throughout this page (Apex / Operator / Mobile / Concierge / Sentinel) are not just for security — they force you to write about what each agent does, which is the thing that actually changes over time. The handles, ports, and bot tokens are implementation details.

What's next

This page is the first published artifact. It will be updated as the fleet evolves — particularly as host-level and task-type benchmarks complete, and as the inter-agent A2A protocol layer matures from primitive to production.