This is a working analysis of a personal AI agent fleet — five distinct agents distributed across two hosts, each running a different model, each playing a specialized role. The artifact documents what it actually costs to run, what each model is genuinely good at, and the framework behind specialization. It is intentionally written so a reader without deep AI infrastructure background can follow it.
The thesis is simple: headcount of agents is not the value. Specialization is. Five general-purpose assistants is worse than two specialized ones — every interaction starts with "wait, which one did I tell about X?" and their memories drift apart. The value of a fleet goes up when each agent has a distinct job, a distinct cost profile, and a distinct trust boundary.
What you'll find in this page:
Two physical hosts. Five Docker containers. Five different models. Three external model providers. One local model service. Inter-agent communication is the next layer up — the A2A protocol is the public standard that makes this kind of fleet legible to other systems.
Functional names used throughout this analysis. Real handles, ports, and network details are intentionally not shown.
Each agent has a role, a model tier, and a cost shape. The roles map to the four archetypes worth keeping in mind for any future fleet design: best, workhorse, mobile, restricted/public-facing, and always-on/private.
Heavy synthesis, published content, hard reasoning. The frontier model for tasks where quality matters more than cost or speed.
~$15/$75 per M tok
Daily-driver ops. Mid-cost, mid-speed, competent across most tasks. The agent that runs the most actual work.
~$3/$15 per M tok
Mobile-facing front-end. Lives on a Windows host because the messaging stack pairs to one device. Same model class as Operator but distinct purpose.
~$3/$15 per M tok
Public-facing secretary. Scheduling, screening, FAQ. Hard restrictions: no email, no web, no exec, no browser.
~$0.075/$0.30 per M tok
Local-only, free, always-up. Runs against a local Ollama service. Backup when paid quotas die; baseline analyst for non-urgent jobs.
$0 / $0 per M tok
Each agent has a different model class, not because it had to — but so the fleet has redundancy across providers. If Anthropic has a bad day, the Google-backed and local-backed agents keep working. If quotas die, the local-backed one keeps working. Provider diversity is a feature, not an accident.
Every comparison in this page is from a serial run — one model call at a time, with a 2-second gap between calls. Parallel runs were avoided so concurrent load couldn't confound the timing numbers.
For the model benchmark and the host-level comparison, all calls used the same prompt — chosen because it has one correct numerical answer and requires brief reasoning prose:
A store sells apples for $0.85 each. I have $20 and want to buy as
many as possible while leaving enough to buy a $3.50 coffee with the
change. How many apples can I buy? Show your reasoning in 2-3 sentences.
Correct answer: 19 apples ($16.15 spent, $3.85 left → covers the $3.50 coffee).
Why this prompt: it tests arithmetic, constraint reasoning, and writing — all comparable across models. The answer is unambiguous (a model either says 19 or it doesn't), so "correctness" isn't a judgment call.
Each provider returns token counts in their response (input_tokens, output_tokens, or the equivalent). Cost is computed against the public pricing in effect as of June 2026:
| Model | Input ($/M tok) | Output ($/M tok) |
|---|---|---|
| anthropic/claude-opus-4-8 | 15.00 | 75.00 |
| anthropic/claude-opus-4-7 | 15.00 | 75.00 |
| anthropic/claude-sonnet-4-6 | 3.00 | 15.00 |
| anthropic/claude-haiku-4-5 | 1.00 | 5.00 |
| google/gemini-2.5-pro | 1.25 | 10.00 |
| google/gemini-2.5-flash | 0.075 | 0.30 |
| google/gemini-2.5-flash-lite | 0.10 | 0.40 |
| ollama/gemma4 (local) | 0.00 | 0.00 |
Two hosts:
Both hosts have the same Anthropic and Google API keys available to their containers, so the model availability matrix is identical except for local Ollama (Mac Mini only).
Same prompt, every reachable model, run serially. This is the headline comparison: speed, cost, and correctness on a representative reasoning task.
| Model | Time (s) | Input tok | Output tok | Cost (USD) | Correct? |
|---|---|---|---|---|---|
| anthropic/claude-opus-4-8 | 1.98 | 83 | 92 | $0.008145 | ✓ |
| anthropic/claude-opus-4-7 | 2.63 | 88 | 94 | $0.008370 | ✓ |
| anthropic/claude-sonnet-4-6 | 3.43 | 68 | 122 | $0.002034 | ✓ |
| anthropic/claude-haiku-4-5 | 1.53 | 68 | 110 | $0.000618 | ✓ |
| google/gemini-2.5-pro | 7.25 | 58 | 109 | $0.001163 | ✓ |
| google/gemini-2.5-flash | 4.63 | 58 | 89 | $0.000031 | ✓ |
| google/gemini-2.5-flash-lite | 0.57 | 58 | 83 | $0.000039 | ✓ |
| ollama/gemma4 (local) | 12.28 | 74 | 587 | $0.000000 | ✓ |
| ollama/nemotron-mini:4b (local) | 8.26 | 66 | 360 | $0.000000 | ✗ (said 24) |
| google/gemini-3-pro-preview | — | — | — | — | ✗ (404 — retired) |
Same prompt, same models, but each run is initiated from inside a different agent's container. This isolates host hardware, network path, and runtime overhead — useful for deciding which host should run which workload.
During the host-level benchmark, the Anthropic credit balance hit zero across the fleet. Every Anthropic call returned HTTP 400: credit balance too low. This page reports the Google-side and local-Ollama data only for the host comparison. The original model benchmark above ran while Anthropic credits were live — that data is intact.
| Host (agent's container) | Gemini 2.5 Flash | Gemini 2.5 Flash Lite | Gemma4 (local Ollama) |
|---|---|---|---|
| Mac Mini · Apex container | 3.40s / $0.000033 / ✓ | 0.60s / $0.000044 / ✓ | 15.11s / $0 / ✓ |
| Mac Mini · Concierge container | 3.12s / $0.000029 / ✓ | 0.65s / $0.000040 / ✓ | 14.49s / $0 / ✓ |
| Mac Mini · Sentinel container | 2.99s / $0.000030 / ✓ | 0.63s / $0.000039 / ✓ | 25.53s / $0 / ✓ |
| Mac Mini · Operator container | 3.89s / $0.000030 / ✓ | 0.56s / $0.000041 / ✓ | 20.85s / $0 / ✓* |
| Windows · Mobile container | 3.51s / $0.000034 / ✓ | 0.69s / $0.000042 / ✓ | — (no local Ollama) |
* Mac Mini's local Ollama service is effectively single-tenant — first parallel attempt dropped a connection; the timing above is from a serial retry.
Speed and cost only tell you half the story. Different models are good at different shapes of work — math/logic, code generation, summarization, creative writing. The plan was four models × four task types. Anthropic's credit exhaustion limited what we could measure.
Anthropic models (Opus 4.8, Sonnet 4.6, Haiku 4.5) all returned credit-balance errors during this benchmark. Only Gemini 2.5 Flash returned usable data. The Anthropic row will be filled when credits are restored.
| Model | Math / Logic | Code Generation | Summarization (50 words) | Creative (4-line poem) |
|---|---|---|---|---|
| Gemini 2.5 Flash | 4.00s / $0.000028 ✓ correct (19) |
5.61s / $0.000036 ✓ typed return, handles edge case |
6.03s / $0.000028 ✓ exactly 50 words, faithful |
9.00s / $0.000015 ✓ avoided cliches, 4 metered lines |
| Opus 4.8 | pending — Anthropic credits depleted during this benchmark | |||
| Sonnet 4.6 | pending — Anthropic credits depleted during this benchmark | |||
| Haiku 4.5 | pending — Anthropic credits depleted during this benchmark | |||
The credit-exhaustion failure during the benchmark is itself instructive. The fleet kept working — because Gemini was still available and the Sentinel agent's local Gemma4 was free. The provider-diversity design rule didn't survive contact with the bill; it survived because of it. A fleet that depends on a single provider can't tolerate a billing event. A fleet with three independent providers (Anthropic + Google + local) degrades gracefully.
This is the case study for the "design for provider diversity" principle in the Lessons section. We didn't plan for the credit to run out during a public benchmark — but the architecture absorbed it.
The interesting cost isn't a single API call. It's the recurring work — heartbeats, dreaming, inbox classification, polling — running unattended at 3am, every 30 minutes, every 10 minutes. That's where the bill compounds.
Assuming each cron firing uses ~500 input + 1500 output tokens (a defensible midpoint for the prompts in production):
| Job | Cadence | Firings / week | Was running on | Weekly cost |
|---|---|---|---|---|
| Inter-agent progress watch | every 10 min | 1,008 | Opus 4.8 | $117.18 |
| Unified inbox check | every 30 min | 336 | Opus 4.8 | $39.06 |
| High-frequency email check | every 4 hr | 42 | Opus 4.8 | $4.88 |
| Memory dreaming (×5 agents) | nightly, 3 phases | 105 | Opus 4.8 default | $12.21 |
| Heartbeats (mobile + Windows) | every 30 min | 672 | gpt-4o-mini (DEAD) | $0 (silently failing) |
The same jobs, routed to the model class that actually fits the work:
| Job | Now running on | Weekly cost | Saved / week |
|---|---|---|---|
| Inter-agent progress watch | Gemma4 local | $0.00 | $117.18 |
| Unified inbox check | Haiku 4.5 | $2.55 | $36.51 |
| High-frequency email check | Haiku 4.5 | $0.32 | $4.56 |
| Memory dreaming (×5 agents) | Haiku 4.5 / Gemma4 | $0.80 | $11.41 |
| Heartbeats | Gemma4 local + Gemini Flash Lite | $0.01 | (restored function) |
Cost reduction without quality loss is the whole point. These remain on Opus 4.8 because the output is published, brand-shaping, or genuinely complex:
After a quarter of running this in production, three principles keep proving themselves out.
The most honest organizing principle is the one you'd describe to a stranger in one sentence per agent: "this one is the best; this one is the cheapest; this one is the always-on; this one is restricted." That's the right level of abstraction. Naming agents after pets or letters doesn't survive contact with a real workload — roles do.
Quotas die. Models get deprecated. A whole provider can have an outage. When the fleet spans Anthropic + Google + local Ollama, no single provider failure takes the fleet down. The "free + always-on" agent is structurally the most important one because it's the only one that cannot fail for a billing reason.
One off-the-cuff "explain this to me" request is rounding error. A 10-minute cron job firing 144 times a day silently into a frontier model is where the budget evaporates. The audit is worth doing — but the better discipline is to set per-job model at job creation time, never default to "the smartest thing available."
This fleet was offline for a full day because a model documentation file said "Opus 4.8 works ✅" — but the runtime hadn't been upgraded to recognize it. The lesson: agent skills that change configuration should run a live probe before persisting, and a runtime upgrade should be part of the model-adoption workflow.
The functional names used throughout this page (Apex / Operator / Mobile / Concierge / Sentinel) are not just for security — they force you to write about what each agent does, which is the thing that actually changes over time. The handles, ports, and bot tokens are implementation details.
This page is the first published artifact. It will be updated as the fleet evolves — particularly as host-level and task-type benchmarks complete, and as the inter-agent A2A protocol layer matures from primitive to production.