March 30, 20268 min readInfrastructure

Local LLMs vs Cloud APIs: The Real Cost Math

We run a 70B parameter model on a Mac Studio for $0/month in inference costs. Here is when local beats cloud, when it does not, and the exact breakeven calculation.

Our Actual Setup

We run Llama 3.3 70B quantized to 4-bit on a Mac Studio M4 Max with 128GB of unified memory. The model eats roughly 95GB of VRAM and holds itself in memory with a 24-hour keep-alive, so the first request of the day is slow and every request after that answers instantly. No per-token billing. No rate limits. No data leaving the machine.

That sounds perfect. It is not. It is perfect for our workload, which is where the real question lives.

The Naive Comparison (And Why It Is Wrong)

If you just price tokens, the cloud APIs look great. GPT-4-class output at a few dollars per million tokens feels cheap compared to buying a $6,000 Mac Studio. But "cheap per token" is only the right frame when you use tokens the way the pricing page assumes you do: a bursty, human-in-the-loop chat pattern.

Autonomous agents do not look like that. They look like a cron job that wakes up every 5 minutes, pulls a queue, runs a 4-step chain, writes to a database, and goes back to sleep. Multiply that by 31 agents running 24 hours a day and you are inside an entirely different pricing regime.

The Breakeven Formula

Here is the honest math. For a given workload:

Monthly tokens (M): How many million tokens per month does your system actually consume, input + output?
Cloud cost per million (C):Your provider's blended rate. For agents that send long prompts and produce short outputs, this is usually close to the input-token price.
Hardware amortization (H): Cost of the machine divided by however many months you plan to keep it. A $6,000 Mac Studio over 36 months is about $167/month before electricity.
Electricity (E): A Mac Studio under sustained inference load draws around 200W. At typical US residential rates that works out to $15 to $30/month depending on how hot you keep it running.

Breakeven hits when M × C exceeds H + E. At a blended rate of $3/million tokens, that means the local box wins the moment you cross roughly 65 million tokens per month — which is trivial for any serious agent system.

What Changes the Math

Three things that the per-token view hides and that matter more than the sticker price:

Rate limits. Cloud providers throttle you. Local models do not. If your workflow bursts (say, fan-out research across 40 topics at once), the local box runs them in parallel while the cloud API either rate-limits you or charges the premium tier.
Data gravity. If your prompts include customer data, revenue data, or anything you do not want sitting in a vendor log, local inference stops being a cost question and becomes a compliance question.
Iteration speed. Local lets you burn tokens carelessly while you iterate on a prompt. Cloud makes every experiment cost money, so you unconsciously run fewer experiments and ship worse prompts.

When Cloud Still Wins

Local is not always right. Cloud still wins in these cases:

You need frontier capability. If your workflow only works with the absolute top of the benchmark — multimodal reasoning, long-context retrieval, tool-use at GPT-4.5 or Claude Opus quality — you do not yet get that locally, period.
Your traffic is spiky and low. A few thousand calls per month to summarize support tickets will never come close to paying off a dedicated machine. Pay-per-token is fine.
You do not want to be a sysadmin.Running a local model means managing updates, keep-alive, memory pressure, thermal throttling, and the occasional reboot. If that is not your team's strength, the hourly cost of running the box is hidden and real.

The Hybrid Pattern That Actually Works

Most of our production systems are hybrid. Local Llama 3.3 70B handles the 80% of traffic that is bulk, structured, repetitive work — summarization, tagging, routing, data extraction, rewriting, simple reasoning. When a task needs frontier capability, the workflow promotes it to Claude or GPT through an AI gateway so we can swap providers by config instead of code.

The mental model: local is the default, cloud is the escalation. Cost floor is basically zero, and quality ceiling is still the frontier.

Get new lessons free

We publish free AI lessons weekly. Drop your email and we will send them directly — no spam, no sales sequences, just signal.

The Honest Answer

If you run a real agent workload — meaning something that wakes up on its own and does work without a human in the loop — local wins at a smaller volume than the cloud provider pricing pages want you to believe. The breakeven is measured in tens of millions of tokens per month, not billions.

If you are running a chatbot that 50 people poke at a few times a day, pay the cloud bill and do not think about it.

Everyone else: do the math with your actual token volume, not the press-release one.