
When it pays to run AI models on your own hardware
Cloud APIs are the right call more often than not. For some workloads, though, owning the hardware is cheaper, more private, and more predictable. Here is how we decide which is which.

The default answer to where a model should run is a cloud API, and most of the time the default is right. You get the strongest models, no capital outlay, and someone else loses sleep over the hardware. But the default is not always the cheapest option, and it is not always the safest one. A new class of compact machines is quietly making the alternative worth a second look.
What actually changed
NVIDIA's DGX Spark is an AI computer about the size of a small desktop, carrying 128 GB of unified memory. On its own it can run sizeable open-weight models locally. Connect two of them over a fast 200-gigabit link and you have 256 GB to work with, enough to host models in the hundreds of billions of parameters without a single byte leaving the room. A few years ago that capability lived in a data centre and was priced to match. Now it fits under a desk.
Two honest reasons to own the box
The first is privacy. When a model runs on hardware you control, the sensitive context stays on the device, and you decide exactly what the agent can see and what it can never reach. For a client working with health records, legal files, or customer financials, the line 'the data never leaves the building' is not a phrase for a brochure. It is often the difference between a project that can go ahead and one that legal will not sign off on.
The second is cost, and it is less obvious. Cloud inference is billed per token. For spiky, exploratory work, that pricing is a gift. For steady, high-volume jobs that run the same shape of task all day, per-token pricing slowly turns into the most expensive line in the whole system. Owning the hardware converts a variable bill into a fixed one. Past a certain volume, the machine pays for itself and then keeps earning.
Where the cloud still wins, clearly
We are not romantic about local hardware. Cloud APIs give you the best frontier models, scale on demand, and zero maintenance. If your volume is low, if every call needs the strongest possible reasoning, or if your workload is unpredictable, renting wins and it is not close. The mistake is not choosing the cloud. The mistake is assuming one answer fits every workload in the building.
How we make the call
When we scope a system, where each part runs is part of the design, not a detail we sort out later. We weigh three things. Sensitivity: does this data have to stay on premises. Volume: is the job steady enough that a fixed cost beats paying per token. And the model itself: does the task genuinely need a frontier model, or will a smaller open one do it well enough that nobody can tell the difference.
In practice the answer is almost always a mix. A frontier model in the cloud takes the open-ended reasoning. A smaller model on owned hardware takes the high-volume, sensitive, repetitive work. The user sees one system. Underneath, each job runs wherever it is cheapest and safest to run it.
That split only works if you design for it early. Bolting local hardware onto a system built entirely around one cloud API is slow and painful. Drawing the boundary up front is simple. It is one of the first questions we ask a client, because the answer shapes almost everything that comes after it.
