How AI infrastructure works: local vs. managed

A breakdown of where and how LLMs actually run: weights, inference, managed vs. local, and why any of it matters.

Illustration comparing local hardware setups with managed cloud AI

What is managed AI?
Local vs. cloud-managed
AWS vs. Anthropic under the hood
Model weights
Inference
Why weights load into RAM
CPU vs. GPU
Unified memory
Closed vs. open weights

What is managed AI?

Managed AI means a cloud provider hosts and operates the model infrastructure for you. You don't touch the servers, you don't worry about scaling, you don't deal with uptime. You call an endpoint and get a response back.

The "managed" part is everything that keeps that endpoint alive: scaling up when traffic spikes, patching hardware, guaranteeing uptime SLAs. All the ops work you'd otherwise have to do yourself.

Local vs. cloud-managed

Running a model locally means you own and operate everything. The model lives on your hardware, nothing leaves your machine, you're responsible for making it work.

Managed AI is the opposite: you're calling out to someone else's servers running the model at scale. Same basic idea: prompt goes in, output comes back. You're just not the one keeping the lights on.

The full spectrum:

Local: you run everything. Fully private. Limited by your hardware.
Direct API: someone else runs the model, you just call it.
Managed cloud: same as above, but wrapped inside your existing cloud provider's ecosystem with their compliance certs, billing, and IAM on top.

AWS vs. Anthropic under the hood

It might be AWS hardware either way. Anthropic likely runs their own API on AWS or a mix of cloud providers under the hood. When you use Bedrock, AWS is just wrapping that access and saying "this is our product now, with our SLAs, our billing, our compliance certs."

The actual GPU doing the math might be in the same data center either way. It's a white-label product: same factory, different brand on the box.

Model weights

Model weights are a giant list of numbers (literally billions of them) that encode everything the model learned during training. When you download a model to run locally, those numbers are what you're downloading. That file is the model.

Inference

Inference is what happens when you actually use the model. You give it a prompt, and it runs those billions of numbers through a series of math operations to produce a response.

Training: the process that originally produced the numbers. The AI company does this.
Weights: the resulting file of numbers from training.
Inference: running your input through those numbers to get an output.

Why weights load into RAM

The weights file sitting on disk is inert; the processor can't do math directly on a file. To run inference, the weights need to be in fast-access memory so the processor can read them in real time. Disk is orders of magnitude too slow. Loading the model just means moving the weights from disk into RAM.

CPU vs. GPU

A CPU has a small number of very powerful cores designed to handle complex logic sequentially (good for general purpose tasks, branching decisions, running your OS).

A GPU has thousands of smaller, simpler cores designed to do the same math operation on many things simultaneously. Originally built for rendering pixels, which is massively parallel work.

Inference is also massively parallel. Matrix multiplication across billions of numbers is exactly what a GPU is built for. A CPU can do it, just much slower.

Unified memory

Traditionally, a computer has two separate memory pools: RAM for the CPU and VRAM for the GPU. Data has to get copied between them over a bus, which is slow.

Apple Silicon unified these into a single pool that both the CPU and GPU share directly. So when you have 24GB unified memory, there's no copying; the GPU reads the weights from the same memory the CPU uses. For local inference, that's a meaningful advantage.

Closed vs. open weights

Closed weights means the weights are kept private and never released. There's no file to download; the only way to access the model is through an API or product. Claude, GPT-4, and Gemini are all closed weights.

Open weights means the weights are publicly downloadable. You can pull them down and run them on your own hardware. Kimi k2, Meta's Llama, Mistral, and DeepSeek are all open weights.

Open weights is not the same as open source. Open weights means you have the file: you can run it, fine-tune it, modify it. Open source would mean you also have the training code, data pipelines, and fine-tuning recipes. Almost no serious model releases everything.

The reason companies keep weights closed is straightforward: the weights are the product. If Anthropic released Claude's weights, anyone could run it for free, strip out the safety training, or fork it into a competitor.

Contents