A modern graphics card spends most of its life doing one thing: multiplying enormous tables of numbers together, thousands of operations per second, to render a frame on your screen.
An AI model — at its core — also spends most of its time doing one thing: multiplying enormous tables of numbers together, thousands of operations per second, to produce a word.
They are, at the math level, almost the same problem. So you would expect them to just click together. Load a model, point it at the GPU, go.
It almost never works that simply. And the reason is a layer of software that most tutorials skip right past.
What "Running a Model" Actually Means
Three terms come up constantly when people talk about local AI. They are worth getting straight early, because everything else builds on them.
Model. A large file — typically several gigabytes — that contains millions or billions of numbers called weights. These weights are what the model "knows." They were produced by a training process that processed enormous amounts of text, and they encode patterns that let the model predict what comes next in a sequence. You do not retrain a model when you use it; you just load it and run it.
Inference. The process of actually using a loaded model. You feed it an input — your question — and it does math on the weights to produce an output. Each time you get a response, inference happened. The word comes from logic ("drawing an inference"), but in practice it just means "running the model on new input."
Token. The unit the model works in. A token is roughly a word fragment — "running" might be two tokens, "AI" one, punctuation its own. Speed is measured in tokens per second, which maps loosely to words per second. A comfortable reading speed is around 3–4 tokens per second; a fast local model on a good GPU can do 30–60.
Inference is the expensive part. It is a chain of matrix multiplications — the exact kind of math a GPU was designed to parallelize. But "designed for similar math" and "can be handed this arbitrary workload" are two different things, and the gap between them is what this article is about.
The Layer Nobody Talks About
Here is the actual path between your question and your GPU:
Most of the confusion about local AI sits in that highlighted middle section — the compute layer. This is the software that translates "neural network operations" into "instructions your specific chip understands." Without a working compute layer for your hardware, a model has no way to talk to your GPU.
How the Compute Layer Evolved
For most of AI's history, this translation layer had one good implementation: CUDA, a programming toolkit NVIDIA built for their own chips in 2007. It arrived early, matured fast, and the entire AI research ecosystem built on top of it. When someone writes an AI framework — PyTorch, TensorFlow, whatever came next — CUDA support is assumed. Everything else is optional.
This is not a story about any company being malicious. It is just how ecosystems grow. CUDA was there, it was good, and developers used it. The result is that "GPU support" and "NVIDIA GPU support" became synonymous in most AI tooling — not by design, but by the weight of accumulated defaults.
Two other paths exist and matter:
ROCm is AMD's equivalent. A compute layer AMD built for their own hardware, designed to work with the same AI code that CUDA handles. For years it was aimed at datacenter hardware — the big, expensive GPUs in server rooms. Getting it to work on the Radeon cards in consumer PCs was a separate, slower project that is only recently reaching completion for gaming cards.
Vulkan is a lower-level graphics standard that runs on almost any GPU made in the last decade, regardless of brand. It was designed for games, not AI math, so it is slower for inference — but it works everywhere and requires no special installation. Think of it as the universal fallback when nothing better is available.
| Hardware | Compute layer | Status for AI |
|---|---|---|
| NVIDIA GPU (GTX 10xx+) | CUDA | Best-supported, most tools assume it |
| AMD Radeon RX 5000–9000 | ROCm | Now working on consumer cards — the subject of this series |
| Any modern GPU | Vulkan | Broad compatibility, slower for AI workloads |
| AMD Ryzen AI / NPU | FLM | Dedicated AI chip, fast for specific workloads |
| No GPU | CPU | Works everywhere, significantly slower |
The practical consequence: if you had an AMD gaming card a couple of years ago and tried to run a local AI tool, the tool would silently route everything through your CPU. Not because your GPU was incapable — it was doing the exact same math for games every day — but because the compute layer table did not have an entry for your chip family.
Part 2 of this series is the story of exactly that gap, how it was found, and a pull request that closed it.
Why Models Got Small Enough to Care
The compute layer getting better is only half the story. The other half is that models got small enough to fit in gaming hardware.
Two years ago, the models worth running required enormous amounts of GPU memory — 40, 80 gigabytes — the kind of hardware found in research labs. The models that fit on a gaming card were too limited to be genuinely useful.
That changed. A few things happened at once: better training techniques, better compression methods (called quantization — a way of reducing a model's file size by storing its numbers less precisely, trading a small amount of quality for a large reduction in memory use), and a wave of open-source models that punched well above their size.
A current 7B or 14B model — where the B stands for billion parameters, a rough measure of model size and capability — holds its own against early commercial AI assistants on many practical tasks. And these models fit comfortably in 8–16 GB of VRAM (the memory on your graphics card — faster and more directly accessible to the GPU than system RAM).
8–16 GB of VRAM is normal gaming hardware. An RX 7900 XT ships with 20 GB. A lot of people already own the hardware this needs.
What This Series Covers
The pieces are now in place to tell the full story. Here is where the rest of the series goes:
Part 2 — "Every Chip Has a Secret Name" digs into what GPU architecture codes are, why they caused a matchmaking failure between Radeon cards and ROCm, and the specific pull request that fixed it. The story is technical but grounded — no prior GPU knowledge required.
Part 3 — Hands-on is the walkthrough: install Lemonade, pull a model sized for your hardware, get a first response. Proof that the foundation from parts 1 and 2 works in practice.
Part 4 — Lemonade Server covers what the stack can actually do once the GPU path is working: multimodal models that can chat, generate images, transcribe speech, and speak — all running locally through a single server.
Part 5 — Contributing is a different kind of article. It is about contributing to the project that makes this work, what it is like to use AI assistance to write real open-source code, and why reviewing every line carefully is a better long-term strategy than trusting that it looks right.
The quick-start tutorial is also available if you just want something running now.
Part 1 of the Local AI Stack series. Part 2 →