How to Run AI on Your GPU Locally, Explained

A modern graphics card spends most of its life doing one thing: multiplying enormous tables of numbers together, thousands of operations per second, to render a frame on your screen.

An AI model, at its core, also spends most of its time doing one thing: multiplying enormous tables of numbers together, thousands of operations per second, to produce a word.

They are, at the math level, almost the same problem. So you would expect them to click together easily. Load a model, point it at the GPU, go.

There is one more piece sitting between them: a layer of software that most tutorials move past quickly. Once you understand it, the whole picture clicks into place.

What "Running a Model" Actually Means

Three terms come up constantly when people talk about local AI. They are worth getting straight early, because everything else builds on them.

Model. A large file, typically several gigabytes, that contains millions or billions of numbers called weights. These weights are what the model "knows." They were produced by a training process that processed enormous amounts of text, and they encode patterns that let the model predict what comes next in a sequence. You do not retrain a model when you use it; you just load it and run it.

Inference. The process of actually using a loaded model. You feed it an input, your question, and it does math on the weights to produce an output. Each time you get a response, inference happened. The word comes from logic ("drawing an inference"), but in practice it just means "running the model on new input."

Token. The unit the model works in. A token is roughly a word fragment. "running" might be two tokens, "AI" one, punctuation its own. Speed is measured in tokens per second, which maps loosely to words per second. A comfortable reading speed is around 3 to 4 tokens per second, and a fast local model on a good GPU can do 30 to 60.

Inference is the expensive part. It is a chain of matrix multiplications, the exact kind of math a GPU was designed to parallelize. But "designed for similar math" and "can be handed this arbitrary workload" are two different things, and the space between them is what this article is about.

The Layer Nobody Talks About

Here is the actual path between your question and your GPU:

Your prompt

"What is machine learning?"

↓

AI framework

Tokenizes input, runs the inference loop

↓

Compute layer this series

CUDA / ROCm / Vulkan, translating AI ops to hardware instructions

↓

GPU driver

Schedules work on the physical chip

↓

GPU silicon

Matrix math across thousands of parallel cores

↑

Response token

Result travels back up and decodes to text

The invisible stack

Every response you get from a local AI model travels this path. Most local AI questions come down to the compute layer, the middle section. Press the button to watch a prompt move through.

Most of the questions about local AI come back to that highlighted middle section, the compute layer. This is the software that translates "neural network operations" into "instructions your specific chip understands." With a working compute layer for your hardware, a model can talk directly to your GPU.

How the Compute Layer Evolved

For most of AI's history, this translation layer had one good implementation: CUDA, a programming toolkit NVIDIA built for their own chips in 2007. It arrived early, matured fast, and the entire AI research ecosystem built on top of it. When someone writes an AI framework, whether PyTorch, TensorFlow, or whatever came next, CUDA support is assumed. Everything else is optional.

This is simply how ecosystems grow. CUDA arrived early, it was good, and developers built on it. Over time, "GPU support" and "NVIDIA GPU support" came to mean much the same thing in most AI tooling. Not by design, but through the natural weight of accumulated defaults.

Two other paths exist and matter:

ROCm is AMD's equivalent. A compute layer AMD built for their own hardware, designed to work with the same AI code that CUDA handles. For years it was aimed at datacenter hardware, the big GPUs in server rooms. Bringing it to the Radeon cards in consumer PCs was a separate project, and it has recently reached the gaming cards too.

Vulkan is a lower-level graphics standard that runs on almost any GPU made in the last decade, regardless of brand. It was designed for games, not AI math, so it is slower for inference, but it works everywhere and requires no special installation. Think of it as the dependable universal option that is always there when you need it.

Hardware	Compute layer	Status for AI
NVIDIA GPU (GTX 10xx+)	CUDA	Best-supported, most tools assume it
AMD Radeon RX 5000 to 9000	ROCm	Now working on consumer cards, the subject of this series
Any modern GPU	Vulkan	Broad compatibility, slower for AI workloads
AMD Ryzen AI / NPU	FLM	Dedicated AI chip, fast for specific workloads
No GPU	CPU	Works everywhere, significantly slower

The practical consequence: if you had an AMD gaming card a couple of years ago and tried to run a local AI tool, the tool would route everything through your CPU. Not because your GPU lacked the power, since it was doing the exact same math for games every day, but simply because the compute layer table did not yet list your chip family. That entry has since been added.

Part 2 of this series is the story of exactly that gap, how it was found, and a pull request that closed it.

Why Models Got Small Enough to Care

The compute layer getting better is only half the story. The other half is that models got small enough to fit in gaming hardware.

Two years ago, the models worth running required enormous amounts of GPU memory, 40 or 80 gigabytes, the kind of hardware found in research labs. The models that fit on a gaming card were still finding their footing.

That changed. A few things happened at once: better training techniques, better compression methods (called quantization, a way of reducing a model's file size by storing its numbers less precisely, trading a small amount of quality for a large reduction in memory use), and a wave of open-source models that punched well above their size.

A current 7B or 14B model, where the B stands for billion parameters, a rough measure of model size and capability, holds its own against early commercial AI assistants on many practical tasks. And these models fit comfortably in 8 to 16 GB of VRAM (the memory on your graphics card, faster and more directly accessible to the GPU than system RAM).

8 to 16 GB of VRAM is normal gaming hardware. An RX 7900 XT ships with 20 GB. A lot of people already own the hardware this needs.

What This Series Covers

The pieces are now in place to tell the full story. Here is where the rest of the series goes:

Part 2: "Every Chip Has a Secret Name" digs into what GPU architecture codes are, why they once kept Radeon cards and ROCm from recognizing each other, and the specific pull request that brought them together. The story is technical but grounded, with no prior GPU knowledge required.

Part 3: Hands-on is the walkthrough: install Lemonade, pull a model sized for your hardware, get a first response. Proof that the foundation from parts 1 and 2 works in practice.

Part 4: Lemonade Server covers what the stack can actually do once the GPU path is working: multimodal models that can chat, generate images, transcribe speech, and speak, all running locally through a single server.

Part 5: Contributing is a different kind of article. It is about contributing to the project that makes this work, what it is like to use AI assistance to write real open-source code, and why reviewing every line carefully is a better long-term strategy than trusting that it looks right.

The quick-start tutorial is also available if you just want something running now.

Part 1 of the Local AI Stack series. Part 2 →

Your GPU Is a Math Machine. So Is AI. Here's What Connects Them.

What "Running a Model" Actually Means

The Layer Nobody Talks About

How the Compute Layer Evolved

Why Models Got Small Enough to Care

What This Series Covers