Local LLM Hardware Guide: VRAM, Quantization, and What You Can Actually Run
Why Running Locally Is in Fashion
Over the last few months, running models locally has become a hot topic across AI communities. A few things drove this:
- The emergence of AI agents has captured the public's imagination. For the first time, the concept of agents going out into the real world and doing work for you has become real — and running a local model for at least some of those tasks is simply more cost-efficient.
- Models have become extremely efficient. Something like Qwen 2.5 can now run on a consumer-grade GPU and produce genuinely useful results.
- Some distrust in closed model AI labs. A meaningful number of users feel uncomfortable sending all their private data to OpenAI or Anthropic.
I recently made the jump into self-hosting by purchasing an NVIDIA DGX Spark. I had always played with the idea, but the agent use case finally pushed me over the edge. Here is a plain-English guide to help you make better decisions as you evaluate your own setup.
Why VRAM Is Everything
A lot of AI hobbyists and power users are trying to figure out what models will run on their hardware. The biggest constraint is VRAM — Video RAM — the high-speed memory integrated directly into your graphics card. Originally designed for rendering video, it turns out VRAM is also excellent at the parallel processing that LLMs require.
During inference (using a trained model to generate output), VRAM is consumed by three components:
- Model weights — the bulk of memory usage
- KV cache — grows with context length; more on this below
- Activations — temporary space the model needs while processing
When VRAM doesn't cover all the layers of a model, it spills over into system RAM. This can cause 5–20× slower performance — a meaningful difference between a usable experience and watching a cursor blink.
What the Hell Is Quantization?
Think of packing a suitcase. Each parameter in a model is an item you want to pack, and quantization determines how tightly you pack each one.
- FP16 — Large box, plenty of space. Takes up the most VRAM.
- Q8 — Medium box. Roughly half the space of FP16.
- Q4 — Small box. Roughly a quarter the space of FP16.
The smaller the box, the lower the VRAM requirement. The tradeoff: the tighter you pack, the more quality you sacrifice. Quantization is the compression technique that makes this possible — it's essentially packing the model more tightly by reducing the precision of each weight.
The rule of thumb for VRAM requirements:
A more realistic estimate that accounts for overhead:
For consumer hardware, look for Q4_K_M and Q5_K_M quantizations. Q4_K_M compresses the original model to roughly a quarter of its original size with only minor quality degradation — an excellent compromise. You could consider it the gold standard for local inference.
A Q8 model is nearly indistinguishable from the original. Moving from Q8 to Q4, the difference becomes observable on complex tasks. One useful rule: a larger model at heavier quantization will usually outperform a smaller model at lighter quantization. A 70B at Q4 beats a 13B at Q8 on reasoning tasks.
What Can We Actually Run Locally?
Here's where theory meets reality. The simple math at Q4_K_M is approximately 0.5–0.6 GB per billion parameters including overhead. A 7B model needs roughly 4–5 GB; a 70B model needs around 40 GB.
| VRAM | Example Hardware | Max Model (Q4_K_M) | Best Use Case |
|---|---|---|---|
| 8GB | RTX 4060 | 7–8B | Chatbot, simple tasks |
| 12–16GB | RTX 4060 Ti, RTX 4070 | 13–14B | Coding assistant, writing |
| 24GB | RTX 4090, RTX 3090 | 30–34B (70B with offload) | Serious inference, reasoning |
| 48GB+ | Dual 4090, DGX Spark | 70B+ cleanly | Frontier models, research |
Why Apple Silicon Is So Hot Right Now
The main advantage of Apple Silicon is its unified memory architecture — the CPU and GPU share the same memory pool. A Mac with 64GB of unified memory gives the GPU access to all 64GB, compared to a discrete GPU's hard VRAM ceiling.
You can roughly compare a 64GB MacBook Pro's unified memory to a 64GB GPU when it comes to inference. That said, NVIDIA still wins on raw inference speed for smaller models — the memory bandwidth on Apple Silicon is lower than a top-tier discrete GPU. Apple's advantage is on large 70B models, where unified memory eliminates the painful PCIe offloading penalty. Quiet operation is another genuine benefit for home office setups.
AMD: The Black Sheep
Let's give credit where it's due. AMD has made real progress running LLMs on newer cards, driven largely by the open-source community rather than AMD itself. AMD cards are meaningfully cheaper than NVIDIA equivalents — important for budget-conscious builders.
The honest downside: AMD's ROCm software stack requires Linux, extensive command-line setup, and has real compatibility gaps with some inference frameworks. It is not plug-and-play. If you want the smoothest local LLM experience, NVIDIA or Apple Silicon is the safer bet. AMD is for builders who don't mind the friction.
Practical Takeaways
The silent killer: context length. This is where "it should fit" estimations break down. The KV cache grows with every token in your context window. A model that fits fine with a short prompt can run out of VRAM mid-conversation if you're using a 128K context window. Always leave headroom.
- At Q4_K_M, estimate 0.5–0.6 GB per billion parameters including overhead — 7B ≈ 4GB, 70B ≈ 40GB.
- When VRAM runs out, the model spills to system RAM. This works, but expect a significant performance hit.
- A larger model at heavier quantization beats a smaller model at lighter quantization for most reasoning tasks.
- Q4_K_M is the gold standard for consumer hardware. Q5_K_M if you have headroom.
Quick Buying Guide
- 8GB VRAM — Absolute entry point. 7B models only.
- 12–16GB VRAM — Sufficient for most people. 13B models comfortably.
- 24GB VRAM — Solid local inference for 30B+. Recommended if budget allows.
- 48GB+ VRAM — Workstation territory. Full 70B models without compromise.
What I can tell you from personal experience is that there is something magical about running your own model locally. It feels like you are more in control of your own destiny. Which probably explains why I'm already thinking about buying my second DGX Spark.
Related Guides
Full GPU comparison with benchmark numbers — NVIDIA, AMD, and where each tier makes sense. How Much RAM Do You Need to Run Llama 3?
Exact memory requirements for every Llama 3 model size, with and without quantization. Mac for Local LLMs: Complete Apple Silicon Guide
Every Apple chip from M1 to M5 Max — which models fit, real performance, and best tools. Best Hardware for Claude-Distilled Models
Hardware shortlist for the newer GGUF distill wave — 7B through 70B.