📌 Pinned Guide

Local LLM Hardware Guide: VRAM, Quantization, and What You Can Actually Run

Why Running Locally Is in Fashion

Over the last few months, running models locally has become a hot topic across AI communities. A few things drove this:

I recently made the jump into self-hosting by purchasing an NVIDIA DGX Spark. I had always played with the idea, but the agent use case finally pushed me over the edge. Here is a plain-English guide to help you make better decisions as you evaluate your own setup.


Why VRAM Is Everything

A lot of AI hobbyists and power users are trying to figure out what models will run on their hardware. The biggest constraint is VRAM — Video RAM — the high-speed memory integrated directly into your graphics card. Originally designed for rendering video, it turns out VRAM is also excellent at the parallel processing that LLMs require.

During inference (using a trained model to generate output), VRAM is consumed by three components:

When VRAM doesn't cover all the layers of a model, it spills over into system RAM. This can cause 5–20× slower performance — a meaningful difference between a usable experience and watching a cursor blink.


What the Hell Is Quantization?

Think of packing a suitcase. Each parameter in a model is an item you want to pack, and quantization determines how tightly you pack each one.

The smaller the box, the lower the VRAM requirement. The tradeoff: the tighter you pack, the more quality you sacrifice. Quantization is the compression technique that makes this possible — it's essentially packing the model more tightly by reducing the precision of each weight.

The rule of thumb for VRAM requirements:

(Parameters × Bits per Parameter) / 8 = VRAM in GB

A more realistic estimate that accounts for overhead:

((Parameters × Bits per Parameter) × 1.2) / 8 = VRAM in GB

For consumer hardware, look for Q4_K_M and Q5_K_M quantizations. Q4_K_M compresses the original model to roughly a quarter of its original size with only minor quality degradation — an excellent compromise. You could consider it the gold standard for local inference.

A Q8 model is nearly indistinguishable from the original. Moving from Q8 to Q4, the difference becomes observable on complex tasks. One useful rule: a larger model at heavier quantization will usually outperform a smaller model at lighter quantization. A 70B at Q4 beats a 13B at Q8 on reasoning tasks.


What Can We Actually Run Locally?

Here's where theory meets reality. The simple math at Q4_K_M is approximately 0.5–0.6 GB per billion parameters including overhead. A 7B model needs roughly 4–5 GB; a 70B model needs around 40 GB.

8GB VRAM
RTX 4060, RTX 3070
Handles 7–8B models at Q4_K_M at 40+ tokens/second. Good for chatbots and simple tasks. Tight, but functional for the most popular small models.
12–16GB VRAM
RTX 4060 Ti (16GB), RTX 4070, RTX 5080
Comfortably runs 13B models at Q4_K_M, with headroom for Q8 on smaller models. Good for coding assistants, long-form writing, and general-purpose chat. Stretches to some 14B models.
24GB VRAM
RTX 3090, RTX 4090, RTX 5090 (32GB)
The sweet spot for serious local inference. Comfortably fits 30–34B models at Q4_K_M. With the 4090 or 5090, you can push into 70B territory with some CPU offloading — real reasoning and coding skills close to commercially available APIs.
48–80GB VRAM
Dual RTX 4090 (48GB combined), Workstation GPUs, NVIDIA DGX Spark
Full 70B models without compromise. Frontier-class open-source models like Llama 3.1 70B at Q8 quality. This is where local inference stops feeling like a compromise.
VRAMExample HardwareMax Model (Q4_K_M)Best Use Case
8GBRTX 40607–8BChatbot, simple tasks
12–16GBRTX 4060 Ti, RTX 407013–14BCoding assistant, writing
24GBRTX 4090, RTX 309030–34B (70B with offload)Serious inference, reasoning
48GB+Dual 4090, DGX Spark70B+ cleanlyFrontier models, research

Why Apple Silicon Is So Hot Right Now

The main advantage of Apple Silicon is its unified memory architecture — the CPU and GPU share the same memory pool. A Mac with 64GB of unified memory gives the GPU access to all 64GB, compared to a discrete GPU's hard VRAM ceiling.

You can roughly compare a 64GB MacBook Pro's unified memory to a 64GB GPU when it comes to inference. That said, NVIDIA still wins on raw inference speed for smaller models — the memory bandwidth on Apple Silicon is lower than a top-tier discrete GPU. Apple's advantage is on large 70B models, where unified memory eliminates the painful PCIe offloading penalty. Quiet operation is another genuine benefit for home office setups.


AMD: The Black Sheep

Let's give credit where it's due. AMD has made real progress running LLMs on newer cards, driven largely by the open-source community rather than AMD itself. AMD cards are meaningfully cheaper than NVIDIA equivalents — important for budget-conscious builders.

The honest downside: AMD's ROCm software stack requires Linux, extensive command-line setup, and has real compatibility gaps with some inference frameworks. It is not plug-and-play. If you want the smoothest local LLM experience, NVIDIA or Apple Silicon is the safer bet. AMD is for builders who don't mind the friction.


Practical Takeaways

The silent killer: context length. This is where "it should fit" estimations break down. The KV cache grows with every token in your context window. A model that fits fine with a short prompt can run out of VRAM mid-conversation if you're using a 128K context window. Always leave headroom.

Quick Buying Guide


What I can tell you from personal experience is that there is something magical about running your own model locally. It feels like you are more in control of your own destiny. Which probably explains why I'm already thinking about buying my second DGX Spark.


Best GPU for Running LLMs Locally in 2026
Full GPU comparison with benchmark numbers — NVIDIA, AMD, and where each tier makes sense.
How Much RAM Do You Need to Run Llama 3?
Exact memory requirements for every Llama 3 model size, with and without quantization.
Mac for Local LLMs: Complete Apple Silicon Guide
Every Apple chip from M1 to M5 Max — which models fit, real performance, and best tools.
Best Hardware for Claude-Distilled Models
Hardware shortlist for the newer GGUF distill wave — 7B through 70B.
Not sure what will run on your hardware?
Use the interactive picker — filter by VRAM, budget, and target model.
Open LLM Picker →
Not ready to buy hardware?
Try on RunPod for instant access to powerful GPUs.
Try on RunPod →