How much VRAM do I need to run a 7B model?

At Q4_K_M quantization, a 7B model needs around 4–5GB of VRAM including overhead. An 8GB GPU like the RTX 4060 handles it comfortably.

Can I run a 70B model on a single consumer GPU?

Yes, on a 24GB card like the RTX 4090 at Q4_K_M quantization (~40GB fit with CPU offloading), but expect performance drops. Unified memory systems like Apple Silicon handle 70B cleanly.

Is Apple Silicon good for local LLMs?

Yes — Apple Silicon's unified memory architecture means CPU and GPU share the same memory pool, eliminating the PCIe bottleneck. A Mac with 64GB+ can run 70B models cleanly. NVIDIA is faster on smaller models, but Apple wins on large model capacity.

📌 Pinned Guide

Local LLM Hardware Guide: VRAM, Quantization, and What You Can Actually Run

Q: What is Q4_K_M and why is it recommended?

Q4_K_M is a 4-bit quantization format that compresses a model to roughly a quarter of its original size with only minor quality degradation. It's the best balance of VRAM efficiency and output quality for consumer hardware.

Why Running Locally Is in Fashion

Over the last few months, running models locally has become a hot topic across AI communities. A few things drove this:

The emergence of AI agents has captured the public's imagination. For the first time, the concept of agents going out into the real world and doing work for you has become real — and running a local model for at least some of those tasks is simply more cost-efficient.
Models have become extremely efficient. Something like Qwen 2.5 can now run on a consumer-grade GPU and produce genuinely useful results.
Some distrust in closed model AI labs. A meaningful number of users feel uncomfortable sending all their private data to OpenAI or Anthropic.

I recently made the jump into self-hosting by purchasing an NVIDIA DGX Spark. I had always played with the idea, but the agent use case finally pushed me over the edge. Here is a plain-English guide to help you make better decisions as you evaluate your own setup.

Why VRAM Is Everything

A lot of AI hobbyists and power users are trying to figure out what models will run on their hardware. The biggest constraint is VRAM — Video RAM — the high-speed memory integrated directly into your graphics card. Originally designed for rendering video, it turns out VRAM is also excellent at the parallel processing that LLMs require.

During inference (using a trained model to generate output), VRAM is consumed by three components:

Model weights — the bulk of memory usage
KV cache — grows with context length; more on this below
Activations — temporary space the model needs while processing

When VRAM doesn't cover all the layers of a model, it spills over into system RAM. This can cause 5–20× slower performance — a meaningful difference between a usable experience and watching a cursor blink.

What the Hell Is Quantization?

Think of packing a suitcase. Each parameter in a model is an item you want to pack, and quantization determines how tightly you pack each one.

FP16 — Large box, plenty of space. Takes up the most VRAM.
Q8 — Medium box. Roughly half the space of FP16.
Q4 — Small box. Roughly a quarter the space of FP16.

The smaller the box, the lower the VRAM requirement. The tradeoff: the tighter you pack, the more quality you sacrifice. Quantization is the compression technique that makes this possible — it's essentially packing the model more tightly by reducing the precision of each weight.

The rule of thumb for VRAM requirements:

(Parameters × Bits per Parameter) / 8 = VRAM in GB

A more realistic estimate that accounts for overhead:

((Parameters × Bits per Parameter) × 1.2) / 8 = VRAM in GB

For consumer hardware, look for Q4_K_M and Q5_K_M quantizations. Q4_K_M compresses the original model to roughly a quarter of its original size with only minor quality degradation — an excellent compromise. You could consider it the gold standard for local inference.

A Q8 model is nearly indistinguishable from the original. Moving from Q8 to Q4, the difference becomes observable on complex tasks. One useful rule: a larger model at heavier quantization will usually outperform a smaller model at lighter quantization. A 70B at Q4 beats a 13B at Q8 on reasoning tasks.

What Can We Actually Run Locally?

Here's where theory meets reality. The simple math at Q4_K_M is approximately 0.5–0.6 GB per billion parameters including overhead. A 7B model needs roughly 4–5 GB; a 70B model needs around 40 GB.

8GB VRAM

RTX 4060, RTX 3070

Handles 7–8B models at Q4_K_M at 40+ tokens/second. Good for chatbots and simple tasks. Tight, but functional for the most popular small models.

12–16GB VRAM

RTX 4060 Ti (16GB), RTX 4070, RTX 5080

Comfortably runs 13B models at Q4_K_M, with headroom for Q8 on smaller models. Good for coding assistants, long-form writing, and general-purpose chat. Stretches to some 14B models.

24GB VRAM

RTX 3090, RTX 4090, RTX 5090 (32GB)

The sweet spot for serious local inference. Comfortably fits 30–34B models at Q4_K_M. With the 4090 or 5090, you can push into 70B territory with some CPU offloading — real reasoning and coding skills close to commercially available APIs.

48–80GB VRAM

Dual RTX 4090 (48GB combined), Workstation GPUs, NVIDIA DGX Spark

Full 70B models without compromise. Frontier-class open-source models like Llama 3.1 70B at Q8 quality. This is where local inference stops feeling like a compromise.

VRAM	Example Hardware	Max Model (Q4_K_M)	Best Use Case
8GB	RTX 4060	7–8B	Chatbot, simple tasks
12–16GB	RTX 4060 Ti, RTX 4070	13–14B	Coding assistant, writing
24GB	RTX 4090, RTX 3090	30–34B (70B with offload)	Serious inference, reasoning
48GB+	Dual 4090, DGX Spark	70B+ cleanly	Frontier models, research

Why Apple Silicon Is So Hot Right Now

The main advantage of Apple Silicon is its unified memory architecture — the CPU and GPU share the same memory pool. A Mac with 64GB of unified memory gives the GPU access to all 64GB, compared to a discrete GPU's hard VRAM ceiling.

You can roughly compare a 64GB MacBook Pro's unified memory to a 64GB GPU when it comes to inference. That said, NVIDIA still wins on raw inference speed for smaller models — the memory bandwidth on Apple Silicon is lower than a top-tier discrete GPU. Apple's advantage is on large 70B models, where unified memory eliminates the painful PCIe offloading penalty. Quiet operation is another genuine benefit for home office setups.

AMD: The Black Sheep

Let's give credit where it's due. AMD has made real progress running LLMs on newer cards, driven largely by the open-source community rather than AMD itself. AMD cards are meaningfully cheaper than NVIDIA equivalents — important for budget-conscious builders.

The honest downside: AMD's ROCm software stack requires Linux, extensive command-line setup, and has real compatibility gaps with some inference frameworks. It is not plug-and-play. If you want the smoothest local LLM experience, NVIDIA or Apple Silicon is the safer bet. AMD is for builders who don't mind the friction.

Practical Takeaways

The silent killer: context length. This is where "it should fit" estimations break down. The KV cache grows with every token in your context window. A model that fits fine with a short prompt can run out of VRAM mid-conversation if you're using a 128K context window. Always leave headroom.

At Q4_K_M, estimate 0.5–0.6 GB per billion parameters including overhead — 7B ≈ 4GB, 70B ≈ 40GB.
When VRAM runs out, the model spills to system RAM. This works, but expect a significant performance hit.
A larger model at heavier quantization beats a smaller model at lighter quantization for most reasoning tasks.
Q4_K_M is the gold standard for consumer hardware. Q5_K_M if you have headroom.

Quick Buying Guide

8GB VRAM — Absolute entry point. 7B models only.
12–16GB VRAM — Sufficient for most people. 13B models comfortably.
24GB VRAM — Solid local inference for 30B+. Recommended if budget allows.
48GB+ VRAM — Workstation territory. Full 70B models without compromise.

What I can tell you from personal experience is that there is something magical about running your own model locally. It feels like you are more in control of your own destiny. Which probably explains why I'm already thinking about buying my second DGX Spark.

Best GPU for Running LLMs Locally in 2026
Full GPU comparison with benchmark numbers — NVIDIA, AMD, and where each tier makes sense. How Much RAM Do You Need to Run Llama 3?
Exact memory requirements for every Llama 3 model size, with and without quantization. Mac for Local LLMs: Complete Apple Silicon Guide
Every Apple chip from M1 to M5 Max — which models fit, real performance, and best tools. Best Hardware for Claude-Distilled Models
Hardware shortlist for the newer GGUF distill wave — 7B through 70B.

Not sure what will run on your hardware?

Use the interactive picker — filter by VRAM, budget, and target model.

Open LLM Picker →

Not ready to buy hardware?

Try on RunPod for instant access to powerful GPUs.

Try on RunPod →

Get new local AI guides

One email when new posts drop. No spam.

← All guides

Local LLM Hardware Guide: VRAM, Quantization, and What You Can Actually Run

Why Running Locally Is in Fashion

Why VRAM Is Everything

What the Hell Is Quantization?

What Can We Actually Run Locally?

Why Apple Silicon Is So Hot Right Now

AMD: The Black Sheep

Practical Takeaways

Quick Buying Guide

Related Guides