How Much RAM Do You Need to Run Llama 3?
TL;DR
- Llama 3 8B needs ~16GB RAM for CPU inference or a 12GB GPU with Q4 quantization
- Llama 3 70B requires approximately 42–48GB RAM at Q4_K_M (64GB recommended for headroom) or dual RTX 3090s (48GB VRAM) for quantized GPU inference
- Llama 3 405B is enterprise territory — budget 256GB RAM or 8x H100s
- Q4 quantization cuts memory requirements by ~70% with only a modest quality tradeoff — use it
Running Llama 3 locally is genuinely exciting. Meta's open-weight models punch well above their weight class, and with the right hardware, you can run a capable AI assistant entirely offline. But "right hardware" is doing a lot of work in that sentence.
The RAM and VRAM requirements for Llama 3 vary wildly depending on which model variant you're targeting and how you plan to run it. Get this wrong and you'll either be waiting 45 seconds per response or watching your system crash entirely.
This guide gives you the exact numbers you need, no hand-waving.
The Memory Math Behind LLMs
Before jumping into specific configurations, you need to understand why these models eat so much memory. It's not magic — it's arithmetic.
Every parameter in a language model needs to be stored in memory during inference. The amount of space each parameter takes depends on numerical precision:
- FP32 (full precision): 4 bytes per parameter
- FP16/BF16 (half precision): 2 bytes per parameter
- Q4 quantization: ~0.5 bytes per parameter
So for Llama 3 8B at BF16 precision: 8 billion × 2 bytes = 16GB just for weights. Add ~20% overhead for activations and the KV cache, and you're looking at 18-19GB minimum.
That's the formula: Parameters × Bytes per weight × 1.2 ≈ Minimum memory needed
VRAM vs. RAM: Why It Matters
GPU VRAM and system RAM are not interchangeable. VRAM is fast — an RTX 4090 has ~1TB/s of memory bandwidth. System DDR5 RAM tops out around 50-100GB/s. Running a model on CPU with system RAM instead of GPU VRAM means you're trading speed for accessibility. You'll get functional inference, but "functional" might mean 2-5 tokens per second instead of 40+.
Buy on Amazon →For serious use, GPU VRAM is the goal. System RAM is the fallback.
Llama 3 8B: The Sweet Spot for Most People
The 8B model is where most local AI enthusiasts should start. It's capable, reasonably fast, and actually fits on consumer hardware.
GPU Requirements (8B)
| Precision | VRAM Needed | Minimum GPU |
|---|---|---|
| BF16 | 15-16GB | RTX 4080 (16GB) |
| Q4_K_M | 5-6GB | RTX 3060 (12GB) |
| Q6_K | ~7GB | RTX 3060 (12GB) |
The Q4_K_M quantized version is the practical choice for most people. At 5-6GB VRAM, it runs comfortably on an RTX 3060 12GB or RTX 4070 12GB. On an RTX 4070 SUPER with 16GB, you're looking at 40+ tokens per second — fast enough that it genuinely feels like a real-time conversation.
Buy on Amazon →If you have an RTX 4080 or 4090 and want maximum quality, run BF16. The difference in output quality is noticeable, especially for complex reasoning tasks.
CPU/RAM Requirements (8B)
If you're going CPU-only with llama.cpp:
- Minimum: 16GB RAM (tight, expect slowdowns)
- Recommended: 32GB RAM
- Performance: Expect 5-15 tokens/sec on a modern CPU like Ryzen 9 7950X
It works. It's not fast. For casual use or overnight batch processing, it's a legitimate option. For interactive chat, it gets frustrating quickly.
Llama 3 70B: Where Things Get Serious
The 70B model is a different beast. At BF16, you need 140GB+ of VRAM — that's multiple A100s or H100s. For most people, that means quantization is not optional, it's mandatory.
GPU Requirements (70B)
| Precision | VRAM Needed | Hardware |
|---|---|---|
| BF16 | 140GB+ | 2x A100 80GB minimum |
| Q4_K_M | 35-40GB | 2x RTX 3090 (48GB total) |
| Q6_K | ~55GB | 3x RTX 3090 or 2x RTX 6000 Ada |
The dual RTX 3090 setup is the community-tested sweet spot for 70B Q4. Two used RTX 3090s will run you around $1,600 total, giving you 48GB of combined VRAM. You'll get roughly 8 tokens per second on Q4_K_M — slower than the 8B, but the output quality jump is substantial for complex tasks.
Multi-GPU inference requires some setup (llama.cpp handles tensor parallelism reasonably well), but it's not rocket science.
RAM Requirements (70B)
For CPU offloading or pure CPU inference:
- Minimum: 48GB RAM
- Recommended: 128GB RAM for comfortable operation
- Performance: 1-3 tokens/sec on CPU-only — technically functional, practically painful
If you're doing CPU-only 70B inference, you need DDR5 RAM. DDR4's lower bandwidth makes an already slow process noticeably worse. This isn't a recommendation to run 70B on CPU — it's a warning that if you must, at least use fast RAM.
Llama 3 405B: Enterprise Hardware Only
Let's be direct: Llama 3 405B is not a consumer model. The numbers don't lie.
| Configuration | Requirement |
|---|---|
| BF16 GPU | ~800GB VRAM (8x H100 80GB) |
| Q4 GPU | ~200GB VRAM (3-4x A100 80GB) |
| CPU RAM | 256GB+ DDR5 |
Unless you're running a research lab or a well-funded startup, this model isn't for your home setup. If you're curious about 405B-class performance without the hardware, use the API. For everyone else, 70B Q4 gets you 80% of the capability at a fraction of the cost.
Quantization: Your Best Friend for Local Inference
Quantization is the technique that makes local LLM inference practical. It reduces the precision of model weights, shrinking memory requirements dramatically.
The Main Quantization Formats
Q4_K_M — This is the one you want for most use cases. It reduces the 70B model from 140GB to ~35GB, a 75% reduction. The quality loss is real but modest — roughly 10-15% degradation on benchmarks, and often imperceptible in everyday use.
Q6_K — Higher quality, higher memory. Good choice if you have the VRAM headroom and want to minimize quality loss. The 8B Q6_K fits in 7GB, making it viable on a 12GB GPU with room to spare.
Q2_K — Extreme compression. Memory requirements drop dramatically, but quality takes a serious hit. Only worth considering if you're severely memory-constrained and understand the tradeoffs.
Q8_0 — Near-lossless quantization. Requires almost as much memory as BF16 but is faster than FP16 on some hardware. Niche use case.
Tools for Running Quantized Models
- llama.cpp — The workhorse. Runs on CPU and GPU, supports all major quantization formats, actively maintained. Start here.
- Ollama — Wraps llama.cpp in a cleaner interface. Great for getting started quickly, less flexibility for advanced configs.
- GPTQ — GPU-optimized quantization, integrates well with Hugging Face transformers.
- AWQ — Activation-aware quantization, generally produces better quality than GPTQ at the same bit width.
For most people: download a Q4_K_M GGUF from Hugging Face, run it with Ollama or llama.cpp, done.
Hardware Configurations by Budget
Budget Build: Under $1,000
Target: Llama 3 8B Q4_K_M at 40+ tokens/sec
- GPU: RTX 4060 Ti 16GB (~$450)
- RAM: 32GB DDR5 (~$100)
- Why it works: 16GB VRAM handles 8B Q4 with room for context, DDR5 keeps CPU offloading viable
This is the minimum setup where local inference actually feels good. The 16GB variant of the 4060 Ti is specifically worth the premium over the 8GB version — that extra VRAM headroom matters.
Mid-Range Build: ~$2,000
Target: Llama 3 8B BF16 or 70B Q4_K_M
- GPU: RTX 3090 24GB (~$800 used) or RTX 4070 SUPER 16GB (~$600 new)
- RAM: 64GB DDR5 (~$180)
- Why it works: 24GB VRAM handles 8B at full precision comfortably; 64GB RAM enables CPU offloading for 70B
The RTX 3090 remains one of the best value propositions in the used GPU market specifically because of its 24GB VRAM. For local LLM work, VRAM capacity matters more than raw compute.
High-End Build: $10,000+
Target: Llama 3 70B Q4 or Q6 with fast inference
- GPUs: 2x RTX 6000 Ada (48GB VRAM each, 96GB total) or 2x A6000 Ada
- RAM: 256GB DDR5 ECC
- CPU: Threadripper PRO 7000 series
- Why it works: 96GB combined VRAM runs 70B at Q6 quality with headroom; Threadripper PRO supports massive RAM configs
At this level, you're looking at professional workstation territory. The dual RTX 6000 Ada setup delivers ~8 tokens/sec on 70B Q4 with significantly better sustained performance than consumer GPUs under load.
Optimization Techniques Worth Knowing
Layer Offloading
When your model doesn't quite fit in VRAM, you can offload some transformer layers to system RAM. llama.cpp's --n-gpu-layers flag controls this. Offloading 30 of 32 layers to GPU while keeping 2 in RAM is much faster than full CPU inference.
Experiment with this flag to find the maximum layers your VRAM can handle — you'll often find a sweet spot that's 80% as fast as full GPU inference at a fraction of the VRAM requirement.
FlashAttention-3
If your GPU supports it (RTX 40-series, A100, H100), enabling FlashAttention-3 can boost throughput by ~30% and reduce VRAM usage during inference. In llama.cpp, this is enabled with --flash-attn. It's a free performance upgrade — always enable it if your hardware supports it.
Context Length and Memory
Longer context windows eat more memory. A 128K context window requires significantly more VRAM than 4K. If you're running tight on memory, reducing your context length (--ctx-size in llama.cpp) is an easy way to free up headroom. For most chat use cases, 8K-16K context is plenty.
Docker Memory Limits
If you're running multiple models or hosting Ollama as a service, set explicit memory limits in Docker to prevent any single model from consuming all available resources. This is especially important on shared machines or when running inference alongside other workloads.
Quick Reference: Llama 3 Memory Requirements
| Model | BF16 VRAM | Q4 VRAM | CPU RAM (Q4) | Min GPU |
|---|---|---|---|---|
| 8B | 15-16GB | 5-6GB | 16GB | RTX 3060 12GB |
| 70B | 140GB+ | 35-40GB | 64GB | 2x RTX 3090 |
| 405B | ~800GB | ~200GB | 256GB+ | 8x H100 |
Bottom Line
For 95% of people, the answer is Llama 3 8B Q4_K_M.
It runs on a $450 GPU, delivers 40+ tokens/sec, and handles the vast majority of tasks you'd throw at a local LLM — coding assistance, writing, summarization, Q&A. The quality difference between 8B Q4 and 70B is real, but it's not the difference between useful and useless.
If you specifically need 70B capability — complex multi-step reasoning, nuanced analysis, tasks where 8B consistently falls short — budget for two used RTX 3090s and 64GB of DDR5. That's the cheapest path to 70B Q4 inference that actually works.
Avoid these mistakes:
- Buying a GPU with less than 12GB VRAM for any Llama 3 work
- Running 70B on CPU-only unless you have extreme patience and DDR5
- Chasing BF16 precision when Q4_K_M gets you 85-90% of the quality at a fraction of the memory cost
The local LLM ecosystem has matured to the point where this stuff actually works well. Pick the right model size for your hardware, use Q4_K_M quantization, and run it through Ollama or llama.cpp. You'll have a capable local AI assistant running in an afternoon.
Related Memory & Hardware Guides
If you’re still choosing hardware, this is the best top-down GPU buying guide on the site. Best Hardware for Claude-Distilled Models
Use this if you want a hardware shortlist for the newer GGUF distill wave. RTX 4090 vs RTX 4080 for LLMs
The cleanest breakdown if you’re stuck deciding whether 16GB VRAM is enough. How to Run Llama on a Mac
For Apple Silicon buyers who want to compare unified memory against discrete GPUs.