How to Run Llama on Mac Apple Silicon: The Complete Setup Guide

TL;DR
- Apple Silicon's unified memory architecture makes Macs genuinely great for running local LLMs — no discrete GPU required
- You need at least 16GB RAM for 8B models; roughly 42–48GB for 70B models, with 64GB recommended for headroom
- Ollama is the fastest way to get started; llama.cpp gives you the most control and best raw performance
- An M2 MacBook Air 16GB ($1,299) is the sweet spot for most people running local AI

Running a large language model locally on your Mac used to feel like a science project. Compile this, patch that, pray to the Metal gods. In 2025, it's genuinely straightforward — and Apple Silicon hardware has become one of the best platforms for local LLM inference you can buy at any price point.

This guide walks you through exactly how to run Llama on Mac Apple Silicon, from picking the right hardware to getting your first model responding in under 10 minutes. Whether you're on an M2 MacBook Air or an M3 Max MacBook Pro, there's a setup here that works for you.

Why Apple Silicon Is Surprisingly Good at This

Before we get into setup, it's worth understanding why this works so well. Most people assume you need a beefy NVIDIA GPU to run LLMs locally. That's true if you're training models — but for inference (actually using a model to generate text), the bottleneck is memory bandwidth, not raw compute.

Here's the key insight: Apple Silicon uses unified memory architecture. The CPU and GPU share the same memory pool. On a discrete GPU setup, you're limited to whatever VRAM is soldered onto the card — typically 8-24GB on consumer hardware. On a Mac Studio with 64GB or more of unified memory, the GPU can access that full pool. That's why you can run a 70B parameter model on a Mac that costs less than a single high-end GPU.

View Mac Studio on Apple Store →

Apple Store link

The numbers back this up. Metal acceleration on Apple Silicon delivers 3-5x faster inference than Intel Macs, and the M-series chips handle sustained workloads without thermal throttling — especially on Mac Studio and Mac mini, which have better passive cooling than the thin MacBook Air chassis.

Buy on Amazon →

Affiliate link · No extra cost to you

Real-world benchmarks:
- M1 Max (64GB): ~33 tokens/sec on Llama 3.1 8B via Core ML
- M2 Pro (16GB): ~27 tokens/sec on Llama 3 8B Q4 via Jan.ai
- M3 Max (128GB): 30-40 tokens/sec on 70B models at 4-bit quantization

Those are usable speeds. 27 tokens/sec feels like a fast typist. 33+ tokens/sec feels instant.

Hardware Requirements: What You Actually Need

Let me be direct: 8GB of RAM is not enough. If you have a base M1 or M2 Mac with 8GB, you're limited to tiny 3B models that aren't particularly useful. Don't waste your time trying to squeeze a 7B model into 8GB — the constant memory swapping will make it painfully slow.

Here's the honest breakdown:

Entry-Level: 16GB RAM

Best for: Llama 3 8B, Mistral 7B, Phi-3 Mini
Devices: M2/M3 MacBook Air 16GB, M2/M3 Mac mini 16GB
Experience: Solid. 8B models at 4-bit quantization run comfortably with room to spare. This is the minimum I'd recommend for anyone serious about local AI.

Mid-Range: 32-48GB RAM

Best for: Llama 3 13B, Qwen2 12B, Code Llama 34B
Devices: M2/M3 Max MacBook Pro, Mac Studio M2 Max
Experience: Excellent. You can run 13B models at full quality and experiment with larger quantized models.

High-End: 64GB+ RAM

Best for: Llama 3 70B, Mixtral 8x22B, full-precision 13B models
Devices: M2/M3 Ultra Mac Studio, Mac Pro
Experience: This is where local LLMs become genuinely competitive with cloud APIs. A 70B model at 4-bit quantization needs ~40GB, which fits comfortably in 64GB.

Quick note on quantization: 4-bit quantized models reduce RAM requirements by roughly 60%. A 70B model that would normally need 128GB+ of memory runs in ~40-64GB at Q4 precision. You lose a small amount of quality, but the tradeoff is almost always worth it.

Method 1: Ollama (Recommended for Most People)

Ollama is the easiest way to run Llama on Mac Apple Silicon. It handles model downloads, Metal acceleration, and serving — all with a clean CLI interface. Start here unless you have a specific reason not to.

Installation

brew install ollama

Or download the DMG directly from ollama.com if you prefer a GUI installer. The DMG version adds a menu bar icon, which is handy.

Running Your First Model

# Download and run Llama 3.2 8B
ollama pull llama3.2:8b
ollama run llama3.2:8b

That's it. Ollama automatically detects Apple Silicon and enables Metal acceleration. The first run downloads the model (around 4.7GB for the 8B Q4 version), then drops you into an interactive chat.

Useful Ollama Commands

# List available models
ollama list

# Run a specific quantization
ollama run llama3.2:8b-instruct-q4_K_M

# Run as an API server (useful for integrations)
ollama serve

# Check what's running
ollama ps

Integrations Worth Knowing

Once Ollama is running as a server (ollama serve), it exposes a local API at http://localhost:11434. This means you can:

Connect it to VS Code via the Continue extension for AI-assisted coding
Use it with Open WebUI for a ChatGPT-style browser interface
Integrate with Home Assistant for local AI automation
Point any OpenAI-compatible client at it (it mimics the OpenAI API format)

Ollama is the right default choice. It's actively maintained, the model library is extensive, and the Metal support is solid out of the box.

Method 2: llama.cpp (Best Raw Performance)

If you want maximum performance and control, llama.cpp is the tool. It's more involved to set up, but it gives you direct access to Metal acceleration flags, quantization options, and fine-grained memory controls.

Building with Metal Support

# Clone the repo
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with Metal acceleration enabled
LLAMA_METAL=1 make -j$(sysctl -n hw.logicalcpu)

The LLAMA_METAL=1 flag is critical. Without it, you're running on CPU only and leaving most of your hardware's capability on the table.

Getting GGUF Models

llama.cpp uses the GGUF format. The best source is Hugging Face — search for "GGUF" plus your model name. Bartowski and TheBloke are reliable uploaders with well-tested quantizations.

For Llama 3 8B, you want something like Meta-Llama-3-8B-Instruct-Q4_K_M.gguf. The Q4_K_M suffix means 4-bit quantization with medium quality — the best balance of size and performance for most use cases.

Running a Model

./llama-cli \
  -m models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  -n 512 \
  --temp 0.7 \
  -ngl 99

The -ngl 99 flag offloads all 99 layers to the GPU (Metal). This is what makes it fast. If you run into memory issues, reduce this number to offload fewer layers.

When to Choose llama.cpp Over Ollama

You want to benchmark specific quantizations
You're running models not yet in Ollama's library
You need fine-grained control over context length and sampling parameters
You're building something custom and want direct GGUF access

Method 3: LM Studio (Best for Non-Technical Users)

If command lines aren't your thing, LM Studio is a polished desktop app that wraps llama.cpp in a GUI. You get a model browser, drag-and-drop loading, and a chat interface that looks like a proper application.

Download it from lmstudio.ai, install it like any Mac app, search for models in the built-in browser, and click download. Metal acceleration is enabled automatically on Apple Silicon.

LM Studio is genuinely good software. The main tradeoff is that it's slightly less flexible than direct llama.cpp access and adds a layer of abstraction you can't always see through. But for someone who just wants to run local models without touching a terminal, it's the right choice.

Performance Optimization: Getting More Speed

Once you have a model running, here are the tweaks that actually matter.

Enable Metal Explicitly (Ollama)

Metal should be on by default, but you can force it:

export OLLAMA_METAL_ENABLED=1
ollama serve

Choose the Right Quantization

Not all quantizations are equal. Here's a practical guide:

Quantization	RAM Usage	Quality	Use When
Q2_K	Lowest	Noticeably worse	Tight on RAM
Q4_K_M	~4.7GB (8B)	Very good	Default choice
Q5_K_M	~5.7GB (8B)	Excellent	Have RAM to spare
Q8_0	~8.5GB (8B)	Near-perfect	Max quality

Q4_K_M is the sweet spot for most people. The quality difference between Q4 and Q8 is small; the RAM difference is significant.

Free Up Memory

This sounds obvious, but it matters. Before running large models:

Quit Chrome (it's a memory hog)
Stop Docker Desktop if it's running
Close Xcode, Figma, and other heavy apps

On a 16GB machine, freeing 4-6GB of RAM can be the difference between a model running smoothly and constant swapping.

Context Length

Longer context = more RAM. If you're running tight on memory, reduce the context window:

# In llama.cpp
./llama-cli -m model.gguf -c 2048  # Instead of default 4096

In Ollama, you can set this in a Modelfile.

Troubleshooting Common Issues

Model Won't Load / Out of Memory

First, check how much RAM the model actually needs versus what you have free. Run ollama ps to see active models and their memory usage. If you're running multiple models, stop the ones you're not using.

Metal Errors on macOS Sonoma

Some users hit Metal-related crashes on Sonoma. The quick fix:

export OLLAMA_METAL_ENABLED=0

This falls back to CPU-only mode. Slower, but stable. Check for Ollama updates first — most Sonoma-specific issues have been patched in recent releases.

Slow Download / Failed Model Downloads

Llama models are large. A 70B Q4 model is ~40GB. If downloads are failing:

Check your firewall isn't blocking Ollama's download server
Make sure you have enough disk space (models are stored in ~/.ollama/models)
Try downloading via the Ollama CLI rather than a GUI wrapper

Verify Metal Is Actually Being Used

system_profiler SPDisplaysDataType

This shows your GPU details. You can also watch GPU utilization in Activity Monitor (Window → GPU History) while running inference. If the GPU graph isn't moving, Metal isn't engaged.

Model Storage: Don't Overlook This

A collection of GGUF models adds up fast. A few 70B models can easily consume 150GB+. Your Mac's internal SSD is fast enough for model loading, but if you're building a serious local AI setup, an external NVMe SSD is worth considering.

The Samsung T7 Shield 4TB is a solid choice — fast enough for model loading (1,050 MB/s read), durable, and reasonably priced. Store your model library there and point llama.cpp or LM Studio at the external drive.

For sustained inference workloads on a MacBook, a cooling pad can help maintain consistent performance. The MacBook Air in particular has no fan, and while it handles inference well, it can throttle slightly under very long continuous sessions. A passive cooling pad helps dissipate heat.

Bottom Line: What Should You Actually Do?

Here's my direct recommendation based on your situation:

If you're just getting started: Install Ollama, run ollama pull llama3.2:8b, and start chatting. You'll be up and running in 10 minutes. Upgrade to llama.cpp later if you need more control.

If you want maximum performance: Build llama.cpp with Metal support and download Q4_K_M GGUF models from Hugging Face. The extra setup time pays off in speed and flexibility.

If you hate terminals: LM Studio. No shame in it — it's good software.

Hardware recommendations:
- Best value: M2 MacBook Air 16GB (~$1,299) — handles 8B models well, portable, great battery life
- Prosumer pick: Mac Studio with 64GB+ unified memory — runs 70B models comfortably, avoids laptop thermal limits, and makes a strong desk setup for sustained local inference
- Hard pass: Any Mac with 8GB RAM for LLM work — save yourself the frustration

The local AI experience on Apple Silicon has crossed a threshold. It's not a compromise anymore. A 70B model running at 30-40 tokens/sec on a Mac Studio is genuinely useful for real work — coding assistance, document analysis, private conversations you don't want going to a cloud API. The hardware is there. The software is mature. There's no reason to wait.

Part of our Apple Silicon Guide
Mac for Local LLMs: Complete Apple Silicon Guide →

How Much RAM Do You Need to Run Llama 3?
Use this next if you want exact 8B and 70B memory math before buying a Mac. RTX 4090 vs Mac Studio M4 Max 128GB
For readers deciding between Apple Silicon convenience and a desktop GPU build. Mac Studio M4 Max: 64GB vs 128GB for LLMs
The best follow-up if you’re stuck on whether unified memory upgrades are worth it. Best Hardware for Claude-Distilled Models
A good bridge if you want model-specific hardware advice instead of general Mac setup help.

Not ready to buy hardware?

Try on RunPod for instant access to powerful GPUs.

Not ready to buy hardware? Try on RunPod →

Get new local AI guides

One email when new posts drop. No spam.