Benchmarking the NVIDIA DGX Spark for Local LLMs — Part 1

NVIDIA DGX Spark on desk — home lab setup

Since everyone was rushing to buy Mac Studios to run models locally, I wanted to see what alternatives were out there. I landed on the NVIDIA DGX Spark with 128GB LPDDR5X and an ARM processor. What I found appealing was the ability to also use this machine for fine-tuning my own models — something that is possible on the Mac Studio and Mac Mini via Apple's MLX framework, but that locks you into one framework and usually takes longer.

I noticed during my research that every DGX Spark blog out there repeats NVIDIA's specs, but few are actually running tests in real life. So after setting up the box in my home lab — which was very straightforward — I ran my first experiment. I wanted to capture angles that don't exist anywhere on r/LocalLLaMA yet.

Here are the 5 areas I tested.

Buy NVIDIA DGX Spark on Amazon →

Affiliate link · No extra cost to you

1. Unified Memory Scaling

The Concept

As described in our Local LLM Hardware Guide: VRAM, Quantization, and What You Can Actually Run, VRAM is a make-or-break stat for running models locally. If the VRAM isn't sufficiently large, the model will either not fit or slow down significantly — this is known as hitting the "VRAM cliff."

What's Special

The DGX Spark has a single 128GB pool that both the GPU and CPU can tap into. That eliminates the VRAM cliff where GPU memory spills into RAM and drastically slows down the system. You'll still need to be mindful of the 128GB ceiling, but performance stays consistent right up until you hit it.

What We Measured

We ran 4 different model sizes — two 32B models (QwQ & Qwen 2.5 Coder), one 70B (DeepSeek R1), and one 122B (Qwen 3.5) — and measured how fast tokens were generated (tok/s).

What We Found

As expected, speed dropped going from the 32B models to the 70B model. What was interesting is that the 122B Qwen 3.5 model ran at 15 tok/s — faster than the 70B DeepSeek at 4 tok/s. But the main takeaway is that nothing crashed and nothing failed to load. All models ran — including the one you'd never expect to fit on a consumer GPU.

Note on Qwen3.5-122B: This is a Mixture-of-Experts (MoE) model — 122B total parameters, but only ~10B are active per token. That's why throughput is higher than the dense 32B models despite the larger memory footprint. It still requires ~81GB of RAM to load, which is why it only runs on hardware like the Spark — but the compute cost per token is closer to a 10B model.

Model	Params	Mean tok/s	Min	Max
QwQ-32B	32B	8.51	5.73	9.90
Qwen2.5-Coder-32B	32B	7.34	3.61	9.40
DeepSeek-R1-70B	70B	4.11	3.21	4.59
Qwen3.5-122B	122B	15.11	5.61	19.88

2. Long Context Degradation

The Concept

Context is how much the model can hold in working memory — your inputs such as raw text or documents, plus the model's own responses. I'm sure you've been in a situation where you've been going back and forth with an LLM and it suddenly forgets something you already discussed. The context window is measured in tokens, where 1 token is roughly equivalent to ¾ of a word.

Why It Matters

For RAG (Retrieval Augmented Generation) use cases, the context window is extremely important — the model needs to hold large chunks of information in memory to answer questions about them. The larger the context, the more memory and compute the GPU requires.

What We Measured

DeepSeek-R1:70B was given an identical task with 3 different context sizes — 4K, 32K, and 64K tokens — and we measured how fast the response came back each time.

What We Found

As expected, speed dropped as context size increased:

4K context: 3.42 tok/s
32K context: 2.74 tok/s (−20%)
64K context: 2.28 tok/s (−33%)

The big takeaway: the DGX Spark handles book-length context at an acceptable speed. It doesn't fall off a cliff — it slows down gradually and gracefully.

3. Reasoning Overhead

The Concept

Reasoning models (like QwQ-32B or DeepSeek-R1) are designed to think through a response before providing an answer. You've probably seen the internal monologue these models generate before giving you an output. This is especially important for hard problems in math, science, or coding — but it obviously takes longer.

What We Measured

We compared two models equivalent in size but different in thinking style:

QwQ-32B — the reasoning model
Qwen2.5-Coder-32B — standard instruct model

What We Found

The reasoning model was only 6.4% slower — a surprisingly small gap. For daily use on a DGX Spark, reasoning models are fast enough that you won't feel penalized for using them.

QwQ-32B (reasoning): 9.07 tok/s
Qwen2.5-Coder-32B (instruct): 9.69 tok/s

Note: these speeds reflect isolated reasoning benchmark conditions and differ slightly from the unified memory scaling test above, which used different prompts and averaging windows.

4. Task-Type Throughput

The Concept

Some questions require more "thinking" from a model. Simple fact retrieval like "what is the capital of the USA?" requires less work than a complex coding task. We wanted to test how much the task type impacts the speed at which the GPU generates tokens.

Why It's Interesting

Some hardware slows down drastically as task complexity increases. We wanted to check if the DGX Spark holds up when you push it.

What We Measured

DeepSeek-R1:70B was given four different task types — factual, technical writing, coding, and math reasoning — and we measured tok/s for each.

What We Found

The DGX Spark stayed in a tight range regardless of task complexity — a great sign for a general-purpose machine. The factual task came in slightly lower at 3.72 tok/s, likely because the short answer gives us less data to average across. Technical writing, coding, and math were nearly indistinguishable.

Task	Mean tok/s
Factual (short)	3.72
Technical writing	4.54
Coding	4.54
Math reasoning	4.56

5. Memory Bandwidth Efficiency

The Concept

The DGX Spark has a theoretical maximum memory bandwidth of 273 GB/s — meaning it can move 273 gigabytes of data per second between memory and the processor. In practice, software never fully maxes out the hardware. Overhead from loading instructions, managing memory, and waiting for operations to sync always eats into that ceiling.

What We Measured

To generate each token, the GPU has to read the model's weights from memory roughly once. So if a 42GB model runs at 4 tok/s, it's reading 42GB × 4 = 168 GB/s of data per second. Divide that by the theoretical max of 273 GB/s and you get 61.5% utilization.

What We Found

The DGX Spark is using 50–63% of its memory bandwidth for LLM inference — which is exceptional for this type of workload.

Model	Size	tok/s	Effective BW	GB10 Utilization
QwQ-32B	19.2 GB	8.51	163.4 GB/s	59.9%
Qwen2.5-Coder-32B	19.2 GB	7.34	140.9 GB/s	51.6%
DeepSeek-R1-70B	42.0 GB	4.11	172.6 GB/s	63.2%

Qwen3.5-122B excluded — post-quantization model size is too difficult to estimate reliably at this scale, which would make the derived bandwidth figure misleading.

All benchmarks measured first-party on a DGX Spark GB10 (128GB LPDDR5X) via Ollama's OpenAI-compatible API. Each condition run 3× and averaged.

Part 2 is coming soon — I'll cover whether the DGX Spark is actually worth the price: head-to-head vs the RTX 4090 and Mac Studio, plus a 3-year cost breakdown to help you decide.

The beginner-friendly primer — start here if you're new to running models locally

How Much RAM Do You Need to Run Llama 3?

Memory requirements by model size — 8B through 70B

Best GPU for Running LLMs Locally in 2026

Full GPU comparison — RTX 4090, 3090, 4080, RX 7900 XTX and more