Jun 30, 2026

The Best gemma4 12B Quant for a 16GB Blackwell GPU

Benchmarking gemma4 12B quants on the cheapest 16GB Blackwell card. QAT beats Q4_K_M outright, and its FP4, FP8, and MTP speedups are all macOS-only on Ollama.

The RTX 5060 Ti 16GB is the cheapest, slowest card in the 50-series, and that makes it the most interesting one to benchmark. If a model runs well here, it runs well on every card above it. So when gemma4 12B landed, I ran every 12B variant Ollama offers through the same harness to answer one question: which one should you actually pull onto a 16GB Blackwell GPU?

The answer is not the obvious one. The quant-aware-training build is faster, smaller, and higher quality than the plain Q4 most people default to. And the variants that should shine on Blackwell, the native FP4 and FP8 tags, do not run on Linux at all.

Test setup

Everything ran on the same machine and the same Ollama build, so the variant is the only thing that changes.

GPU: NVIDIA GeForce RTX 5060 Ti 16GB, driver 610.47
Software: Ollama 0.30.6 in Docker, on WSL2
Server flags: flash attention, q8_0 KV cache, keep-alive set to never unload
Benchmark: num_ctx=8192, num_batch=1024, warmup plus 10 iterations over a fixed prompt set
Mode: --no-think for the cross-variant comparison

Throughput is generation tokens per second, averaged across 10 iterations with the warmup discarded. Resident VRAM comes from ollama ps with the model loaded. The harness is on GitHub if you want to reproduce any of this.

One gotcha worth flagging: gemma4 is a thinking model, and you have to pass an explicit --think or --no-think. Run the bare default and the think phase folds into time-to-first-token, thinking tokens read as zero, and the reported throughput inflates to physically impossible numbers. Always set the flag.

Results

Variant	Disk	Resident VRAM	Tokens/sec	vs Q4_K_M	Quality	Runs on Linux?
`12b-it-qat`	7.2 GB	7.7 GB	39.5	+5%	Q4 size, QAT-recovered	Yes
`12b-it-q4_K_M`	7.6 GB	8.1 GB	37.7	baseline	naive Q4	Yes
`12b-it-q8_0`	12 GB	13 GB	25.7	-32%	near-bf16	Yes
`12b-nvfp4`	10 GB	n/a	n/a	n/a	Blackwell FP4	No
`12b-mxfp8`	12 GB	n/a	n/a	n/a	FP8	No

Four findings fall out of this.

QAT is a free lunch

The headline result is that 12b-it-qat beats plain 12b-it-q4_K_M on every axis at once. It is faster (39.5 vs 37.7 tokens/sec), smaller in VRAM (7.7 vs 8.1 GB), and higher quality. The quality gap is the point of QAT: quantization-aware training fine-tunes the model with the rounding error in the loop, so the 4-bit weights recover most of the accuracy that naive post-training quantization throws away. You get a better model that also happens to run faster.

There is no reason to pull plain q4_K_M on this hardware. Make qat the default.

Q8 fits, and it stays on the GPU

The q8_0 build is the quality-max option, near bf16 fidelity at half the size of full precision. The open question on a 16GB card is whether it fits without spilling to the CPU. It does. At 13 GB resident, with the 8K q8_0 KV cache, it stays 100% on the GPU with zero offload.

The cost is throughput. At 25.7 tokens/sec it runs about a third slower than qat. That tradeoff is clean: pick q8_0 when output quality matters more than speed, and qat for everything interactive. The one caveat is headroom. At 13 GB resident there is not much room left for a larger context window before the KV cache starts to spill.

Notice that throughput tracks weight size almost perfectly. 7.7 GB gives 39.5 tokens/sec, 13 GB gives 25.7. That inverse-linear relationship is the signature of a memory-bandwidth-bound workload. The 5060 Ti is not compute limited on a 12B dense model, it is waiting on VRAM. Smaller weights mean fewer bytes to stream per token, which is exactly why the well-made 4-bit quant wins.

The Blackwell FP4 promise does not pay off

The 5060 Ti is native Blackwell silicon with hardware FP4 support, and Ollama ships nvfp4 and mxfp8 tags that should map straight onto it. They do not run. The pull fails immediately:

Error: pull model manifest: 412: this model requires macOS

Both tags are gated to macOS because they ship MLX kernels, Apple’s framework, rather than CUDA. So despite owning the exact hardware these formats were designed for, you cannot use Ollama’s FP4 path on Linux. Actually using Blackwell FP4 means stepping outside Ollama to vLLM or TensorRT-LLM with an NVFP4 checkpoint. The spec sheet and the software stack disagree.

The MTP speedup is macOS-only too

The number formats are not the only thing locked to macOS. Gemma 4 also ships Multi-Token Prediction, a built-in speculative decoding path that Google advertises at up to 2 to 3x faster with no loss in quality. Ollama supports it through a DRAFT directive in the Modelfile, so I tried to wire it up on the 5060 Ti, pairing the 12B target with Google’s own MTP drafter. It refuses at create time:

Error: MTP draft safetensors require a qwen3.5 base model, got "gemma4"

The CUDA path only accepts a qwen3.5 base. Gemma 4’s MTP implementation lives entirely in Ollama’s MLX backend, Apple again. The speedup is real, but on Linux you cannot point it at gemma4 at all, at any size. That makes three acceleration paths in a row, after FP4 and FP8, that gemma4 advertises on paper and gates to macOS in practice. To run MTP on this card you would have to switch model families to qwen3.5, or leave Ollama for vLLM or Transformers.

How this scales

I ran the same three variants on an RTX 5090 to see whether the ranking survives on a much faster card. It does, and the bandwidth story gets more interesting on the way up.

Variant	5090 tokens/sec	5060 Ti tokens/sec	5090 / 5060 Ti
`12b-it-qat`	105.9	39.5	2.68x
`12b-it-q4_K_M`	102.3	37.7	2.71x
`12b-it-q8_0`	77.2	25.7	3.00x

qat wins here too. It is the fastest and smallest variant on the 5090 as well, so nothing about the recommendation changes with the hardware.

What changes is q8_0. On the 5060 Ti it cost 35% of throughput against qat. On the 5090 the same jump from roughly 8 to 13 GB of weights costs only 27%. If the 5090 were purely bandwidth bound, streaming about 70% more weight per token would drop it to around 62 tokens/sec. It holds 77.2 instead. The bigger card has bandwidth to spare on a model this size, so the heavier quant stops hurting as much.

The ratio column shows the same thing from another angle. On the 4-bit quants the 5090 is about 2.7x the 5060 Ti, but on q8_0 the gap widens to 3.0x. The 5090 pulls further ahead exactly when the model gets heavy enough to lean on its memory system. On the light quants some of that bandwidth advantage simply goes unused.

So the bandwidth-bound rule from the results above looks like a small-card rule. The 5060 Ti is memory limited on every 12B variant, which is why a well-made 4-bit quant wins so cleanly on it. Move up to a 5090 and the constraint loosens. VRAM stops being scarce, the bandwidth ceiling sits out of reach for a model this size, and q8_0 becomes an easy call: near-bf16 quality for a 27% tax, with around 19 GB of headroom left over. The variant you pick still matters, but on a big card it is a different decision than on a small one.

It is not really about size. Drop the same qat build onto a GTX 1080 Ti, an 11 GB Pascal card from 2017, and it runs at 23.7 tokens/sec.

Card	Memory bandwidth	`qat` tokens/sec
GTX 1080 Ti (Pascal)	~484 GB/s	23.7
RTX 5060 Ti 16GB (Blackwell)	~448 GB/s	39.5
RTX 5090 (Blackwell)	~1792 GB/s	105.9

The 1080 Ti has more memory bandwidth than the 5060 Ti and still runs 40% slower on byte-identical weights. If the workload were bandwidth bound everywhere, that extra bandwidth would predict about 43 tokens/sec. It does 23.7. Pascal has no tensor cores and no fast 4-bit path, so the per-token dequant and matmul that Blackwell hides behind its memory ceiling become the actual ceiling here. The card is compute bound, not bandwidth bound.

So the rule is about the architecture, not the size or price of the card. The 5060 Ti is bandwidth bound, which is why the small 4-bit quant wins so cleanly. The 5090 has bandwidth to spare. The 1080 Ti is compute bound, where a smaller quant means fewer bytes to stream but no less math to do. Pick the quant for the bottleneck the card actually has.

Bottom line

For a 16GB Blackwell GPU, in priority order:

gemma4:12b-it-qat is the default. Fastest and highest quality at a 4-bit footprint.
gemma4:12b-it-q8_0 when you want maximum fidelity and can spend about a third of your throughput to get it.
Skip the FP4 and FP8 tags, and do not expect MTP either. None of gemma4’s accelerated paths run on Linux through Ollama, hardware support notwithstanding.

The lesson generalizes past gemma4. When you are bandwidth bound, a well-trained smaller quant beats a naive larger one on speed and quality simultaneously, and the acceleration features a model advertises are only as useful as the kernels that ship for your platform.

The benchmark harness, the Modelfiles, and every raw result are in bschwabQ/ollama-load-test. Clone it and run the same sweep on your own card.