Benchmark methodology
How we test every GPU in the index. Reproducible, transparent, no marketing numbers.
How we test
Every GPU in the index is tested with the same model, the same parameters, and the same tooling. We run Qwen3 32B at three quantization levels (Q4_K_M, Q8_0, FP16) and report the median decode speed from 5 independent runs. No cherry-picking, no warm caches, no synthetic loads.
Hardware
All benchmarks run on bare metal. No VMs, no containers, no cloud instances. This eliminates hypervisor overhead and IOMMU latency that would skew results.
Software
We use llama.cpp as the inference engine. It supports every GPU architecture in our index (CUDA, Metal, Vulkan) and gives us consistent, comparable numbers across vendors.
Parameters
Every run uses identical parameters. We measure single-stream decode speed — the real-world scenario of one user generating text.
Reporting
We run each configuration 5 times and report the median value. The median is more robust than the mean — it ignores outliers from thermal throttling or background OS activity. The first run is always a cold start; we don't discard it.
Quantization levels
We test three quantization levels to cover the full spectrum of quality vs. speed tradeoffs.
4-bit quantization with K-means optimization. Best speed-to-quality ratio for most users. ~60% smaller than FP16.
8-bit quantization. Minimal quality loss vs. full precision. ~50% smaller than FP16. Good for tasks requiring high accuracy.
Full 16-bit floating point. No quantization loss. Requires the most VRAM. Use when you need exact model fidelity.