home/methodology

Benchmark methodology

How we test every GPU in the index. Reproducible, transparent, no marketing numbers.

01 // Overview

How we test

Every GPU in the index is tested with the same model, the same parameters, and the same tooling. We run Qwen3 32B at three quantization levels (Q4_K_M, Q8_0, FP16) and report the median decode speed from 5 independent runs. No cherry-picking, no warm caches, no synthetic loads.

02 // Test environment

Hardware

All benchmarks run on bare metal. No VMs, no containers, no cloud instances. This eliminates hypervisor overhead and IOMMU latency that would skew results.

# test rig
host bare metal · no virtualization
os Ubuntu 24.04 LTS
drivers latest stable at time of test
cooling ambient 22°C · open bench

03 // Inference engine

Software

We use llama.cpp as the inference engine. It supports every GPU architecture in our index (CUDA, Metal, Vulkan) and gives us consistent, comparable numbers across vendors.

# engine
llama.cpp "b4732"
# pinned to this build for all current results

04 // Test configuration

Parameters

Every run uses identical parameters. We measure single-stream decode speed — the real-world scenario of one user generating text.

Context length4,096 tokens

Batch size1 (single stream)

Prompt tokens512

Decode tokens512

Temperature0.0 (deterministic)

ModelQwen3 32B

05 // Statistical method

Reporting

We run each configuration 5 times and report the median value. The median is more robust than the mean — it ignores outliers from thermal throttling or background OS activity. The first run is always a cold start; we don't discard it.

06 // Model formats

Quantization levels

We test three quantization levels to cover the full spectrum of quality vs. speed tradeoffs.

Q4_K_MRecommended

4-bit quantization with K-means optimization. Best speed-to-quality ratio for most users. ~60% smaller than FP16.

Q8_0High quality

8-bit quantization. Minimal quality loss vs. full precision. ~50% smaller than FP16. Good for tasks requiring high accuracy.

FP16Full precision

Full 16-bit floating point. No quantization loss. Requires the most VRAM. Use when you need exact model fidelity.

Browse the index