ViT ablation lab

Precisions to compare

fp32 is always run as the reference (cosine sim and KL are measured against it).

Pick an image, choose precisions, then hit Run all.

First run takes ~60s — the server loads and warms each model variant. Subsequent runs are fast.

Why some precisions are slower on this CPU

This server runs on a generic x86 CPU without dedicated narrow-precision SIMD (no AMX, no FP16 instructions). PyTorch's CPU kernels emulate FP16/BF16 matmul by upcasting to FP32 — so you pay an extra cast per op and get worse cache utilization, not faster math. On GPUs with Tensor Cores it's the opposite: FP16/BF16 are roughly 2× faster than FP32.

INT8 dynamicis genuinely faster here because x86 CPUs have well-optimized INT8 GEMM kernels (FBGEMM / oneDNN). It quantizes only nn.Linear weights; LayerNorm and the attention math stay FP32. That's why model size drops ~4× from FP32 but only ~1.6× faster — non-Linear ops are still FP32.

The CLS cosine and KL from fp32columns tell you whether the precision is numerically lossy. FP16/BF16 usually land at cosine > 0.999. INT8 dynamic is the lossy one — and it can flip the top-1 prediction outright, as you'll see on tricky images.