Why some precisions are slower on this CPU
This server runs on a generic x86 CPU without dedicated narrow-precision SIMD (no AMX, no FP16 instructions). PyTorch's CPU kernels emulate FP16/BF16 matmul by upcasting to FP32 — so you pay an extra cast per op and get worse cache utilization, not faster math. On GPUs with Tensor Cores it's the opposite: FP16/BF16 are roughly 2× faster than FP32.
INT8 dynamicis genuinely faster here because x86 CPUs have well-optimized INT8 GEMM kernels (FBGEMM / oneDNN). It quantizes only nn.Linear weights; LayerNorm and the attention math stay FP32. That's why model size drops ~4× from FP32 but only ~1.6× faster — non-Linear ops are still FP32.
The CLS cosine and KL from fp32columns tell you whether the precision is numerically lossy. FP16/BF16 usually land at cosine > 0.999. INT8 dynamic is the lossy one — and it can flip the top-1 prediction outright, as you'll see on tricky images.