Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition