Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance