Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs