Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures