Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation