D4: Improving LLM Pretraining via Document De-Duplication and Diversification