Combatting Dimensional Collapse in LLM Pre-Training Data via Diversified File Selection

Open in new window