Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training