Pretraining Data and Tokenizer for Indic LLM