Formalising lexical and syntactic diversity for data sampling in French