Principled Detection of Hallucinations in Large Language Models via Multiple Testing