IRM--when it works and when it doesn't: A test case of natural language inference