New open-source platform allows users to evaluate performance of AI-powered chatbots
A team of computer scientists, engineers, mathematicians and cognitive scientists have developed an open-source evaluation platform called CheckMate, which allows human users to interact with and evaluate the performance of large language models (LLMs). The researchers tested CheckMate in an experiment where human participants used three LLMs – InstructGPT, ChatGPT and GPT-4 – as assistants for solving undergraduate-level mathematics problems. The team studied how well LLMs can assist participants in solving problems. Despite a generally positive correlation between a chatbot's correctness and perceived helpfulness, the researchers also found instances where the LLMs were incorrect, but still useful for the participants. However, certain incorrect LLM outputs were thought to be correct by participants.
Jul-1-2024, 08:56:39 GMT
- Technology: