TEL'M: Test and Evaluation of Language Models
Cybenko, George, Ackerman, Joshua, Lintilhac, Paul
–arXiv.org Artificial Intelligence
It is assumed that readers are already familiar with Language Models of various flavors such as: Transformer-based Language Models (currently the most promising and studied LMs) [78]; Multimodal Foundation Models such as Blip-2 [48] and CLIP [61]; Auto-regressive Language Models [15, 51]; Recurrent Neural Network Language Models [75]; State space language models [40]; Hybrid Models [24] as well as the current and proposed use cases and the various technologies underlying them [1, 65, 70]. There is growing interest in LM performance and benchmarks [13, 16, 18,46, 47, 64,72, 74, 80] with recent acknowledgement that this is a hard problem [53]. Many suggestions are proposed in the commercial literature [17] and a large number of benchmark-based methods have surfaced (Big Bench [67], GLUE Benchmark, SuperGLUE Benchmark, OpenAI Moderation API, MMLU, EleutherAI LM Eval, OpenAI Evals Adversarial NLI, LIT, ParlAI, CoQA, LAMBADA, HellaSwag, LogiQA, MultiNLI, SQUAD to name a few). A review of existing approaches demonstrates that they are not quantitative or rigorous enough to past muster with respect to accepted testing requirements [3, 55]. In particular, existing use of benchmarks do not investigate the extent to which a benchmark can predict or quantify certain properties on future prompts (that is, statistical soundness of any conclusions) and do not identify factors affecting performance dependence as would be possible with more rigorous experimental design and test execution. LMs can be black box, gray box or white box according to the visibility into the architecture and training data used to create an LM (see Table 1). Remote Black Box LMs typically throttle the number of prompts so sustained access for testing could be difficult unless priority access to an API is given. For example, ChatGPT limits users to a small number of free prompts but allows unlimited prompts on its subscription option. Additionally, reproducability may not be guaranteed because of randomness in the response generation and/or continuous adaptation of the LM platform.
arXiv.org Artificial Intelligence
Apr-15-2024
- Country:
- North America
- Greenland (0.04)
- United States
- Wisconsin (0.04)
- New York (0.04)
- New Hampshire > Grafton County
- Hanover (0.04)
- California > Santa Clara County
- Palo Alto (0.04)
- Europe
- Asia > Middle East
- Jordan (0.04)
- North America
- Genre:
- Research Report (1.00)
- Industry:
- Technology: