TEL'M: Test and Evaluation of Language Models

Cybenko, George, Ackerman, Joshua, Lintilhac, Paul

Apr-15-2024–arXiv.org Artificial Intelligence

It is assumed that readers are already familiar with Language Models of various flavors such as: Transformer-based Language Models (currently the most promising and studied LMs) [78]; Multimodal Foundation Models such as Blip-2 [48] and CLIP [61]; Auto-regressive Language Models [15, 51]; Recurrent Neural Network Language Models [75]; State space language models [40]; Hybrid Models [24] as well as the current and proposed use cases and the various technologies underlying them [1, 65, 70]. There is growing interest in LM performance and benchmarks [13, 16, 18,46, 47, 64,72, 74, 80] with recent acknowledgement that this is a hard problem [53]. Many suggestions are proposed in the commercial literature [17] and a large number of benchmark-based methods have surfaced (Big Bench [67], GLUE Benchmark, SuperGLUE Benchmark, OpenAI Moderation API, MMLU, EleutherAI LM Eval, OpenAI Evals Adversarial NLI, LIT, ParlAI, CoQA, LAMBADA, HellaSwag, LogiQA, MultiNLI, SQUAD to name a few). A review of existing approaches demonstrates that they are not quantitative or rigorous enough to past muster with respect to accepted testing requirements [3, 55]. In particular, existing use of benchmarks do not investigate the extent to which a benchmark can predict or quantify certain properties on future prompts (that is, statistical soundness of any conclusions) and do not identify factors affecting performance dependence as would be possible with more rigorous experimental design and test execution. LMs can be black box, gray box or white box according to the visibility into the architecture and training data used to create an LM (see Table 1). Remote Black Box LMs typically throttle the number of prompts so sustained access for testing could be difficult unless priority access to an API is given. For example, ChatGPT limits users to a small number of free prompts but allows unlimited prompts on its subscription option. Additionally, reproducability may not be guaranteed because of randomness in the response generation and/or continuous adaptation of the LM platform.

arxiv preprint arxiv, confidence interval, language model, (13 more...)

arXiv.org Artificial Intelligence

Apr-15-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Greenland (0.04)
  - United States
    - Wisconsin (0.04)
    - New York (0.04)
    - New Hampshire > Grafton County
      - Hanover (0.04)
    - California > Santa Clara County
      - Palo Alto (0.04)
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - France > Bourgogne-Franche-Comté
    - Doubs > Besançon (0.04)
- Asia > Middle East
  - Jordan (0.04)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
- Government
  - Military (0.67)
  - Regional Government > North America Government
    - United States Government (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.45)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found