keyword
Establishing Construct Validity in LLM Capability Benchmarks Requires Nomological Networks
Recent work in machine learning increasingly attributes human-like capabilities such as reasoning or theory of mind to large language models (LLMs) on the basis of benchmark performance. This paper examines this practice through the lens of construct validity, understood as the problem of linking theoretical capabilities to their empirical measurements. It contrasts three influential frameworks: the nomological account developed by Cronbach and Meehl, the inferential account proposed by Messick and refined by Kane, and Borsboom's causal account. I argue that the nomological account provides the most suitable foundation for current LLM capability research. It avoids the strong ontological commitments of the causal account while offering a more substantive framework for articulating construct meaning than the inferential account. I explore the conceptual implications of adopting the nomological account for LLM research through a concrete case: the assessment of reasoning capabilities in LLMs.
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Model Details
We decreased the confidence threshold to 0.1 to increase article and headline The following specifications were used: { resolution: 256, learning rate: 2e-3 }. This limit is binding for common words, e.g., "the". The recognizer is trained using the Supervised Contrastive ("SupCon") loss function [7], a gener-45 In particular, we work with the "outside" SupCon loss formulation We use a MobileNetV3 (Small) encoder pre-trained on ImageNet1k sourced from the timm [19] We use 0.1 as the temperature for Center Cropping, to avoid destroying too much information. C (Small) model that is developed in [2] for character recognition. If multiple article bounding boxes satisfy these rules for a given headline, then we take the highest.
- North America > United States (0.14)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Law (1.00)
- Information Technology (1.00)
- Government (1.00)
- Oceania > Australia (0.05)
- Asia > China (0.05)
- North America > United States > Texas (0.04)
- (6 more...)
Incorporating Geographical and Temporal Contexts into Generative Commonsense Reasoning
Recently, commonsense reasoning in text generation has attracted much attention. Generative commonsense reasoning is the task that requires machines, given a group of keywords, to compose a single coherent sentence with commonsense plausibility. While existing datasets targeting generative commonsense reasoning focus on everyday scenarios, it is unclear how well machines reason under specific geographical and temporal contexts.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
- Oceania > Australia (0.06)
- (18 more...)
- North America > United States (0.04)
- Europe > United Kingdom (0.04)
- Europe > Russia (0.04)
- (6 more...)
- Research Report > New Finding (0.67)
- Instructional Material (0.67)
- Research Report > Promising Solution (0.45)
- Law (0.97)
- Government (0.68)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Therapeutic Area > Dermatology (1.00)
- (2 more...)