grammaticality
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- Oceania > Nauru (0.04)
- Asia > Indonesia > Bali (0.04)
- (2 more...)
6754e06e46dfa419d5afe3c9781cecad-AuthorFeedback.pdf
So,thefactthatourtraining8 data comes solely from infectious virus, which would be highly probable (or "grammatical") sequences under our9 language model (LM), isakeyfeature ofourapproach. Importantly,however,we note that,fundamentally,CSCS ispresented ingenerality here sothese methods are19 not strictly "competitor methods" in the sense that, if one were to work better, it would still be incorporable within20 theCSCSframework. "`1 rather than Euclidean": We used`1 since it has nicer properties than, e.g.,`2 in26 high-dimensional spaces(Aggarwaletal.,ICDT,2001)butotherdistance metrics couldbeempirically quantified. "theoretical44 detail"/"how the method works": We apologize for sparsity of detail. "choice of beta": We find good robustness ofβ values reasonably close to 1 (e.g, 0.5-2).56
What Can String Probability Tell Us About Grammaticality?
Hu, Jennifer, Wilcox, Ethan Gotlieb, Song, Siyuan, Mahowald, Kyle, Levy, Roger P.
What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM's underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models' and humans' deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs' structural knowledge, and suggest directions for future work in LM grammatical evaluation.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- (9 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- Oceania > Nauru (0.04)
- Asia > Indonesia > Bali (0.04)
- (2 more...)
FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization
Hong, Shibo, Ying, Jiahao, Liang, Haiyuan, Zhang, Mengdi, Kuang, Jun, Zhang, Jiazheng, Cao, Yixin
Evaluating open-ended outputs of Multimodal Large Language Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing ``MLLM-as-a-Judge'' evaluators, though promising, remain constrained to specific tasks and aspects. In this paper, we argue that, on one hand, based on the interconnected nature of aspects, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual aspects and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks -- Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. However, training such a unified evaluator is hindered by the lack of a large-scale, multi-modal, and aspect-level resource. To address this gap, we introduce FRABench, a comprehensive fine-grained evaluation dataset. Specifically, (1) We first construct a hierarchical aspect taxonomy encompassing 112 distinct aspects across the aforementioned four tasks. (2) Based on this taxonomy, we create FRABench, comprising 60.4k pairwise samples with 325k evaluation labels obtained from a combination of human and GPT-4o annotations. (3) Finally, leveraging FRABench, we develop UFEval, a unified fine-grained evaluator. Experiments show that learning on specific aspects enables UFEval to generalize to unseen aspects, and joint learning to assess diverse visual tasks and aspects can lead to substantial mutual benefits.
- Europe > North Macedonia > Southwestern Statistical Region > Ohrid Municipality > Ohrid (0.04)
- Europe > France (0.04)
- Asia > Singapore (0.04)
- Africa > Central Africa (0.04)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Law (0.67)
- (2 more...)
Large Language Model probabilities cannot distinguish between possible and impossible language
Leivada, Evelina, Montero, Raquel, Morosi, Paolo, Moskvina, Natalia, Serrano, Tamara, Aguilar, Marcel, Guenther, Fritz
A controversial test for Large Language Models concerns the ability to discern possible from impossible language. While some evidence attests to the models' sensitivity to what crosses the limits of grammatically impossible language, this evidence has been contested on the grounds of the soundness of the testing material. We use model-internal representations to tap directly into the way Large Language Models represent the 'grammatical-ungrammatical' distinction. In a novel benchmark, we elicit probabilities from 4 models and compute minimal-pair surprisal differences, juxtaposing probabilities assigned to grammatical sentences to probabilities assigned to (i) lower frequency grammatical sentences, (ii) ungrammatical sentences, (iii) semantically odd sentences, and (iv) pragmatically odd sentences. The prediction is that if string-probabilities can function as proxies for the limits of grammar, the ungrammatical condition will stand out among the conditions that involve linguistic violations, showing a spike in the surprisal rates. Our results do not reveal a unique surprisal signature for ungrammatical prompts, as the semantically and pragmatically odd conditions consistently show higher surprisal. We thus demonstrate that probabilities do not constitute reliable proxies for model-internal representations of syntactic knowledge. Consequently, claims about models being able to distinguish possible from impossible language need verification through a different methodology.
The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English
We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status ("well-formedness" according to syntactic formalisms) extracted from the literature, as well as its acceptability status ("intuitive goodness" as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that grammaticality and acceptability judgments converge in about 83% of the cases and that "in-betweenness" occurs frequently. This corroborates existing research. We also find that while machine learning models struggle with predicting grammaticality, they perform considerably better in predicting acceptability. This is a novel finding. Future work will focus on expanding the dataset.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)
- Europe > Germany > Saxony > Leipzig (0.04)
- (4 more...)
Review for NeurIPS paper: Learning Mutational Semantics
Weaknesses: There are a few weaknesses that might be helpful to address (also see comment on Correctness): clarification of the notion of "grammaticality" for this problem; further connections to similar approaches for protein modeling that could be considered; the slightly ad hoc nature of the CSCS objective; and the fact that comparisons made to similarly high-capacity deep unsupervised models in the Appendix did not use viral data. These are explained in further detail below: Appropriateness of "grammaticality" for the viral immunological escape problem: I appreciate the trend of using massive amounts of unsupervised data to circumvent the difficulty of obtaining fitness measurements for biological sequences, which this work also advances. However, there is some degree of implicit supervision involved here, in that the amino acid sequences used (described in Lines 199-200) are explicitly from infectious viruses (rather than somehow being neutral/benign). It's also not clear that casting this observation as merely an issue of "grammaticality" makes sense: if a "grammatically correct" sequence is one that belongs to an infectious virus, what's the difference between grammaticality and semantics (which are also supposed to capture what makes a sequence infectious)? Perhaps one could claim that grammaticality in this context has to do with whether the protein is "valid", in the sense that it folds or is stable, but this is not explained, and does not absolve the first point that all the sequence data come from infectious viruses (rather than, for example, all valid protein variants that fold/are stable, which would allow for a much clearer distinction between grammaticality and semantics).
Generating a biomedical knowledge graph question answering dataset
The biomedical domain is a complex network of interconnected knowledge, encompassing genetics, diseases, drugs, and biological processes. While knowledge graphs (KGs) excel at organizing and linking this information, their complexity often makes them difficult for users to query. Ideally, users should be able to ask questions in natural language and receive precise answers directly from the KG, without needing specialized query expertise. However, enabling deep learning-based systems to query KGs using natural language remains a major challenge. Existing biomedical knowledge graph question answering (BioKGQA) datasets are small and limited in scope, typically containing only a few hundred question answering (QA) pairs.