Goto

Collaborating Authors

 fluency




b125999bde7e80910cbdbd323087df8f-Supplemental-Conference.pdf

Neural Information Processing Systems

Foreachprompt, wecompare 6 pairs of models: Quark versus other baselines, as shown in Table 2. These agreement scores are moderate as result of subjectivity involved in ratings of text quality. PPLM (Plug and Play Language Model) uses one or more classifiers to control attributes of model generations. Figure 8: Screenshot of the mechanical turk interfaced used to gather human judgments for the sentimentevaluation. Unlikelihood represents a GPT-2 model fine-tuned with unlikelihoodobjective(Eqn.5)[79].





Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

Samuel, David, Øvrelid, Lilja, Velldal, Erik, Kutuzov, Andrey

arXiv.org Artificial Intelligence

We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.


A Definition of AGI

Hendrycks, Dan, Song, Dawn, Szegedy, Christian, Lee, Honglak, Gal, Yarin, Brynjolfsson, Erik, Li, Sharon, Zou, Andy, Levine, Lionel, Han, Bo, Fu, Jie, Liu, Ziwei, Shin, Jinwoo, Lee, Kimin, Mazeika, Mantas, Phan, Long, Ingebretsen, George, Khoja, Adam, Xie, Cihang, Salaudeen, Olawale, Hein, Matthias, Zhao, Kevin, Pan, Alexander, Duvenaud, David, Li, Bo, Omohundro, Steve, Alfour, Gabriel, Tegmark, Max, McGrew, Kevin, Marcus, Gary, Tallinn, Jaan, Schmidt, Eric, Bengio, Yoshua

arXiv.org Artificial Intelligence

The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today's specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly "jagged" cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 57%) concretely quantify both rapid progress and the substantial gap remaining before AGI.


On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts

Gulzar, Kashaf, Wagner, Dominik, Bayerl, Sebastian P., Hönig, Florian, Bocklet, Tobias, Riedhammer, Korbinian

arXiv.org Artificial Intelligence

Automatic transcription of stuttered speech remains a challenge, even for modern end-to-end (E2E) automatic speech recognition (ASR) frameworks. Dysfluencies and fluency-shaping artifacts are often overlooked, resulting in non-verbatim transcriptions with limited clinical and research value. We propose a parameter-efficient adaptation method to decode dysfluencies and fluency modifications as special tokens within transcriptions, evaluated on simulated (LibriStutter, English) and natural (KSoF, German) stuttered speech datasets. To mitigate ASR performance disparities and bias towards English, we introduce a multi-step fine-tuning strategy with language-adaptive pretraining. Tokenization analysis further highlights the tokenizer's English-centric bias, which poses challenges for improving performance on German data. Our findings demonstrate the effectiveness of lightweight adaptation techniques for dysfluency-aware ASR while exposing key limitations in multilingual E2E systems.


Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation

Hamedi, Parisa, Jelodar, Hamed, Bai, Samita, Meymani, Mohammad, Razavi-Far, Roozbeh, Ghorbani, Ali A.

arXiv.org Artificial Intelligence

Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance, yet systematic benchmarks for evaluating large language models on this problem remain scarce. In this work, we present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation. We assess model performance using a diverse set of metrics capturing lexical similarity (BLEU, ROUGE, and METEOR), semantic alignment (BERTScore), fluency (Perplexity), and efficiency (time prediction). Our results reveal clear trade-offs: while certain models excel in text similarity metrics, others demonstrate lower perplexity or faster inference times. We further provide qualitative analyses of typical model successes and failure cases, highlighting challenges such as control flow recovery and identifier reconstruction. Taken together, our benchmark offers actionable insights into the strengths and limitations of current large language models for program translation, establishing a foundation for future research in combining accuracy with efficiency for real-world applications.