prominence
Prominence-Aware Artifact Detection and Dataset for Image Super-Resolution
Molodetskikh, Ivan, Malyshev, Kirill, Mirgaleev, Mark, Zagainov, Nikita, Bogatyrev, Evgeney, Vatolin, Dmitriy
Generative image super-resolution (SR) is rapidly advancing in visual quality and detail restoration. As the capacity of SR models expands, however, so does their tendency to produce artifacts: incorrect, visually disturbing details that reduce perceived quality. Crucially, their perceptual impact varies: some artifacts are barely noticeable while others strongly degrade the image. We argue that artifacts should be characterized by their prominence to human observers rather than treated as uniform binary defects. Motivated by this, we present a novel dataset of 1302 artifact examples from 11 contemporary image-SR methods, where each artifact is paired with a crowdsourced prominence score. Building on this dataset, we train a lightweight regressor that produces spatial prominence heatmaps and outperforms existing methods at detecting prominent artifacts. We release the dataset and code to facilitate prominence-aware evaluation and mitigation of SR artifacts.
A Unified Theory of Language
A unified theory of language combines a Bayesian cognitive linguistic model of language processing, with the proposal that language evolved by sexual selection for the display of intelligence. The theory accounts for the major facts of language, including its speed and expressivity, and data on language diversity, pragmatics, syntax and semantics. The computational element of the theory is based on Construction Grammars. These give an account of the syntax and semantics of the worlds languages, using constructions and unification. Two novel elements are added to construction grammars: an account of language pragmatics, and an account of fast, precise language learning. Constructions are represented in the mind as graph like feature structures. People use slow general inference to understand the first few examples they hear of any construction. After that it is learned as a feature structure, and is rapidly applied by unification. All aspects of language (phonology, syntax, semantics, and pragmatics) are seamlessly computed by fast unification; there is no boundary between semantics and pragmatics. This accounts for the major puzzles of pragmatics, and for detailed pragmatic phenomena. Unification is Bayesian maximum likelihood pattern matching. This gives evolutionary continuity between language processing in the human brain, and Bayesian cognition in animal brains. Language is the basis of our mind reading abilities, our cooperation, self esteem and emotions; the foundations of human culture and society.
Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility
Wang, Sheng-Fu, Prevot, Laurent, Chi, Jou-an, Huang, Ri-Sheng, Hsieh, Shu-Kai
The achievements of Large Language Models in Natural Language Processing, especially for high-resource languages, call for a better understanding of their characteristics from a cognitive perspective. Researchers have attempted to evaluate artificial models by testing their ability to predict behavioral (e.g., eye-tracking fixations) and physiological (e.g., brain responses) variables during language processing (e.g., reading/listening). In this paper, we propose using spontaneous speech corpora to derive production variables (speech reductions, prosodic prominences) and applying them in a similar fashion. More precisely, we extract. We then test models trained with a standard procedure on different pretraining datasets (written, spoken, and mixed genres) for their ability to predict these two variables. Our results show that, after some fine-tuning, the models can predict these production variables well above baselines. We also observe that spoken genre training data provides more accurate predictions than written genres. These results contribute to the broader effort of using high-quality speech corpora as benchmarks for LLMs.
The time scale of redundancy between prosody and linguistic context
Regev, Tamar I., Ohams, Chiebuka, Xie, Shaylee, Wolf, Lukas, Fedorenko, Evelina, Warstadt, Alex, Wilcox, Ethan G., Pimentel, Tiago
In spoken language, speakers transmit information not only using words, but also via a rich array of non-verbal signals, which include prosody -- the auditory features of speech. However, previous studies have shown that prosodic features exhibit significant redundancy with both past and future words. Here, we examine the time scale of this relationship: How many words in the past (or future) contribute to predicting prosody? We find that this scale differs for past and future words. Prosody's redundancy with past words extends across approximately 3-8 words, whereas redundancy with future words is limited to just 1-2 words. These findings indicate that the prosody-future relationship reflects local word dependencies or short-scale processes such as next word prediction, while the prosody-past relationship unfolds over a longer time scale. The latter suggests that prosody serves to emphasize earlier information that may be challenging for listeners to process given limited cognitive resources in real-time communication. Our results highlight the role of prosody in shaping efficient communication.
A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings
Mondal, Anindita, Bharadwaj, Rangavajjala Sankara, Mallela, Jhansi, Vuppala, Anil Kumar, Yarra, Chiranjeevi
Automatic detection of prominence at the word and syllable-levels is critical for building computer-assisted language learning systems. It has been shown that prosody embeddings learned by the current state-of-the-art (SOTA) text-to-speech (TTS) systems could generate word- and syllable-level prominence in the synthesized speech as natural as in native speech. To understand the effectiveness of prosody embeddings from TTS for prominence detection under nonnative context, a comparative analysis is conducted on the embeddings extracted from native and non-native speech considering the prominence-related embeddings: duration, energy, and pitch from a SOTA TTS named FastSpeech2. These embeddings are extracted under two conditions considering: 1) only text, 2) both speech and text. For the first condition, the embeddings are extracted directly from the TTS inference mode, whereas for the second condition, we propose to extract from the TTS under training mode. Experiments are conducted on native speech corpus: Tatoeba, and non-native speech corpus: ISLE. For experimentation, word-level prominence locations are manually annotated for both corpora. The highest relative improvement on word \& syllable-level prominence detection accuracies with the TTS embeddings are found to be 13.7% & 5.9% and 16.2% & 6.9% compared to those with the heuristic-based features and self-supervised Wav2Vec-2.0 representations, respectively.
DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI
Cho, Hyowon, Ka, Soonwon, Park, Daechul, Kang, Jaewook, Seo, Minjoon, Son, Bokyung
Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification. The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.
The 5th Paradigm: AI-Driven Scientific Discovery
How many times must a phenomenon occur before it graduates from a coincidence to a pattern? Usually, the answer depends on how unlikely, how far from the ordinary, and how (seemingly) inexplicable the phenomenon is. The more so, the lower the threshold. I was very surprised (and pleased) to read of this year's winners of the Nobel Prize in Physics: John Hopfield, a professor of Molecular Biology and earlier of Chemistry and Biology, together with Geoffrey Hinton, a professor of Computer Science. Their affiliations name three major scientific fields, none of them being Physics!
AuToMATo: A Parameter-Free Persistence-Based Clustering Algorithm
Huber, Marius, Kalisnik, Sara, Schnider, Patrick
We present AuToMATo, a novel parameter-free clustering algorithm based on persistent homology. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo against many other state-of-the-art clustering algorithms. We find that not only that AuToMATo compares favorably against other parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a parameter-free clustering algorithm. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standard scikit-learn architecture.
Auctions with LLM Summaries
Dubey, Kumar Avinava, Feng, Zhe, Kidambi, Rahul, Mehta, Aranyak, Wang, Di
The advent of large language model (LLM) technology has the potential to change the user experience of online services such as internet search, online recommendations (Geng et al., 2022), or shopping (Fan et al., 2023). For example, search platforms and apps, e.g., Microsoft Bing (Microsoft, 2023) and Google Search (Google, 2023), have already experimented with generative AI tools to provide augmented search summarization to facilitate users' search experience. Such summarization (e.g., based on retrieval augmented generation RAG (Lewis et al., 2020)) can sometimes provide an efficient way for users to gain useful information in a more condensed space. For queries of a commercial nature, search platforms respond with relevant online advertising. Online search advertising has provided a means not only to connect buyers and sellers, but also to support free internet services to users. Given the exciting potential of LLMs to summarize multiple sources of content and provide a succinct and informative output, it is natural to ask how LLM technology can help improve online advertising. In the ever-evolving landscape of online advertising, auction design has been a critical component towards improving the effectiveness and efficiency of ad delivery. A well-designed auction mechanism not only provides revenue for the platform but also ensures relevancy and value for users and advertisers alike.
Quantifying the redundancy between prosody and text
Wolf, Lukas, Pimentel, Tiago, Fedorenko, Evelina, Cotterell, Ryan, Warstadt, Alex, Wilcox, Ethan, Regev, Tamar
Prosody -- the suprasegmental component of speech, including pitch, loudness, and tempo -- carries critical aspects of meaning. However, the relationship between the information conveyed by prosody vs. by the words themselves remains poorly understood. We use large language models (LLMs) to estimate how much information is redundant between prosody and the words themselves. Using a large spoken corpus of English audiobooks, we extract prosodic features aligned to individual words and test how well they can be predicted from LLM embeddings, compared to non-contextual word embeddings. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Furthermore, a word's prosodic information is redundant with both the word itself and the context preceding as well as following it. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words. Along with this paper, we release a general-purpose data processing pipeline for quantifying the relationship between linguistic information and extra-linguistic features.