meaningfulness
Aligned explanations in neural networks
Lobet, Corentin, Chiaromonte, Francesca
Feature attribution is the dominant paradigm for explaining deep neural networks. However, most existing methods only loosely reflect the model's prediction-making process, thereby merely white-painting the black box. We argue that explanatory alignment is a key aspect of trustworthiness in prediction tasks: explanations must be directly linked to predictions, rather than serving as post-hoc rationalizations. We present model readability as a design principle enabling alignment, and PiNets as a modeling framework to pursue it in a deep learning context. PiNets are pseudo-linear networks that produce instance-wise linear predictions in an arbitrary feature space, making them linearly readable. We illustrate their use on image classification and segmentation tasks, demonstrating how PiNets produce explanations that are faithful across multiple criteria in addition to alignment.
- Asia > Middle East > Jordan (0.04)
- North America > United States (0.04)
- Europe > Italy (0.04)
- Asia > India > Karnataka > Bengaluru (0.04)
A Variational Framework for Improving Naturalness in Generative Spoken Language Models
Chen, Li-Wei, Higuchi, Takuya, Aldeneh, Zakaria, Abdelaziz, Ahmed Hussen, Rudnicky, Alexander
The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Canada (0.04)
- (3 more...)
Towards a Japanese Full-duplex Spoken Dialogue System
Ohashi, Atsumoto, Iizuka, Shinya, Jiang, Jingjing, Higashinaka, Ryuichiro
Full-duplex spoken dialogue systems, which can model simultaneous bidirectional features of human conversations such as speech overlaps and backchannels, have attracted significant attention recently. However, the study of full-duplex spoken dialogue systems for the Japanese language has been limited, and the research on their development in Japanese remains scarce. In this paper, we present the first publicly available full-duplex spoken dialogue model in Japanese, which is built upon Moshi, a full-duplex dialogue model in English. Our model is trained through a two-stage process: pre-training on a large-scale spoken dialogue data in Japanese, followed by fine-tuning on high-quality stereo spoken dialogue data. We further enhance the model's performance by incorporating synthetic dialogue data generated by a multi-stream text-to-speech system. Evaluation experiments demonstrate that the trained model outperforms Japanese baseline models in both naturalness and meaningfulness.
NanoVLMs: How small can we go and still make coherent Vision Language Models?
Agarwalla, Mukund, Kumar, Himanshu, Dandekar, Raj, Dandekar, Rajat, Panat, Sreedath
Vision-Language Models (VLMs), such as GPT-4V and Llama 3.2 vision, have garnered significant research attention for their ability to leverage Large Language Models (LLMs) in multimodal tasks. However, their potential is constrained by inherent challenges, including proprietary restrictions, substantial computational demands, and limited accessibility. Smaller models, such as GIT and BLIP, exhibit marked limitations, often failing to generate coherent and consistent text beyond a few tokens, even with extensive training. This underscores a pivotal inquiry: how small can a VLM be and still produce fluent and consistent text? Drawing inspiration from the exceptional learning process of 3-4 year old children, who rely heavily on visual cues for understanding and communication, we introduce two novel datasets: ShortDesc (featuring concise image descriptions) and LongDesc (containing more detailed image descriptions). These datasets consist of image-text pairs where the text is restricted to the simple vocabulary and syntax typically used by young children, generated with a scaled- down model, GPT-4o. Using these datasets, we demonstrate that it is possible to train VLMs that are significantly smaller, up to 10 times smaller than state of the art(SOTA) small VLMs while maintaining architectural simplicity. To evaluate the outputs, we leverage GPT-4o to grade the text, as if stories written by students, on creativity, meaningfulness, and consistency, assigning scores out of 10. This method addresses limitations of standard benchmarks by accommodating unstructured outputs and providing a multidimensional evaluation of the model capabilities. Our findings contribute to the development of lightweight, accessible multimodal models for resource constrained environments.
On Verbalized Confidence Scores for LLMs
Yang, Daniel, Tsai, Yao-Hung Hubert, Yamada, Makoto
The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other's uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at https://github.com/danielyxyang/llm-verbalized-uq .
- Europe > Italy (0.04)
- Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
Exploring the Meaningfulness of Nearest Neighbor Search in High-Dimensional Space
Chen, Zhonghan, Zhang, Ruiyuan, Zhao, Xi, Cheng, Xiaojun, Zhou, Xiaofang
Dense high dimensional vectors are becoming increasingly vital in fields such as computer vision, machine learning, and large language models (LLMs), serving as standard representations for multimodal data. Now the dimensionality of these vector can exceed several thousands easily. Despite the nearest neighbor search (NNS) over these dense high dimensional vectors have been widely used for retrieval augmented generation (RAG) and many other applications, the effectiveness of NNS in such a high-dimensional space remains uncertain, given the possible challenge caused by the "curse of dimensionality." To address above question, in this paper, we conduct extensive NNS studies with different distance functions, such as $L_1$ distance, $L_2$ distance and angular-distance, across diverse embedding datasets, of varied types, dimensionality and modality. Our aim is to investigate factors influencing the meaningfulness of NNS. Our experiments reveal that high-dimensional text embeddings exhibit increased resilience as dimensionality rises to higher levels when compared to random vectors. This resilience suggests that text embeddings are less affected to the "curse of dimensionality," resulting in more meaningful NNS outcomes for practical use. Additionally, the choice of distance function has minimal impact on the relevance of NNS. Our study shows the effectiveness of the embedding-based data representation method and can offer opportunity for further optimization of dense vector-related applications.
- Asia > China > Hong Kong (0.05)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
Nakata, Wataru, Seki, Kentaro, Yanaka, Hitomi, Saito, Yuki, Takamichi, Shinnosuke, Saruwatari, Hiroshi
Spoken dialogue plays a crucial role in human-AI interactions, necessitating dialogue-oriented spoken language models (SLMs). To develop versatile SLMs, large-scale and diverse speech datasets are essential. Additionally, to ensure hiqh-quality speech generation, the data must be spontaneous like in-wild data and must be acoustically clean with noise removed. Despite the critical need, no open-source corpus meeting all these criteria has been available. This study addresses this gap by constructing and releasing a large-scale spoken dialogue corpus, named Japanese Corpus for Human-AI Talks (J-CHAT), which is publicly accessible. Furthermore, this paper presents a language-independent method for corpus construction and describes experiments on dialogue generation using SLMs trained on J-CHAT. Experimental results indicate that the collected data from multiple domains by our method improve the naturalness and meaningfulness of dialogue generation.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
The Two Word Test: A Semantic Benchmark for Large Language Models
Riccardi, Nicholas, Desai, Rutvik H.
Large Language Models (LLMs) have shown remarkable abilities recently, including passing advanced professional exams and demanding benchmark tests. This performance has led many to suggest that they are close to achieving humanlike or 'true' understanding of language, and even Artificial General Intelligence (AGI). Here, we provide a new open-source benchmark that can assess semantic abilities of LLMs using two-word phrases using a task that can be performed relatively easily by humans without advanced training. Combining multiple words into a single concept is a fundamental aspect of human language and intelligence. The test requires meaningfulness judgments of 1768 noun-noun combinations that have been rated as meaningful (e.g., baby boy) or not meaningful (e.g., goat sky). by 150 human raters. We provide versions of the task that probe meaningfulness ratings on a 0-4 scale as well as binary judgments. We conducted a series of experiments using the TWT on GPT-4, GPT-3.5, and Bard, with both versions. Results demonstrated that, compared to humans, all models perform poorly at rating meaningfulness of these phrases. GPT-3.5 and Bard are also unable to make binary discriminations between sensible and nonsense phrases as making sense. GPT-4 makes a substantial improvement in binary discrimination of combinatorial phrases but is still significantly worse than human performance. The TWT can be used to understand the limitations and weaknesses of current LLMs, and potentially improve them. The test also reminds us that caution is warranted in attributing 'true understanding' or AGI to LLMs. TWT is available at: https://github.com/NickRiccardi/two-word-test
- North America > United States > South Carolina (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > Minnesota (0.04)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.46)
INFACT: An Online Human Evaluation Framework for Conversational Recommendation
Manzoor, Ahtsham, jannach, Dietmar
Conversational recommender systems (CRS) are interactive agents that support their users in recommendation-related goals through multi-turn conversations. Generally, a CRS can be evaluated in various dimensions. Today's CRS mainly rely on offline (computational) measures to assess the performance of their algorithms in comparison to different baselines. However, offline measures can have limitations, for example, when the metrics for comparing a newly generated response with a ground truth do not correlate with human perceptions, because various alternative generated responses might be suitable too in a given dialog situation. Current research on machine learning-based CRS models therefore acknowledges the importance of humans in the evaluation process, knowing that pure offline measures may not be sufficient in evaluating a highly interactive system like a CRS. In this work, we provide a user-centric evaluation approach to conversational recommendation along with the INFACT, an onlIne humaN evaluation Framework for conversAtional reCommender sysTems, which can be used to assess the suitability of system responses in a given dialog situation. The INFACT framework is prepared to enable the crowdsourcing of the evaluation task, where various CRS can be integrated for comparison. We have successfully applied the INFACT framework for conducting a number of user studies in our previous research. We believe that our study design along with the INFACT framework can be helpful in facilitating user-centric studies in domains such as dialog systems, machine translation, or Q&A.
- North America > United States (0.04)
- Europe > Austria (0.04)
- Asia > Japan > Shikoku > Ehime Prefecture > Matsuyama (0.04)
Why the future of AI in Healthcare is Salutogenesis
With some even suggesting machines will lead the future of care. But with the risk of error still too high, and covid19 highlighting how much we need a human touch, is the future of Artificial Intelligence salutogenic rather than pathogenic? In 1979 the medical sociologist Aron Antonovsky proposed an approach termed'salutogenesis,' a theory of how and why certain people stay healthy. The word'salutogenesis' comes from the Latin word salus, meaning health, and the Greek word genesis meaning origin. As a medical approach, it focuses on factors that support human health and well-being, rather than on factors that cause disease (pathogenesis).
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.50)