Personal
ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions
Ghosh, Sreyan, Tyagi, Utkarsh, Kumar, Sonal, Evuru, C. K., Ramaneswaran, S, Sakshi, S, Manocha, Dinesh
We present ABEX, a novel and effective generative data augmentation methodology for low-resource Natural Language Understanding (NLU) tasks. ABEX is based on ABstract-and-EXpand, a novel paradigm for generating diverse forms of an input document -- we first convert a document into its concise, abstract description and then generate new documents based on expanding the resultant abstraction. To learn the task of expanding abstract descriptions, we first train BART on a large-scale synthetic dataset with abstract-document pairs. Next, to generate abstract descriptions for a document, we propose a simple, controllable, and training-free method based on editing AMR graphs. ABEX brings the best of both worlds: by expanding from abstract representations, it preserves the original semantic properties of the documents, like style and meaning, thereby maintaining alignment with the original label and data distribution. At the same time, the fundamental process of elaborating on abstract descriptions facilitates diverse generations. We demonstrate the effectiveness of ABEX on 4 NLU tasks spanning 12 datasets and 4 low-resource settings. ABEX outperforms all our baselines qualitatively with improvements of 0.04% - 38.8%. Qualitatively, ABEX outperforms all prior methods from literature in terms of context and length diversity.
Chaos with Keywords: Exposing Large Language Models Sycophancy to Misleading Keywords and Evaluating Defense Strategies
RRV, Aswin, Tyagi, Nemika, Uddin, Md Nayem, Varshney, Neeraj, Baral, Chitta
This study explores the sycophantic tendencies of Large Language Models (LLMs), where these models tend to provide answers that match what users want to hear, even if they are not entirely correct. The motivation behind this exploration stems from the common behavior observed in individuals searching the internet for facts with partial or misleading knowledge. Similar to using web search engines, users may recall fragments of misleading keywords and submit them to an LLM, hoping for a comprehensive response. Our empirical analysis of several LLMs shows the potential danger of these models amplifying misinformation when presented with misleading keywords. Additionally, we thoroughly assess four existing hallucination mitigation strategies to reduce LLMs sycophantic behavior. Our experiments demonstrate the effectiveness of these strategies for generating factually correct statements. Furthermore, our analyses delve into knowledge-probing experiments on factual keywords and different categories of sycophancy mitigation.
Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs
Zhou, Shang, Yao, Feng, Dong, Chengyu, Wang, Zihan, Shang, Jingbo
Controlling the attribute intensity of text generation is crucial across scenarios (e.g., writing conciseness, chatting emotion, and explanation clarity). The remarkable capabilities of large language models (LLMs) have revolutionized text generation, prompting us to explore such \emph{smooth control} of LLM generation. Specifically, we propose metrics to assess the range, calibration, and consistency of the generated text's attribute intensity in response to varying control values, as well as its relevance to the intended context. To quantify the attribute intensity and context relevance, we propose an effective evaluation framework leveraging the Elo rating system and GPT4, both renowned for their robust alignment with human judgment. We look into two viable training-free methods for achieving smooth control of LLMs: (1) Prompting with semantic shifters, and (2) Modifying internal model representations. The evaluations of these two methods are conducted on $5$ different attributes with various models. Our code and dataset can be obtained from \url{https://github.com/ShangDataLab/Smooth-Control}.
Speaker Verification in Agent-Generated Conversations
Yang, Yizhe, Achananuparp, Palakorn, Huang, Heyan, Jiang, Jing, Lim, Ee-Peng
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.
BIPED: Pedagogically Informed Tutoring System for ESL Education
Kwon, Soonwoo, Kim, Sojung, Park, Minju, Lee, Seunghyun, Kim, Kyuseok
Thereafter, we analyzed the dataset post-hoc from a pedagogical As Large Language Models (LLMs) such as viewpoint and developed a categorization GPT (Achiam et al., 2023) revolutionize the field of dialogue acts, which comprises 34 tutor acts and of natural language generation, both researchers 9 student acts. Finally, we annotated the data using and practitioners have put an increasing amount the defined dialogue act categories. of effort into developing Conversational Intelligent As for the development of CITS, we employ Tutoring Systems (CITS) that leverage the the framework (Macina et al., 2023b; Wang et al., generative capabilities of LLM's (Tack and Piech, 2023a) whereby the LLM first chooses the suitable 2022; Abdelghani et al., 2022; Park et al., 2024; tutor act, then generates the corresponding Lee et al., 2023). Specifically, LLMs have the potential utterance. We believe this approach enables the to teach English as a Second/Foreign Language model to generate a more focused response that (ESL/EFL), for they may serve as readilyavailable does not deviate from the chosen tutor intent. We tutors that can emulate native-speaking consider two implementations of such CITS, one contexts (Park et al., 2024; Lee et al., 2023).
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
Zhang, Shaolei, Yu, Tian, Feng, Yang
Large Language Models (LLMs) sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.
Exploring Precision and Recall to assess the quality and diversity of LLMs
Bronnec, Florian Le, Verine, Alexandre, Negrevergne, Benjamin, Chevaleyre, Yann, Allauzen, Alexandre
We introduce a novel evaluation framework for Large Language Models (LLMs) such as \textsc{Llama-2} and \textsc{Mistral}, focusing on importing Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals new insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned on instruction dataset or with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges that current LLMs face in generating diverse and high-quality text. We release our code and data.
Arm CEO: Apple 'woke up the industry on the art of the possible'
As Qualcomm-powered Windows on Arm PCs begin appearing here at Computex, ushering in a generation of AI-infused Copilot laptops, it seemed appropriate to interview a major player in the push. Instead, I mean Arm, the semiconductor design company that licenses CPUs to companies like Qualcomm, Apple, and Samsung. Arm dominates in smartphones and tablets, and now, true PC contention finally seems possible. I sat down with chief executive Rene Haas in Taipei, touching upon everything from NPUs, to how Arm solved its Windows app gap, to why Intel, AMD, and Qualcomm don't matter to the success of Windows on Arm PCs. And he has nothing but praise for Apple's M-series Macs, which he says "woke up the industry on the art of the possible" with Arm laptops. "I think Apple silicon has really proven that you could build a first-class laptop and have no compromises," Haas said. This interview has been slightly edited for length and clarity.
Graph Neural Network Enhanced Retrieval for Question Answering of LLMs
Li, Zijian, Guo, Qingyan, Shao, Jiawei, Song, Lei, Bian, Jiang, Zhang, Jun, Wang, Rui
Retrieval augmented generation has revolutionized large language model (LLM) outputs by providing factual supports. Nevertheless, it struggles to capture all the necessary knowledge for complex reasoning questions. Existing retrieval methods typically divide reference documents into passages, treating them in isolation. These passages, however, are often interrelated, such as passages that are contiguous or share the same keywords. Therefore, recognizing the relatedness is crucial for enhancing the retrieval process. In this paper, we propose a novel retrieval method, called GNN-Ret, which leverages graph neural networks (GNNs) to enhance retrieval by considering the relatedness between passages. Specifically, we first construct a graph of passages by connecting passages that are structure-related and keyword-related. A graph neural network (GNN) is then leveraged to exploit the relationships between passages and improve the retrieval of supporting passages. Furthermore, we extend our method to handle multi-hop reasoning questions using a recurrent graph neural network (RGNN), named RGNN-Ret. At each step, RGNN-Ret integrates the graphs of passages from previous steps, thereby enhancing the retrieval of supporting passages. Extensive experiments on benchmark datasets demonstrate that GNN-Ret achieves higher accuracy for question answering with a single query of LLMs than strong baselines that require multiple queries, and RGNN-Ret further improves accuracy and achieves state-of-the-art performance, with up to 10.4% accuracy improvement on the 2WikiMQA dataset.
VOICE: Variance of Induced Contrastive Explanations to quantify Uncertainty in Neural Network Interpretability
Prabhushankar, Mohit, AlRegib, Ghassan
In this paper, we visualize and quantify the predictive uncertainty of gradient-based post hoc visual explanations for neural networks. Predictive uncertainty refers to the variability in the network predictions under perturbations to the input. Visual post hoc explainability techniques highlight features within an image to justify a network's prediction. We theoretically show that existing evaluation strategies of visual explanatory techniques partially reduce the predictive uncertainty of neural networks. This analysis allows us to construct a plug in approach to visualize and quantify the remaining predictive uncertainty of any gradient-based explanatory technique. We show that every image, network, prediction, and explanatory technique has a unique uncertainty. The proposed uncertainty visualization and quantification yields two key observations. Firstly, oftentimes under incorrect predictions, explanatory techniques are uncertain about the same features that they are attributing the predictions to, thereby reducing the trustworthiness of the explanation. Secondly, objective metrics of an explanation's uncertainty, empirically behave similarly to epistemic uncertainty. We support these observations on two datasets, four explanatory techniques, and six neural network architectures. The code is available at https://github.com/olivesgatech/VOICE-Uncertainty.