Personal
Enhancing Role-playing Systems through Aggressive Queries: Evaluation and Improvement
Tang, Yihong, Ou, Jiao, Liu, Che, Zhang, Fuzheng, Zhang, Di, Gai, Kun
The advent of Large Language Models (LLMs) has propelled dialogue generation into new realms, particularly in the field of role-playing systems (RPSs). While enhanced with ordinary role-relevant training dialogues, existing LLM-based RPSs still struggle to align with roles when handling intricate and trapped queries in boundary scenarios. In this paper, we design the Modular ORchestrated Trap-setting Interaction SystEm (MORTISE) to benchmark and improve the role-playing LLMs' performance. MORTISE can produce highly role-relevant aggressive queries through the collaborative effort of multiple LLM-based modules, and formulate corresponding responses to create an adversarial training dataset via a consistent response generator. We select 190 Chinese and English roles to construct aggressive queries to benchmark existing role-playing LLMs. Through comprehensive evaluation, we find that existing models exhibit a general deficiency in role alignment capabilities. We further select 180 of the roles to collect an adversarial training dataset (named RoleAD) and retain the other 10 roles for testing. Experiments on models improved by RoleAD indicate that our adversarial dataset ameliorates this deficiency, with the improvements demonstrating a degree of generalizability in ordinary scenarios.
Can We Verify Step by Step for Incorrect Answer Detection?
Xu, Xin, Diao, Shizhe, Yang, Can, Wang, Yang
Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of large language models (LLMs). Previous studies have developed various extensions of CoT, which focus primarily on enhancing end-task performance. In addition, there has been research on assessing the quality of reasoning chains in CoT. This raises an intriguing question: Is it possible to predict the accuracy of LLM outputs by scrutinizing the reasoning chains they generate? To answer this research question, we introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks spanning five different domains. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. To make full use of information in multiple reasoning chains, we propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin. Concretely, this resulted in an average of 5.1% increase in the F1 score across all 45 subsets within R2PE. We further demonstrate our PDS's efficacy in advancing open-domain QA accuracy. Data and code are available at https://github.com/XinXU-USTC/R2PE.
QuRating: Selecting High-Quality Data for Training Language Models
Wettig, Alexander, Gupta, Aatmik, Malik, Saumya, Chen, Danqi
Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We find that LLMs are able to discern these qualities and observe that they are better at making pairwise judgments of texts than at rating the quality of a text directly. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity, as selecting only the highest-rated documents leads to poor results. When we sample using quality ratings as logits over documents, our models achieve lower perplexity and stronger in-context learning performance than baselines. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.
Diffusion Models for Audio Restoration
Lemercier, Jean-Marie, Richter, Julius, Welker, Simon, Moliner, Eloi, Vรคlimรคki, Vesa, Gerkmann, Timo
With the development of audio playback devices and fast data transmission, the demand for high sound quality is rising, for both entertainment and communications. In this quest for better sound quality, challenges emerge from distortions and interferences originating at the recording side or caused by an imperfect transmission pipeline. To address this problem, audio restoration methods aim to recover clean sound signals from the corrupted input data. We present here audio restoration algorithms based on diffusion models, with a focus on speech enhancement and music restoration tasks. Traditional approaches, often grounded in handcrafted rules and statistical heuristics, have shaped our understanding of audio signals. In the past decades, there has been a notable shift towards data-driven methods that exploit the modeling capabilities of deep neural networks (DNNs). Deep generative models, and among them diffusion models, have emerged as powerful techniques for learning complex data distributions. However, relying solely on DNN-based learning approaches carries the risk of reducing interpretability, particularly when employing end-to-end models. Nonetheless, data-driven approaches allow more flexibility in comparison to statistical model-based frameworks whose performance depends on distributional and statistical assumptions that can be difficult to guarantee. Here, we aim to show that diffusion models can combine the best of both worlds and offer the opportunity to design audio restoration algorithms with a good degree of interpretability and a remarkable performance in terms of sound quality.
Darwin Turing Dawkins: Building a General Theory of Evolution
Living things, computers, societies, and even books are part of a grand evolutionary struggle to survive. That struggle shapes nature, nations, religions, art, science, and you. What you think, feel, and do is determined by it. Darwinian evolution does not apply solely to the genes that are stored in DNA. Using the insights of Alan Turing and Richard Dawkins, we will see that it also applies to the memes we store in our brains and the information we store in our computers. The next time you run for president, fight a war, or just deal with the ordinary problems humans are heir to, perhaps this book will be of use. If you want to understand why and when you will die, or if you want to achieve greatness this book may help. If you are concerned about where the computer revolution is headed, this book may provide some answers.
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
ลajszczak, Mateusz, Cรกmbara, Guillermo, Li, Yang, Beyhan, Fatih, van Korlaar, Arent, Yang, Fan, Joly, Arnaud, Martรญn-Cortinas, รlvaro, Abbas, Ammar, Michalski, Adam, Moinet, Alexis, Karlapati, Sri, Muszyลska, Ewa, Guo, Haohan, Putrycz, Bartosz, Gambino, Soledad Lรณpez, Yoo, Kayeon, Sokolova, Elena, Drugman, Thomas
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billionparameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-tospeech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.
Voices of the dead: shooting victims plead for gun reform with AI-voice messages
Six years ago today, Joaquin Oliver was killed in a hallway outside his Florida classroom, one of 17 students and staff murdered in the worst high school shooting in the US. On Wednesday, lawmakers in Washington DC will hear his voice, recreated by artificial intelligence, in phone calls demanding to know why they've done nothing to tackle the plague of gun violence. "It's been six years and you've done nothing. Not a thing to stop all the shootings that have happened since," the message from Oliver, who was 17 when he died in the 2018 Valentine's Day's tragedy at Parkland's Marjory Stoneman Douglas high school, says. "I'm back today because my parents used AI to recreate my voice to call you. Other victims like me will be calling too, again and again, to demand action. How many calls will it take for you to care? How many dead voices will you hear before you finally listen?"
Long-form evaluation of model editing
Rosati, Domenic, Gonzales, Robie, Chen, Jinkun, Yu, Xuemin, Erkan, Melis, Kayani, Yahya, Chavatapalli, Satya Deepika, Rudzicz, Frank, Sajjad, Hassan
Evaluations of model editing currently only use the `next few token' completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (\textbf{\textit{LEME}}) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.
HGOT: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation
Fang, Yihao, Thomas, Stephen W., Zhu, Xiaodan
With the widespread adoption of large language models (LLMs) in numerous applications, the challenge of factuality and the propensity for hallucinations raises significant concerns. To address this issue, particularly in retrieval-augmented in-context learning, we introduce the hierarchical graph of thoughts (HGOT), a structured, multi-layered graph approach designed to enhance the retrieval of pertinent passages during in-context learning. The framework utilizes the emergent planning capabilities of LLMs, employing the divide-and-conquer strategy to break down complex queries into manageable sub-queries. It refines self-consistency majority voting for answer selection, which incorporates the recently proposed citation recall and precision metrics to assess the quality of thoughts, linking an answer's credibility intrinsically to the thought's quality. This methodology introduces a weighted system in majority voting, prioritizing answers based on the citation quality of their thoughts. Additionally, we propose a scoring mechanism for evaluating retrieved passages, considering factors such as citation frequency and quality, self-consistency confidence, and the retrieval module's ranking. Experiments reveal that HGOT outperforms other retrieval-augmented in-context learning methods, including Demonstrate-Search-Predict (DSP), ReAct, Self-Ask, and Retrieve-then-Read on different datasets by as much as $7\%$, demonstrating its efficacy in enhancing the factuality of LLMs.
Meta's AI Chief Yann LeCun on AGI, Open-Source, and AI Risk
Meta's chief AI scientist, Yann LeCun, received another accolade to add to his long list of awards on Sunday, when he was recognized with a TIME100 Impact Award for his contributions to the world of artificial intelligence. Ahead of the award ceremony in Dubai, LeCun sat down with TIME to discuss the barriers to achieving "artificial general intelligence" (AGI), the merits of Meta's open-source approach, and what he sees as the "preposterous" claim that AI could pose an existential risk to the human race. TIME spoke with LeCun on Jan. 26. This conversation has been condensed and edited for clarity. Many people in the tech world today believe that training large language models (LLMs) on more computing power and more data will lead to artificial general intelligence.