Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
Recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech. However, an LLM-based strategy for modeling spoken dialogs remains elusive, calling for further investigation. This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-tospeech (TTS) systems. We have verified the inclusion of prosody in speech tokens that predominantly contain semantic information and have used this foundation to construct a prosody-infused speech-text model. Additionally, we propose a generalized speech-text pretraining scheme that enhances the capture of cross-modal semantics. To construct USDM, we fine-tune our speech-text model on spoken dialog data using a multi-step spoken dialog template that stimulates the chain-ofreasoning capabilities exhibited by the underlying LLM. Automatic and human evaluations on the DailyTalk dataset demonstrate that our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines. Our code and checkpoints are available at https://github.com/naverai/usdm.
Supplementary Material
The train, text and validation splits for SST2 [47] and SST5 [47] are used from the source itself while the validation data for TREC6 [35, 18] is obtained using 10% of the train data. The test data for glue-SST2 [51] is obtained using 5% of the train data. Seed value of 42 is used in generator argument in random_split function of torch. In Table 1, we summarize the number classes, and number of instances in each split in the text datasets.
This new AI tool changes a speaker's accent to American English in real-time - hear for yourself
Krisp, an AI startup known for its noise cancellation and transcription services, is launching a new AI tool that can convert a speaker's accent to American English in real time. The company claims the tool can help native speakers better understand non-native English speakers "more easily, without changing [their] natural voice and vocal traits." The company is initially rolling out support for altering 17 Indian dialects into US English but plans to expand in the future with Filipino and more. The tool is compatible with Zoom, Microsoft Teams, Google Meet, and other meeting app platforms. So, as long as users have access to Krisp's existing desktop app, the tool can "clarify" accents. According to Krisp's website, Indian accents were the first accents the company wanted to work on because people from the region are a large segment of the global workforce, especially within STEM fields.
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images Supplementary Materials
Figure 5: t-SNE plots to illustrate the effectiveness of random sampling with the majority species in the Fish-10K dataset. Randomly sampled images are shown as blue dots, while the remaining data points are represented by red dots. To generate the vector representation of the images, we leverage a VGG19 pretrained on the ImageNet dataset. We collected images of three taxonomic groups of organisms: fish, birds, and butterflies, each containing around 10K images. Images for fish (Fish-10K) were curated from the larger image collection, FishAIR [1], which contains images from the Great Lakes Invasive Network Project (GLIN) [2]. We created the Fish-10K dataset by randomly sampling 10K images and preprocessing the images to crop and remove the background. To ensure diversity within the Fish-10K dataset, we applied a targeted sampling strategy in the source collection, FishAIR [1]. Specifically, we retained all images of species with fewer than 200 images, considering these as minority or rare classes. Random sampling was applied only to the majority species--those with more than 200 images per class. To assess the potential sampling bias among the majority species, we generated feature vectors for each image in Fish-10K using a pretrained VGG-19 model. Our analysis shows that the distribution of sampled images closely mirrors the distribution of images that were not included in the dataset (denoted as "others" in the plot). This suggests that our random sampling approach provides a sufficiently accurate representation of the original distribution for the majority species.
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images
Images are increasingly becoming the currency for documenting biodiversity on the planet, providing novel opportunities for accelerating scientific discoveries in the field of organismal biology, especially with the advent of large vision-language models (VLMs). We ask if pre-trained VLMs can aid scientists in answering a range of biologically relevant questions without any additional fine-tuning. In this paper, we evaluate the effectiveness of 12 state-of-the-art (SOTA) VLMs in the field of organismal biology using a novel dataset, VLM4Bio, consisting of 469K questionanswer pairs involving 30K images from three groups of organisms: fishes, birds, and butterflies, covering five biologically relevant tasks. We also explore the effects of applying prompting techniques and tests for reasoning hallucination on the performance of VLMs, shedding new light on the capabilities of current SOTA VLMs in answering biologically relevant questions using images.
Appendix No-regret Algorithms for Fair Resource Allocation
We provide a more comprehensive review of the fair machine learning literature in this section. Multiple different definitions have been used to quantify the fairness of machine learning algorithms. Hardt et al. [2016] introduced equality of opportunity as a fairness criterion, which ensures that individuals have an equal chance of being correctly classified by machine learning algorithms, regardless of their protected attributes like race or gender. Kleinberg et al. [2017] formalized three different notions of fairness and showed that no algorithm can satisfy these notions simultaneously, thus showing the inherent trade-offs in competing notions of fairness. Other prevalent fairness criteria include Price-of-fairness introduced by Bertsimas et al. [2011] which quantifies how much the aggregate utility is affected by enforcing fairness.
AutoMix: Automatically Mixing Language Models
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present AutoMix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to AutoMix are two key technical contributions. First, it has a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring extensive training. Second, given that self-verification can be noisy, it employs a POMDP based router that can effectively select an appropriately sized model, based on answer confidence. Experiments across five language models and five challenging datasets show that AutoMix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.