Goto

Collaborating Authors

 Large Language Model


P-Flow: AFast and Data-Efficient Zero-Shot TTS through Speech Prompting

Neural Information Processing Systems

While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speechprompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time. Unlike the neural codec language models, we specifically train P-Flow on LibriTTS dataset using a continuous mel-representation. Through our training method using continuous speech prompts, P-Flow matches the speaker similarity performance of the large-scale zero-shot TTS models with two orders of magnitude less training data and has more than 20 faster sampling speed. Our results show that P-Flow has better pronunciation and is preferred in human likeness and speaker similarity to its recent state-of-the-art counterparts, thus defining P-Flow as an attractive and desirable alternative. We provide audio samples on our demo page.


e6d58fc68c0f3c36ae6e0e64478a69c0-Supplemental-Conference.pdf

Neural Information Processing Systems

It consists of an image encoder with a Vision Transformer [17] architecture, a text encoder with a similar Transformer architecture, and heads that predict bounding boxes and label scores from provided images and text queries. Input(s) An image and a list of free-text object descriptions (queries).


e6c2e85db1f1039177c4495ccd399ac4-Supplemental-Conference.pdf

Neural Information Processing Systems

A.1 Preliminary Study2 The basic GPT-2 model1 is trained from scratch on each corpus, which has 12 transformer blocks3 and 12 attention heads with 768 hidden dimensions. The Huggingface transformers [4] and Pytorch4 toolkit [2] are used to train the GPT-2 model in the distributed manner on A100 GPU server. The5 hyper-parameters during training are shown in Table 1.6 Hyper-parameter Value Optimization steps 100K Test interval 10K Dropout rate 0.1 Grad clipping 1.0 Learning rate 5e 5 Batch size 128 Maximum sequence length 256 Warmup steps 10K Learning scheduler Linear decay Random seed 0 Number of GPUs 4 Learning objective Cross-Entropy Loss Table 1: The hyper-parameters during GPT-2 training procedure. Most of the hyper-parameters for our proposed method are the same as that in Table 1 for better8 variable controlling. The specific hyper-parameters for our proposed method are the length of9 repetitive n-gram and its repetition dropout rate p, which are set as 2 and 0.6, respectively.10


Vision Model: Frozen, GIT, CoCa, VCAudio Model: WavCaps AC

Neural Information Processing Systems

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million opendomain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks.



ALimitations and Societal

Neural Information Processing Systems

Limitations One limitation of our model is its potential for data bias. KOSMOS-1 is trained on a2 web-scale multimodal corpus, which means that it is likely to be biased towards the data that it was3 trained on. This could lead to the model generating text that is biased towards certain demographics4 or viewpoints.5 Another limitation of KOSMOS-1 is its relatively small size compared to other large language models.6 This means that the model may not be able to learn as complex relationships between different7 modalities. This could lead to the model making mistakes when it is asked to perform tasks that8 require a deep understanding of multiple modalities.9 Finally, KOSMOS-1 only supports vision modality.



In-Context Impersonation Reveals Large Language Models' Strengths and Biases

Neural Information Processing Systems

In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts.



SA3DIP: Segment Any 3DInstance with Potential 3DPriors

Neural Information Processing Systems

The proliferation of 2D foundation models has sparked research into adapting them for open-world 3D instance segmentation. Recent methods introduce a paradigm that leverages superpoints as geometric primitives and incorporates 2D multi-view masks from Segment Anything model (SAM) as merging guidance, achieving outstanding zero-shot instance segmentation results. However, the limited use of 3D priors restricts the segmentation performance. Previous methods calculate the 3D superpoints solely based on estimated normal from spatial coordinates, resulting in under-segmentation for instances with similar geometry. Besides, the heavy reliance on SAM and hand-crafted algorithms in 2D space suffers from over-segmentation due to SAM's inherent part-level segmentation tendency. To address these issues, we propose SA3DIP, a novel method for Segmenting Any 3D Instances via exploiting potential 3DPriors.