Plotting

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Neural Information Processing Systems

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation, and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separate encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.



A Appendix

Neural Information Processing Systems

B: GPT-3's MSE averaged over trials for each task. D: GPT-3's prior expectations across tasks (blue) compared to the true task distribution (orange). We added task similarity as a regressor on Experiment 1's and 2's respective MSE/regret regression bar plots (Figures 7A and 7B). For Experiment 1, we quantified task similarity using the average negative L2 norm of the underlying parameters (slope & intercept) with previous tasks. For Experiment 2, we quantified task similarity using the average difference of mean rewards with previous tasks.


Meta-in-context learning in large language models Julian Coda-Forno 1,2, Marcel Binz 1 Matthew Botvinick 3

Neural Information Processing Systems

Large language models have shown tremendous performance in a variety of tasks. In-context learning - the ability to improve at a task after being provided with a number of demonstrations - is seen as one of the main contributors to their success. In the present paper, we demonstrate that the in-context learning abilities of large language models can be recursively improved via in-context learning itself.



C: A Dataset for Real-world Claim Verification with Evidence from the Web

Neural Information Processing Systems

Existing datasets for automated fact-checking have substantial limitations, such as relying on artificial claims, lacking annotations for evidence and intermediate reasoning, or including evidence published after the claim.


End-To-End Latent Variational Diffusion Models for Inverse Problems in High Energy Physics

Neural Information Processing Systems

High-energy collisions at the Large Hadron Collider (LHC) provide valuable insights into open questions in particle physics. However, detector effects must be corrected before measurements can be compared to certain theoretical predictions or measurements from other detectors. Methods to solve this inverse problem of mapping detector observations to theoretical quantities of the underlying collision are essential parts of many physics analyses at the LHC. We investigate and compare various generative deep learning methods to approximate this inverse mapping. We introduce a novel unified architecture, termed latent variational diffusion models, which combines the latent learning of cutting-edge generative art approaches with an end-to-end variational framework. We demonstrate the effectiveness of this approach for reconstructing global distributions of theoretical kinematic quantities, as well as for ensuring the adherence of the learned posterior distributions to known physics constraints. Our unified approach achieves a distribution-free distance to the truth of over 20 times smaller than non-latent state-of-the-art baseline and 3 times smaller than traditional latent diffusion models.