Goto

Collaborating Authors

 transcription


Melbourne psychiatrist refuses new patients who don't consent to AI note-taking

The Guardian

Digital rights experts have raised concerns about the security of the data recorded by AI in psychiatrists' sessions. Digital rights experts have raised concerns about the security of the data recorded by AI in psychiatrists' sessions. Melbourne psychiatrist refuses new patients who don't consent to AI note-taking A Melbourne psychiatrist has refused new patients unless they agree to allow her to use an AI scribe to transcribe the conversations in their sessions. AI-driven note taking tools are becoming popular within the medical industry - with two in five general practitioners now using such scribes, according to the Royal Australian College of General Practitioners (RACGP). But there have also been concerns about the security of the data and how it might be used by the AI companies, along with the accuracy of the transcriptions.


M4Singer: AMulti-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus

Neural Information Processing Systems

The lack of publicly available high-quality and accurately labeled datasets has long been a major bottleneck for singing voice synthesis (SVS). To tackle this problem, we present M4Singer, a free-to-use Multi-style, Multi-singer Mandarin singing collection with elaborately annotated Musical scores as well as its benchmarks. Specifically, 1) we construct and release a large high-quality Chinese singing voice corpus, which is recorded by 20 professional singers, covering 700 Chinese pop songs as well as all the four SATB types (i.e., soprano, alto, tenor, and bass); 2) we take extensive efforts to manually compose the musical scores for each recorded song, which is necessary to the study of the prosody modeling for SVS. 3) To facilitate the use and demonstrate the quality of M4Singer, we conduct four different benchmark experiments: score-based SVS, controllable singing voice (CSV), singing voice conversion (SVC) and automatic music transcription (AMT). Audio samples can be found at http://m4singer.github.io.


Unsupervised Learning of Spoken Language with Visual Context

Neural Information Processing Systems

Humans learn to speak before they can read or write, so why can't computers do the same? In this paper, we present a deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images. We describe the collection of our data comprised of over 120,000 spoken audio captions for the Places image dataset and evaluate our model on an image search and annotation task. We also provide some visualizations which suggest that our model is learning to recognize meaningful words within the caption spectrograms.



REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Neural Information Processing Systems

Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data. In this paper, we propose REBORN, Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is a segmental structure segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance.



Model Details

Neural Information Processing Systems

We decreased the confidence threshold to 0.1 to increase article and headline The following specifications were used: { resolution: 256, learning rate: 2e-3 }. This limit is binding for common words, e.g., "the". The recognizer is trained using the Supervised Contrastive ("SupCon") loss function [7], a gener-45 In particular, we work with the "outside" SupCon loss formulation We use a MobileNetV3 (Small) encoder pre-trained on ImageNet1k sourced from the timm [19] We use 0.1 as the temperature for Center Cropping, to avoid destroying too much information. C (Small) model that is developed in [2] for character recognition. If multiple article bounding boxes satisfy these rules for a given headline, then we take the highest.


ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Neural Information Processing Systems

We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks.


A file format used in the

Neural Information Processing Systems

The keywords were extracted using the procedure described in SectionC. The restricted part of the Muharaf dataset has 428 images distributed under a proprietary license.