Goto

Collaborating Authors

 aed


Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Neural Information Processing Systems

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the ''Aligner-Encoder''. To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention---it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform ''self-transduction''.


Drones are delivering life-saving defibrillators to 911 calls

Popular Science

A new pilot program aims to help EMS respond quicker, not act as a replacement. Breakthroughs, discoveries, and DIY tips sent every weekday. When they aren't baffling the public or grounding wildfire planes, drones have some pretty solid uses. Apart from unnecessarily fast same-day deliveries, the pilotless aircrafts may soon become a lifesaving emergency response tool . A collaborative team of health experts, community organizations, and universities are in the middle of a pilot program using drones and automated external defibrillators (AEDs).


Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data

arXiv.org Artificial Intelligence

A joint speech and text optimization method is proposed for hybrid transducer and attention-based encoder decoder (TAED) modeling to leverage large amounts of text corpus and enhance ASR accuracy. The joint TAED (J-TAED) is trained with both speech and text input modalities together, while it only takes speech data as input during inference. The trained model can unify the internal representations from different modalities, and be further extended to text-based domain adaptation. It can effectively alleviate data scarcity for mismatch domain tasks since no speech data is required. Our experiments show J-TAED successfully integrates speech and linguistic information into one model, and reduce the WER by 5.8 ~12.8% on the Librispeech dataset. The model is also evaluated on two out-of-domain datasets: one is finance and another is named entity focused. The text-based domain adaptation brings 15.3% and 17.8% WER reduction on those two datasets respectively.


Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Neural Information Processing Systems

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the ''Aligner-Encoder''. To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention---it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition.


Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

arXiv.org Artificial Intelligence

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform "self-transduction".


Enhancing CTC-based speech recognition with diverse modeling units

arXiv.org Artificial Intelligence

In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units. This methodology does not only align the strengths of both phoneme and grapheme-based models but also reveals that using these diverse modeling units in a synergistic way can significantly enhance model accuracy. Our findings offer new insights into the optimal integration of heterogeneous modeling units in the development of more robust and accurate ASR systems.


Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future

arXiv.org Artificial Intelligence

Annotated data is an essential ingredient in natural language processing for training and evaluating machine learning models. It is therefore very desirable for the annotations to be of high quality. Recent work, however, has shown that several popular datasets contain a surprising amount of annotation errors or inconsistencies. To alleviate this issue, many methods for annotation error detection have been devised over the years. While researchers show that their approaches work well on their newly introduced datasets, they rarely compare their methods to previous work or on the same datasets. This raises strong concerns on methods' general performance and makes it difficult to asses their strengths and weaknesses. We therefore reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets for text classification as well as token and span labeling. In addition, we define a uniform evaluation setup including a new formalization of the annotation error detection task, evaluation protocol and general best practices. To facilitate future research and reproducibility, we release our datasets and implementations in an easy-to-use and open source software package.


Drone saves the life of man, 71, suffering a heart attack by delivering defibrillator to his home

Daily Mail - Science & tech

A 71-year-old Swedish man who suffered a heart attack while shoveling snow in his driveway was saved by an unlikely hero - a delivery drone. Sven, a retiree who asked for his last name to be withheld, collapsed outside his home in the western town of Trollhättan in early December. Within moments of receiving the call from Sven's wife, emergency services dispatched the unmanned aerial vehicle carrying an AED, or automated external defibrillator, which arrived in less than four minutes. The system, called Emergency Medical Aerial Delivery (EMADE), was developed by Everdrones to assist patients within 10 minutes of experiencing cardiac arrest. 'Everything from the first 112 call to the drone getting the signal to start and go took about 15-30 seconds and then the whole process took about three and a half minutes,' Sven told AFP.


A New First Responder: How Drones May Revolutionize Healthcare

#artificialintelligence

A new article published last week in the European Heart Journal discusses the use of drones for delivering life-saving automated external defibrillators (AED) to out-of-hospital cardiac arrest (OHCA) patients. As the study describes, "Early treatment in line with the'chain-of-survival' concept such as cardiopulmonary resuscitation (CPR) and defibrillation by an automated external defibrillator (AED) prior to ambulance arrival is associated with increased survival. Use of AEDs in the early-cardiac-arrest electrical phase can increase survival rates to up to 50–70%. Although hundreds of thousands of AEDs are available in high-income countries, their accessibility and use are still low." Thus, the investigators of the study designed a system to deploy drones to real-life suspected OHCA patients in order to determine whether this was a viable solution to the accessibility problem.


Reinforcement Learning for Robot Navigation with Adaptive ExecutionDuration (AED) in a Semi-Markov Model

arXiv.org Artificial Intelligence

Deep reinforcement learning (DRL) algorithms have proven effective in robot navigation, especially in unknown environments, through directly mapping perception inputs into robot control commands. Most existing methods adopt uniform execution duration with robots taking commands at fixed intervals. As such, the length of execution duration becomes a crucial parameter to the navigation algorithm. In particular, if the duration is too short, then the navigation policy would be executed at a high frequency, with increased training difficulty and high computational cost. Meanwhile, if the duration is too long, then the policy becomes unable to handle complex situations, like those with crowded obstacles. It is thus tricky to find the "sweet" duration range; some duration values may render a DRL model to fail to find a navigation path. In this paper, we propose to employ adaptive execution duration to overcome this problem. Specifically, we formulate the navigation task as a Semi-Markov Decision Process (SMDP) problem to handle adaptive execution duration. We also improve the distributed proximal policy optimization (DPPO) algorithm and provide its theoretical guarantee for the specified SMDP problem. We evaluate our approach both in the simulator and on an actual robot. The results show that our approach outperforms the other DRL-based method (with fixed execution duration) by 10.3% in terms of the navigation success rate.