Goto

Collaborating Authors

 end-to-end


Putting An End to End-to-End: Gradient-Isolated Learning of Representations

Neural Information Processing Systems

We propose a novel deep learning method for local self-supervised representation learning that does not require labels nor end-to-end backpropagation but exploits the natural order in data instead. Inspired by the observation that biological neural networks appear to learn without backpropagating a global error signal, we split a deep neural network into a stack of gradient-isolated modules. Each module is trained to maximally preserve the information of its inputs using the InfoNCE bound from Oord et al [2018]. Despite this greedy training, we demonstrate that each module improves upon the output of its predecessor, and that the representations created by the top module yield highly competitive results on downstream classification tasks in the audio and visual domain. The proposal enables optimizing modules asynchronously, allowing large-scale distributed training of very deep neural networks on unlabelled datasets.


Reviews: Putting An End to End-to-End: Gradient-Isolated Learning of Representations

Neural Information Processing Systems

In the present manuscript the authors propose greedy InfoMax, a greedy algorithm which allows unsupervised learning in deep neural networks with state of the art performance. Specifically, the algorithm leverages implicit label information which is encoded temporally in the streaming data. Importantly, the present work rests on the shoulders and success of Contrastive Predictive Coding, but dispenses with end-to-end training entirely. Getting greedy layer-wise unsupervised learning to perform at such levels is quite impressive and will without doubt have an important impact on the community. The work is original and the quality of the writing and figures seems quite high. What I would have liked to see is a more in depth review of the precise data generation process.


Visual Imitation Learning of Non-Prehensile Manipulation Tasks with Dynamics-Supervised Models

Mustafa, Abdullah, Hanai, Ryo, Ramirez, Ixchel, Erich, Floris, Nakajo, Ryoichi, Domae, Yukiyasu, Ogata, Tetsuya

arXiv.org Artificial Intelligence

Unlike quasi-static robotic manipulation tasks like pick-and-place, dynamic tasks such as non-prehensile manipulation pose greater challenges, especially for vision-based control. Successful control requires the extraction of features relevant to the target task. In visual imitation learning settings, these features can be learnt by backpropagating the policy loss through the vision backbone. Yet, this approach tends to learn task-specific features with limited generalizability. Alternatively, learning world models can realize more generalizable vision backbones. Utilizing the learnt features, task-specific policies are subsequently trained. Commonly, these models are trained solely to predict the next RGB state from the current state and action taken. But only-RGB prediction might not fully-capture the task-relevant dynamics. In this work, we hypothesize that direct supervision of target dynamic states (Dynamics Mapping) can learn better dynamics-informed world models. Beside the next RGB reconstruction, the world model is also trained to directly predict position, velocity, and acceleration of environment rigid bodies. To verify our hypothesis, we designed a non-prehensile 2D environment tailored to two tasks: "Balance-Reaching" and "Bin-Dropping". When trained on the first task, dynamics mapping enhanced the task performance under different training configurations (Decoupled, Joint, End-to-End) and policy architectures (Feedforward, Recurrent). Notably, its most significant impact was for world model pretraining boosting the success rate from 21% to 85%. Although frozen dynamics-informed world models could generalize well to a task with in-domain dynamics, but poorly to a one with out-of-domain dynamics.


Putting An End to End-to-End: Gradient-Isolated Learning of Representations

Neural Information Processing Systems

We propose a novel deep learning method for local self-supervised representation learning that does not require labels nor end-to-end backpropagation but exploits the natural order in data instead. Inspired by the observation that biological neural networks appear to learn without backpropagating a global error signal, we split a deep neural network into a stack of gradient-isolated modules. Each module is trained to maximally preserve the information of its inputs using the InfoNCE bound from Oord et al [2018]. Despite this greedy training, we demonstrate that each module improves upon the output of its predecessor, and that the representations created by the top module yield highly competitive results on downstream classification tasks in the audio and visual domain. The proposal enables optimizing modules asynchronously, allowing large-scale distributed training of very deep neural networks on unlabelled datasets.


Decoding the End-to-end Writing Trajectory in Scholarly Manuscripts

Koo, Ryan, Martin, Anna, Wang, Linghe, Kang, Dongyeop

arXiv.org Artificial Intelligence

Scholarly writing presents a complex space that generally follows a methodical procedure to plan and produce both rationally sound and creative compositions. Recent works involving large language models (LLM) demonstrate considerable success in text generation and revision tasks; however, LLMs still struggle to provide structural and creative feedback on the document level that is crucial to academic writing. In this paper, we introduce a novel taxonomy that categorizes scholarly writing behaviors according to intention, writer actions, and the information types of the written data. We also provide ManuScript, an original dataset annotated with a simplified version of our taxonomy to show writer actions and the intentions behind them. Motivated by cognitive writing theory, our taxonomy for scientific papers includes three levels of categorization in order to trace the general writing flow and identify the distinct writer activities embedded within each higher-level process. ManuScript intends to provide a complete picture of the scholarly writing process by capturing the linearity and non-linearity of writing trajectory, such that writing assistants can provide stronger feedback and suggestions on an end-to-end level. The collected writing trajectories are viewed at https://minnesotanlp.github.io/REWARD_demo/


DialogQAE: N-to-N Question Answer Pair Extraction from Customer Service Chatlog

Zheng, Xin, Liu, Tianyu, Meng, Haoran, Wang, Xu, Jiang, Yufan, Rao, Mengliang, Lin, Binghuai, Sui, Zhifang, Cao, Yunbo

arXiv.org Artificial Intelligence

Harvesting question-answer (QA) pairs from customer service chatlog in the wild is an efficient way to enrich the knowledge base for customer service chatbots in the cold start or continuous integration scenarios. Prior work attempts to obtain 1-to-1 QA pairs from growing customer service chatlog, which fails to integrate the incomplete utterances from the dialog context for composite QA retrieval. In this paper, we propose N-to-N QA extraction task in which the derived questions and corresponding answers might be separated across different utterances. We introduce a suite of generative/discriminative tagging based methods with end-to-end and two-stage variants that perform well on 5 customer service datasets and for the first time setup a benchmark for N-to-N DialogQAE with utterance and session level evaluation metrics. With a deep dive into extracted QA pairs, we find that the relations between and inside the QA pairs can be indicators to analyze the dialogue structure, e.g. information seeking, clarification, barge-in and elaboration. We also show that the proposed models can adapt to different domains and languages, and reduce the labor cost of knowledge accumulation in the real-world product dialogue platform.


End-to-end learning using CARLA Simulator

#artificialintelligence

End-to-end learning refers to using a single system/model to perform complex tasks instead of breaking them into smaller simple tasks. One such complex task is driving. We, humans, are born with state-of-the-art capabilities when it comes to vision and learning complex tasks. Therefore it might seem simple when we look at it as humans, but we only realize the complexity when we have to build a system to carry out the same task. In this blog, we will try to demonstrate how end-to-end learning can be used to solve the driving problem by using convolutional Neural Networks.


Simple and Effective Unsupervised Speech Translation

Wang, Changhan, Inaguma, Hirofumi, Chen, Peng-Jen, Kulikov, Ilia, Tang, Yun, Hsu, Wei-Ning, Auli, Michael, Pino, Juan

arXiv.org Artificial Intelligence

The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognition, machine translation and speech synthesis, either in a pipeline approach, or to generate pseudo-labels for training end-to-end speech translation models. Furthermore, we present an unsupervised domain adaptation technique for pre-trained speech models which improves the performance of downstream unsupervised speech recognition, especially for low-resource settings. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art by 3.2 BLEU on the Libri-Trans benchmark, on CoVoST 2, our best systems outperform the best supervised end-to-end models (without pre-training) from only two years ago by an average of 5.0 BLEU over five X-En directions. We also report competitive results on MuST-C and CVSS benchmarks.


A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation

Nguyen, Linh The, Tran, Nguyen Luong, Doan, Long, Luong, Manh, Nguyen, Dat Quoc

arXiv.org Artificial Intelligence

In this paper, we introduce a high-quality and large-scale benchmark dataset for English-Vietnamese speech translation with 508 audio hours, consisting of 331K triplets of (sentence-lengthed audio, English source transcript sentence, Vietnamese target subtitle sentence). We also conduct empirical experiments using strong baselines and find that the traditional "Cascaded" approach still outperforms the modern "End-to-End" approach. To the best of our knowledge, this is the first large-scale English-Vietnamese speech translation study. We hope both our publicly available dataset and study can serve as a starting point for future research and applications on English-Vietnamese speech translation. Our dataset is available at https://github.com/VinAIResearch/PhoST


YOLOv5, End-to-End object detector project on custom dataset

#artificialintelligence

This could be a command you give one of your drones walking in the forest. The technology we're gonna use here is so light, I'm sure this is far from a fantasy. In my previous article, I walked through a first draft to classify mushrooms using CNNs with Tensorflow libraries. I used the Fungus competition dataset available on Kaggle. Many images of this dataset contain multiple objects with a rich background.