Undirected Networks
Learning for Structured Prediction
Structured prediction is the main term for supervised machine learning techniques. Those techniques are involved predicting structured objects, instead of scalar discrete or real values. Structured prediction models are normally trained by means of observed data. In which the true value is used to regulate model parameters similar to usually used supervised learning techniques. The process of prediction using a trained model and of training the aforementioned is frequently computationally infeasible.
Masked prediction tasks: a parameter identifiability view
Liu, Bingbin, Hsu, Daniel, Ravikumar, Pradeep, Risteski, Andrej
The vast majority of work in self-supervised learning, both theoretical and empirical (though mostly the latter), have largely focused on recovering good features for downstream tasks, with the definition of "good" often being intricately tied to the downstream task itself. This lens is undoubtedly very interesting, but suffers from the problem that there isn't a "canonical" set of downstream tasks to focus on -- in practice, this problem is usually resolved by competing on the benchmark dataset du jour. In this paper, we present an alternative lens: one of parameter identifiability. More precisely, we consider data coming from a parametric probabilistic model, and train a self-supervised learning predictor with a suitably chosen parametric form. Then, we ask whether we can read off the ground truth parameters of the probabilistic model from the optimal predictor. We focus on the widely used self-supervised learning method of predicting masked tokens, which is popular for both natural languages and visual data. While incarnations of this approach have already been successfully used for simpler probabilistic models (e.g. learning fully-observed undirected graphical models), we focus instead on latent-variable models capturing sequential structures -- namely Hidden Markov Models with both discrete and conditionally Gaussian observations. We show that there is a rich landscape of possibilities, out of which some prediction tasks yield identifiability, while others do not. Our results, borne of a theoretical grounding of self-supervised learning, could thus potentially beneficially inform practice. Moreover, we uncover close connections with uniqueness of tensor rank decompositions -- a widely used tool in studying identifiability through the lens of the method of moments.
what-is-the-use-of-machine-learning-handwriting-recognition
Recent Deep Learning advancements, such as the introduction of transformer topologies, have helped us accelerate our handwritten character recognition. Intelligent Character Recognition (ICR), is a term used to describe the process for recognizing handwritten content. ICR algorithms require more intelligence than ordinary OCR. This post will cover the challenges of handwritten text identification and the techniques that can be used to tackle them using deep learning and machine learning. In the healthcare/pharmaceutical industry, patient medication digitization is a serious issue. Roche processes millions of PDFs each day, processing petabytes in medical PDFs.
On the potential of Transformers in Reinforcement Learning
Summary Transformers architectures are the hottest thing in supervised and unsupervised learning, achieving SOTA results on natural language processing, vision, audio and multimodal tasks. Their key capability is to capture which elements in a long sequence are worthy of attention, resulting in great summarisation and generative skills. Can we transfer any of these skills to reinforcement learning? The answer is yes (with some caveats). I will cover how it’s possible to refactor reinforcement learning as a sequence problem and reflect on potential and limitations of this approach. Warning: This blogpost is pretty technical, it presupposes a basic understanding of deep learning and good familiarity with reinforcement learning. Previous knowledge of transformers is not required. Intro to Transformers Introduced in 2017, Transformers architectures took the deep learning scene by storm: they achieved SOTA results on nearly all benchmarks, while being simpler and faster than the previous overengineered networks and scaling well with additional data. For instance by getting rid of convolutional network (CNNs) we can go from the complex Feature Pyramid Network Base RCNN of Detectron2, which uses the old fashioned anchor boxes of classic computer vision, (Feature Pyramid Network Base RCNN Architecture, image taken from this nice blogpost) to a transformer model based on embedding vectors extracted from the image patches and few hidden layers. (ViT Architecture, all credits to the original paper) Similarly, in the NLP domain, transformers get rid of recurrent neural networks (RNNs) and are able to handle sequences as a whole, rather than sequentially. The key are the self-attention and multi-head attention layers, which roughly tell the network how elements in a long sequence correlate with each other, so that for every element (say a word, or a pixel) we know where we should pay attention to. The best example is in text translation: when translating “piace” in “Mi piace questo articolo”, the transformer needs to also pay attention to the word “Mi” to correctly translate it into “I like this article”, while there is no need to pay attention to the other words. A similar proposition such as “Le piace questo articolo” translates instead into “She likes this article”. The transformer architecture is usually divided into encoder and decoder modules. The encoder is responsible for taking the input data and building representations of its features, usually as an embedding vector. The decoder takes these embeddings as input and generates new sequences. Depending on what we need to do, we may only need one of these modules. Encoder-only architectures are used for tasks requiring understanding of the input, such as sentiment analysis and sentence classification. Deconder-only models are generative models, for instance text or image generation. Encoder-Decoder architectures are suited for text translation and summarisation. Famous networks of each category include: Encoder-only: BERT, RoBERTa, ALBERT, ViT Deconder-only: GPT2, GPT3 Encoder-Decoder: BART, T5, DETR The de-facto standard to play with transformers is the hugginface library, which contains excellent documentation which I encourage you to check. The next big-thing for transformer architectures in supervised/unsupervised learning is likely to be multimodal transformers: networks able to ingest text, audio, images and more at the same time. RL Transformers We usually model reinforcement learning problems as a Markov decision process described by sequences of actions, states, transition probabilities and rewards. At every step a typical algorithm would check only the current state and the past performances encoded in a state or state-action value function to decide which action to take next. There is no notion of “previous-action”, a Markov process doesn’t care (or better, it doesn’t remember). The key hypothesis behind RL transformers is: what if instead we take the whole sequence as the building block to base our prediction? If we do so, we mapped a 1-step Markovian problem into a sequence-to-sequence or sequence-to-action problem, which we can approach using battle tested supervised learning techniques. No value functions, no actor-critics, no bootstrapping. Sequences are usually called trajectories in RL literature, so we will use these terms interchangeably. We will assume that trajectories are already given before-hand, so we will limit ourselves to offline-reinforcement learning. These trajectories may have been produced by another RL agent or they may be expert (human) demonstrations. RL with transformers has been recently explored in the paper Offline Reinforcement Learning as One Big Sequence Modeling Problem and Decision Transformer: Reinforcement Learning via Sequence Modeling. In the former work they first use a GPT-like architecture to ingest trajectories composed by actions, states, rewards [\tau = (s_1, a_1, r_1, s_2, a_2, r_2, \dots )] and to generate novel candidate trajectories. The setup is nearly identical to the one that we would use in natural language processing. Finally they apply a beam-search strategy to planning, by iteratively expanding the most promising trajectories weighted by cumulative returns until a single trajectory stands out as the most promising. To avoid being too greedy they also augment trajectories with the reward-to-go $R_t$ (future cumulative reward) at each step. [R_t = \sum_{t’ = t}^{T} \gamma^{t’-t} r_{t’}] In the latter work they get rid of rewards at each time step and only consider the reward-to-go, directly predicting the next action given the state and a target final return. [\tau = (s_1, a_1, R_1, s_2, a_2, R_2, \dots )] This very simple approach is reminiscent of upside-down reinforcement learning, that is instead of asking the agent to search for an optimal policy, just ask it to “obtain so much total reward in so much time”. In essence we are doing supervised learning, asking the agent to extrapolate based on past trajectories which trajectory will lead to the highest rewards. Actually it’s possible to condition the agent to achieve arbitrary rewards (among the ones achieved in the training dataset). This can be handy if intermediate levels of rewards are still meaningful, for instance last year I used this technique to teach a food scooping robot (before knowing that upside-down RL had a name) to pick different portion sizes, having set the exact weight of scooped food as a positive reward. Basically the robot trains to scoop as much as possible, but if we ask for a small portion it should be able to do that! From what I wrote so far you may be thinking that this is pretty much behaviour cloning on transformers steroids… and you would not be that wrong! Indeed the authors of Decision Transformer also recongnised this and benchmarked against vanilla behaviour cloning (by the way, props to how they divided their discussion session in clearly articulated questions). The conclusion is that, even though performance are often comparable, the transformers approach do outperform vanilla behaviour cloning if there is a large amount of training data. How do RL transformers perform overall? Pretty well, they are often on par or better than state of the art TD-learning model-free offline methods such as Conservative Q-Learning (CQL) on benchmarks like Atari and D4RL! On Atari they have been tested on a DQN-replay Atari dataset, performing on par or better in Breakout, Pong and Seaquest, while worst on Qbert. On D4RL (an offline RL benchmark using OpenAI Gym) they have been tried on continuous tasks such as HalfCheetah, Hopper and Walker being again on par with state of the art methods. Tests on a Key-to-Door environment, in which a key must be picked well in advance of reaching the final door, also shows very good long-term credit assignment capabilities. Tests on modified D4RL benchmarks in which the reward is given only at episode’s end shows that transformers perform well also in sparse reward settings, while the performance of CQL plummets. The Good and the Hype So, why is this whole thing interesting? The reasons are very pragmatic, there is the opportunity of transferring progress from very well funded fields (NLP, Vision for consumer applications) to many less wealthy fields (and often harder, e.g. robotics) out of the box. Battle tested architectures trained on large quantities of data can be reused, saving engineering time on RL algorithms (no need to bootstrap value functions, no actor-critic networks, etc.). These techniques may become the standard to pretrain agents. On a more technical level, RL transformers appear to be good at long-term credit assignment and in sparse rewards scenarios, which is very often what we face in real-world RL. That said, these methods are not a panacea, for instance no serious benchmark has been tried on online RL, the “normal” RL. Knowing that current deep learning techniques tend to handle very poorly discrepancies from identically and independently distributed statistics, current transformers architectures may not be enough. Also, for specific domains ultimately it boils down to the availability of large datasets of expert trajectories, which in fields like robotics are still missing. All in all, the idea is pretty fresh and there may still be large room for improvement. Onwards!
Chord-Conditioned Melody Choralization with Controllable Harmonicity and Polyphonicity
Wu, Shangda, Li, Xiaobing, Sun, Maosong
Melody choralization, i.e. generating a four-part chorale based on a user-given melody, has long been closely associated with J.S. Bach chorales. Previous neural network-based systems rarely focus on chorale generation conditioned on a chord progression, and none of them realised controllable melody choralization. To enable neural networks to learn the general principles of counterpoint from Bach's chorales, we first design a music representation that encoded chord symbols for chord conditioning. We then propose DeepChoir, a melody choralization system, which can generate a four-part chorale for a given melody conditioned on a chord progression. Furthermore, with the improved density sampling, a user can control the extent of harmonicity and polyphonicity for the chorale generated by DeepChoir. Experimental results reveal the effectiveness of our data representation and the controllability of DeepChoir over harmonicity and polyphonicity. The code and generated samples (chorales, folk songs and a symphony) of DeepChoir, and the dataset we use now are available at https://github.com/sander-wood/deepchoir.
Autonomous Vehicles on the Edge: A Survey on Autonomous Vehicle Racing
Betz, Johannes, Zheng, Hongrui, Liniger, Alexander, Rosolia, Ugo, Karle, Phillip, Behl, Madhur, Krovi, Venkat, Mangharam, Rahul
The rising popularity of self-driving cars has led to the emergence of a new research field in the recent years: Autonomous racing. Researchers are developing software and hardware for high performance race vehicles which aim to operate autonomously on the edge of the vehicles limits: High speeds, high accelerations, low reaction times, highly uncertain, dynamic and adversarial environments. This paper represents the first holistic survey that covers the research in the field of autonomous racing. We focus on the field of autonomous racecars only and display the algorithms, methods and approaches that are used in the fields of perception, planning and control as well as end-to-end learning. Further, with an increasing number of autonomous racing competitions, researchers now have access to a range of high performance platforms to test and evaluate their autonomy algorithms. This survey presents a comprehensive overview of the current autonomous racing platforms emphasizing both the software-hardware co-evolution to the current stage. Finally, based on additional discussion with leading researchers in the field we conclude with a summary of open research challenges that will guide future researchers in this field.
A Unified Perspective on Value Backup and Exploration in Monte-Carlo Tree Search
Dam, Tuan, D'Eramo, Carlo, Peters, Jan, Pajarinen, Joni
Monte-Carlo Tree Search (MCTS) is a class of methods for solving complex decision-making problems through the synergy of Monte-Carlo planning and Reinforcement Learning (RL). The highly combinatorial nature of the problems commonly addressed by MCTS requires the use of efficient exploration strategies for navigating the planning tree and quickly convergent value backup methods. These crucial problems are particularly evident in recent advances that combine MCTS with deep neural networks for function approximation. In this work, we propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator and entropy regularization. We provide strong theoretical guarantees to bound convergence rate, approximation error, and regret of our methods. Moreover, we introduce a mathematical framework based on the use of the $\alpha$-divergence for backup and exploration in MCTS. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework, allowing to obtain different methods by simply changing the value of $\alpha$. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by tuning the single $\alpha$ parameter according to the problem at hand. We validate our methods through a rigorous empirical study from basic toy problems to the complex Atari games, and including both MDP and POMDP problems.
Cooperative Solutions to Exploration Tasks Under Speed and Budget Constraints
We present a multi-agent system where agents can cooperate to solve a system of dependent tasks, with agents having the capability to explore a solution space, make inferences, as well as query for information under a limited budget. Re-exploration of the solution space takes place by an agent when an older solution expires and is thus able to adapt to dynamic changes in the environment. We investigate the effects of task dependencies, with highly-dependent graph $G_{40}$ (a well-known program graph that contains $40$ highly interlinked nodes, each representing a task) and less-dependent graphs $G_{18}$ (a program graph that contains $18$ tasks with fewer links), increasing the speed of the agents and the complexity of the problem space and the query budgets available to agents. Specifically, we evaluate trade-offs between the agent's speed and query budget. During the experiments, we observed that increasing the speed of a single agent improves the system performance to a certain point only, and increasing the number of faster agents may not improve the system performance due to task dependencies. Favoring faster agents during budget allocation enhances the system performance, in line with the "Matthew effect." We also observe that allocating more budget to a faster agent gives better performance for a less-dependent system, but increasing the number of faster agents gives a better performance for a highly-dependent system.
Fitting Sparse Markov Models to Categorical Time Series Using Regularization
Majumder, Tuhin, Lahiri, Soumendra, Martin, Donald
The major problem of fitting a higher order Markov model is the exponentially growing number of parameters. The most popular approach is to use a Variable Length Markov Chain (VLMC), which determines relevant contexts (recent pasts) of variable orders and form a context tree. A more general approach is called Sparse Markov Model (SMM), where all possible histories of order $m$ form a partition so that the transition probability vectors are identical for the histories belonging to a particular group. We develop an elegant method of fitting SMM using convex clustering, which involves regularization. The regularization parameter is selected using BIC criterion. Theoretical results demonstrate the model selection consistency of our method for large sample size. Extensive simulation studies under different set-up have been presented to measure the performance of our method. We apply this method to classify genome sequences, obtained from individuals affected by different viruses.